doc/nestedvm.ivme04.tex

   1 %
   2 % FIXME: - Add something about size limits on the constant pool
   3 %          how we worked around that and the performance impact
   4 %          of -o lessconstants
   5 %        - Add something about encoding data sections as string constants
   6 %          and the UTF8 non-7-bit-ascii penalty
   7 %
   8
   9 \documentclass{acmconf}
  10 \usepackage{graphicx}
  11 \usepackage{amssymb,amsmath,epsfig,alltt}
  12 \sloppy         % better line breaks
  13 \usepackage{palatino}
  14 \usepackage{parskip}
  15 \usepackage{tabularx}
  16 \usepackage{alltt}
  17 \bibliographystyle{alpha}
  18
  19 \title{\textbf{\textsf{
  20 Running Legacy C/C++ Libraries in a Pure Java Environment
  21 }}}
  22 \date{}
  23 \author{\begin{tabular}{@{}c@{}}
  24         {\em {Brian Alliet}} \\ \\
  25         {{\it Affiliation Goes Here}}\relax
  26    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  27         {\em {Adam Megacz}} \\ \\
  28         {UC Berkeley}\relax
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 \begin{abstract}
  35 Abstract
  36 \end{abstract}
  37
  38 \section{Introduction}
  39
  40 \subsection{Why would you want to do this?}
  41
  42 The C programming language has been around for over 30 years now.
  43 There is a huge library of software written in this language.  By
  44 contrast, Java has been around for less than ten years.  Although it
  45 offers substantial advantages over C, the set of libraries written in
  46 this language still lags behind C/C++.
  47
  48 The typical solution to this dilemma is to use JNI or CNI to invoke C
  49 code from within a Java VM.  Unfortunately, there are a number of
  50 situations in which this is not an acceptable solution:
  51
  52 \begin{itemize}
  53 \item Java Applets are not permitted to invoke {\tt Runtime.loadLibrary()}
  54 \item Java Servlet containers with a {\tt SecurityManager} will not
  55       permit loading new JNI libraries.  This configuration is popular
  56       with {\it shared hosting} providers and corporate intranets
  57       where a number of different parties contribute individual web
  58       applications which are run together in a single container.
  59 \item JNI requires the native library to be compiled ahead of time,
  60       separately, for every architecture on which it will be deployed.
  61       This is unworkable for situations in which the full set of
  62       target architectures is not known at deployment time.
  63 \item The increasingly popular J2ME platform does not support JNI or CNI.
  64 \item Unlike Java Bytecode, JNI code is susceptible to buffer overflow
  65       and heap corruption attacks.  This can be a major security
  66       vulnerability.
  67 \item JNI often introduces undesirable added complexity to an
  68       application.
  69 \end{itemize}
  70
  71 The technique we present here is based on using a typical ANSI C
  72 compiler to compile C/C++ code into a MIPS binary, and then using a
  73 tool to translate that binary on an instruction-by-instruction basis
  74 into Java bytecode.
  75
  76 The technique presented here is general; we anticipate that it can be
  77 applied to other secure virtual machines such as Microsoft's .NET.
  78
  79
  80 \section{Existing Work: Source-to-Source Translation}
  81
  82 \begin{itemize}
  83 \item c2java
  84 \item commercial products
  85 \end{itemize}
  86
  87 \section{Mips2Java: Binary-to-Binary Translation}
  88
  89 We present Mips2Java, a binary-to-binary translation tool to convert
  90 MIPS binaries into Java bytecode files.
  91
  92 The process of utilizing Mips2Java begins by using any compiler for
  93 any language to compile the source library into a statically linked
  94 MIPS binary.  We used {\tt gcc 3.3.3}, but any compiler which can
  95 target the MIPS platform should be acceptable.  The binary is
  96 statically linked with a system library (in the case of C code this is
  97 {\tt libc}) which translates system requests (such as {\tt open()},
  98 {\tt read()}, or {\tt write()}) into appropriate invocations of the
  99 MIPS {\tt SYSCALL} instruction.
 100
 101 The statically linked MIPS binary is then fed to the Mips2Java tool
 102 (which is itself written in Java), which emits a sequence of Java
 103 Bytecodes in the form of a {\tt .class} file equivalent to the
 104 provided binary.  This {\tt .class} file contains a single class which
 105 implements the {\tt Runtime} interface.  This class may then be
 106 instantiated; invoking the {\tt execute()} method is equivalent to
 107 invoking the {\tt main()} function in the original binary.
 108
 109
 110 \subsection{Why MIPS?}
 111
 112 We chose MIPS as a source format for two primary reasons: the
 113 availability of tools to translate legacy code into MIPS binaries, and
 114 the close similarity between the MIPS ISA and the Java Virtual Machine.
 115
 116 The MIPS architecture has been around for quite some time, and is well
 117 supported by the GNU Compiler Collection, which is capable of
 118 compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C
 119 into MIPS binaries.
 120
 121 The MIPS R1000 ISA bears a striking similarity to the Java Virtual
 122 Machine.  This early revision of the MIPS supports only 32-bit aligned
 123 loads and stores from memory, which is precisely the access model
 124 supported by Java for {\tt int[]}s.
 125
 126 Cover dynamic page allocation.
 127
 128 Cover stack setup.
 129
 130 Brian, are there any other fortunate similarities we should mention
 131 here?  I seem to remember there being a bunch, but I can't recall them
 132 right now; it's been a while since I dealt with this stuff in detail.
 133
 134
 135 \subsection{Interpreter}
 136
 137 \begin{itemize}
 138 \item slow, but easy to write
 139 \end{itemize}
 140
 141
 142 \subsection{Compiling to Java Source}
 143 \begin{itemize}
 144 \item performance boost
 145 \item pushes the limits of {\tt javac} and {\tt jikes}
 146 \end{itemize}
 147
 148 \subsection{Compiling directly to Java Bytecode}
 149 \begin{itemize}
 150 \item further performance boost (quantify)
 151 \item brian, can you add any comments here?
 152 \end{itemize}
 153
 154 \subsection{Interfacing with Java Code}
 155
 156 Java source code can create a copy of the translated binary by
 157 instantiating the corresponding class, which extends {\tt Runtime}.
 158 Invoking the {\tt main()} method on this class is equivalent to
 159 calling the {\tt main()} function within the binary; the {\tt String}
 160 arguments to this function are copied into the binary's memory space
 161 and made available as {\tt argv**} and {\tt argc}.
 162
 163 The translated binary communicates with the rest of the VM by
 164 executing MIPS {\tt SYSCALL} instructions, which are translated into
 165 invocations of the {\tt syscall()} method.  This calls back to the
 166 native Java world, which can manipulate the binary's environment by
 167 reading and writing to its memory space, checking its exit status,
 168 pausing the VM, and restarting the VM.
 169
 170
 171 \subsection{Virtualization}
 172
 173 The {\tt Runtime} class implements the majority of the standard {\tt
 174 libc} syscalls, providing a complete interface to the filesystem,
 175 network socket library, time of day, (Brian: what else goes here?).
 176
 177 \begin{itemize}
 178 \item ability to provide the same interface to CNI code and mips2javaified code
 179 \item security advantages (chroot the {\tt fork()}ed process)
 180 \end{itemize}
 181
 182 \section{Related Work}
 183
 184 \subsection{Source-to-Source translators}
 185
 186 A number of commercial products and research projects attempt to
 187 translate C++ code to Java code, preserving the mapping of C++ classes
 188 to Java classes.  Unfortunately this is problematic since there is no
 189 way to do pointer arithmetic except within arrays, and even in that
 190 case, arithmetic cannot be done in terms of fractional objects.
 191
 192 Mention gcc backend
 193
 194 c2java
 195
 196 Many of these products advise the user to tweak the code which results
 197 from the translation.  Unfortunately, hand-modifying machine-generated
 198 code is generally a bad idea, since this modification cannot be
 199 automated.  This means that every time the origin code changes, the
 200 code generator must be re-run, and the hand modifications must be
 201 performed yet again.  This is an error-prone process.
 202
 203 Furthermore, Mips2Java does not attempt to read C code directly.  This
 204 frees it from the complex task of faithfully implementing the ANSI C
 205 standard (or, in the case of non ANSI-C compliant code, some other
 206 interface).  This also saves the user the chore of altering their
 207 build process to accomodate Mips2Java.
 208
 209
 210 \section{Performance}
 211
 212 \subsection{Charts}
 213
 214 (Note that none of these libraries have pure-Java equivalents.)
 215
 216 \begin{itemize}
 217 \item libjpeg
 218 \item libfreetype
 219 \item libmspack
 220 \end{itemize}
 221
 222
 223 \subsection{Optimizations}
 224
 225 Brian, can you write something to go here?  Just mention which
 226 optimizations helped and which ones hurt.
 227
 228 \begin{itemize}
 229 \item trampoline
 230 \item optimal method size
 231 \item -msingle-float
 232 \item -mmemcpy
 233 \item fastmem
 234 \item local vars for registers (useless)
 235 \item -fno-rename-registers
 236 \item -ffast-math
 237 \item -fno-trapping-math
 238 \item -fsingle-precision-constant
 239 \item -mfused-madd
 240 \item -freg-struct-return
 241 \item -freduce-all-givs
 242 \item -fno-peephole
 243 \item -fno-peephole2
 244 \item -fmove-all-movables
 245 \item -fno-sched-spec-load
 246 \item -fno-sched-spec
 247 \item -fno-schedule-insns
 248 \item -fno-schedule-insns2
 249 \item -fno-delayed-branch
 250 \item -fno-function-cse
 251 \item -ffunction-sections
 252 \item -fdata-sections
 253 \item array bounds checking
 254 \item -falign-functions=n
 255 \item -falign-labels=n
 256 \item -falign-loops=n
 257 \item -falign-jumps=n
 258 \item -fno-function-cse
 259 \end{itemize}
 260
 261 \section{Future Directions}
 262
 263 World domination.
 264
 265 \section{Conclusion}
 266
 267 We rock the hizzouse.
 268
 269 \section{References}
 270
 271 Yer mom.
 272
 273 \section{stuff}
 274 \begin{verbatim}
 275
 276 libjpeg (render thebride_1280.jpg)
 277 Native -  0.235s
 278 JavaSource - 1.86s
 279 ClassFile - 1.37s
 280
 281 freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to
 282 48 incrementing by 4)
 283 Native - 0.201s
 284 JavaSource - 2.02s
 285 ClassFile - 1.46s
 286
 287 Section 3.2  - Interpreter
 288 The Interpreter was the first part of Mips2Java to be written.  This was the most
 289 straightforward and simple way to run MIPS binaries inside the JVM.  The simplicity of the
 290 interpreter also made it very simple to debug. Debugging machine-generated code is a pain.
 291 Most importantly, the interpreter provided a great reference to use when developing the
 292 compiler. With known working implementations of each MIPS instruction in Java writing a
 293 compiler became a matter of simple doing the same thing ahead of time.
 294 With the completion of the compiler the interpreter in Mips2Java has become less useful.
 295 However, it may still have some life left in it. One possible use is remote debugging with
 296 GDB. Although debugging the compiler generated JVM code is theoretically possible, it
 297 would be far easier to do in the interpreter. The interpreter may also be useful in cases
 298 where size is far more important than speed. The interpreter is very small. The interpreter
 299 and MIPS binary combined are smaller than the compiled classfiles.
 300 Section 3.3 - Compiling to Java Source
 301 The next major step in Mips2JavaÕs development was the Java source compiler. This
 302 generated Java source code (compliable with javac or Jikes) from a MIPS binary.
 303 Generating Java source code was preferred initially over JVM bytecode for two reasons.
 304 The authors werenÕt very familiar with JVM bytecode and therefore generating Java source
 305 code was simpler. Generating source code also eliminated the need to do trivial
 306 optimizations in the Mips2java compiler that javac and Jikes already do. This mainly
 307 includes 2+2=4 stuff. For example, the MIPS register r0 is immutable and always 0. This
 308 register is represented by a static final int in the Java source compiler. Javac and Jikes
 309 automatically handle optimizing this away when possible. In the JVM bytecode compiler
 310 these optimizations needs to be done in Mips2Java.
 311 Early versions of the Mips2Java compiler were very simple. All 32 MIPS GPRs and a
 312 special PC register were fields in the generated java class. There was a run() method
 313 containing all the instructions in the .text segment converted to Java source code. A switch
 314 statement was used to allow jumps from instruction to instruction. The generated code
 315 looked something like this.
 316 private final static int r0 = 0;
 317 private int r1, r2, r3,...,r30;
 318 private int r31 = 0xdeadbeef;
 319 private int pc = ENTRY_POINT;
 320
 321 public void run() {
 322         for(;;) {
 323         switch(pc) {
 324                 case 0x10000:
 325                         r29 = r29 Ð 32;
 326                 case 0x10004:
 327                         r1 = r4 + r5;
 328                 case 0x10008:
 329                         if(r1 == r6) {
 330                                 /* delay slot */
 331                                 r1 = r1 + 1;
 332                                 pc = 0x10018:
 333                                 continue;
 334                         }
 335                 case 0x1000C:
 336                         r1 = r1 + 1;
 337                 case 0x10010:
 338                         r31 = 0x10018;
 339                         pc = 0x10210;
 340                         continue;
 341                 case 0x10014:
 342                         /* nop */
 343                 case 0x10018:
 344                         pc = r31;
 345                         continue;
 346                 ...
 347                 case 0xdeadbeef:
 348                         System.err.println("Exited from ENTRY_POINT");
 349                         System.err.println("R2: " + r2);
 350                         System.exit(1);
 351         }
 352         }
 353 }
 354
 355 This worked fine for small binaries but as soon as anything substantial was fed to it the 64k
 356 JVM method size limit was soon hit. The solution to this was to break up the code into
 357 many smaller methods. This required a trampoline to dispatch  jumps to the appropriate
 358 method. With the addition of the trampoline the generated code looked something like this:
 359 public void run_0x10000() {
 360         for(;;) {
 361         switch(pc) {
 362                 case 0x10000:
 363                         ...
 364                 case 0x10004:
 365                         ...
 366                 ...
 367                 case 0x10010:
 368                         r31 = 0x10018;
 369                         pc = 0x10210;
 370                         return;
 371                 ...
 372         }
 373         }
 374 }
 375
 376 pubic void run_0x10200() {
 377         for(;;) {
 378         switch(pc) {
 379                 case 0x10200:
 380                         ...
 381                 case 0x10204:
 382                         ...
 383         }
 384         }
 385 }
 386
 387 public void trampoline() {
 388         for(;;) {
 389         switch(pc&0xfffffe00) {
 390                         case 0x10000: run_0x10000(); break;
 391                         case 0x10200: run_0x10200(); break;
 392                         case 0xdeadbe00:
 393                                 ...
 394                 }
 395         }
 396 }
 397 With this trampoline in place somewhat large binaries could be handled without much
 398 difficulty. There is no limit on the size of a classfile as a whole, just individual methods.
 399 This method should scale well. However, there are other classfile limitations that will limit
 400 the size of compiled binaries.
 401 Another interesting problem that was discovered while creating the trampoline method was
 402 javac and JikesÕ incredible stupidity when dealing with switch statements. The follow code
 403 fragment gets compiled into a lookupswich by javac:
 404 Switch(pc&0xffffff00) {
 405         Case 0x00000100: run_100(); break;
 406         Case 0x00000200: run_200(); break;
 407         Case 0x00000300: run_300(); break;
 408 }
 409 while this nearly identical code fragment gets compiled to a tableswitch
 410 switch(pc>>>8) {
 411         case 0x1: run_100(); break
 412         case 0x2: run_200(); break;
 413         case 0x3: run_300(); break;
 414 }
 415 Javac isnÕt smart enough to see the patter in the case values and generates very suboptimal
 416 bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which
 417 is significantly faster. This change alone nearly doubled the speed of the compiled binary.
 418 Finding the optimal method size lead to the next big performance increase.  It was
 419 determined with experimentation that the optimal number of MIPS instructions per method
 420 is 128 (considering only power of two options). Going above or below that lead to
 421 performance decreases. This is most likely due to a combination of two factors.
 422 _ The two levels  of switch statements jumps have to pass though Ð The first switch
 423 statement jumps go through is the trampoline switch statement. This is implemented
 424 as a TABLESWITCH in JVM bytecode so it is very fast. The second level switch
 425 statement in the individual run_ methods is implemented as a LOOKUPSWITCH,
 426 which is much slower. Using smaller methods puts more burden on the faster
 427 TABLESWITCH and less on the slower LOOKUPSWITCH.
 428 _ JIT compilers probably favor smaller methods smaller methods are easier to compile
 429 and are probably better candidates for JIT compilation than larger methods.
 430 FIXME: Put a chart here
 431 The next big optimization was eliminating unnecessary case statements. Having case
 432 statements before each instruction prevents JIT compilers from being able to optimize
 433 across instruction boundaries. In order to eliminate unnecessary case statements every
 434 possible address that could be jumped to directly needed to be identified. The sources for
 435 possible jump targets come from 3 places.
 436 _ The .text segment Ð Every instruction in the text segment in scanned for jump
 437 targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list
 438 of possible branch targets. In addition, functions that set the link register have
 439 theirpc+8 added to the list (the address that wouldÕve been put to the link register).
 440 Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate
 441 Unsigned) are scanned for possible addresses in the text segment. This combination
 442 of instructions is often used to load a 32-bit word into a register.
 443 _ The .data segment Ð When GCC generates switch() statements it often uses a jump
 444 table stored in the .data segment. Unfortunately gcc doesnÕt identify these jump
 445 tables in any way. Therefore, the entire .data segment is conservatively scanned for
 446 possible addresses in the .text segment.
 447 _ The symbol table Ð This is mainly used as a backup. Scanning the .text and .data
 448 segments should identify any possible jump targets but adding every function in the
 449 symbol table in the ELF binary doesnÕt hurt. This will also catch functions that are
 450 never called directly from the MIPS binary (for example, functions called with the
 451 call() method in the runtime).
 452 Eliminating unnecessary case statements provided a 10-25% speed increase .
 453 Despite all the above optimizations and workaround an impossible to workaround hard
 454 classfile limit was eventually hit, the constant pool. The constant pool in classfiles is limited
 455 to 65536 entries. Every Integer with a magnitude greater than 32767 requires an entry in the
 456 constant pool. Every time the compiler emits a jump/branch instruction the PC field is set to
 457 the branch target. This means nearly every branch instruction requires an entry in the
 458 constant pool. Large binaries hit this limit fairly quickly. One workaround that was
 459 employed in the Java source compiler was to express constants as offsets from a few central
 460 values. For example: "pc = N_0x00010000 + 0x10" where N_0x000100000 is a non-
 461 final field to prevent javac from inlining it. This was sufficient to get reasonable large
 462 binaries to compile. It has a small (approximately 5%) performance impact on the generated
 463 code. It also makes the generated classfile somewhat larger. Fortunately, the classfile
 464 compiler eliminates this problem.
 465 3.4 Ð Bytecode compiler
 466 The next step in the evolution of Mips2Java was to compile directly to JVM bytecode
 467 eliminating the intermediate javac step. This had several advantages
 468 _ There are little tricks that can be done in JVM bytecode that canÕt be done in Java
 469 source code.
 470 _ Eliminates the time-consuming javac step Ð Javac takes a long time to parse and
 471 compile the output from the java source compiler.
 472 _ Allows for MIPS binaries to be compiled and loaded into a running VM using a
 473 class loader. This eliminates the need to compile the binaries ahead of time.
 474 By generating code at the bytecode level  there are many areas where the compiler can be
 475 smarter than javac. Most of the areas where improvements where made where in the
 476 handling of branch instructions and in taking advantage of the JVM stack to eliminate
 477 unnecessary LOADs and STOREs to local variables.
 478 The first obvious optimization that generating bytecode allows for is the use of GOTO.
 479 Despite the fact that java doesnÕt have a GOTO keyword  a GOTO bytecode does exist and
 480 is used heavily in the code generates by javac. Unfortunately the java language doesnÕt
 481 provide any way to take advantage of this. As result of this jumps within a method were
 482 implemented by setting the PC field to the new address and making a trip back to the initial
 483 switch statement.  In the classfile compiler these jumps are implemented as GOTOs directly
 484 to the target instruction. This saves a costly trip back through the LOOKUPSWITCH
 485 statement and is a huge win for small loops within a method.
 486 Somewhat related to using GOTO is the ability to optimize branch statements. In the Java
 487 source compiler branch statements are implemented as follows (delay slots are ignored for
 488 the purpose of this example)
 489 If(condition) { pc = TARGET; continue; }
 490 This requires a branch in the JVM regardless of whether the MIPS branch is actually taken.
 491 If condition is false the JVM has to jump over the code to set the PC and go back to the
 492 switch block. If condition is true the JVM as to jump to the switch block. By generating
 493 bytecode directly we can make the target of the JVM branch statement the actual bytecode
 494 of the final destination. In the case where the branch isnÕt taken the JVM doesnÕt need to
 495 branch at all.
 496 A side affect of the above two optimizations is a solution to the excess constant pool entries
 497 problem. When jumps are implemented as GOTOs and direct branches to the target the PC
 498 field doesnÕt need to be set. This eliminates many of the constant pool entries the java
 499 source compiler requires. The limit is still there however, and given a large enough binary it
 500 will still be reached.
 501 Delay slots are another area where things are done somewhat inefficiently in the Java source
 502 compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a
 503 "delay slot". That is, an instruction after a branch or jump instruction that is executed
 504 regardless of whether the branch is taken. This is done because by the time the branch or
 505 jump instruction is finished being processes the next instruction is already ready to be
 506 executed and it is wasteful to discard it. (However, newer  MIPS CPUs have pipelines that
 507 are much larger than early MIPS CPUs so they have to discard many instructions anyway.)
 508 As a result of this the instruction in the delay slot is actually executed BEFORE the branch
 509 is taken. To make things even more difficult, values from the register file are loaded
 510 BEFORE the delay slot is executed.  Here is a small piece of MIPS assembly:
 511 ADDIU r2,r0,-1
 512 BLTZ r2, target
 513 ADDIU r2,r2,10
 514 ...
 515 :target
 516 This piece of code is executed as follows
 517 1. r2 is set to Ð1
 518 2. r2 is loaded from the register file by the BLTEZ instruction
 519 3. 10 is added to r2 by the ADDIU instruction
 520 4. The branch is taken because at the time the BLTZ instruction was executed r2 was
 521 Ð1, but r2 is now 9 (-1 + 10)
 522 There is a very element solution to this problem when using JVM bytecode. When a branch
 523 instruction is encountered the registers needed for the comparison are pushed onto the stack
 524 to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the
 525 delay slot is emitted, and then finally the actual JVM branch instruction. Because the values
 526 were pushed to the stack before the delay slot was executed any changes the delay slot made
 527 to the registers are not visible to the branch bytecode. This allows delay slots to be used
 528 with no performance penalty or size penalty.
 529 One final advantage that generating bytecode directly allows is smaller more compact
 530 bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also
 531 a few other areas where the generated bytecode can be optimized for size with more
 532 knowledge of the program as a whole.
 533 When encountering the following switch block both javac and Jikes generate redundant
 534 bytecode.
 535 Switch(pc>>>8) {
 536         Case 0x1: run_1(); break;
 537         Case 0x2: run_2(); break
 538         ...
 539         case 0x100: run_100(); break;
 540 }
 541 The first bytecode in each case arm in the switch statement is ALOAD_0 to prepare for a
 542 invoke special call. By simple moving this outside the switch statement  each case arm was
 543 reduced in size by one instruction. Similar optimizations were also done in other parts of the
 544 compiler.
 545
 546 Section 3
 547 - Adam - The method is run(), not execute. Execute() is only used when you need to
 548 resume from a pause syscall.
 549
 550 Section 3.1
 551 - Adam - Even the R1000 supports LB/SB/LH/SH/LWL/LWR Ð saying it only supports
 552 32-bit aligned loads is incorrect.
 553 - Other similarities
 554 o All the branching instructions in MIPS directly map to single JVM instructions.
 555 o Most of the ALU instructions map to single JVM instructions.
 556
 557 (I can write up some stuff for each of these next several sections if you want)
 558 Section 3.2  - Interpreter
 559 - Originally written mainly to understand the MIPS instruction set
 560 - Far easier to debug than an ahead of time compiler (no waiting, can throw in quick
 561 hacks like if(pc >= 0xbadc0de && pc <= 0xbadcfff) debugOutput() ), donÕt need to
 562 understand most of the ELF format)
 563 - Still could be useful
 564 o for GDB remote debugging
 565 o cases where size is more important than speed (size of interpreter + size of mips
 566 binary < size of compiled binary or size of compiler + mips binary)
 567 o code which dynamically generates code (JIT compilers, etc). The ahead of time
 568 compiler canÕt possibly handle this
 569
 570 Section 3.3 Ð Compiling to Java Source
 571 - First version of an ahead of time compiler
 572 - Huge performance boost
 573 - Java source output preferred for the 2+2=4 type optimizations java compilers do
 574 - Worked well for small binaries Ð large MIPS binaries required ridiculous amounts of
 575 memory to compile and often created  invalid classfiles
 576 - Too many constants Ð every jump operation required an entry in the constant pool (pc =
 577 0xabcd1234; continue; )
 578
 579 Section 3.4 Ð Compiling directly to JVM Bytecode
 580 - Another jump in performance
 581 - More efficient code can be created at the bytecode level
 582 o Information can be stored on the JVM stack rather than in local variables
 583 _ Javac/jikes often unnecessarily use local variables
 584 long tmp = a*b;
 585 lo = (int)tmp;
 586 hi = (int)(tmp>>>32)
 587 does a store and two loads when a simple DUP would suffice
 588 o GOTO can be used to branch directly  to branch destinations in the same method
 589 rather than going through the switch block again.
 590 o BEQ, BGTZ, BLE, etc can jump directly to destination rather than doing
 591 if(condition) { pc=0xabcd1234; continue; }
 592 o Eliminates excess constant pool entries (only jump outside the current method
 593 require a constant pool entry)
 594 o Delay slots implemented more efficiently.
 595 _ Java source compiler does:
 596 if(condition) { /* delay slot /; pc = 0xabcd1234; continue; }
 597 /* delay slot again */
 598 _ This is required because the delay slot can change the registers used in
 599 condition. The registers need to be read BEFORE the delay slot in executed.
 600 _ In the bytecode compiler the registers used in the condition are pushed to the
 601 stack, then the delay slot is executed, and finally the comparison is done.
 602 This eliminates the needs to output the delay slot twice.
 603 - Smaller bytecode
 604 o Everything mentioned above makes it smaller and faster
 605 o Javac/jikes add redundant code
 606 _ Switch(a) {
 607    Case 1: Run_1000(); break;
 608    Case 2: run_2000(); break;
 609 }
 610 Gets compiled into
 611 1 ÐTableswitch É.
 612 2 Ð ALOAD_0
 613 3 Ð invokespecial  run_1000
 614 4 Ð goto end
 615 5 Ð ALOAD_0
 616 6 Ð invokespecial run_2000
 617 10 Ð goto end
 618 ALOAD_0 is unnecessarily put in each switch arm
 619
 620 3.5 Interfacing with Java Code
 621 - _call_java ()/Runtime.call()
 622 o _call_java () - Call a java method from mips
 623 o Runime.call() Ð call a mips method from java
 624 o Easily allocate memory within the binaryÕs memory space by calling libcÕs malloc()
 625 o Can go back and forth between mips and java (java calls mips which calls java which
 626 calls back into mips)
 627 - Java Strings can easily be converted to and from null terminated strings in the processÕ
 628 memory space
 629 - Java InputStreams, OutputStreams, and Sockets can easily be turned in to standard
 630 UNIX file descriptors (and ANSI FILE*s)
 631 - Can easily create custom filedescriptors and have full control over all operations on
 632 them (read, write, seek, close, fstat, etc)
 633 - Int User_info[] Ð optional chunk of memory can very easily be accessed from java
 634 (Runtime.getUserInfo/setUserInfo)
 635
 636 3.6 Virtualization
 637 - Adam Ð we actually donÕt support sockets directly yet Ð you should probably take that
 638 out. (But you can create a socket in java and expose it to mips as a filedescriptor)
 639 - Runtime services
 640 o Provides a easy to use interface to subclasses (Interpreter or compiles binaries)
 641 _ Subclasses only know how to execute instructions
 642 _ Runtime handles setting up registers/stack for execution to begin and
 643 extracting return values and the exit status from the process
 644 o Memory management
 645 _ Sets up stack and guard pages
 646 _ Allocates pages with the sbrk syscall
 647 _ Provide easy an memcpy like interface for accessing the processes memory
 648 from java
 649 o Runtime.call() support Ð sets up registers,etc to prepare the process for a call into it
 650 from java
 651 o Filesystem Ð open/close/read/write/seek/fcntl s syscalls
 652 o Time related functions Ð sleep, gettimeofday, times syscall
 653 o UnixRuntime provides a more complete unix-like environment (Runtime smaller
 654 though)
 655 _ Supports fork() and waitpid()
 656 _ Pipe() for IPC
 657 _ More advocated filesystem interface
 658 _ All filesystem operations abstracted away into a FileSystem class
 659 o FileSystem class can be written that exposes a zip file,
 660 directory on an http server, etc as a filesystem
 661 _ Chdir/getcwd
 662 _ /dev filesystem
 663 _ stat()
 664 _ directory listing support
 665 5.1 Charts
 666 IÕll put together some charts tonight
 667 5.2 Optimizations
 668 And finish this part
 669
 670 Let me know if this was what you were looking for
 671
 672                                           libjpeg  libmspack libfreetype
 673 Interpreted MIPS Binary                   22.2      12.9      21.4
 674 Compled MIPS Binary (fastest options)     3.39      2.23      4.31
 675 Native -O3                                0.235    0.084     0.201
 676
 677 Compled - with all case statements        3.50      2.30      4.99
 678 Compiled - with pruned case statement     3.39      2.23      4.31
 679
 680 Compiled - 512 instructions/method        62.7      27.7      56.9
 681 Compiled - 256 instructions/method        3.54      2.55      4.43
 682 Compiled - 128 instructions/method        3.39      2.23      4.31
 683 Compiled - 64 instructions/method         3.56      2.31      4.40
 684 Compiled - 32 instruction/method          3.71      2.46      4.64
 685
 686 Compild MIPS Binary (Server VM)           3.21      2.00      4.54
 687 Compiled MIPS Binary (Client VM)          3.39      2.23      4.31
 688
 689 All times are measured in seconds. These were all run on a dual 1ghz G4
 690 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
 691 was run 8 times within a single VM. The highest and lowest times were
 692 removed and the remaining 6 were averaged. In each case only the first
 693 run differed significantly from the rest.
 694
 695 The libjpeg test consisted of decoding a 1280x1024 jpeg
 696 (thebride_1280.jpg) and writing a tga. The mspack test consisted of
 697 extracting all members from arial32.exe, comic32.exe, times32.exe, and
 698 verdan32.exe. The freetype test consisted of rendering characters
 699 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
 700 about 950 individual glyphs).
 701
 702 I can provide you with the source for any of these test if you'd like.
 703
 704 -Brian
 705 \end{verbatim}
 706
 707 \end{document}
 708