2 % FIXME: - Add something about size limits on the constant pool
3 % how we worked around that and the performance impact
5 % - Add something about encoding data sections as string constants
6 % and the UTF8 non-7-bit-ascii penalty
9 \documentclass{acmconf}
11 \usepackage{amssymb,amsmath,epsfig,alltt}
12 \sloppy % better line breaks
17 \bibliographystyle{alpha}
19 \title{\textbf{\textsf{
20 Running Legacy C/C++ Libraries in a Pure Java Environment
23 \author{\begin{tabular}{@{}c@{}}
24 {\em {Brian Alliet}} \\ \\
26 \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
27 {\em {Adam Megacz}} \\ \\
38 \section{Introduction}
40 \subsection{Why would you want to do this?}
42 The C programming language has been around for over 30 years now.
43 There is a huge library of software written in this language. By
44 contrast, Java has been around for less than ten years. Although it
45 offers substantial advantages over C, the set of libraries written in
46 this language still lags behind C/C++.
48 The typical solution to this dilemma is to use JNI or CNI to invoke C
49 code from within a Java VM. Unfortunately, there are a number of
50 situations in which this is not an acceptable solution:
53 \item Java Applets are not permitted to invoke {\tt Runtime.loadLibrary()}
54 \item Java Servlet containers with a {\tt SecurityManager} will not
55 permit loading new JNI libraries. This configuration is popular
56 with {\it shared hosting} providers and corporate intranets
57 where a number of different parties contribute individual web
58 applications which are run together in a single container.
59 \item JNI requires the native library to be compiled ahead of time,
60 separately, for every architecture on which it will be deployed.
61 This is unworkable for situations in which the full set of
62 target architectures is not known at deployment time.
63 \item The increasingly popular J2ME platform does not support JNI or CNI.
64 \item Unlike Java Bytecode, JNI code is susceptible to buffer overflow
65 and heap corruption attacks. This can be a major security
67 \item JNI often introduces undesirable added complexity to an
71 The technique we present here is based on using a typical ANSI C
72 compiler to compile C/C++ code into a MIPS binary, and then using a
73 tool to translate that binary on an instruction-by-instruction basis
76 The technique presented here is general; we anticipate that it can be
77 applied to other secure virtual machines such as Microsoft's .NET.
80 \section{Existing Work: Source-to-Source Translation}
84 \item commercial products
87 \section{Mips2Java: Binary-to-Binary Translation}
89 We present Mips2Java, a binary-to-binary translation tool to convert
90 MIPS binaries into Java bytecode files.
92 The process of utilizing Mips2Java begins by using any compiler for
93 any language to compile the source library into a statically linked
94 MIPS binary. We used {\tt gcc 3.3.3}, but any compiler which can
95 target the MIPS platform should be acceptable. The binary is
96 statically linked with a system library (in the case of C code this is
97 {\tt libc}) which translates system requests (such as {\tt open()},
98 {\tt read()}, or {\tt write()}) into appropriate invocations of the
99 MIPS {\tt SYSCALL} instruction.
101 The statically linked MIPS binary is then fed to the Mips2Java tool
102 (which is itself written in Java), which emits a sequence of Java
103 Bytecodes in the form of a {\tt .class} file equivalent to the
104 provided binary. This {\tt .class} file contains a single class which
105 implements the {\tt Runtime} interface. This class may then be
106 instantiated; invoking the {\tt execute()} method is equivalent to
107 invoking the {\tt main()} function in the original binary.
110 \subsection{Why MIPS?}
112 We chose MIPS as a source format for two primary reasons: the
113 availability of tools to translate legacy code into MIPS binaries, and
114 the close similarity between the MIPS ISA and the Java Virtual Machine.
116 The MIPS architecture has been around for quite some time, and is well
117 supported by the GNU Compiler Collection, which is capable of
118 compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C
121 The MIPS R1000 ISA bears a striking similarity to the Java Virtual
122 Machine. This early revision of the MIPS supports only 32-bit aligned
123 loads and stores from memory, which is precisely the access model
124 supported by Java for {\tt int[]}s.
126 Cover dynamic page allocation.
130 Brian, are there any other fortunate similarities we should mention
131 here? I seem to remember there being a bunch, but I can't recall them
132 right now; it's been a while since I dealt with this stuff in detail.
135 \subsection{Interpreter}
137 The Interpreter was the first part of Mips2Java to be written. This was the most straightforward and simple way to run MIPS binaries inside the JVM. The simplicity of the interpreter also made it very simple to debug. Debugging machine-generated code is a pain. Most importantly, the interpreter provided a great reference to use when developing the compiler. With known working implementations of each MIPS instruction in Java writing a compiler became a matter of simple doing the same thing ahead of time.
139 With the completion of the compiler the interpreter in Mips2Java has become less useful. However, it may still have some life left in it. One possible use is remote debugging with GDB. Although debugging the compiler generated JVM code is theoretically possible, it would be far easier to do in the interpreter. The interpreter may also be useful in cases where size is far more important than speed. The interpreter is very small. The interpreter and MIPS binary combined are smaller than the compiled classfiles.
142 \subsection{Compiling to Java Source}
144 The next major step in Mips2Java?s development was the Java source compiler. This generated Java source code (compliable with javac or Jikes) from a MIPS binary. Generating Java source code was preferred initially over JVM bytecode for two reasons. The authors weren?t very familiar with JVM bytecode and therefore generating Java source code was simpler. Generating source code also eliminated the need to do trivial optimizations in the Mips2java compiler that javac and Jikes already do. This mainly includes 2+2=4 stuff. For example, the MIPS register r0 is immutable and always 0. This register is represented by a static final int in the Java source compiler. Javac and Jikes automatically handle optimizing this away when possible. In the JVM bytecode compiler these optimizations needs to be done in Mips2Java.
146 Early versions of the Mips2Java compiler were very simple. All 32 MIPS GPRs and a special PC register were fields in the generated java class. There was a run() method containing all the instructions in the .text segment converted to Java source code. A switch statement was used to allow jumps from instruction to instruction. The generated code looked something like this.
148 %private final static int r0 = 0;
149 %private int r1, r2, r3,...,r30;
150 %private int r31 = 0xdeadbeef;
151 %private int pc = ENTRY_POINT;
180 % System.err.println(?Exited from ENTRY_POINT?);
181 % System.err.println(?R2: ? + r2);
187 This worked fine for small binaries but as soon as anything substantial was fed to it the 64k JVM method size limit was soon hit. The solution to this was to break up the code into many smaller methods. This required a trampoline to dispatch jumps to the appropriate method. With the addition of the trampoline the generated code looked something like this:
189 %public void run_0x10000() {
206 %pubic void run_0x10200() {
217 %public void trampoline() {
219 % switch(pc&0xfffffe00) {
220 % case 0x10000: run_0x10000(); break;
221 % case 0x10200: run_0x10200(); break;
228 With this trampoline in place somewhat large binaries could be handled without much difficulty. There is no limit on the size of a classfile as a whole, just individual methods. This method should scale well. However, there are other classfile limitations that will limit the size of compiled binaries.
230 Another interesting problem that was discovered while creating the trampoline method was javac and Jikes? inability to properly optimize switch statements. The follow code fragment gets compiled into a lookupswich by javac:
232 %Switch(pc&0xffffff00) {
233 % Case 0x00000100: run_100(); break;
234 % Case 0x00000200: run_200(); break;
235 % Case 0x00000300: run_300(); break;
238 while this nearly identical code fragment gets compiled to a tableswitch
240 Javac isn?t smart enough to see the patter in the case values and generates very suboptimal bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which is significantly faster. This change alone nearly doubled the speed of the compiled binary.
242 Finding the optimal method size lead to the next big performance increase. It was determined with experimentation that the optimal number of MIPS instructions per method is 128 (considering only power of two options). Going above or below that lead to performance decreases. This is most likely due to a combination of two factors.
245 \item The two levels of switch statements jumps have to pass though - The first
246 switch statement jumps go through is the trampoline switch statement. This
247 is implemented as a TABLESWITCH in JVM bytecode so it is very fast. The
248 second level switch statement in the individual run\_ methods is implemented
249 as a LOOKUPSWITCH, which is much slower. Using smaller methods puts more
250 burden on the faster TABLESWITCH and less on the slower LOOKUPSWITCH.
252 \item JIT compilers probably favor smaller methods smaller methods are easier to compile and are probably better candidates for JIT compilation than larger methods.
257 The next big optimization was eliminating unnecessary case statements. Having case statements before each instruction prevents JIT compilers from being able to optimize across instruction boundaries. In order to eliminate unnecessary case statements every possible address that could be jumped to directly needed to be identified. The sources for possible jump targets come from 3 places.
259 \item The .text segment ? Every instruction in the text segment in scanned for jump targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list of possible branch targets. In addition, functions that set the link register have theirpc+8 added to the list (the address that would?ve been put to the link register). Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate Unsigned) are scanned for possible addresses in the text segment. This combination of instructions is often used to load a 32-bit word into a register.
260 \item The .data segment ? When GCC generates switch() statements it often uses a jump table stored in the .data segment. Unfortunately gcc doesn?t identify these jump tables in any way. Therefore, the entire .data segment is conservatively scanned for possible addresses in the .text segment.
261 \item The symbol table ? This is mainly used as a backup. Scanning the .text and .data segments should identify any possible jump targets but adding every function in the symbol table in the ELF binary doesn?t hurt. This will also catch functions that are never called directly from the MIPS binary (for example, functions called with the call() method in the runtime).
264 Eliminating unnecessary case statements provided a 10-25\% speed increase
266 Despite all the above optimizations and workaround an impossible to
267 workaround hard classfile limit was eventually hit, the constant pool. The
268 constant pool in classfiles is limited to 65536 entries. Every Integer with
269 a magnitude greater than 32767 requires an entry in the constant pool. Every
270 time the compiler emits a jump/branch instruction the PC field is set to the
271 branch target. This means nearly every branch instruction requires an entry
272 in the constant pool. Large binaries hit this limit fairly quickly. One
273 workaround that was employed in the Java source compiler was to express
274 constants as offsets from a few central values. For example: ``pc =
275 N\_0x00010000 + 0x10'' where N\_0x000100000 is a non-final field to prevent
276 javac from inlining it. This was sufficient to get reasonable large binaries
277 to compile. It has a small (approximately 5\%) performance impact on the
278 generated code. It also makes the generated classfile somewhat larger.
279 Fortunately, the classfile compiler eliminates this problem.
281 \subsection{Compiling directly to Java Bytecode}
283 The next step in the evolution of Mips2Java was to compile directly to JVM bytecode eliminating the intermediate javac step. This had several advantages:
285 \item There are little tricks that can be done in JVM bytecode that can?t be done in Java source code.
286 \item Eliminates the time-consuming javac step ? Javac takes a long time to parse and compile the output from the java source compiler.
287 \item Allows for MIPS binaries to be compiled and loaded into a running VM using a class loader. This eliminates the need to compile the binaries ahead of time.
290 By generating code at the bytecode level there are many areas where the compiler can be smarter than javac. Most of the areas where improvements where made where in the handling of branch instructions and in taking advantage of the JVM stack to eliminate unnecessary LOADs and STOREs to local variables.
292 The first obvious optimization that generating bytecode allows for is the use of GOTO. Despite the fact that java doesn?t have a GOTO keyword a GOTO bytecode does exist and is used heavily in the code generates by javac. Unfortunately the java language doesn?t provide any way to take advantage of this. As result of this jumps within a method were implemented by setting the PC field to the new address and making a trip back to the initial switch statement. In the classfile compiler these jumps are implemented as GOTOs directly to the target instruction. This saves a costly trip back through the LOOKUPSWITCH statement and is a huge win for small loops within a method.
294 Somewhat related to using GOTO is the ability to optimize branch statements. In the Java source compiler branch statements are implemented as follows (delay slots are ignored for the purpose of this example):
295 %if(condition) { pc = TARGET; continue; }
297 This requires a branch in the JVM regardless of whether the MIPS branch is actually taken. If condition is false the JVM has to jump over the code to set the PC and go back to the switch block. If condition is true the JVM as to jump to the switch block. By generating bytecode directly we can make the target of the JVM branch statement the actual bytecode of the final destination. In the case where the branch isn?t taken the JVM doesn?t need to branch at all.
299 A side affect of the above two optimizations is a solution to the excess constant pool entries problem. When jumps are implemented as GOTOs and direct branches to the target the PC field doesn?t need to be set. This eliminates many of the constant pool entries the java source compiler requires. The limit is still there however, and given a large enough binary it will still be reached.
301 Delay slots are another area where things are done somewhat inefficiently in the Java source compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a ?delay slot?. That is, an instruction after a branch or jump instruction that is executed regardless of whether the branch is taken. This is done because by the time the branch or jump instruction is finished being processes the next instruction is already ready to be executed and it is wasteful to discard it. (However, newer MIPS CPUs have pipelines that are much larger than early MIPS CPUs so they have to discard many instructions anyway.) As a result of this the instruction in the delay slot is actually executed BEFORE the branch is taken. To make things even more difficult, values from the register file are loaded BEFORE the delay slot is executed. Here is a small piece of MIPS assembly:
309 This piece of code is executed as follows
311 \item r2 is set to ?1
312 \item r2 is loaded from the register file by the BLTEZ instruction
313 \item 10 is added to r2 by the ADDIU instruction
314 \item The branch is taken because at the time the BLTZ instruction was executed r2 was ?1, but r2 is now 9 (-1 + 10)
317 There is a very element solution to this problem when using JVM bytecode. When a branch instruction is encountered the registers needed for the comparison are pushed onto the stack to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the delay slot is emitted, and then finally the actual JVM branch instruction. Because the values were pushed to the stack before the delay slot was executed any changes the delay slot made to the registers are not visible to the branch bytecode. This allows delay slots to be used with no performance penalty or size penalty.
319 One final advantage that generating bytecode directly allows is smaller more compact bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also a few other areas where the generated bytecode can be optimized for size with more knowledge of the program as a whole.
321 When encountering the following switch block both javac and Jikes generate redundant bytecode.
323 % Case 0x1: run_1(); break;
324 % Case 0x2: run_2(); break
326 % case 0x100: run_100(); break;
329 The first bytecode in each case arm in the switch statement is ALOAD\_0 to
330 prepare for a invoke special call. By simple moving this outside the switch
331 statement each case arm was reduced in size by one instruction. Similar
332 optimizations were also done in other parts of the compiler.
335 \subsection{Interfacing with Java Code}
337 Java source code can create a copy of the translated binary by
338 instantiating the corresponding class, which extends {\tt Runtime}.
339 Invoking the {\tt main()} method on this class is equivalent to
340 calling the {\tt main()} function within the binary; the {\tt String}
341 arguments to this function are copied into the binary's memory space
342 and made available as {\tt argv**} and {\tt argc}.
344 The translated binary communicates with the rest of the VM by
345 executing MIPS {\tt SYSCALL} instructions, which are translated into
346 invocations of the {\tt syscall()} method. This calls back to the
347 native Java world, which can manipulate the binary's environment by
348 reading and writing to its memory space, checking its exit status,
349 pausing the VM, and restarting the VM.
352 \subsection{Virtualization}
354 The {\tt Runtime} class implements the majority of the standard {\tt
355 libc} syscalls, providing a complete interface to the filesystem,
356 network socket library, time of day, (Brian: what else goes here?).
359 \item ability to provide the same interface to CNI code and mips2javaified code
360 \item security advantages (chroot the {\tt fork()}ed process)
363 \section{Related Work}
365 \subsection{Source-to-Source translators}
367 A number of commercial products and research projects attempt to
368 translate C++ code to Java code, preserving the mapping of C++ classes
369 to Java classes. Unfortunately this is problematic since there is no
370 way to do pointer arithmetic except within arrays, and even in that
371 case, arithmetic cannot be done in terms of fractional objects.
377 Many of these products advise the user to tweak the code which results
378 from the translation. Unfortunately, hand-modifying machine-generated
379 code is generally a bad idea, since this modification cannot be
380 automated. This means that every time the origin code changes, the
381 code generator must be re-run, and the hand modifications must be
382 performed yet again. This is an error-prone process.
384 Furthermore, Mips2Java does not attempt to read C code directly. This
385 frees it from the complex task of faithfully implementing the ANSI C
386 standard (or, in the case of non ANSI-C compliant code, some other
387 interface). This also saves the user the chore of altering their
388 build process to accomodate Mips2Java.
391 \section{Performance}
395 (Note that none of these libraries have pure-Java equivalents.)
404 \subsection{Optimizations}
406 Brian, can you write something to go here? Just mention which
407 optimizations helped and which ones hurt.
411 \item optimal method size
415 \item local vars for registers (useless)
416 \item -fno-rename-registers
418 \item -fno-trapping-math
419 \item -fsingle-precision-constant
421 \item -freg-struct-return
422 \item -freduce-all-givs
425 \item -fmove-all-movables
426 \item -fno-sched-spec-load
427 \item -fno-sched-spec
428 \item -fno-schedule-insns
429 \item -fno-schedule-insns2
430 \item -fno-delayed-branch
431 \item -fno-function-cse
432 \item -ffunction-sections
433 \item -fdata-sections
434 \item array bounds checking
435 \item -falign-functions=n
436 \item -falign-labels=n
437 \item -falign-loops=n
438 \item -falign-jumps=n
439 \item -fno-function-cse
442 \section{Future Directions}
448 We rock the hizzouse.
457 libjpeg (render thebride_1280.jpg)
462 freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to
463 48 incrementing by 4)
468 Section 3.2 - Interpreter
469 The Interpreter was the first part of Mips2Java to be written. This was the most
470 straightforward and simple way to run MIPS binaries inside the JVM. The simplicity of the
471 interpreter also made it very simple to debug. Debugging machine-generated code is a pain.
472 Most importantly, the interpreter provided a great reference to use when developing the
473 compiler. With known working implementations of each MIPS instruction in Java writing a
474 compiler became a matter of simple doing the same thing ahead of time.
475 With the completion of the compiler the interpreter in Mips2Java has become less useful.
476 However, it may still have some life left in it. One possible use is remote debugging with
477 GDB. Although debugging the compiler generated JVM code is theoretically possible, it
478 would be far easier to do in the interpreter. The interpreter may also be useful in cases
479 where size is far more important than speed. The interpreter is very small. The interpreter
480 and MIPS binary combined are smaller than the compiled classfiles.
481 Section 3.3 - Compiling to Java Source
482 The next major step in Mips2JavaÕs development was the Java source compiler. This
483 generated Java source code (compliable with javac or Jikes) from a MIPS binary.
484 Generating Java source code was preferred initially over JVM bytecode for two reasons.
485 The authors werenÕt very familiar with JVM bytecode and therefore generating Java source
486 code was simpler. Generating source code also eliminated the need to do trivial
487 optimizations in the Mips2java compiler that javac and Jikes already do. This mainly
488 includes 2+2=4 stuff. For example, the MIPS register r0 is immutable and always 0. This
489 register is represented by a static final int in the Java source compiler. Javac and Jikes
490 automatically handle optimizing this away when possible. In the JVM bytecode compiler
491 these optimizations needs to be done in Mips2Java.
492 Early versions of the Mips2Java compiler were very simple. All 32 MIPS GPRs and a
493 special PC register were fields in the generated java class. There was a run() method
494 containing all the instructions in the .text segment converted to Java source code. A switch
495 statement was used to allow jumps from instruction to instruction. The generated code
496 looked something like this.
497 private final static int r0 = 0;
498 private int r1, r2, r3,...,r30;
499 private int r31 = 0xdeadbeef;
500 private int pc = ENTRY_POINT;
529 System.err.println("Exited from ENTRY_POINT");
530 System.err.println("R2: " + r2);
536 This worked fine for small binaries but as soon as anything substantial was fed to it the 64k
537 JVM method size limit was soon hit. The solution to this was to break up the code into
538 many smaller methods. This required a trampoline to dispatch jumps to the appropriate
539 method. With the addition of the trampoline the generated code looked something like this:
540 public void run_0x10000() {
557 pubic void run_0x10200() {
568 public void trampoline() {
570 switch(pc&0xfffffe00) {
571 case 0x10000: run_0x10000(); break;
572 case 0x10200: run_0x10200(); break;
578 With this trampoline in place somewhat large binaries could be handled without much
579 difficulty. There is no limit on the size of a classfile as a whole, just individual methods.
580 This method should scale well. However, there are other classfile limitations that will limit
581 the size of compiled binaries.
582 Another interesting problem that was discovered while creating the trampoline method was
583 javac and JikesÕ incredible stupidity when dealing with switch statements. The follow code
584 fragment gets compiled into a lookupswich by javac:
585 Switch(pc&0xffffff00) {
586 Case 0x00000100: run_100(); break;
587 Case 0x00000200: run_200(); break;
588 Case 0x00000300: run_300(); break;
590 while this nearly identical code fragment gets compiled to a tableswitch
592 case 0x1: run_100(); break
593 case 0x2: run_200(); break;
594 case 0x3: run_300(); break;
596 Javac isnÕt smart enough to see the patter in the case values and generates very suboptimal
597 bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which
598 is significantly faster. This change alone nearly doubled the speed of the compiled binary.
599 Finding the optimal method size lead to the next big performance increase. It was
600 determined with experimentation that the optimal number of MIPS instructions per method
601 is 128 (considering only power of two options). Going above or below that lead to
602 performance decreases. This is most likely due to a combination of two factors.
603 _ The two levels of switch statements jumps have to pass though Ð The first switch
604 statement jumps go through is the trampoline switch statement. This is implemented
605 as a TABLESWITCH in JVM bytecode so it is very fast. The second level switch
606 statement in the individual run_ methods is implemented as a LOOKUPSWITCH,
607 which is much slower. Using smaller methods puts more burden on the faster
608 TABLESWITCH and less on the slower LOOKUPSWITCH.
609 _ JIT compilers probably favor smaller methods smaller methods are easier to compile
610 and are probably better candidates for JIT compilation than larger methods.
611 FIXME: Put a chart here
612 The next big optimization was eliminating unnecessary case statements. Having case
613 statements before each instruction prevents JIT compilers from being able to optimize
614 across instruction boundaries. In order to eliminate unnecessary case statements every
615 possible address that could be jumped to directly needed to be identified. The sources for
616 possible jump targets come from 3 places.
617 _ The .text segment Ð Every instruction in the text segment in scanned for jump
618 targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list
619 of possible branch targets. In addition, functions that set the link register have
620 theirpc+8 added to the list (the address that wouldÕve been put to the link register).
621 Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate
622 Unsigned) are scanned for possible addresses in the text segment. This combination
623 of instructions is often used to load a 32-bit word into a register.
624 _ The .data segment Ð When GCC generates switch() statements it often uses a jump
625 table stored in the .data segment. Unfortunately gcc doesnÕt identify these jump
626 tables in any way. Therefore, the entire .data segment is conservatively scanned for
627 possible addresses in the .text segment.
628 _ The symbol table Ð This is mainly used as a backup. Scanning the .text and .data
629 segments should identify any possible jump targets but adding every function in the
630 symbol table in the ELF binary doesnÕt hurt. This will also catch functions that are
631 never called directly from the MIPS binary (for example, functions called with the
632 call() method in the runtime).
633 Eliminating unnecessary case statements provided a 10-25% speed increase .
634 Despite all the above optimizations and workaround an impossible to workaround hard
635 classfile limit was eventually hit, the constant pool. The constant pool in classfiles is limited
636 to 65536 entries. Every Integer with a magnitude greater than 32767 requires an entry in the
637 constant pool. Every time the compiler emits a jump/branch instruction the PC field is set to
638 the branch target. This means nearly every branch instruction requires an entry in the
639 constant pool. Large binaries hit this limit fairly quickly. One workaround that was
640 employed in the Java source compiler was to express constants as offsets from a few central
641 values. For example: "pc = N_0x00010000 + 0x10" where N_0x000100000 is a non-
642 final field to prevent javac from inlining it. This was sufficient to get reasonable large
643 binaries to compile. It has a small (approximately 5%) performance impact on the generated
644 code. It also makes the generated classfile somewhat larger. Fortunately, the classfile
645 compiler eliminates this problem.
646 3.4 Ð Bytecode compiler
647 The next step in the evolution of Mips2Java was to compile directly to JVM bytecode
648 eliminating the intermediate javac step. This had several advantages
649 _ There are little tricks that can be done in JVM bytecode that canÕt be done in Java
651 _ Eliminates the time-consuming javac step Ð Javac takes a long time to parse and
652 compile the output from the java source compiler.
653 _ Allows for MIPS binaries to be compiled and loaded into a running VM using a
654 class loader. This eliminates the need to compile the binaries ahead of time.
655 By generating code at the bytecode level there are many areas where the compiler can be
656 smarter than javac. Most of the areas where improvements where made where in the
657 handling of branch instructions and in taking advantage of the JVM stack to eliminate
658 unnecessary LOADs and STOREs to local variables.
659 The first obvious optimization that generating bytecode allows for is the use of GOTO.
660 Despite the fact that java doesnÕt have a GOTO keyword a GOTO bytecode does exist and
661 is used heavily in the code generates by javac. Unfortunately the java language doesnÕt
662 provide any way to take advantage of this. As result of this jumps within a method were
663 implemented by setting the PC field to the new address and making a trip back to the initial
664 switch statement. In the classfile compiler these jumps are implemented as GOTOs directly
665 to the target instruction. This saves a costly trip back through the LOOKUPSWITCH
666 statement and is a huge win for small loops within a method.
667 Somewhat related to using GOTO is the ability to optimize branch statements. In the Java
668 source compiler branch statements are implemented as follows (delay slots are ignored for
669 the purpose of this example)
670 If(condition) { pc = TARGET; continue; }
671 This requires a branch in the JVM regardless of whether the MIPS branch is actually taken.
672 If condition is false the JVM has to jump over the code to set the PC and go back to the
673 switch block. If condition is true the JVM as to jump to the switch block. By generating
674 bytecode directly we can make the target of the JVM branch statement the actual bytecode
675 of the final destination. In the case where the branch isnÕt taken the JVM doesnÕt need to
677 A side affect of the above two optimizations is a solution to the excess constant pool entries
678 problem. When jumps are implemented as GOTOs and direct branches to the target the PC
679 field doesnÕt need to be set. This eliminates many of the constant pool entries the java
680 source compiler requires. The limit is still there however, and given a large enough binary it
681 will still be reached.
682 Delay slots are another area where things are done somewhat inefficiently in the Java source
683 compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a
684 "delay slot". That is, an instruction after a branch or jump instruction that is executed
685 regardless of whether the branch is taken. This is done because by the time the branch or
686 jump instruction is finished being processes the next instruction is already ready to be
687 executed and it is wasteful to discard it. (However, newer MIPS CPUs have pipelines that
688 are much larger than early MIPS CPUs so they have to discard many instructions anyway.)
689 As a result of this the instruction in the delay slot is actually executed BEFORE the branch
690 is taken. To make things even more difficult, values from the register file are loaded
691 BEFORE the delay slot is executed. Here is a small piece of MIPS assembly:
697 This piece of code is executed as follows
699 2. r2 is loaded from the register file by the BLTEZ instruction
700 3. 10 is added to r2 by the ADDIU instruction
701 4. The branch is taken because at the time the BLTZ instruction was executed r2 was
702 Ð1, but r2 is now 9 (-1 + 10)
703 There is a very element solution to this problem when using JVM bytecode. When a branch
704 instruction is encountered the registers needed for the comparison are pushed onto the stack
705 to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the
706 delay slot is emitted, and then finally the actual JVM branch instruction. Because the values
707 were pushed to the stack before the delay slot was executed any changes the delay slot made
708 to the registers are not visible to the branch bytecode. This allows delay slots to be used
709 with no performance penalty or size penalty.
710 One final advantage that generating bytecode directly allows is smaller more compact
711 bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also
712 a few other areas where the generated bytecode can be optimized for size with more
713 knowledge of the program as a whole.
714 When encountering the following switch block both javac and Jikes generate redundant
717 Case 0x1: run_1(); break;
718 Case 0x2: run_2(); break
720 case 0x100: run_100(); break;
722 The first bytecode in each case arm in the switch statement is ALOAD_0 to prepare for a
723 invoke special call. By simple moving this outside the switch statement each case arm was
724 reduced in size by one instruction. Similar optimizations were also done in other parts of the
728 - Adam - The method is run(), not execute. Execute() is only used when you need to
729 resume from a pause syscall.
732 - Adam - Even the R1000 supports LB/SB/LH/SH/LWL/LWR Ð saying it only supports
733 32-bit aligned loads is incorrect.
735 o All the branching instructions in MIPS directly map to single JVM instructions.
736 o Most of the ALU instructions map to single JVM instructions.
738 (I can write up some stuff for each of these next several sections if you want)
739 Section 3.2 - Interpreter
740 - Originally written mainly to understand the MIPS instruction set
741 - Far easier to debug than an ahead of time compiler (no waiting, can throw in quick
742 hacks like if(pc >= 0xbadc0de && pc <= 0xbadcfff) debugOutput() ), donÕt need to
743 understand most of the ELF format)
744 - Still could be useful
745 o for GDB remote debugging
746 o cases where size is more important than speed (size of interpreter + size of mips
747 binary < size of compiled binary or size of compiler + mips binary)
748 o code which dynamically generates code (JIT compilers, etc). The ahead of time
749 compiler canÕt possibly handle this
751 Section 3.3 Ð Compiling to Java Source
752 - First version of an ahead of time compiler
753 - Huge performance boost
754 - Java source output preferred for the 2+2=4 type optimizations java compilers do
755 - Worked well for small binaries Ð large MIPS binaries required ridiculous amounts of
756 memory to compile and often created invalid classfiles
757 - Too many constants Ð every jump operation required an entry in the constant pool (pc =
758 0xabcd1234; continue; )
760 Section 3.4 Ð Compiling directly to JVM Bytecode
761 - Another jump in performance
762 - More efficient code can be created at the bytecode level
763 o Information can be stored on the JVM stack rather than in local variables
764 _ Javac/jikes often unnecessarily use local variables
768 does a store and two loads when a simple DUP would suffice
769 o GOTO can be used to branch directly to branch destinations in the same method
770 rather than going through the switch block again.
771 o BEQ, BGTZ, BLE, etc can jump directly to destination rather than doing
772 if(condition) { pc=0xabcd1234; continue; }
773 o Eliminates excess constant pool entries (only jump outside the current method
774 require a constant pool entry)
775 o Delay slots implemented more efficiently.
776 _ Java source compiler does:
777 if(condition) { /* delay slot /; pc = 0xabcd1234; continue; }
778 /* delay slot again */
779 _ This is required because the delay slot can change the registers used in
780 condition. The registers need to be read BEFORE the delay slot in executed.
781 _ In the bytecode compiler the registers used in the condition are pushed to the
782 stack, then the delay slot is executed, and finally the comparison is done.
783 This eliminates the needs to output the delay slot twice.
785 o Everything mentioned above makes it smaller and faster
786 o Javac/jikes add redundant code
788 Case 1: Run_1000(); break;
789 Case 2: run_2000(); break;
794 3 Ð invokespecial run_1000
797 6 Ð invokespecial run_2000
799 ALOAD_0 is unnecessarily put in each switch arm
801 3.5 Interfacing with Java Code
802 - _call_java ()/Runtime.call()
803 o _call_java () - Call a java method from mips
804 o Runime.call() Ð call a mips method from java
805 o Easily allocate memory within the binaryÕs memory space by calling libcÕs malloc()
806 o Can go back and forth between mips and java (java calls mips which calls java which
807 calls back into mips)
808 - Java Strings can easily be converted to and from null terminated strings in the processÕ
810 - Java InputStreams, OutputStreams, and Sockets can easily be turned in to standard
811 UNIX file descriptors (and ANSI FILE*s)
812 - Can easily create custom filedescriptors and have full control over all operations on
813 them (read, write, seek, close, fstat, etc)
814 - Int User_info[] Ð optional chunk of memory can very easily be accessed from java
815 (Runtime.getUserInfo/setUserInfo)
818 - Adam Ð we actually donÕt support sockets directly yet Ð you should probably take that
819 out. (But you can create a socket in java and expose it to mips as a filedescriptor)
821 o Provides a easy to use interface to subclasses (Interpreter or compiles binaries)
822 _ Subclasses only know how to execute instructions
823 _ Runtime handles setting up registers/stack for execution to begin and
824 extracting return values and the exit status from the process
826 _ Sets up stack and guard pages
827 _ Allocates pages with the sbrk syscall
828 _ Provide easy an memcpy like interface for accessing the processes memory
830 o Runtime.call() support Ð sets up registers,etc to prepare the process for a call into it
832 o Filesystem Ð open/close/read/write/seek/fcntl s syscalls
833 o Time related functions Ð sleep, gettimeofday, times syscall
834 o UnixRuntime provides a more complete unix-like environment (Runtime smaller
836 _ Supports fork() and waitpid()
838 _ More advocated filesystem interface
839 _ All filesystem operations abstracted away into a FileSystem class
840 o FileSystem class can be written that exposes a zip file,
841 directory on an http server, etc as a filesystem
845 _ directory listing support
847 IÕll put together some charts tonight
851 Let me know if this was what you were looking for
853 libjpeg libmspack libfreetype
854 Interpreted MIPS Binary 22.2 12.9 21.4
855 Compled MIPS Binary (fastest options) 3.39 2.23 4.31
856 Native -O3 0.235 0.084 0.201
858 Compled - with all case statements 3.50 2.30 4.99
859 Compiled - with pruned case statement 3.39 2.23 4.31
861 Compiled - 512 instructions/method 62.7 27.7 56.9
862 Compiled - 256 instructions/method 3.54 2.55 4.43
863 Compiled - 128 instructions/method 3.39 2.23 4.31
864 Compiled - 64 instructions/method 3.56 2.31 4.40
865 Compiled - 32 instruction/method 3.71 2.46 4.64
867 Compild MIPS Binary (Server VM) 3.21 2.00 4.54
868 Compiled MIPS Binary (Client VM) 3.39 2.23 4.31
870 All times are measured in seconds. These were all run on a dual 1ghz G4
871 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
872 was run 8 times within a single VM. The highest and lowest times were
873 removed and the remaining 6 were averaged. In each case only the first
874 run differed significantly from the rest.
876 The libjpeg test consisted of decoding a 1280x1024 jpeg
877 (thebride_1280.jpg) and writing a tga. The mspack test consisted of
878 extracting all members from arial32.exe, comic32.exe, times32.exe, and
879 verdan32.exe. The freetype test consisted of rendering characters
880 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
881 about 950 individual glyphs).
883 I can provide you with the source for any of these test if you'd like.