-public void trampoline() {
- for(;;) {
- switch(pc&0xfffffe00) {
- case 0x10000: run_0x10000(); break;
- case 0x10200: run_0x10200(); break;
- case 0xdeadbe00:
- ...
- }
- }
-}
-With this trampoline in place somewhat large binaries could be handled without much
-difficulty. There is no limit on the size of a classfile as a whole, just individual methods.
-This method should scale well. However, there are other classfile limitations that will limit
-the size of compiled binaries.
-Another interesting problem that was discovered while creating the trampoline method was
-javac and JikesÕ incredible stupidity when dealing with switch statements. The follow code
-fragment gets compiled into a lookupswich by javac:
-Switch(pc&0xffffff00) {
- Case 0x00000100: run_100(); break;
- Case 0x00000200: run_200(); break;
- Case 0x00000300: run_300(); break;
-}
-while this nearly identical code fragment gets compiled to a tableswitch
-switch(pc>>>8) {
- case 0x1: run_100(); break
- case 0x2: run_200(); break;
- case 0x3: run_300(); break;
-}
-Javac isnÕt smart enough to see the patter in the case values and generates very suboptimal
-bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which
-is significantly faster. This change alone nearly doubled the speed of the compiled binary.
-Finding the optimal method size lead to the next big performance increase. It was
-determined with experimentation that the optimal number of MIPS instructions per method
-is 128 (considering only power of two options). Going above or below that lead to
-performance decreases. This is most likely due to a combination of two factors.
-_ The two levels of switch statements jumps have to pass though Ð The first switch
-statement jumps go through is the trampoline switch statement. This is implemented
-as a TABLESWITCH in JVM bytecode so it is very fast. The second level switch
-statement in the individual run_ methods is implemented as a LOOKUPSWITCH,
-which is much slower. Using smaller methods puts more burden on the faster
-TABLESWITCH and less on the slower LOOKUPSWITCH.
-_ JIT compilers probably favor smaller methods smaller methods are easier to compile
-and are probably better candidates for JIT compilation than larger methods.
-FIXME: Put a chart here
-The next big optimization was eliminating unnecessary case statements. Having case
-statements before each instruction prevents JIT compilers from being able to optimize
-across instruction boundaries. In order to eliminate unnecessary case statements every
-possible address that could be jumped to directly needed to be identified. The sources for
-possible jump targets come from 3 places.
-_ The .text segment Ð Every instruction in the text segment in scanned for jump
-targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list
-of possible branch targets. In addition, functions that set the link register have
-theirpc+8 added to the list (the address that wouldÕve been put to the link register).
-Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate
-Unsigned) are scanned for possible addresses in the text segment. This combination
-of instructions is often used to load a 32-bit word into a register.
-_ The .data segment Ð When GCC generates switch() statements it often uses a jump
-table stored in the .data segment. Unfortunately gcc doesnÕt identify these jump
-tables in any way. Therefore, the entire .data segment is conservatively scanned for
-possible addresses in the .text segment.
-_ The symbol table Ð This is mainly used as a backup. Scanning the .text and .data
-segments should identify any possible jump targets but adding every function in the
-symbol table in the ELF binary doesnÕt hurt. This will also catch functions that are
-never called directly from the MIPS binary (for example, functions called with the
-call() method in the runtime).
-Eliminating unnecessary case statements provided a 10-25% speed increase .
-Despite all the above optimizations and workaround an impossible to workaround hard
-classfile limit was eventually hit, the constant pool. The constant pool in classfiles is limited
-to 65536 entries. Every Integer with a magnitude greater than 32767 requires an entry in the
-constant pool. Every time the compiler emits a jump/branch instruction the PC field is set to
-the branch target. This means nearly every branch instruction requires an entry in the
-constant pool. Large binaries hit this limit fairly quickly. One workaround that was
-employed in the Java source compiler was to express constants as offsets from a few central
-values. For example: "pc = N_0x00010000 + 0x10" where N_0x000100000 is a non-
-final field to prevent javac from inlining it. This was sufficient to get reasonable large
-binaries to compile. It has a small (approximately 5%) performance impact on the generated
-code. It also makes the generated classfile somewhat larger. Fortunately, the classfile
-compiler eliminates this problem.
-3.4 Ð Bytecode compiler
-The next step in the evolution of Mips2Java was to compile directly to JVM bytecode
-eliminating the intermediate javac step. This had several advantages
-_ There are little tricks that can be done in JVM bytecode that canÕt be done in Java
-source code.
-_ Eliminates the time-consuming javac step Ð Javac takes a long time to parse and
-compile the output from the java source compiler.
-_ Allows for MIPS binaries to be compiled and loaded into a running VM using a
-class loader. This eliminates the need to compile the binaries ahead of time.
-By generating code at the bytecode level there are many areas where the compiler can be
-smarter than javac. Most of the areas where improvements where made where in the
-handling of branch instructions and in taking advantage of the JVM stack to eliminate
-unnecessary LOADs and STOREs to local variables.
-The first obvious optimization that generating bytecode allows for is the use of GOTO.
-Despite the fact that java doesnÕt have a GOTO keyword a GOTO bytecode does exist and
-is used heavily in the code generates by javac. Unfortunately the java language doesnÕt
-provide any way to take advantage of this. As result of this jumps within a method were
-implemented by setting the PC field to the new address and making a trip back to the initial
-switch statement. In the classfile compiler these jumps are implemented as GOTOs directly
-to the target instruction. This saves a costly trip back through the LOOKUPSWITCH
-statement and is a huge win for small loops within a method.
-Somewhat related to using GOTO is the ability to optimize branch statements. In the Java
-source compiler branch statements are implemented as follows (delay slots are ignored for
-the purpose of this example)
-If(condition) { pc = TARGET; continue; }
-This requires a branch in the JVM regardless of whether the MIPS branch is actually taken.
-If condition is false the JVM has to jump over the code to set the PC and go back to the
-switch block. If condition is true the JVM as to jump to the switch block. By generating
-bytecode directly we can make the target of the JVM branch statement the actual bytecode
-of the final destination. In the case where the branch isnÕt taken the JVM doesnÕt need to
-branch at all.
-A side affect of the above two optimizations is a solution to the excess constant pool entries
-problem. When jumps are implemented as GOTOs and direct branches to the target the PC
-field doesnÕt need to be set. This eliminates many of the constant pool entries the java
-source compiler requires. The limit is still there however, and given a large enough binary it
-will still be reached.
-Delay slots are another area where things are done somewhat inefficiently in the Java source
-compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a
-"delay slot". That is, an instruction after a branch or jump instruction that is executed
-regardless of whether the branch is taken. This is done because by the time the branch or
-jump instruction is finished being processes the next instruction is already ready to be
-executed and it is wasteful to discard it. (However, newer MIPS CPUs have pipelines that
-are much larger than early MIPS CPUs so they have to discard many instructions anyway.)
-As a result of this the instruction in the delay slot is actually executed BEFORE the branch
-is taken. To make things even more difficult, values from the register file are loaded
-BEFORE the delay slot is executed. Here is a small piece of MIPS assembly:
-ADDIU r2,r0,-1
-BLTZ r2, target
-ADDIU r2,r2,10
-...
-:target
-This piece of code is executed as follows
-1. r2 is set to Ð1
-2. r2 is loaded from the register file by the BLTEZ instruction
-3. 10 is added to r2 by the ADDIU instruction
-4. The branch is taken because at the time the BLTZ instruction was executed r2 was
-Ð1, but r2 is now 9 (-1 + 10)
-There is a very element solution to this problem when using JVM bytecode. When a branch
-instruction is encountered the registers needed for the comparison are pushed onto the stack
-to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the
-delay slot is emitted, and then finally the actual JVM branch instruction. Because the values
-were pushed to the stack before the delay slot was executed any changes the delay slot made
-to the registers are not visible to the branch bytecode. This allows delay slots to be used
-with no performance penalty or size penalty.
-One final advantage that generating bytecode directly allows is smaller more compact
-bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also
-a few other areas where the generated bytecode can be optimized for size with more
-knowledge of the program as a whole.
-When encountering the following switch block both javac and Jikes generate redundant
-bytecode.
-Switch(pc>>>8) {
- Case 0x1: run_1(); break;
- Case 0x2: run_2(); break
- ...
- case 0x100: run_100(); break;
-}
-The first bytecode in each case arm in the switch statement is ALOAD_0 to prepare for a
-invoke special call. By simple moving this outside the switch statement each case arm was
-reduced in size by one instruction. Similar optimizations were also done in other parts of the
-compiler.
-
-Section 3
-- Adam - The method is run(), not execute. Execute() is only used when you need to
-resume from a pause syscall.
-
-Section 3.1
-- Adam - Even the R1000 supports LB/SB/LH/SH/LWL/LWR Ð saying it only supports
-32-bit aligned loads is incorrect.
-- Other similarities
-o All the branching instructions in MIPS directly map to single JVM instructions.
-o Most of the ALU instructions map to single JVM instructions.
-
-(I can write up some stuff for each of these next several sections if you want)
-Section 3.2 - Interpreter
-- Originally written mainly to understand the MIPS instruction set
-- Far easier to debug than an ahead of time compiler (no waiting, can throw in quick
-hacks like if(pc >= 0xbadc0de && pc <= 0xbadcfff) debugOutput() ), donÕt need to
-understand most of the ELF format)
-- Still could be useful
-o for GDB remote debugging
-o cases where size is more important than speed (size of interpreter + size of mips
-binary < size of compiled binary or size of compiler + mips binary)
-o code which dynamically generates code (JIT compilers, etc). The ahead of time
-compiler canÕt possibly handle this
-
-Section 3.3 Ð Compiling to Java Source
-- First version of an ahead of time compiler
-- Huge performance boost
-- Java source output preferred for the 2+2=4 type optimizations java compilers do
-- Worked well for small binaries Ð large MIPS binaries required ridiculous amounts of
-memory to compile and often created invalid classfiles
-- Too many constants Ð every jump operation required an entry in the constant pool (pc =
-0xabcd1234; continue; )
-
-Section 3.4 Ð Compiling directly to JVM Bytecode
-- Another jump in performance
-- More efficient code can be created at the bytecode level
-o Information can be stored on the JVM stack rather than in local variables
-_ Javac/jikes often unnecessarily use local variables
-long tmp = a*b;
-lo = (int)tmp;
-hi = (int)(tmp>>>32)
-does a store and two loads when a simple DUP would suffice
-o GOTO can be used to branch directly to branch destinations in the same method
-rather than going through the switch block again.
-o BEQ, BGTZ, BLE, etc can jump directly to destination rather than doing
-if(condition) { pc=0xabcd1234; continue; }
-o Eliminates excess constant pool entries (only jump outside the current method
-require a constant pool entry)
-o Delay slots implemented more efficiently.
-_ Java source compiler does:
-if(condition) { /* delay slot /; pc = 0xabcd1234; continue; }
-/* delay slot again */
-_ This is required because the delay slot can change the registers used in
-condition. The registers need to be read BEFORE the delay slot in executed.
-_ In the bytecode compiler the registers used in the condition are pushed to the
-stack, then the delay slot is executed, and finally the comparison is done.
-This eliminates the needs to output the delay slot twice.
-- Smaller bytecode
-o Everything mentioned above makes it smaller and faster
-o Javac/jikes add redundant code
-_ Switch(a) {
- Case 1: Run_1000(); break;
- Case 2: run_2000(); break;
+v v v v v v v
+These two methods can even be combined. MIPS can call Java through the
+CALL\_JAVA syscall, which can in turn invoke a MIPS function in the
+binary with the call() method.\r\r Users preferring a simpler
+communication mechanism can also use Java StreamÕs and file
+descriptors. Runtime provides a simple interface for mapping a Java
+Input or OutputStream to a File Descriptor.
+*************
+These two methods can even be combined. MIPS can call Java through the
+CALL\_JAVA syscall, which can in turn invoke a MIPS function in the binary
+with the call() method.
+^ ^ ^ ^ ^ ^ ^
+
+Users preferring a simpler communication mechanism can also use Java
+Stream's and file descriptors. Runtime provides a simple interface for
+mapping a Java Input or OutputStream to a File Descriptor.
+
+%Java source code can create a copy of the translated binary by
+%instantiating the corresponding class, which extends {\tt Runtime}.
+%Invoking the {\tt main()} method on this class is equivalent to
+%calling the {\tt main()} function within the binary; the {\tt String}
+%arguments to this function are copied into the binary's memory space
+%and made available as {\tt **argv} and {\tt argc}.
+
+%The translated binary communicates with the rest of the VM by
+%executing MIPS {\tt SYSCALL} instructions, which are translated into
+%invocations of the {\tt syscall()} method. This calls back to the
+%native Java world, which can manipulate the binary's environment by
+%reading and writing to its memory space, checking its exit status,
+%pausing the VM, and restarting the VM.
+
+
+%\subsection{Virtualization}
+
+%The {\tt Runtime} class implements the majority of the standard {\tt
+%libc} syscalls, providing a complete interface to the filesystem,
+%network socket library, time of day, (Brian: what else goes here?).
+
+%\begin{itemize}
+
+%\item ability to provide the same interface to CNI code and
+ % NestedVMified code
+
+%\item security advantages (chroot the {\tt fork()}ed process)
+ %
+%\end{itemize}
+
+
+\section{Future Directions}
+
+\section{Conclusion}
+
+\section{Appendix A: Testing Environment}
+
+The MIPS binaries can also invoke a special method of Runtime called
+callJava().When the MIPS binary invokes the {\tt CALL\_JAVA} syscall
+(usually done through the {\tt \_call\_java()} function provided by
+the NestedVM support library) the callJava() method in Runtime is
+invoked with the arguments passes to the syscall.
+
+v v v v v v v
+Although NestedVM perfectly emulates a MIPS R2000 CPU its performance
+characteristics are not anything like an actual MIPS R2000 CPU. GCC makes
+several optimizations that increase performance on an actually MIPS CPU but
+actually decrease performance when run through the NestedVM binary
+translator. Fortunately, GCC provides many options to customize its code
+generations and eliminate these optimizations. GCC also has optimization
+options that are not helpful on a real MIPS CPU but are very helpful under
+NestedVM
+*************
+{\footnotesize\begin{verbatim}
+^ ^ ^ ^ ^ ^ ^
+
+// Java
+private Runtime rt = new MyBinary() {
+ pubilc int callJava(int a, int b, int c, int d) {
+ System.err.println("Got " + a + " " + b);
+ }
+};
+public void foo() { rt.run(); }
+
+/* C */
+void main(int argc, char **argv) {
+ _call_java(1,2);