-public void trampoline() {
- for(;;) {
- switch(pc&0xfffffe00) {
- case 0x10000: run_0x10000(); break;
- case 0x10200: run_0x10200(); break;
- case 0xdeadbe00:
- ...
- }
- }
-}
-With this trampoline in place somewhat large binaries could be handled without much
-difficulty. There is no limit on the size of a classfile as a whole, just individual methods.
-This method should scale well. However, there are other classfile limitations that will limit
-the size of compiled binaries.
-Another interesting problem that was discovered while creating the trampoline method was
-javac and JikesÕ incredible stupidity when dealing with switch statements. The follow code
-fragment gets compiled into a lookupswich by javac:
-Switch(pc&0xffffff00) {
- Case 0x00000100: run_100(); break;
- Case 0x00000200: run_200(); break;
- Case 0x00000300: run_300(); break;
-}
-while this nearly identical code fragment gets compiled to a tableswitch
-switch(pc>>>8) {
- case 0x1: run_100(); break
- case 0x2: run_200(); break;
- case 0x3: run_300(); break;
-}
-Javac isnÕt smart enough to see the patter in the case values and generates very suboptimal
-bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which
-is significantly faster. This change alone nearly doubled the speed of the compiled binary.
-Finding the optimal method size lead to the next big performance increase. It was
-determined with experimentation that the optimal number of MIPS instructions per method
-is 128 (considering only power of two options). Going above or below that lead to
-performance decreases. This is most likely due to a combination of two factors.
-_ The two levels of switch statements jumps have to pass though Ð The first switch
-statement jumps go through is the trampoline switch statement. This is implemented
-as a TABLESWITCH in JVM bytecode so it is very fast. The second level switch
-statement in the individual run_ methods is implemented as a LOOKUPSWITCH,
-which is much slower. Using smaller methods puts more burden on the faster
-TABLESWITCH and less on the slower LOOKUPSWITCH.
-_ JIT compilers probably favor smaller methods smaller methods are easier to compile
-and are probably better candidates for JIT compilation than larger methods.
-FIXME: Put a chart here
-The next big optimization was eliminating unnecessary case statements. Having case
-statements before each instruction prevents JIT compilers from being able to optimize
-across instruction boundaries. In order to eliminate unnecessary case statements every
-possible address that could be jumped to directly needed to be identified. The sources for
-possible jump targets come from 3 places.
-_ The .text segment Ð Every instruction in the text segment in scanned for jump
-targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list
-of possible branch targets. In addition, functions that set the link register have
-theirpc+8 added to the list (the address that wouldÕve been put to the link register).
-Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate
-Unsigned) are scanned for possible addresses in the text segment. This combination
-of instructions is often used to load a 32-bit word into a register.
-_ The .data segment Ð When GCC generates switch() statements it often uses a jump
-table stored in the .data segment. Unfortunately gcc doesnÕt identify these jump
-tables in any way. Therefore, the entire .data segment is conservatively scanned for
-possible addresses in the .text segment.
-_ The symbol table Ð This is mainly used as a backup. Scanning the .text and .data
-segments should identify any possible jump targets but adding every function in the
-symbol table in the ELF binary doesnÕt hurt. This will also catch functions that are
-never called directly from the MIPS binary (for example, functions called with the
-call() method in the runtime).
-Eliminating unnecessary case statements provided a 10-25% speed increase .
-Despite all the above optimizations and workaround an impossible to workaround hard
-classfile limit was eventually hit, the constant pool. The constant pool in classfiles is limited
-to 65536 entries. Every Integer with a magnitude greater than 32767 requires an entry in the
-constant pool. Every time the compiler emits a jump/branch instruction the PC field is set to
-the branch target. This means nearly every branch instruction requires an entry in the
-constant pool. Large binaries hit this limit fairly quickly. One workaround that was
-employed in the Java source compiler was to express constants as offsets from a few central
-values. For example: "pc = N_0x00010000 + 0x10" where N_0x000100000 is a non-
-final field to prevent javac from inlining it. This was sufficient to get reasonable large
-binaries to compile. It has a small (approximately 5%) performance impact on the generated
-code. It also makes the generated classfile somewhat larger. Fortunately, the classfile
-compiler eliminates this problem.
-3.4 Ð Bytecode compiler
-The next step in the evolution of Mips2Java was to compile directly to JVM bytecode
-eliminating the intermediate javac step. This had several advantages
-_ There are little tricks that can be done in JVM bytecode that canÕt be done in Java
-source code.
-_ Eliminates the time-consuming javac step Ð Javac takes a long time to parse and
-compile the output from the java source compiler.
-_ Allows for MIPS binaries to be compiled and loaded into a running VM using a
-class loader. This eliminates the need to compile the binaries ahead of time.
-By generating code at the bytecode level there are many areas where the compiler can be
-smarter than javac. Most of the areas where improvements where made where in the
-handling of branch instructions and in taking advantage of the JVM stack to eliminate
-unnecessary LOADs and STOREs to local variables.
-The first obvious optimization that generating bytecode allows for is the use of GOTO.
-Despite the fact that java doesnÕt have a GOTO keyword a GOTO bytecode does exist and
-is used heavily in the code generates by javac. Unfortunately the java language doesnÕt
-provide any way to take advantage of this. As result of this jumps within a method were
-implemented by setting the PC field to the new address and making a trip back to the initial
-switch statement. In the classfile compiler these jumps are implemented as GOTOs directly
-to the target instruction. This saves a costly trip back through the LOOKUPSWITCH
-statement and is a huge win for small loops within a method.
-Somewhat related to using GOTO is the ability to optimize branch statements. In the Java
-source compiler branch statements are implemented as follows (delay slots are ignored for
-the purpose of this example)
-If(condition) { pc = TARGET; continue; }
-This requires a branch in the JVM regardless of whether the MIPS branch is actually taken.
-If condition is false the JVM has to jump over the code to set the PC and go back to the
-switch block. If condition is true the JVM as to jump to the switch block. By generating
-bytecode directly we can make the target of the JVM branch statement the actual bytecode
-of the final destination. In the case where the branch isnÕt taken the JVM doesnÕt need to
-branch at all.
-A side affect of the above two optimizations is a solution to the excess constant pool entries
-problem. When jumps are implemented as GOTOs and direct branches to the target the PC
-field doesnÕt need to be set. This eliminates many of the constant pool entries the java
-source compiler requires. The limit is still there however, and given a large enough binary it
-will still be reached.
-Delay slots are another area where things are done somewhat inefficiently in the Java source
-compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a
-"delay slot". That is, an instruction after a branch or jump instruction that is executed
-regardless of whether the branch is taken. This is done because by the time the branch or
-jump instruction is finished being processes the next instruction is already ready to be
-executed and it is wasteful to discard it. (However, newer MIPS CPUs have pipelines that
-are much larger than early MIPS CPUs so they have to discard many instructions anyway.)
-As a result of this the instruction in the delay slot is actually executed BEFORE the branch
-is taken. To make things even more difficult, values from the register file are loaded
-BEFORE the delay slot is executed. Here is a small piece of MIPS assembly:
-ADDIU r2,r0,-1
-BLTZ r2, target
-ADDIU r2,r2,10
-...
-:target
-This piece of code is executed as follows
-1. r2 is set to Ð1
-2. r2 is loaded from the register file by the BLTEZ instruction
-3. 10 is added to r2 by the ADDIU instruction
-4. The branch is taken because at the time the BLTZ instruction was executed r2 was
-Ð1, but r2 is now 9 (-1 + 10)
-There is a very element solution to this problem when using JVM bytecode. When a branch
-instruction is encountered the registers needed for the comparison are pushed onto the stack
-to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the
-delay slot is emitted, and then finally the actual JVM branch instruction. Because the values
-were pushed to the stack before the delay slot was executed any changes the delay slot made
-to the registers are not visible to the branch bytecode. This allows delay slots to be used
-with no performance penalty or size penalty.
-One final advantage that generating bytecode directly allows is smaller more compact
-bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also
-a few other areas where the generated bytecode can be optimized for size with more
-knowledge of the program as a whole.
-When encountering the following switch block both javac and Jikes generate redundant
-bytecode.
-Switch(pc>>>8) {
- Case 0x1: run_1(); break;
- Case 0x2: run_2(); break
- ...
- case 0x100: run_100(); break;
-}
-The first bytecode in each case arm in the switch statement is ALOAD_0 to prepare for a
-invoke special call. By simple moving this outside the switch statement each case arm was
-reduced in size by one instruction. Similar optimizations were also done in other parts of the
-compiler.
-
-Section 3
-- Adam - The method is run(), not execute. Execute() is only used when you need to
-resume from a pause syscall.
-
-Section 3.1
-- Adam - Even the R1000 supports LB/SB/LH/SH/LWL/LWR Ð saying it only supports
-32-bit aligned loads is incorrect.
-- Other similarities
-o All the branching instructions in MIPS directly map to single JVM instructions.
-o Most of the ALU instructions map to single JVM instructions.
-
-(I can write up some stuff for each of these next several sections if you want)
-Section 3.2 - Interpreter
-- Originally written mainly to understand the MIPS instruction set
-- Far easier to debug than an ahead of time compiler (no waiting, can throw in quick
-hacks like if(pc >= 0xbadc0de && pc <= 0xbadcfff) debugOutput() ), donÕt need to
-understand most of the ELF format)
-- Still could be useful
-o for GDB remote debugging
-o cases where size is more important than speed (size of interpreter + size of mips
-binary < size of compiled binary or size of compiler + mips binary)
-o code which dynamically generates code (JIT compilers, etc). The ahead of time
-compiler canÕt possibly handle this
-
-Section 3.3 Ð Compiling to Java Source
-- First version of an ahead of time compiler
-- Huge performance boost
-- Java source output preferred for the 2+2=4 type optimizations java compilers do
-- Worked well for small binaries Ð large MIPS binaries required ridiculous amounts of
-memory to compile and often created invalid classfiles
-- Too many constants Ð every jump operation required an entry in the constant pool (pc =
-0xabcd1234; continue; )
-
-Section 3.4 Ð Compiling directly to JVM Bytecode
-- Another jump in performance
-- More efficient code can be created at the bytecode level
-o Information can be stored on the JVM stack rather than in local variables
-_ Javac/jikes often unnecessarily use local variables
-long tmp = a*b;
-lo = (int)tmp;
-hi = (int)(tmp>>>32)
-does a store and two loads when a simple DUP would suffice
-o GOTO can be used to branch directly to branch destinations in the same method
-rather than going through the switch block again.
-o BEQ, BGTZ, BLE, etc can jump directly to destination rather than doing
-if(condition) { pc=0xabcd1234; continue; }
-o Eliminates excess constant pool entries (only jump outside the current method
-require a constant pool entry)
-o Delay slots implemented more efficiently.
-_ Java source compiler does:
-if(condition) { /* delay slot /; pc = 0xabcd1234; continue; }
-/* delay slot again */
-_ This is required because the delay slot can change the registers used in
-condition. The registers need to be read BEFORE the delay slot in executed.
-_ In the bytecode compiler the registers used in the condition are pushed to the
-stack, then the delay slot is executed, and finally the comparison is done.
-This eliminates the needs to output the delay slot twice.
-- Smaller bytecode
-o Everything mentioned above makes it smaller and faster
-o Javac/jikes add redundant code
-_ Switch(a) {
- Case 1: Run_1000(); break;
- Case 2: run_2000(); break;
-}
-Gets compiled into
-1 ÐTableswitch É.
-2 Ð ALOAD_0
-3 Ð invokespecial run_1000
-4 Ð goto end
-5 Ð ALOAD_0
-6 Ð invokespecial run_2000
-10 Ð goto end
-ALOAD_0 is unnecessarily put in each switch arm
-
-3.5 Interfacing with Java Code
-- _call_java ()/Runtime.call()
-o _call_java () - Call a java method from mips
-o Runime.call() Ð call a mips method from java
-o Easily allocate memory within the binaryÕs memory space by calling libcÕs malloc()
-o Can go back and forth between mips and java (java calls mips which calls java which
-calls back into mips)
-- Java Strings can easily be converted to and from null terminated strings in the processÕ
-memory space
-- Java InputStreams, OutputStreams, and Sockets can easily be turned in to standard
-UNIX file descriptors (and ANSI FILE*s)
-- Can easily create custom filedescriptors and have full control over all operations on
-them (read, write, seek, close, fstat, etc)
-- Int User_info[] Ð optional chunk of memory can very easily be accessed from java
-(Runtime.getUserInfo/setUserInfo)
-
-3.6 Virtualization
-- Adam Ð we actually donÕt support sockets directly yet Ð you should probably take that
-out. (But you can create a socket in java and expose it to mips as a filedescriptor)
-- Runtime services
-o Provides a easy to use interface to subclasses (Interpreter or compiles binaries)
-_ Subclasses only know how to execute instructions
-_ Runtime handles setting up registers/stack for execution to begin and
-extracting return values and the exit status from the process
-o Memory management
-_ Sets up stack and guard pages
-_ Allocates pages with the sbrk syscall
-_ Provide easy an memcpy like interface for accessing the processes memory
-from java
-o Runtime.call() support Ð sets up registers,etc to prepare the process for a call into it
-from java
-o Filesystem Ð open/close/read/write/seek/fcntl s syscalls
-o Time related functions Ð sleep, gettimeofday, times syscall
-o UnixRuntime provides a more complete unix-like environment (Runtime smaller
-though)
-_ Supports fork() and waitpid()
-_ Pipe() for IPC
-_ More advocated filesystem interface
-_ All filesystem operations abstracted away into a FileSystem class
-o FileSystem class can be written that exposes a zip file,
-directory on an http server, etc as a filesystem
-_ Chdir/getcwd
-_ /dev filesystem
-_ stat()
-_ directory listing support
-5.1 Charts
-IÕll put together some charts tonight
-5.2 Optimizations
-And finish this part
-
-Let me know if this was what you were looking for
-
- libjpeg libmspack libfreetype
-Interpreted MIPS Binary 22.2 12.9 21.4
-Compled MIPS Binary (fastest options) 3.39 2.23 4.31
-Native -O3 0.235 0.084 0.201
-
-Compled - with all case statements 3.50 2.30 4.99
-Compiled - with pruned case statement 3.39 2.23 4.31
-
-Compiled - 512 instructions/method 62.7 27.7 56.9
-Compiled - 256 instructions/method 3.54 2.55 4.43
-Compiled - 128 instructions/method 3.39 2.23 4.31
-Compiled - 64 instructions/method 3.56 2.31 4.40
-Compiled - 32 instruction/method 3.71 2.46 4.64
-
-Compild MIPS Binary (Server VM) 3.21 2.00 4.54
-Compiled MIPS Binary (Client VM) 3.39 2.23 4.31
-
-All times are measured in seconds. These were all run on a dual 1ghz G4
-running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
-was run 8 times within a single VM. The highest and lowest times were
-removed and the remaining 6 were averaged. In each case only the first
-run differed significantly from the rest.
-
-The libjpeg test consisted of decoding a 1280x1024 jpeg
-(thebride_1280.jpg) and writing a tga. The mspack test consisted of
-extracting all members from arial32.exe, comic32.exe, times32.exe, and
-verdan32.exe. The freetype test consisted of rendering characters
-32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
-about 950 individual glyphs).
-
-I can provide you with the source for any of these test if you'd like.
-
--Brian
-\end{verbatim}