doc/nestedvm.ivme04.tex

   1 %
   2 % FIXME: - Add something about size limits on the constant pool
   3 %          how we worked around that and the performance impact
   4 %          of -o lessconstants
   5 %        - Add something about encoding data sections as string constants
   6 %          and the UTF8 non-7-bit-ascii penalty
   7 %
   8
   9 \documentclass{acmconf}
  10 \usepackage{graphicx}
  11 \usepackage{amssymb,amsmath,epsfig,alltt}
  12 \sloppy         % better line breaks
  13 \usepackage{palatino}
  14 \usepackage{parskip}
  15 \usepackage{tabularx}
  16 \usepackage{alltt}
  17 \bibliographystyle{alpha}
  18
  19 \title{\textbf{\textsf{
  20 Running Legacy C/C++ Libraries in a Pure Java Environment
  21 }}}
  22 \date{}
  23 \author{\begin{tabular}{@{}c@{}}
  24         {\em {Brian Alliet}} \\
  25         {Rochester Institute of Technology}\\
  26         {\tt brian@ibex.org}
  27    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  28         {\em {Adam Megacz}} \\
  29         {UC Berkeley Statistical Computing Facility} \\
  30         {\tt adam@ibex.org}
  31 \end{tabular}}
  32 \begin{document}
  33
  34 \maketitle
  35
  36 \begin{abstract}
  37 Abstract
  38 \end{abstract}
  39
  40 \section{Introduction}
  41
  42 \subsection{Why would you want to do this?}
  43
  44 The C programming language has been around for over 30 years now.
  45 There is a huge library of software written in this language.  By
  46 contrast, Java has been around for less than ten years.  Although it
  47 offers substantial advantages over C, the set of libraries written in
  48 this language still lags behind C/C++.
  49
  50 The typical solution to this dilemma is to use JNI or CNI to invoke C
  51 code from within a Java VM.  Unfortunately, there are a number of
  52 situations in which this is not an acceptable solution:
  53
  54 \begin{itemize}
  55 \item Java Applets are not permitted to invoke {\tt Runtime.loadLibrary()}
  56 \item Java Servlet containers with a {\tt SecurityManager} will not
  57       permit loading new JNI libraries.  This configuration is popular
  58       with {\it shared hosting} providers and corporate intranets
  59       where a number of different parties contribute individual web
  60       applications which are run together in a single container.
  61 \item JNI requires the native library to be compiled ahead of time,
  62       separately, for every architecture on which it will be deployed.
  63       This is unworkable for situations in which the full set of
  64       target architectures is not known at deployment time.
  65 \item The increasingly popular J2ME platform does not support JNI or CNI.
  66 \item Unlike Java Bytecode, JNI code is susceptible to buffer overflow
  67       and heap corruption attacks.  This can be a major security
  68       vulnerability.
  69 \item JNI often introduces undesirable added complexity to an
  70       application.
  71 \end{itemize}
  72
  73 The technique we present here is based on using a typical ANSI C
  74 compiler to compile C/C++ code into a MIPS binary, and then using a
  75 tool to translate that binary on an instruction-by-instruction basis
  76 into Java bytecode.
  77
  78 The technique presented here is general; we anticipate that it can be
  79 applied to other secure virtual machines such as Microsoft's .NET.
  80
  81
  82 \section{Existing Work: Source-to-Source Translation}
  83
  84 \begin{itemize}
  85 \item c2java
  86 \item commercial products
  87 \end{itemize}
  88
  89 \section{Mips2Java: Binary-to-Binary Translation}
  90
  91 We present Mips2Java, a binary-to-binary translation tool to convert
  92 MIPS binaries into Java bytecode files.
  93
  94 The process of utilizing Mips2Java begins by using any compiler for
  95 any language to compile the source library into a statically linked
  96 MIPS binary.  We used {\tt gcc 3.3.3}, but any compiler which can
  97 target the MIPS platform should be acceptable.  The binary is
  98 statically linked with a system library (in the case of C code this is
  99 {\tt libc}) which translates system requests (such as {\tt open()},
 100 {\tt read()}, or {\tt write()}) into appropriate invocations of the
 101 MIPS {\tt SYSCALL} instruction.
 102
 103 The statically linked MIPS binary is then fed to the Mips2Java tool
 104 (which is itself written in Java), which emits a sequence of Java
 105 Bytecodes in the form of a {\tt .class} file equivalent to the
 106 provided binary.  This {\tt .class} file contains a single class which
 107 implements the {\tt Runtime} interface.  This class may then be
 108 instantiated; invoking the {\tt execute()} method is equivalent to
 109 invoking the {\tt main()} function in the original binary.
 110
 111
 112 \subsection{Why MIPS?}
 113
 114 We chose MIPS as a source format for two primary reasons: the
 115 availability of tools to translate legacy code into MIPS binaries, and
 116 the close similarity between the MIPS ISA and the Java Virtual Machine.
 117
 118 The MIPS architecture has been around for quite some time, and is well
 119 supported by the GNU Compiler Collection, which is capable of
 120 compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C
 121 into MIPS binaries.
 122
 123 The MIPS R1000 ISA bears a striking similarity to the Java Virtual
 124 Machine.  This early revision of the MIPS supports only 32-bit aligned
 125 loads and stores from memory, which is precisely the access model
 126 supported by Java for {\tt int[]}s.
 127
 128 Cover dynamic page allocation.
 129
 130 Cover stack setup.
 131
 132 Brian, are there any other fortunate similarities we should mention
 133 here?  I seem to remember there being a bunch, but I can't recall them
 134 right now; it's been a while since I dealt with this stuff in detail.
 135
 136
 137 \subsection{Interpreter}
 138
 139 The Interpreter was the first part of Mips2Java to be written.  This was the most straightforward and simple way to run MIPS binaries inside the JVM.  The simplicity of the interpreter also made it very simple to debug. Debugging machine-generated code is a pain. Most importantly, the interpreter provided a great reference to use when developing the compiler. With known working implementations of each MIPS instruction in Java writing a compiler became a matter of simple doing the same thing ahead of time.
 140
 141 With the completion of the compiler the interpreter in Mips2Java has become less useful. However, it may still have some life left in it. One possible use is remote debugging with GDB. Although debugging the compiler generated JVM code is theoretically possible, it would be far easier to do in the interpreter. The interpreter may also be useful in cases where size is far more important than speed. The interpreter is very small. The interpreter and MIPS binary combined are smaller than the compiled classfiles.
 142
 143
 144 \subsection{Compiling to Java Source}
 145
 146 The next major step in Mips2Java?s development was the Java source compiler. This generated Java source code (compliable with javac or Jikes) from a MIPS binary. Generating Java source code was preferred initially over JVM bytecode for two reasons. The authors weren?t very familiar with JVM bytecode and therefore generating Java source code was simpler. Generating source code also eliminated the need to do trivial optimizations in the Mips2java compiler that javac and Jikes already do. This mainly includes 2+2=4 stuff. For example, the MIPS register r0 is immutable and always 0. This register is represented by a static final int in the Java source compiler. Javac and Jikes automatically handle optimizing this away when possible. In the JVM bytecode compiler these optimizations needs to be done in Mips2Java.
 147
 148 Early versions of the Mips2Java compiler were very simple. All 32 MIPS GPRs and a special PC register were fields in the generated java class. There was a run() method containing all the instructions in the .text segment converted to Java source code. A switch statement was used to allow jumps from instruction to instruction. The generated code looked something like this.
 149
 150 %private final static int r0 = 0;
 151 %private int r1, r2, r3,...,r30;
 152 %private int r31 = 0xdeadbeef;
 153 %private int pc = ENTRY_POINT;
 154 %
 155 %public void run() {
 156 %       for(;;) {
 157 %               switch(pc) {
 158 %                       case 0x10000:
 159 %                               r29 = r29 ? 32;
 160 %                       case 0x10004:
 161 %                               r1 = r4 + r5;
 162 %                       case 0x10008:
 163 %                               if(r1 == r6) {
 164 %                                       /* delay slot */
 165 %                                       r1 = r1 + 1;
 166 %                                       pc = 0x10018:
 167 %                                       continue;
 168 %                               }
 169 %                       case 0x1000C:
 170 %                               r1 = r1 + 1;
 171 %                       case 0x10010:
 172 %                               r31 = 0x10018;
 173 %                               pc = 0x10210;
 174 %                               continue;
 175 %                       case 0x10014:
 176 %                               /* nop */
 177 %                       case 0x10018:
 178 %                               pc = r31;
 179 %                               continue;
 180 %                       ...
 181 %                       case 0xdeadbeef:
 182 %                               System.err.println(?Exited from ENTRY_POINT?);
 183 %                               System.err.println(?R2: ? + r2);
 184 %                               System.exit(1);
 185 %               }
 186 %       }
 187 %}
 188
 189 This worked fine for small binaries but as soon as anything substantial was fed to it the 64k JVM method size limit was soon hit. The solution to this was to break up the code into many smaller methods. This required a trampoline to dispatch  jumps to the appropriate method. With the addition of the trampoline the generated code looked something like this:
 190
 191 %public void run_0x10000() {
 192 %       for(;;) {
 193 %       switch(pc) {
 194 %               case 0x10000:
 195 %                       ...
 196 %               case 0x10004:
 197 %                       ...
 198 %               ...
 199 %               case 0x10010:
 200 %                       r31 = 0x10018;
 201 %                       pc = 0x10210;
 202 %                       return;
 203 %               ...
 204 %       }
 205 %       }
 206 %}
 207 %
 208 %pubic void run_0x10200() {
 209 %       for(;;) {
 210 %       switch(pc) {
 211 %               case 0x10200:
 212 %                       ...
 213 %               case 0x10204:
 214 %                       ...
 215 %       }
 216 %       }
 217 %}
 218 %
 219 %public void trampoline() {
 220 %       for(;;) {
 221 %       switch(pc&0xfffffe00) {
 222 %                       case 0x10000: run_0x10000(); break;
 223 %                       case 0x10200: run_0x10200(); break;
 224 %                       case 0xdeadbe00:
 225 %                               ...
 226 %               }
 227 %       }
 228 %}
 229
 230 With this trampoline in place somewhat large binaries could be handled without much difficulty. There is no limit on the size of a classfile as a whole, just individual methods. This method should scale well. However, there are other classfile limitations that will limit the size of compiled binaries.
 231
 232 Another interesting problem that was discovered while creating the trampoline method was javac and Jikes? inability to properly optimize switch statements. The follow code fragment gets compiled into a lookupswich by javac:
 233
 234 %Switch(pc&0xffffff00) {
 235 %       Case 0x00000100: run_100(); break;
 236 %       Case 0x00000200: run_200(); break;
 237 %       Case 0x00000300: run_300(); break;
 238 %}
 239
 240 while this nearly identical code fragment gets compiled to a tableswitch
 241
 242 Javac isn?t smart enough to see the patter in the case values and generates very suboptimal bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which is significantly faster. This change alone nearly doubled the speed of the compiled binary.
 243
 244 Finding the optimal method size lead to the next big performance increase.  It was determined with experimentation that the optimal number of MIPS instructions per method is 128 (considering only power of two options). Going above or below that lead to performance decreases. This is most likely due to a combination of two factors.
 245
 246 \begin{itemize}
 247 \item The two levels of switch statements jumps have to pass though - The first
 248 switch statement jumps go through is the trampoline switch statement. This
 249 is implemented as a TABLESWITCH in JVM bytecode so it is very fast. The
 250 second level switch statement in the individual run\_ methods is implemented
 251 as a LOOKUPSWITCH, which is much slower. Using smaller methods puts more
 252 burden on the faster TABLESWITCH and less on the slower LOOKUPSWITCH.
 253
 254 \item JIT compilers probably favor smaller methods smaller methods are easier to compile and are probably better candidates for JIT compilation than larger methods.
 255 \end{itemize}
 256
 257 Put a chart in here
 258
 259 The next big optimization was eliminating unnecessary case statements. Having case statements before each instruction prevents JIT compilers from being able to optimize across instruction boundaries. In order to eliminate unnecessary case statements every possible address that could be jumped to directly needed to be identified. The sources for possible jump targets come from 3 places.
 260 \begin{itemize}
 261 \item The .text segment ? Every instruction in the text segment in scanned for jump targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list of possible branch targets. In addition, functions that set the link register have theirpc+8 added to the list (the address that would?ve been put to the link register). Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate Unsigned) are scanned for possible addresses in the text segment. This combination of instructions is often used to load a 32-bit word into a register.
 262 \item The .data segment ? When GCC generates switch() statements it often uses a jump table stored in the .data segment. Unfortunately gcc doesn?t identify these jump tables in any way. Therefore, the entire .data segment is conservatively scanned for possible addresses in the .text segment.
 263 \item The symbol table ? This is mainly used as a backup. Scanning the .text and .data segments should identify any possible jump targets but adding every function in the symbol table in the ELF binary doesn?t hurt. This will also catch functions that are never called directly from the MIPS binary (for example, functions called with the call() method in the runtime).
 264 \end{itemize}
 265
 266 Eliminating unnecessary case statements provided a 10-25\% speed increase
 267
 268 Despite all the above optimizations and workaround an impossible to
 269 workaround hard classfile limit was eventually hit, the constant pool. The
 270 constant pool in classfiles is limited to 65536 entries. Every Integer with
 271 a magnitude greater than 32767 requires an entry in the constant pool. Every
 272 time the compiler emits a jump/branch instruction the PC field is set to the
 273 branch target. This means nearly every branch instruction requires an entry
 274 in the constant pool. Large binaries hit this limit fairly quickly. One
 275 workaround that was employed in the Java source compiler was to express
 276 constants as offsets from a few central values. For example: ``pc =
 277 N\_0x00010000 + 0x10'' where N\_0x000100000 is a non-final field to prevent
 278 javac from inlining it. This was sufficient to get reasonable large binaries
 279 to compile. It has a small (approximately 5\%) performance impact on the
 280 generated code. It also makes the generated classfile somewhat larger.
 281 Fortunately, the classfile compiler eliminates this problem.
 282
 283 \subsection{Compiling directly to Java Bytecode}
 284
 285 The next step in the evolution of Mips2Java was to compile directly to JVM bytecode eliminating the intermediate javac step. This had several advantages:
 286 \begin{itemize}
 287 \item There are little tricks that can be done in JVM bytecode that can?t be done in Java source code.
 288 \item Eliminates the time-consuming javac step ? Javac takes a long time to parse and compile the output from the java source compiler.
 289 \item Allows for MIPS binaries to be compiled and loaded into a running VM using a class loader. This eliminates the need to compile the binaries ahead of time.
 290 \end{itemize}
 291
 292 By generating code at the bytecode level  there are many areas where the compiler can be smarter than javac. Most of the areas where improvements where made where in the handling of branch instructions and in taking advantage of the JVM stack to eliminate unnecessary LOADs and STOREs to local variables.
 293
 294 The first obvious optimization that generating bytecode allows for is the use of GOTO. Despite the fact that java doesn?t have a GOTO keyword  a GOTO bytecode does exist and is used heavily in the code generates by javac. Unfortunately the java language doesn?t provide any way to take advantage of this. As result of this jumps within a method were implemented by setting the PC field to the new address and making a trip back to the initial switch statement.  In the classfile compiler these jumps are implemented as GOTOs directly to the target instruction. This saves a costly trip back through the LOOKUPSWITCH statement and is a huge win for small loops within a method.
 295
 296 Somewhat related to using GOTO is the ability to optimize branch statements. In the Java source compiler branch statements are implemented as follows (delay slots are ignored for the purpose of this example):
 297 %if(condition) { pc = TARGET; continue; }
 298
 299 This requires a branch in the JVM regardless of whether the MIPS branch is actually taken. If condition is false the JVM has to jump over the code to set the PC and go back to the switch block. If condition is true the JVM as to jump to the switch block. By generating bytecode directly we can make the target of the JVM branch statement the actual bytecode of the final destination. In the case where the branch isn?t taken the JVM doesn?t need to branch at all.
 300
 301 A side affect of the above two optimizations is a solution to the excess constant pool entries problem. When jumps are implemented as GOTOs and direct branches to the target the PC field doesn?t need to be set. This eliminates many of the constant pool entries the java source compiler requires. The limit is still there however, and given a large enough binary it will still be reached.
 302
 303 Delay slots are another area where things are done somewhat inefficiently in the Java source compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a ?delay slot?. That is, an instruction after a branch or jump instruction that is executed regardless of whether the branch is taken. This is done because by the time the branch or jump instruction is finished being processes the next instruction is already ready to be executed and it is wasteful to discard it. (However, newer  MIPS CPUs have pipelines that are much larger than early MIPS CPUs so they have to discard many instructions anyway.) As a result of this the instruction in the delay slot is actually executed BEFORE the branch is taken. To make things even more difficult, values from the register file are loaded BEFORE the delay slot is executed.  Here is a small piece of MIPS assembly:
 304
 305 %ADDIU r2,r0,-1
 306 %BLTZ r2, target
 307 %ADDIU r2,r2,10
 308 %...
 309 %:target
 310
 311 This piece of code is executed as follows
 312 \begin{enumerate}
 313 \item r2 is set to ?1
 314 \item r2 is loaded from the register file by the BLTEZ instruction
 315 \item 10 is added to r2 by the ADDIU instruction
 316 \item The branch is taken because at the time the BLTZ instruction was executed r2 was ?1, but r2 is now 9 (-1 + 10)
 317 \end{enumerate}
 318
 319 There is a very element solution to this problem when using JVM bytecode. When a branch instruction is encountered the registers needed for the comparison are pushed onto the stack to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the delay slot is emitted, and then finally the actual JVM branch instruction. Because the values were pushed to the stack before the delay slot was executed any changes the delay slot made to the registers are not visible to the branch bytecode. This allows delay slots to be used with no performance penalty or size penalty.
 320
 321 One final advantage that generating bytecode directly allows is smaller more compact bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also a few other areas where the generated bytecode can be optimized for size with more knowledge of the program as a whole.
 322
 323 When encountering the following switch block both javac and Jikes generate redundant bytecode.
 324 %Switch(pc>>>8) {
 325 %       Case 0x1: run_1(); break;
 326 %       Case 0x2: run_2(); break
 327 %       ...
 328 %       case 0x100: run_100(); break;
 329 %}
 330
 331 The first bytecode in each case arm in the switch statement is ALOAD\_0 to
 332 prepare for a invoke special call. By simple moving this outside the switch
 333 statement each case arm was reduced in size by one instruction. Similar
 334 optimizations were also done in other parts of the compiler.
 335
 336
 337 \subsection{Interfacing with Java Code}
 338
 339 Java source code can create a copy of the translated binary by
 340 instantiating the corresponding class, which extends {\tt Runtime}.
 341 Invoking the {\tt main()} method on this class is equivalent to
 342 calling the {\tt main()} function within the binary; the {\tt String}
 343 arguments to this function are copied into the binary's memory space
 344 and made available as {\tt argv**} and {\tt argc}.
 345
 346 The translated binary communicates with the rest of the VM by
 347 executing MIPS {\tt SYSCALL} instructions, which are translated into
 348 invocations of the {\tt syscall()} method.  This calls back to the
 349 native Java world, which can manipulate the binary's environment by
 350 reading and writing to its memory space, checking its exit status,
 351 pausing the VM, and restarting the VM.
 352
 353
 354 \subsection{Virtualization}
 355
 356 The {\tt Runtime} class implements the majority of the standard {\tt
 357 libc} syscalls, providing a complete interface to the filesystem,
 358 network socket library, time of day, (Brian: what else goes here?).
 359
 360 \begin{itemize}
 361 \item ability to provide the same interface to CNI code and mips2javaified code
 362 \item security advantages (chroot the {\tt fork()}ed process)
 363 \end{itemize}
 364
 365 \section{Related Work}
 366
 367 \subsection{Source-to-Source translators}
 368
 369 A number of commercial products and research projects attempt to
 370 translate C++ code to Java code, preserving the mapping of C++ classes
 371 to Java classes.  Unfortunately this is problematic since there is no
 372 way to do pointer arithmetic except within arrays, and even in that
 373 case, arithmetic cannot be done in terms of fractional objects.
 374
 375 Mention gcc backend
 376
 377 c2java
 378
 379 Many of these products advise the user to tweak the code which results
 380 from the translation.  Unfortunately, hand-modifying machine-generated
 381 code is generally a bad idea, since this modification cannot be
 382 automated.  This means that every time the origin code changes, the
 383 code generator must be re-run, and the hand modifications must be
 384 performed yet again.  This is an error-prone process.
 385
 386 Furthermore, Mips2Java does not attempt to read C code directly.  This
 387 frees it from the complex task of faithfully implementing the ANSI C
 388 standard (or, in the case of non ANSI-C compliant code, some other
 389 interface).  This also saves the user the chore of altering their
 390 build process to accomodate Mips2Java.
 391
 392
 393 \section{Performance}
 394
 395 \subsection{Charts}
 396
 397 (Note that none of these libraries have pure-Java equivalents.)
 398
 399 \begin{itemize}
 400 \item libjpeg
 401 \item libfreetype
 402 \item libmspack
 403 \end{itemize}
 404
 405
 406 \subsection{Optimizations}
 407
 408 Brian, can you write something to go here?  Just mention which
 409 optimizations helped and which ones hurt.
 410
 411 \begin{itemize}
 412 \item trampoline
 413 \item optimal method size
 414 \item -msingle-float
 415 \item -mmemcpy
 416 \item fastmem
 417 \item local vars for registers (useless)
 418 \item -fno-rename-registers
 419 \item -ffast-math
 420 \item -fno-trapping-math
 421 \item -fsingle-precision-constant
 422 \item -mfused-madd
 423 \item -freg-struct-return
 424 \item -freduce-all-givs
 425 \item -fno-peephole
 426 \item -fno-peephole2
 427 \item -fmove-all-movables
 428 \item -fno-sched-spec-load
 429 \item -fno-sched-spec
 430 \item -fno-schedule-insns
 431 \item -fno-schedule-insns2
 432 \item -fno-delayed-branch
 433 \item -fno-function-cse
 434 \item -ffunction-sections
 435 \item -fdata-sections
 436 \item array bounds checking
 437 \item -falign-functions=n
 438 \item -falign-labels=n
 439 \item -falign-loops=n
 440 \item -falign-jumps=n
 441 \item -fno-function-cse
 442 \end{itemize}
 443
 444 \section{Future Directions}
 445
 446 World domination.
 447
 448 \section{Conclusion}
 449
 450 We rock the hizzouse.
 451
 452 \section{References}
 453
 454 Yer mom.
 455
 456 \section{stuff}
 457 \begin{verbatim}
 458
 459 libjpeg (render thebride_1280.jpg)
 460 Native -  0.235s
 461 JavaSource - 1.86s
 462 ClassFile - 1.37s
 463
 464 freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to
 465 48 incrementing by 4)
 466 Native - 0.201s
 467 JavaSource - 2.02s
 468 ClassFile - 1.46s
 469
 470 Section 3.2  - Interpreter
 471 The Interpreter was the first part of Mips2Java to be written.  This was the most
 472 straightforward and simple way to run MIPS binaries inside the JVM.  The simplicity of the
 473 interpreter also made it very simple to debug. Debugging machine-generated code is a pain.
 474 Most importantly, the interpreter provided a great reference to use when developing the
 475 compiler. With known working implementations of each MIPS instruction in Java writing a
 476 compiler became a matter of simple doing the same thing ahead of time.
 477 With the completion of the compiler the interpreter in Mips2Java has become less useful.
 478 However, it may still have some life left in it. One possible use is remote debugging with
 479 GDB. Although debugging the compiler generated JVM code is theoretically possible, it
 480 would be far easier to do in the interpreter. The interpreter may also be useful in cases
 481 where size is far more important than speed. The interpreter is very small. The interpreter
 482 and MIPS binary combined are smaller than the compiled classfiles.
 483 Section 3.3 - Compiling to Java Source
 484 The next major step in Mips2JavaÕs development was the Java source compiler. This
 485 generated Java source code (compliable with javac or Jikes) from a MIPS binary.
 486 Generating Java source code was preferred initially over JVM bytecode for two reasons.
 487 The authors werenÕt very familiar with JVM bytecode and therefore generating Java source
 488 code was simpler. Generating source code also eliminated the need to do trivial
 489 optimizations in the Mips2java compiler that javac and Jikes already do. This mainly
 490 includes 2+2=4 stuff. For example, the MIPS register r0 is immutable and always 0. This
 491 register is represented by a static final int in the Java source compiler. Javac and Jikes
 492 automatically handle optimizing this away when possible. In the JVM bytecode compiler
 493 these optimizations needs to be done in Mips2Java.
 494 Early versions of the Mips2Java compiler were very simple. All 32 MIPS GPRs and a
 495 special PC register were fields in the generated java class. There was a run() method
 496 containing all the instructions in the .text segment converted to Java source code. A switch
 497 statement was used to allow jumps from instruction to instruction. The generated code
 498 looked something like this.
 499 private final static int r0 = 0;
 500 private int r1, r2, r3,...,r30;
 501 private int r31 = 0xdeadbeef;
 502 private int pc = ENTRY_POINT;
 503
 504 public void run() {
 505         for(;;) {
 506         switch(pc) {
 507                 case 0x10000:
 508                         r29 = r29 Ð 32;
 509                 case 0x10004:
 510                         r1 = r4 + r5;
 511                 case 0x10008:
 512                         if(r1 == r6) {
 513                                 /* delay slot */
 514                                 r1 = r1 + 1;
 515                                 pc = 0x10018:
 516                                 continue;
 517                         }
 518                 case 0x1000C:
 519                         r1 = r1 + 1;
 520                 case 0x10010:
 521                         r31 = 0x10018;
 522                         pc = 0x10210;
 523                         continue;
 524                 case 0x10014:
 525                         /* nop */
 526                 case 0x10018:
 527                         pc = r31;
 528                         continue;
 529                 ...
 530                 case 0xdeadbeef:
 531                         System.err.println("Exited from ENTRY_POINT");
 532                         System.err.println("R2: " + r2);
 533                         System.exit(1);
 534         }
 535         }
 536 }
 537
 538 This worked fine for small binaries but as soon as anything substantial was fed to it the 64k
 539 JVM method size limit was soon hit. The solution to this was to break up the code into
 540 many smaller methods. This required a trampoline to dispatch  jumps to the appropriate
 541 method. With the addition of the trampoline the generated code looked something like this:
 542 public void run_0x10000() {
 543         for(;;) {
 544         switch(pc) {
 545                 case 0x10000:
 546                         ...
 547                 case 0x10004:
 548                         ...
 549                 ...
 550                 case 0x10010:
 551                         r31 = 0x10018;
 552                         pc = 0x10210;
 553                         return;
 554                 ...
 555         }
 556         }
 557 }
 558
 559 pubic void run_0x10200() {
 560         for(;;) {
 561         switch(pc) {
 562                 case 0x10200:
 563                         ...
 564                 case 0x10204:
 565                         ...
 566         }
 567         }
 568 }
 569
 570 public void trampoline() {
 571         for(;;) {
 572         switch(pc&0xfffffe00) {
 573                         case 0x10000: run_0x10000(); break;
 574                         case 0x10200: run_0x10200(); break;
 575                         case 0xdeadbe00:
 576                                 ...
 577                 }
 578         }
 579 }
 580 With this trampoline in place somewhat large binaries could be handled without much
 581 difficulty. There is no limit on the size of a classfile as a whole, just individual methods.
 582 This method should scale well. However, there are other classfile limitations that will limit
 583 the size of compiled binaries.
 584 Another interesting problem that was discovered while creating the trampoline method was
 585 javac and JikesÕ incredible stupidity when dealing with switch statements. The follow code
 586 fragment gets compiled into a lookupswich by javac:
 587 Switch(pc&0xffffff00) {
 588         Case 0x00000100: run_100(); break;
 589         Case 0x00000200: run_200(); break;
 590         Case 0x00000300: run_300(); break;
 591 }
 592 while this nearly identical code fragment gets compiled to a tableswitch
 593 switch(pc>>>8) {
 594         case 0x1: run_100(); break
 595         case 0x2: run_200(); break;
 596         case 0x3: run_300(); break;
 597 }
 598 Javac isnÕt smart enough to see the patter in the case values and generates very suboptimal
 599 bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which
 600 is significantly faster. This change alone nearly doubled the speed of the compiled binary.
 601 Finding the optimal method size lead to the next big performance increase.  It was
 602 determined with experimentation that the optimal number of MIPS instructions per method
 603 is 128 (considering only power of two options). Going above or below that lead to
 604 performance decreases. This is most likely due to a combination of two factors.
 605 _ The two levels  of switch statements jumps have to pass though Ð The first switch
 606 statement jumps go through is the trampoline switch statement. This is implemented
 607 as a TABLESWITCH in JVM bytecode so it is very fast. The second level switch
 608 statement in the individual run_ methods is implemented as a LOOKUPSWITCH,
 609 which is much slower. Using smaller methods puts more burden on the faster
 610 TABLESWITCH and less on the slower LOOKUPSWITCH.
 611 _ JIT compilers probably favor smaller methods smaller methods are easier to compile
 612 and are probably better candidates for JIT compilation than larger methods.
 613 FIXME: Put a chart here
 614 The next big optimization was eliminating unnecessary case statements. Having case
 615 statements before each instruction prevents JIT compilers from being able to optimize
 616 across instruction boundaries. In order to eliminate unnecessary case statements every
 617 possible address that could be jumped to directly needed to be identified. The sources for
 618 possible jump targets come from 3 places.
 619 _ The .text segment Ð Every instruction in the text segment in scanned for jump
 620 targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list
 621 of possible branch targets. In addition, functions that set the link register have
 622 theirpc+8 added to the list (the address that wouldÕve been put to the link register).
 623 Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate
 624 Unsigned) are scanned for possible addresses in the text segment. This combination
 625 of instructions is often used to load a 32-bit word into a register.
 626 _ The .data segment Ð When GCC generates switch() statements it often uses a jump
 627 table stored in the .data segment. Unfortunately gcc doesnÕt identify these jump
 628 tables in any way. Therefore, the entire .data segment is conservatively scanned for
 629 possible addresses in the .text segment.
 630 _ The symbol table Ð This is mainly used as a backup. Scanning the .text and .data
 631 segments should identify any possible jump targets but adding every function in the
 632 symbol table in the ELF binary doesnÕt hurt. This will also catch functions that are
 633 never called directly from the MIPS binary (for example, functions called with the
 634 call() method in the runtime).
 635 Eliminating unnecessary case statements provided a 10-25% speed increase .
 636 Despite all the above optimizations and workaround an impossible to workaround hard
 637 classfile limit was eventually hit, the constant pool. The constant pool in classfiles is limited
 638 to 65536 entries. Every Integer with a magnitude greater than 32767 requires an entry in the
 639 constant pool. Every time the compiler emits a jump/branch instruction the PC field is set to
 640 the branch target. This means nearly every branch instruction requires an entry in the
 641 constant pool. Large binaries hit this limit fairly quickly. One workaround that was
 642 employed in the Java source compiler was to express constants as offsets from a few central
 643 values. For example: "pc = N_0x00010000 + 0x10" where N_0x000100000 is a non-
 644 final field to prevent javac from inlining it. This was sufficient to get reasonable large
 645 binaries to compile. It has a small (approximately 5%) performance impact on the generated
 646 code. It also makes the generated classfile somewhat larger. Fortunately, the classfile
 647 compiler eliminates this problem.
 648 3.4 Ð Bytecode compiler
 649 The next step in the evolution of Mips2Java was to compile directly to JVM bytecode
 650 eliminating the intermediate javac step. This had several advantages
 651 _ There are little tricks that can be done in JVM bytecode that canÕt be done in Java
 652 source code.
 653 _ Eliminates the time-consuming javac step Ð Javac takes a long time to parse and
 654 compile the output from the java source compiler.
 655 _ Allows for MIPS binaries to be compiled and loaded into a running VM using a
 656 class loader. This eliminates the need to compile the binaries ahead of time.
 657 By generating code at the bytecode level  there are many areas where the compiler can be
 658 smarter than javac. Most of the areas where improvements where made where in the
 659 handling of branch instructions and in taking advantage of the JVM stack to eliminate
 660 unnecessary LOADs and STOREs to local variables.
 661 The first obvious optimization that generating bytecode allows for is the use of GOTO.
 662 Despite the fact that java doesnÕt have a GOTO keyword  a GOTO bytecode does exist and
 663 is used heavily in the code generates by javac. Unfortunately the java language doesnÕt
 664 provide any way to take advantage of this. As result of this jumps within a method were
 665 implemented by setting the PC field to the new address and making a trip back to the initial
 666 switch statement.  In the classfile compiler these jumps are implemented as GOTOs directly
 667 to the target instruction. This saves a costly trip back through the LOOKUPSWITCH
 668 statement and is a huge win for small loops within a method.
 669 Somewhat related to using GOTO is the ability to optimize branch statements. In the Java
 670 source compiler branch statements are implemented as follows (delay slots are ignored for
 671 the purpose of this example)
 672 If(condition) { pc = TARGET; continue; }
 673 This requires a branch in the JVM regardless of whether the MIPS branch is actually taken.
 674 If condition is false the JVM has to jump over the code to set the PC and go back to the
 675 switch block. If condition is true the JVM as to jump to the switch block. By generating
 676 bytecode directly we can make the target of the JVM branch statement the actual bytecode
 677 of the final destination. In the case where the branch isnÕt taken the JVM doesnÕt need to
 678 branch at all.
 679 A side affect of the above two optimizations is a solution to the excess constant pool entries
 680 problem. When jumps are implemented as GOTOs and direct branches to the target the PC
 681 field doesnÕt need to be set. This eliminates many of the constant pool entries the java
 682 source compiler requires. The limit is still there however, and given a large enough binary it
 683 will still be reached.
 684 Delay slots are another area where things are done somewhat inefficiently in the Java source
 685 compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a
 686 "delay slot". That is, an instruction after a branch or jump instruction that is executed
 687 regardless of whether the branch is taken. This is done because by the time the branch or
 688 jump instruction is finished being processes the next instruction is already ready to be
 689 executed and it is wasteful to discard it. (However, newer  MIPS CPUs have pipelines that
 690 are much larger than early MIPS CPUs so they have to discard many instructions anyway.)
 691 As a result of this the instruction in the delay slot is actually executed BEFORE the branch
 692 is taken. To make things even more difficult, values from the register file are loaded
 693 BEFORE the delay slot is executed.  Here is a small piece of MIPS assembly:
 694 ADDIU r2,r0,-1
 695 BLTZ r2, target
 696 ADDIU r2,r2,10
 697 ...
 698 :target
 699 This piece of code is executed as follows
 700 1. r2 is set to Ð1
 701 2. r2 is loaded from the register file by the BLTEZ instruction
 702 3. 10 is added to r2 by the ADDIU instruction
 703 4. The branch is taken because at the time the BLTZ instruction was executed r2 was
 704 Ð1, but r2 is now 9 (-1 + 10)
 705 There is a very element solution to this problem when using JVM bytecode. When a branch
 706 instruction is encountered the registers needed for the comparison are pushed onto the stack
 707 to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the
 708 delay slot is emitted, and then finally the actual JVM branch instruction. Because the values
 709 were pushed to the stack before the delay slot was executed any changes the delay slot made
 710 to the registers are not visible to the branch bytecode. This allows delay slots to be used
 711 with no performance penalty or size penalty.
 712 One final advantage that generating bytecode directly allows is smaller more compact
 713 bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also
 714 a few other areas where the generated bytecode can be optimized for size with more
 715 knowledge of the program as a whole.
 716 When encountering the following switch block both javac and Jikes generate redundant
 717 bytecode.
 718 Switch(pc>>>8) {
 719         Case 0x1: run_1(); break;
 720         Case 0x2: run_2(); break
 721         ...
 722         case 0x100: run_100(); break;
 723 }
 724 The first bytecode in each case arm in the switch statement is ALOAD_0 to prepare for a
 725 invoke special call. By simple moving this outside the switch statement  each case arm was
 726 reduced in size by one instruction. Similar optimizations were also done in other parts of the
 727 compiler.
 728
 729 Section 3
 730 - Adam - The method is run(), not execute. Execute() is only used when you need to
 731 resume from a pause syscall.
 732
 733 Section 3.1
 734 - Adam - Even the R1000 supports LB/SB/LH/SH/LWL/LWR Ð saying it only supports
 735 32-bit aligned loads is incorrect.
 736 - Other similarities
 737 o All the branching instructions in MIPS directly map to single JVM instructions.
 738 o Most of the ALU instructions map to single JVM instructions.
 739
 740 (I can write up some stuff for each of these next several sections if you want)
 741 Section 3.2  - Interpreter
 742 - Originally written mainly to understand the MIPS instruction set
 743 - Far easier to debug than an ahead of time compiler (no waiting, can throw in quick
 744 hacks like if(pc >= 0xbadc0de && pc <= 0xbadcfff) debugOutput() ), donÕt need to
 745 understand most of the ELF format)
 746 - Still could be useful
 747 o for GDB remote debugging
 748 o cases where size is more important than speed (size of interpreter + size of mips
 749 binary < size of compiled binary or size of compiler + mips binary)
 750 o code which dynamically generates code (JIT compilers, etc). The ahead of time
 751 compiler canÕt possibly handle this
 752
 753 Section 3.3 Ð Compiling to Java Source
 754 - First version of an ahead of time compiler
 755 - Huge performance boost
 756 - Java source output preferred for the 2+2=4 type optimizations java compilers do
 757 - Worked well for small binaries Ð large MIPS binaries required ridiculous amounts of
 758 memory to compile and often created  invalid classfiles
 759 - Too many constants Ð every jump operation required an entry in the constant pool (pc =
 760 0xabcd1234; continue; )
 761
 762 Section 3.4 Ð Compiling directly to JVM Bytecode
 763 - Another jump in performance
 764 - More efficient code can be created at the bytecode level
 765 o Information can be stored on the JVM stack rather than in local variables
 766 _ Javac/jikes often unnecessarily use local variables
 767 long tmp = a*b;
 768 lo = (int)tmp;
 769 hi = (int)(tmp>>>32)
 770 does a store and two loads when a simple DUP would suffice
 771 o GOTO can be used to branch directly  to branch destinations in the same method
 772 rather than going through the switch block again.
 773 o BEQ, BGTZ, BLE, etc can jump directly to destination rather than doing
 774 if(condition) { pc=0xabcd1234; continue; }
 775 o Eliminates excess constant pool entries (only jump outside the current method
 776 require a constant pool entry)
 777 o Delay slots implemented more efficiently.
 778 _ Java source compiler does:
 779 if(condition) { /* delay slot /; pc = 0xabcd1234; continue; }
 780 /* delay slot again */
 781 _ This is required because the delay slot can change the registers used in
 782 condition. The registers need to be read BEFORE the delay slot in executed.
 783 _ In the bytecode compiler the registers used in the condition are pushed to the
 784 stack, then the delay slot is executed, and finally the comparison is done.
 785 This eliminates the needs to output the delay slot twice.
 786 - Smaller bytecode
 787 o Everything mentioned above makes it smaller and faster
 788 o Javac/jikes add redundant code
 789 _ Switch(a) {
 790    Case 1: Run_1000(); break;
 791    Case 2: run_2000(); break;
 792 }
 793 Gets compiled into
 794 1 ÐTableswitch É.
 795 2 Ð ALOAD_0
 796 3 Ð invokespecial  run_1000
 797 4 Ð goto end
 798 5 Ð ALOAD_0
 799 6 Ð invokespecial run_2000
 800 10 Ð goto end
 801 ALOAD_0 is unnecessarily put in each switch arm
 802
 803 3.5 Interfacing with Java Code
 804 - _call_java ()/Runtime.call()
 805 o _call_java () - Call a java method from mips
 806 o Runime.call() Ð call a mips method from java
 807 o Easily allocate memory within the binaryÕs memory space by calling libcÕs malloc()
 808 o Can go back and forth between mips and java (java calls mips which calls java which
 809 calls back into mips)
 810 - Java Strings can easily be converted to and from null terminated strings in the processÕ
 811 memory space
 812 - Java InputStreams, OutputStreams, and Sockets can easily be turned in to standard
 813 UNIX file descriptors (and ANSI FILE*s)
 814 - Can easily create custom filedescriptors and have full control over all operations on
 815 them (read, write, seek, close, fstat, etc)
 816 - Int User_info[] Ð optional chunk of memory can very easily be accessed from java
 817 (Runtime.getUserInfo/setUserInfo)
 818
 819 3.6 Virtualization
 820 - Adam Ð we actually donÕt support sockets directly yet Ð you should probably take that
 821 out. (But you can create a socket in java and expose it to mips as a filedescriptor)
 822 - Runtime services
 823 o Provides a easy to use interface to subclasses (Interpreter or compiles binaries)
 824 _ Subclasses only know how to execute instructions
 825 _ Runtime handles setting up registers/stack for execution to begin and
 826 extracting return values and the exit status from the process
 827 o Memory management
 828 _ Sets up stack and guard pages
 829 _ Allocates pages with the sbrk syscall
 830 _ Provide easy an memcpy like interface for accessing the processes memory
 831 from java
 832 o Runtime.call() support Ð sets up registers,etc to prepare the process for a call into it
 833 from java
 834 o Filesystem Ð open/close/read/write/seek/fcntl s syscalls
 835 o Time related functions Ð sleep, gettimeofday, times syscall
 836 o UnixRuntime provides a more complete unix-like environment (Runtime smaller
 837 though)
 838 _ Supports fork() and waitpid()
 839 _ Pipe() for IPC
 840 _ More advocated filesystem interface
 841 _ All filesystem operations abstracted away into a FileSystem class
 842 o FileSystem class can be written that exposes a zip file,
 843 directory on an http server, etc as a filesystem
 844 _ Chdir/getcwd
 845 _ /dev filesystem
 846 _ stat()
 847 _ directory listing support
 848 5.1 Charts
 849 IÕll put together some charts tonight
 850 5.2 Optimizations
 851 And finish this part
 852
 853 Let me know if this was what you were looking for
 854
 855                                           libjpeg  libmspack libfreetype
 856 Interpreted MIPS Binary                   22.2      12.9      21.4
 857 Compled MIPS Binary (fastest options)     3.39      2.23      4.31
 858 Native -O3                                0.235    0.084     0.201
 859
 860 Compled - with all case statements        3.50      2.30      4.99
 861 Compiled - with pruned case statement     3.39      2.23      4.31
 862
 863 Compiled - 512 instructions/method        62.7      27.7      56.9
 864 Compiled - 256 instructions/method        3.54      2.55      4.43
 865 Compiled - 128 instructions/method        3.39      2.23      4.31
 866 Compiled - 64 instructions/method         3.56      2.31      4.40
 867 Compiled - 32 instruction/method          3.71      2.46      4.64
 868
 869 Compild MIPS Binary (Server VM)           3.21      2.00      4.54
 870 Compiled MIPS Binary (Client VM)          3.39      2.23      4.31
 871
 872 All times are measured in seconds. These were all run on a dual 1ghz G4
 873 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
 874 was run 8 times within a single VM. The highest and lowest times were
 875 removed and the remaining 6 were averaged. In each case only the first
 876 run differed significantly from the rest.
 877
 878 The libjpeg test consisted of decoding a 1280x1024 jpeg
 879 (thebride_1280.jpg) and writing a tga. The mspack test consisted of
 880 extracting all members from arial32.exe, comic32.exe, times32.exe, and
 881 verdan32.exe. The freetype test consisted of rendering characters
 882 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
 883 about 950 individual glyphs).
 884
 885 I can provide you with the source for any of these test if you'd like.
 886
 887 -Brian
 888 \end{verbatim}
 889
 890 \end{document}
 891