doc/nestedvm.ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{pdftricks}
   8 \begin{psinputs}
   9   \usepackage{pstricks}
  10   \usepackage{pst-node}
  11 \end{psinputs}
  12 \usepackage{parskip}
  13 \usepackage{tabularx}
  14 \usepackage{alltt}
  15 \bibliographystyle{alpha}
  16
  17 \title{\textbf{\textsf{
  18 NestedVM: Total Translation of Native Code into Safe Bytecode
  19 }}}
  20 \date{}
  21 \author{\begin{tabular}{@{}c@{}}
  22         {\em {Brian Alliet}} \\
  23         {Rochester Institute of Technology}\\
  24         {\tt brian@ibex.org}
  25    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  26         {\em {Adam Megacz}} \\
  27         {UC Berkeley Statistical Computing Facility} \\
  28         {\tt adam@ibex.org}
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 \begin{abstract}
  35
  36 We present a new approach to utilizing unsafe legacy code
  37 within safe virtual machines by compiling to MIPS machine code as an
  38 intermediate language.  This approach carries N key benefits over
  39 existing techniques:
  40
  41 \begin{itemize}
  42 \item total coverage of all language features, unlike source translation
  43 \item no build process modifications
  44 \item no post-translation human intervention
  45 \item efficient bytecode
  46 \end{itemize}
  47
  48 We also present NestedVM, a complete system in production use which
  49 implements this technique.  We conclude with quantitative performance
  50 measurements and suggestions for VM acceleration of the resulting
  51 bytecodes.
  52
  53
  54 \end{abstract}
  55
  56 \section{Introduction}
  57
  58 The C programming language \cite{KR} has been in use for over 30
  59 years.  Consequently, there is a huge library of software written in
  60 this language.  Although Java offers substantial benefits \cite{} over
  61 C (and C++), its comparatively young age means that it often lacks
  62 equivalents of many C/C++ libraries.
  63
  64 The typical solution to this dilemma is to use JNI \cite{} or CNI
  65 \cite{} to invoke C code from within a Java VM.  Unfortunately, there
  66 are a number of situations in which this is not an acceptable
  67 solution due to security concerns:
  68
  69 \begin{itemize}
  70
  71 \item Java Applets are not permitted to invoke {\tt
  72       Runtime.loadLibrary()}
  73
  74 \item Java Servlet containers with a {\tt SecurityManager} will not
  75       permit loading new JNI libraries.  This configuration is popular
  76       with {\it shared hosting} providers and corporate intranets
  77       where a number of different parties contribute individual web
  78       applications which are run together in a single container.
  79
  80 \item Unlike Java Bytecode, JNI code is susceptible to buffer overflow
  81       and heap corruption attacks.  This can be a major security
  82       vulnerability.
  83
  84 \end{itemize}
  85
  86 In addition to security concerns, JNI and CNI carry other
  87 disadvantages:
  88
  89 \begin{itemize}
  90
  91 \item JNI requires the native library to be compiled ahead of time,
  92       separately, for every architecture on which it will be deployed.
  93       This is unworkable for situations in which the full set of
  94       target architectures is not known at deployment time.
  95
  96 \item The increasingly popular J2ME \cite{} platform does not support
  97       JNI or CNI.
  98
  99 \item JNI often introduces undesirable added complexity to an
 100       application.
 101
 102 \end{itemize}
 103
 104 The technique we present here is based on using a typical ANSI C
 105 compiler to compile C/C++ code into a MIPS binary, and then using a
 106 tool to translate that binary on an instruction-by-instruction basis
 107 into Java bytecode.
 108
 109 The technique presented here is general; we anticipate that it can be
 110 applied to other secure virtual machines such as Microsoft's .NET
 111 \cite{}, Perl Parrot \cite{}, or Python bytecode \cite{}.
 112
 113 \section{Approaches to Translation}
 114
 115 Techniques for translating unsafe code into VM bytecode generally fall
 116 into four categories:
 117
 118 \begin{itemize}
 119 \item source-to-source translation
 120 \item source-to-binary translation
 121 \item binary-to-source translation
 122 \item binary-to-binary translation
 123 \end{itemize}
 124
 125 \begin{figure}[h]
 126 \begin{pdfpic}
 127 \newlength{\MyLength}
 128 \settowidth{\MyLength}{machine code}
 129 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 130 \begin{psmatrix}[colsep=3,rowsep=3]
 131   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 132   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\
 133   \psset{nodesep=5pt,arrows=->}
 134   \ncline{s0}{b0}<{\it gcc}
 135   \ncline{s0}{s1}\aput{:U}{\it c2java}
 136   \ncline{s0}{b1}\aput{:U}{\it gcc bytecode backend}
 137   \ncline{s1}{b1}>{\it javac}
 138 \end{psmatrix}
 139 \end{pdfpic}
 140 \caption{\label{lattice} Conversion Lattice with examples of tools specific to a C/JVM scenario}
 141 \end{figure}
 142
 143 \begin{figure}[h]
 144 \begin{pdfpic}
 145 \newlength{\MyLength}
 146 \settowidth{\MyLength}{machine code}
 147 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 148 \begin{psmatrix}[colsep=3,rowsep=3,nrot=:U]
 149   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 150   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\
 151   \psset{nodesep=5pt,arrows=->}
 152   \ncline{s0}{b0}<{\it gcc}
 153   \ncline{s1}{b1}>{\it javac}
 154   \ncline{b0}{s1}\naput{\it NestedVM}
 155   \ncline{b0}{s1}\nbput{\it binary-to-source}
 156 \end{psmatrix}
 157 \end{pdfpic}
 158 \caption{\label{lattice2} Conversion Lattice including NestedVM in {\it source-output} mode}
 159 \end{figure}
 160
 161 \begin{figure}[h]
 162 \begin{pdfpic}
 163 \newlength{\MyLength}
 164 \settowidth{\MyLength}{machine code}
 165 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 166 \begin{psmatrix}[colsep=3,rowsep=3,nrot=:U]
 167   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 168   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\
 169   \psset{nodesep=5pt,arrows=->}
 170   \ncline{s0}{b0}<{\it gcc}
 171   \ncline{s1}{b1}>{\it javac}
 172   \ncline{b0}{b1}\naput{\it NestedVM}
 173   \ncline{b0}{b1}\nbput{\it binary-to-binary}
 174 \end{psmatrix}
 175 \end{pdfpic}
 176 \caption{\label{lattice3} Conversion Lattice including NestedVM in {\it bytecode-output} mode}
 177 \end{figure}
 178
 179 A diagram showing these four translation approaches in the context of
 180 running C/C++ code within a Java VM is shown in Figure~\ref{lattice}.
 181
 182 \subsection{Existing Work}
 183 \subsubsection{Source-to-Source Translation}
 184
 185 \begin{itemize}
 186 \item c2java
 187 \item commercial products
 188 \end{itemize}
 189
 190 A number of commercial products and research projects attempt to
 191 translate C++ code to Java code, preserving the mapping of C++ classes
 192 to Java classes.  Unfortunately this is problematic since there is no
 193 way to do pointer arithmetic except within arrays, and even in that
 194 case, arithmetic cannot be done in terms of fractional objects.
 195
 196 Mention gcc backend
 197
 198 Many of these products advise the user to tweak the code which results
 199 from the translation.  Unfortunately, hand-modifying machine-generated
 200 code is generally a bad idea, since this modification cannot be
 201 automated.  This means that every time the origin code changes, the
 202 code generator must be re-run, and the hand modifications must be
 203 performed yet again.  This is an error-prone process.
 204
 205 Furthermore, NestedVM does not attempt to read C code directly.  This
 206 frees it from the complex task of faithfully implementing the ANSI C
 207 standard (or, in the case of non ANSI-C compliant code, some other
 208 interface).  This also saves the user the chore of altering their
 209 build process to accomodate NestedVM.
 210
 211 \section{NestedVM}
 212
 213 NestedVM takes a novel approach; it uses compiled machine code as a
 214 starting point for the translation process.  NestedVM has gone through
 215 two iterations:
 216
 217 \begin{itemize}
 218 \item binary-to-source compilation  (Figure~\ref{lattice2})
 219 \item binary-to-binary compilation  (Figure~\ref{lattice3})
 220 \end{itemize}
 221
 222 \subsection{Translation Process}
 223
 224 Translating a legacy library for use within a JVM proceeds as follows:
 225
 226 \begin{enumerate}
 227
 228 \item Compile the source code to a statically linked binary, targeting
 229       the MIPS R2000 ISA.
 230
 231 \item Invoke {\tt NestedVM} on the statically linked binary.
 232       Typically this will involve linking against {\tt libc}, which
 233       translates system requests (such as {\tt open()}, {\tt read()},
 234       or {\tt write()}) into appropriate invocations of the MIPS
 235       {\tt SYSCALL} instruction.
 236
 237 \item (If using binary-to-source translation) compile the resulting
 238       {\tt .java} code using {\tt jikes} or {\tt javac}.
 239
 240 \item (Optional) compile the resulting bytecode into a {\it safe}
 241       native binary using {\tt gcj}.
 242
 243 \item From java code, invoke the {\tt run()} method on the generated
 244       class.  This is equivalent to the {\tt main()} entry point.
 245
 246 \end{enumerate}
 247
 248
 249 \subsection{Why MIPS?}
 250
 251 We chose MIPS as a source format for two primary reasons: the
 252 availability of tools to translate legacy code into MIPS binaries, and
 253 the close similarity between the MIPS ISA and the Java Virtual Machine.
 254
 255 The MIPS architecture has been around for quite some time, and is well
 256 supported by the GNU Compiler Collection, which is capable of
 257 compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C
 258 into MIPS binaries.
 259
 260 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
 261 Machine:
 262
 263 \begin{itemize}
 264
 265 %\item The original MIPS ISA supports only 32-bit aligned memory loads
 266 %      and stores.  This allows NestedVM to represent memory as a Java
 267 %      {\tt int[]} without introducing additional overhead.
 268 \item Most of the instructions in the original MIPS ISA operate only on
 269       32-bit aligned memory locations. This allows NestedVM to represent
 270       memory as a Java {\tt int[]} array without introducing additional
 271       overhead.
 272
 273 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
 274       multiply and divide instructions as well as a single and double
 275       precision floating point unit.  These capabilities map nicely
 276       onto Java's arithmetic instructions.
 277
 278 \end{itemize}
 279
 280
 281 \subsection{Binary-to-Source Compilation}
 282
 283 The first incarnation of NestedVM was a binary-to-source compiler.
 284 This version reads in a MIPS binary and emits Java source code, which
 285 can be compiled with {\tt javac}, {\tt jikes}, or {\tt gcj}.
 286
 287 This implementation was primarily a first step towards the
 288 binary-to-binary compiler.  Conveniently, generating Java source code
 289 frees NestedVM from having to perform simple constant propagation
 290 optimizations, since most Java compilers already do this.  A recurring
 291 example is the treatment of the {\tt r0} register, which is fixed as
 292 {\tt 0} in the MIPS ISA.
 293
 294 Lacking the ability to generate specially optimized bytecode
 295 sequences, a straightforward mapping of the general purpose hardware
 296 registers to 32 {\tt int} fields was optimal.
 297
 298 \begin{figure*}[t]
 299 \begin{minipage}[c]{7in}%
 300 \begin{multicols}{2}
 301 {\footnotesize\begin{verbatim}
 302 private final static int r0 = 0;
 303 private int r1, r2, r3,...,r30;
 304 private int r31 = 0xdeadbeef;
 305 private int pc = ENTRY_POINT;
 306
 307 public void run() {
 308     for(;;) {
 309         switch(pc) {
 310             case 0x10000:
 311                 r29 = r29 - 32;
 312             case 0x10004:
 313                 r1 = r4 + r5;
 314             case 0x10008:
 315                 if(r1 == r6) {
 316                     /* delay slot */
 317                     r1 = r1 + 1;
 318                     pc = 0x10018:
 319                     continue;
 320                 }
 321             case 0x1000C:
 322                 r1 = r1 + 1;
 323             case 0x10010:
 324                 r31 = 0x10018;
 325                 pc = 0x10210;
 326                 continue;
 327             case 0x10014:
 328                 /* nop */
 329             case 0x10018:
 330                 pc = r31;
 331                 continue;
 332             ...
 333             case 0xdeadbeef:
 334                 System.err.println(``Exited.'');
 335                 System.exit(1);
 336         }
 337     }
 338 }
 339 \end{verbatim}}
 340 \vspace{1in}
 341 {\footnotesize\begin{verbatim}
 342 public void run_0x10000() {
 343     for(;;) {
 344     switch(pc) {
 345         case 0x10000:
 346             ...
 347         case 0x10004:
 348             ...
 349         ...
 350         case 0x10010:
 351             r31 = 0x10018;
 352             pc = 0x10210;
 353             return;
 354         ...
 355     }
 356     }
 357 }
 358
 359 pubic void run_0x10200() {
 360     for(;;) {
 361     switch(pc) {
 362         case 0x10200:
 363             ...
 364         case 0x10204:
 365             ...
 366     }
 367     }
 368 }
 369
 370 public void trampoline() {
 371     for(;;) {
 372     switch(pc&0xfffffe00) {
 373             case 0x10000: run_0x10000(); break;
 374             case 0x10200: run_0x10200(); break;
 375             case 0xdeadbe00:
 376                 ...
 377         }
 378     }
 379 }
 380 \end{verbatim}}
 381 \end{multicols}
 382 \end{minipage}
 383 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 384 \end{figure*}
 385
 386 Unfortunately Java imposes a 64kb limit on the size of the bytecode
 387 for a single method.  This presents a problem for NestedVM, and
 388 necessitates a {\it trampoline transformation}, as shown in
 389 Figure~\ref{code1}.  With this trampoline in place somewhat large
 390 binaries can be handled without much difficulty -- fortunately there
 391 is no corresponding limit on the size of a classfile as a whole.
 392
 393 Another interesting problem that was discovered while creating the
 394 trampoline method was javac and Jikes' inability to properly optimize
 395 switch statements.  The code in Figure~\ref{lookupswitch} is compiled
 396 into a comparatively inefficient {\tt LOOKUPSWITCH}, while the code in
 397 Figure~\ref{tableswitch} is optimized into a {\tt TABLESWITCH}.
 398
 399 \begin{figure}
 400 {\footnotesize\begin{verbatim}
 401 switch(pc&0xffffff00) {
 402     case 0x00000100: run_100(); break;
 403     case 0x00000200: run_200(); break;
 404     case 0x00000300: run_300(); break;
 405 }
 406 \end{verbatim}}
 407 \caption{\label{lookupswitch} Code which {\it is not} optimized into a tableswitch}
 408 \end{figure}
 409
 410 \begin{figure}
 411 {\footnotesize\begin{verbatim}
 412 switch(pc>>>8) {
 413     case 0x1: run_100();
 414     case 0x2: run_200();
 415     case 0x3: run_300();
 416 }
 417 \end{verbatim}}
 418 \caption{\label{tableswitch} Code which {\it is} optimized into a tableswitch}
 419 \end{figure}
 420
 421 Javac isn't smart enough to see the pattern in the case values and
 422 generates very suboptimal bytecode. Manually doing the shifts
 423 convinces javac to emit a tableswitch statement, which is
 424 significantly faster. This change alone nearly doubled the speed of
 425 the compiled binary.
 426
 427 Finding the optimal method size lead to the next big performance
 428 increase.  It was determined through experimentation that the optimal
 429 number of MIPS instructions per method is 128 (considering only power
 430 of two options). Going above or below that lead to performance
 431 decreases. This is most likely due to a combination of two factors.
 432
 433 \begin{itemize}
 434
 435 \item The two levels of switch statements jumps have to pass though -
 436       The first switch statement jumps go through is the trampoline
 437       switch statement. This is implemented as a {\tt TABLESWITCH} in JVM
 438       bytecode so it is very fast. The second level switch statement
 439       in the individual run\_ methods is implemented as a
 440       {\tt LOOKUPSWITCH}, which is much slower. Using smaller methods puts
 441       more burden on the faster {\tt TABLESWITCH} and less on the slower
 442       {\tt LOOKUPSWITCH}.
 443
 444 \item JIT compilers probably favor smaller methods smaller methods are
 445       easier to compile and are probably better candidates for JIT
 446       compilation than larger methods.
 447
 448 \end{itemize}
 449
 450 Put a chart in here
 451
 452 The next big optimization was eliminating unnecessary case
 453 statements. Having case statements before each instruction prevents
 454 JIT compilers from being able to optimize across instruction
 455 boundaries. In order to eliminate unnecessary case statements every
 456 possible address that could be jumped to directly needed to be
 457 identified. The sources for possible jump targets come from 3 places.
 458
 459 \begin{itemize}
 460
 461 \item The .text segment - Every instruction in the text segment is
 462       scanned for jump targets. Every branch instruction (BEQ, JAL,
 463       etc) has its destination added to the list of possible branch
 464       targets. In addition, functions that set the link register have
 465       theirpc+8 added to the list (the address that would've been put
 466       to the link register). Finally, combinations of LUI (Load Upper
 467       Immediate) of ADDIU (Add Immediate Unsigned) are scanned for
 468       possible addresses in the text segment. This combination of
 469       instructions is often used to load a 32-bit word into a
 470       register.
 471
 472 \item The .data segment - When GCC generates switch() statements it
 473       often uses a jump table stored in the .data
 474       segment. Unfortunately gcc doesn't identify these jump tables in
 475       any way. Therefore, the entire .data segment is conservatively
 476       scanned for possible addresses in the .text segment.
 477
 478 \item The symbol table - This is mainly used as a backup. Scanning the
 479       .text and .data segments should identify any possible jump
 480       targets but adding every function in the symbol table in the ELF
 481       binary doesn't hurt. This will also catch functions that are
 482       never called directly from the MIPS binary (for example,
 483       functions called with the call() method in the runtime).
 484
 485 \end{itemize}
 486
 487 Eliminating unnecessary case statements provided a 10-25\% speed
 488 increase.
 489
 490 Despite all the above optimizations and workaround an impossible to
 491 workaround hard classfile limit was eventually hit, the constant
 492 pool. The constant pool in classfiles is limited to 65536
 493 entries. Every Integer with a magnitude greater than 32767 requires an
 494 entry in the constant pool. Every time the compiler emits a
 495 jump or branch instruction the PC field is set to the branch target. This
 496 means nearly every branch instruction requires an entry in the
 497 constant pool. Large binaries hit this limit fairly quickly. One
 498 workaround that was employed in the Java source compiler was to
 499 express constants as offsets from a few central values. For example:
 500 ``pc = N\_0x00010000 + 0x10'' where N\_0x000100000 is a non-final
 501 field to prevent javac from inlining it. This was sufficient to get
 502 reasonable large binaries to compile. It has a small (approximately
 503 5\%) performance impact on the generated code. It also makes the
 504 generated classfile somewhat larger.  Fortunately, the classfile
 505 compiler eliminates this problem.
 506
 507
 508 \subsection{Binary-to-Binary Translation}
 509
 510 The next step in the evolution of NestedVM was to compile directly to
 511 JVM bytecode eliminating the intermediate javac step. This had several
 512 advantages:
 513
 514 \begin{itemize}
 515
 516 \item There are little tricks that can be done in JVM bytecode that
 517       can't be done in Java source code.
 518
 519 \item Eliminates the time-consuming javac step - Javac takes a long
 520       time to parse and compile the output from the java source
 521       compiler.
 522
 523 \item Allows for MIPS binaries to be compiled and loaded into a
 524       running VM using a class loader. This eliminates the need to
 525       compile the binaries ahead of time.
 526
 527 \end{itemize}
 528
 529 By generating code at the bytecode level there are many areas where
 530 the compiler can be smarter than javac. Most of the areas where
 531 improvements where made where in the handling of branch instructions
 532 and in taking advantage of the JVM stack to eliminate unnecessary
 533 LOADs and STOREs to local variables.
 534
 535 The first obvious optimization that generating bytecode allows for is the
 536 use of GOTO. Despite the fact that java doesn't have a GOTO keyword a GOTO
 537 bytecode does exist and is used heavily in the code generates by javac.
 538 Unfortunately the java language doesn't provide any way to take advantage of
 539 this. As result of this, jumps within a method were implemented in the
 540 binary-to-source compiler by setting the PC field to the new address and
 541 making a trip back to the initial switch statement.  In the classfile
 542 compiler these jumps are implemented as GOTOs directly to the target
 543 instruction. This saves a costly trip back through the LOOKUPSWITCH
 544 statement and is a huge win for small loops within a method.
 545
 546 Somewhat related to using GOTO is the ability to optimize branch
 547 statements. In the Java source compiler branch statements are
 548 implemented as follows (delay slots are ignored for the purpose of
 549 this example):
 550
 551 {\footnotesize\begin{verbatim}
 552 if(condition) { pc = TARGET; continue; }
 553 \end{verbatim}}
 554
 555 This requires a branch in the JVM regardless of whether the MIPS
 556 branch is actually taken. If condition is false the JVM has to jump
 557 over the code to set the PC and go back to the switch block. If
 558 condition is true the JVM as to jump to the switch block. By
 559 generating bytecode directly we can make the target of the JVM branch
 560 statement the actual bytecode of the final destination. In the case
 561 where the branch isn't taken the JVM doesn't need to branch at all.
 562
 563 A side affect of the above two optimizations is a solution to the
 564 excess constant pool entries problem. When jumps are implemented as
 565 GOTOs and direct branches to the target the PC field doesn't need to
 566 be set. This eliminates many of the constant pool entries the java
 567 source compiler requires. The limit is still there however, and given
 568 a large enough binary it will still be reached.
 569
 570 Delay slots are another area where things are done somewhat
 571 inefficiently in the Java source compiler. In order to take advantage
 572 of instructions already in the pipeline MIPS cpu have a ``delay
 573 slot''. That is, an instruction after a branch or jump instruction that
 574 is executed regardless of whether the branch is taken. This is done
 575 because by the time the branch or jump instruction is finished being
 576 processes the next instruction is already ready to be executed and it
 577 is wasteful to discard it. (However, newer MIPS CPUs have pipelines
 578 that are much larger than early MIPS CPUs so they have to discard many
 579 instructions anyway.) As a result of this the instruction in the delay
 580 slot is actually executed BEFORE the branch is taken. To make things
 581 even more difficult, values from the register file are loaded BEFORE
 582 the delay slot is executed.  Here is a small piece of MIPS assembly:
 583
 584 {\footnotesize\begin{verbatim}
 585 ADDIU r2,r0,-1
 586 BLTZ r2, target
 587 ADDIU r2,r2,10
 588 ...
 589 :target
 590 \end{verbatim}}
 591
 592 This piece of code is executed as follows
 593
 594 \begin{enumerate}
 595
 596 \item r2 is set to -1
 597
 598 \item r2 is loaded from the register file by the BLTEZ instruction
 599
 600 \item 10 is added to r2 by the ADDIU instruction
 601
 602 \item The branch is taken because at the time the BLTZ instruction was
 603       executed r2 was -1, but r2 is now 9 (-1 + 10)
 604
 605 \end{enumerate}
 606
 607 There is a very elegent solution to this problem when using JVM
 608 bytecode. When a branch instruction is encountered the registers
 609 needed for the comparison are pushed onto the stack to prepare for the
 610 JVM branch instruction. Then, AFTER the values are on the stack the
 611 delay slot is emitted, and then finally the actual JVM branch
 612 instruction. Because the values were pushed to the stack before the
 613 delay slot was executed any changes the delay slot made to the
 614 registers are not visible to the branch bytecode. This allows delay
 615 slots to be used with no performance penalty or size penalty.
 616
 617 One final advantage that generating bytecode directly allows is
 618 smaller more compact bytecode. All the optimization above lead to
 619 smaller bytecode as a side effect. There are also a few other areas
 620 where the generated bytecode can be optimized for size with more
 621 knowledge of the program as a whole.
 622
 623 When encountering the following switch block both javac and Jikes
 624 generate redundant bytecode.
 625
 626 {\footnotesize\begin{verbatim}
 627 switch(pc>>>8) {
 628     case 0x1: run_1(); break;
 629     case 0x2: run_2(); break
 630     ...
 631     case 0x100: run_100(); break;
 632 }
 633 \end{verbatim}}
 634
 635 The first bytecode in each case arm in the switch statement is ALOAD\_0 to
 636 prepare for a invoke special call. By simple moving this outside the switch
 637 statement each case arm was reduced in size by one instruction. Similar
 638 optimizations were also done in other parts of the compiler.
 639
 640 \section{Interfacing with Java Code}
 641
 642 NestedVM has two primary ways of executing code, the interpreter, and the binary translators. Both the interpreter and the output from the binary translators sit on top of a Runtime class. This class provides the public interface to both the interpreter and the translated binaries.
 643
 644 \subsection{The Runtime Class}
 645
 646 The Runtime class does the work that the operating system usually does. Conceptually the Runtime class can be though of as the operating system and itÕs subclasses (translated binaries and the interpreter) the CPU. The Runtime fulfills 5 primary goals:
 647
 648 \begin{itemize}
 649
 650 \item Provides a consistent external interface - Because only Runtime exposes a public interface the method of actually executing the code (currently only translated binaries and the interpreter) can be changed without any code changes.
 651
 652 \item Provide an easy to use interface - The interpreter and the output from the binary translators only know how to execute code. The Runtime class provides an easy to use interface to the code. It contains methods to pass arguments to the main() function, read and write from memory, and call individual functions in the binary.
 653
 654 \item Manage the processÕs memory - The Runtime class contains large int[] arrays that represent the processÕs entire memory space.  Subclasses read and write to these arrays as required by the instructions they are executing.  Subclasses can expend their memory space using the sbrk syscall.
 655
 656 \item Provide access to the file system and streams - Subclasses access the file system through standard UNIX syscalls (read, write,  open, etc). The Runtime manages the file descriptor table that maps UNIX file descriptors to Java RandomAccessFiles, InputStreams, OutputStreams, and sockets.
 657
 658 \item Miscellaneous other syscalls - In additions to those mentioned above the Runtime class implements a variety of other syscalls (sleep, gettimeofday, getpagesize, sysconf, fcntl, etc).
 659
 660 \end{itemize}
 661
 662 \subsection{Interacting with the Binary}
 663
 664 Java source code can create a copy of the translated binary by instantiating the class generated by the binary translator  or instantiating the interpreter. It can then interact with the process through the many facilities provided by the Runtime interface.  Invoking the run() method of the Runtime interface will load the given arguments into the processÕs memory as invoke the binaries entry point (typically \_start() in crt0.o). This will pass control on to the main() function which will have the arguments passed to run() loaded into argv and argc.
 665
 666 As the binary executes it often passes control back to the Runtime class through the MIPS {\tt SYSCALL} instruction. The interpreter and translated binaries invoke the {\tt syscall()} method of the Runtime class when the {\tt SYSCALL} instruction is executed. The Runtime class then can manipulate the processÕs environment (read and write to memory, modify the file descriptor table, etc) and interact with the rest of the JVM on behalf of the process (read and write to a file or stream, etc). There is even a syscall to pause the VM and temporarily return control to the caller.
 667
 668 In addition to the interfaces provided by NestedVM, users can create their own interfaces between  the MIPS and Java world. The Runtime provides a method called call() that will call a function by name in the MIPS binary. The call() method looks up the function name in the binaryÕs ELF symbol table and manipulating the stack and registers accordingly to execute the given function. This allows Java code to seamlessly invoke functions in the binary.
 669
 670 {\footnotesize\begin{verbatim}
 671 // Java
 672 private Runtime rt = new MyBinary();
 673 public void foo(int n) {
 674     for(int i=0;i<10;i++) {
 675         int result = rt.call("do_work",i);
 676         System.err.println("do_work(i) = " + result);
 677     }
 678 }
 679 // C
 680 void do_work(int n) {
 681     int i;
 682     int ret=0;
 683     for(i=0;i<n;i++) ret += i;
 684     return n;
 685 }
 686 \end{verbatim}}
 687
 688 The MIPS binaries can also invoke a special method of Runtime called callJava().When the MIPS binary invokes the {\tt CALL\_JAVA}  syscall (usually done through the {\tt \_call\_java()} function provided by the NestedVM support library) the callJava() method in Runtime is invoked with the arguments passes to the syscall.
 689
 690 {\footnotesize\begin{verbatim}
 691 // Java
 692 private Runtime rt = new MyBinary() {
 693     pubilc int callJava(int a, int b, int c, int d) { System.err.println("Got " + a + " " + b);
 694 };
 695 public void foo() { rt.run(); }
 696 // C
 697 void main(int argc, char **argv) {
 698     \_call\_java(1,2);
 699 }
 700 \end{verbatim}}
 701
 702 These two methods can even be combined. MIPS can call Java through the CALL\_JAVA syscall, which can in turn invoke a MIPS function in the binary with the call() method.\r\r
 703 Users preferring a simpler communication mechanism can also use Java StreamÕs and file descriptors. Runtime provides a simple interface for mapping a Java Input or OutputStream to a File Descriptor.
 704
 705 %Java source code can create a copy of the translated binary by
 706 %instantiating the corresponding class, which extends {\tt Runtime}.
 707 %Invoking the {\tt main()} method on this class is equivalent to
 708 %calling the {\tt main()} function within the binary; the {\tt String}
 709 %arguments to this function are copied into the binary's memory space
 710 %and made available as {\tt **argv} and {\tt argc}.
 711
 712 %The translated binary communicates with the rest of the VM by
 713 %executing MIPS {\tt SYSCALL} instructions, which are translated into
 714 %invocations of the {\tt syscall()} method.  This calls back to the
 715 %native Java world, which can manipulate the binary's environment by
 716 %reading and writing to its memory space, checking its exit status,
 717 %pausing the VM, and restarting the VM.
 718
 719
 720 %\subsection{Virtualization}
 721
 722 %The {\tt Runtime} class implements the majority of the standard {\tt
 723 %libc} syscalls, providing a complete interface to the filesystem,
 724 %network socket library, time of day, (Brian: what else goes here?).
 725
 726 %\begin{itemize}
 727
 728 %\item ability to provide the same interface to CNI code and
 729 %      NestedVMified code
 730
 731 %\item security advantages (chroot the {\tt fork()}ed process)
 732 %
 733 %\end{itemize}
 734
 735
 736 \section{Quantitative Performance}
 737
 738 \subsection{Charts}
 739
 740 (Note that none of these libraries have pure-Java equivalents.)
 741
 742 \begin{itemize}
 743 \item libjpeg
 744 \item libfreetype
 745 \item libmspack
 746 \end{itemize}
 747
 748
 749 \subsection{Optimizations}
 750
 751 Although NestedVM perfectly emulates a MIPS R2000 CPU its performance characteristics aren't anything like an actual MIPS R2000 CPU. GCC makes several optimizations that increase performance on an actually MIPS CPU but actually decrease performance when run through the NestedVM binary translator. Fortunately, GCC provides many options to customize its code generations and eliminate these optimizations. GCC also has optimization options that aren't helpful on a real MIPS CPU but are very helpful under NestedVM
 752
 753 Adam, we should cite "Using the GNU Compiler Collection" somewhere in here.
 754
 755 \begin{itemize}
 756
 757 \item {\tt -falign-functions}
 758 Normally a function's location in memory has no effect on its execution speed. However, in the NestedVM binary translator the .text segment is split up on a power of two boundary. If a function is unlucky enough to start near the end of one of these boundaries a performance critical part of the function could end up spanning two methods. There is a significant amount of overhead in switching between two methods so this must be avoided at all costs. By telling GCC to align all functions to the boundary that the .text segment is split on the chances of a critical part of a function spanning two methods is significantly reduced.
 759
 760 \item {\tt -fno-rename-registers}
 761 Some processors can better schedule code when registers aren't reused for two different purposes. By default GCC will try to use as many registers as possibly when it can. This excess use of registers just confuses JIT's trying to compile the output from the binary translator. All the JIT compilers we tested do much better with a few frequently used registers.
 762
 763 \item {\tt -fno-delayed-branch}
 764 The MIPS CPU has a delay slot (see above). Earlier versions of NestedVM didn't efficiently emulate delay slots. This option causes GCC to avoid using delay slots for anything (a NOP is simply placed in the delay slot). This had a small performance benefit. However, recent versions of NestedVM emulate delay slots with no performance overhead so this options has little effect. Nonetheless, these delay slots provide no benefit under NestedVM either so they are avoided with this option.
 765
 766 \item {\tt -fno-schedule-insns}
 767 Load operations in the MIPS ISA also have a delay slot. The results of a load operation are not available for use until one instruction later. Several other instructions also have similar delay slots. GCC tries to do useful work wile waiting for the results of one of these operations by default. However, this, like register renaming, tends to confuse JIT compilers. This option prevents GCC from going out of its way to take advantage of these delay slots and makes the code generated by NestedVM easier for JIT compilers to handle.
 768
 769 \item {\tt -mmemcpy}
 770 GCC sometimes has to copy somewhat large areas of memory. The most common example of this is assigning one struct to another. Memory copying can be done far more efficiently in Java than under NestedVM. Calls to the memcpy libc function are treated specially by the binary translator. They are turned into calls to a memcpy method in Runtime. The {\tt -mmemcpy} option causes GCC to invoke libc's memcpy() function when it needs to copy a region of memory rather than generating its own memcpy code. This call in then turned into a call to this Java memcpy function which is significantly faster than the MIPS implementation.
 771
 772 \item {\tt -ffunction-sections -fdata-sections}
 773 These two options are used in conjunction with the {\tt --gc-section} linker option. These three options cause the linker to aggressively discard unused functions and data sections. In some cases this leads to significantly smaller binaries.
 774
 775 %\item {\tt trampoline}
 776 %\item {\tt optimal method size}
 777 %\item {\tt -msingle-float}
 778 %\item {\tt -mmemcpy}
 779 %\item {\tt fastmem}
 780 %\item {\tt local vars for registers (useless)}
 781 %\item {\tt -fno-rename-registers}
 782 %\item {\tt -ffast-math}
 783 %\item {\tt -fno-trapping-math}
 784 %\item {\tt -fsingle-precision-constant}
 785 %\item {\tt -mfused-madd}
 786 %\item {\tt -freg-struct-return}
 787 %\item {\tt -freduce-all-givs}
 788 %\item {\tt -fno-peephole}
 789 %\item {\tt -fno-peephole2}
 790 %\item {\tt -fmove-all-movables}
 791 %\item {\tt -fno-sched-spec-load}
 792 %\item {\tt -fno-sched-spec}
 793 %\item {\tt -fno-schedule-insns}
 794 %\item {\tt -fno-schedule-insns2}
 795 %\item {\tt -fno-delayed-branch}
 796 %\item {\tt -fno-function-cse}
 797 %\item {\tt -ffunction-sections}
 798 %\item {\tt -fdata-sections}
 799 %\item {\tt array bounds checking}
 800 %\item {\tt -falign-functions=n}
 801 %\item {\tt -falign-labels=n}
 802 %\item {\tt -falign-loops=n}
 803 %\item {\tt -falign-jumps=n}
 804 %\item {\tt -fno-function-cse}
 805 \end{itemize}
 806
 807 \section{Future Directions}
 808
 809 World domination.
 810
 811 \section{Conclusion}
 812
 813 We rock the hizzouse.
 814
 815 \section{References}
 816
 817 Yer mom.
 818
 819 \section{stuff}
 820 \begin{onecolumn}
 821 {\footnotesize\begin{verbatim}
 822
 823 libjpeg (render thebride_1280.jpg)
 824 Native -  0.235s
 825 JavaSource - 1.86s
 826 ClassFile - 1.37s
 827
 828 freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to
 829 48 incrementing by 4)
 830 Native - 0.201s
 831 JavaSource - 2.02s
 832 ClassFile - 1.46s
 833
 834                                           libjpeg  libmspack libfreetype
 835 Interpreted MIPS Binary                   22.2      12.9      21.4
 836 Compled MIPS Binary (fastest options)     3.39      2.23      4.31
 837 Native -O3                                0.235    0.084     0.201
 838
 839 Compled - with all case statements        3.50      2.30      4.99
 840 Compiled - with pruned case statement     3.39      2.23      4.31
 841
 842 Compiled - 512 instructions/method        62.7      27.7      56.9
 843 Compiled - 256 instructions/method        3.54      2.55      4.43
 844 Compiled - 128 instructions/method        3.39      2.23      4.31
 845 Compiled - 64 instructions/method         3.56      2.31      4.40
 846 Compiled - 32 instruction/method          3.71      2.46      4.64
 847
 848 Compild MIPS Binary (Server VM)           3.21      2.00      4.54
 849 Compiled MIPS Binary (Client VM)          3.39      2.23      4.31
 850
 851 All times are measured in seconds. These were all run on a dual 1ghz G4
 852 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
 853 was run 8 times within a single VM. The highest and lowest times were
 854 removed and the remaining 6 were averaged. In each case only the first
 855 run differed significantly from the rest.
 856
 857 The libjpeg test consisted of decoding a 1280x1024 jpeg
 858 (thebride_1280.jpg) and writing a tga. The mspack test consisted of
 859 extracting all members from arial32.exe, comic32.exe, times32.exe, and
 860 verdan32.exe. The freetype test consisted of rendering characters
 861 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
 862 about 950 individual glyphs).
 863
 864 I can provide you with the source for any of these test if you'd like.
 865
 866 -Brian
 867 \end{verbatim}}
 868 \end{onecolumn}
 869 \end{document}
 870