doc/ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{pdftricks}
   8 \begin{psinputs}
   9   \usepackage{pstricks}
  10   \usepackage{pst-node}
  11 \end{psinputs}
  12 \usepackage{parskip}
  13 \usepackage{tabularx}
  14 \usepackage{alltt}
  15 \bibliographystyle{amsplain}
  16
  17 \title{\textbf{\textsf{
  18 Complete Translation of Unsafe Native Code to Safe Bytecode
  19 }}}
  20 \date{}
  21 \author{\begin{tabular}{@{}c@{}}
  22         {\em {Brian Alliet}} \\
  23         {Rochester Institute of Technology}\\
  24         {\tt bja8464@cs.rit.edu}
  25    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  26         {\em {Adam Megacz}} \\
  27         {University of California, Berkeley} \\
  28         {\tt megacz@cs.berkeley.edu}
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 {\it This document was typeset using D. E. Knuth's original \TeX 89
  35      Pascal source code, which was both compiled and executed entirely
  36      within a Java Virtual Machine.  No native code was utilized.}
  37
  38 \begin{abstract}
  39
  40 Most existing techniques for using code written in an unsafe language
  41 within a safe virtual machine involve transformations from one source
  42 code language (such as C, Pascal, or Fortran) to another (such as
  43 Java) and then to virtual machine bytecodes.  We present an
  44 alternative approach which can translate MIPS binaries produced by any
  45 compiler into safe virtual machine bytecodes.  This approach offers
  46 four key advantages over existing techniques:
  47
  48 \begin{itemize}
  49 \item Language-agnostic
  50 \item Bug-for-bug compiler compatability
  51 \item No post-translation human intervention
  52 \item No build process modifications
  53 \end{itemize}
  54
  55 We also present NestedVM, a complete implementation of this technique,
  56 which we have used to translate a number of programs.  Six examples
  57 are discussed in this paper: LINPACK (Fortran source), which was used
  58 as one of our performance tests, \TeX (Pascal source), which was used
  59 to typeset this document, {\tt libjpeg}, {\tt libmspack}, and FreeType
  60 (all C source), which are currently in production use as part of the
  61 Ibex Project, and {\tt gcc}, which was used to compile all of the
  62 aforementioned.
  63
  64 Performance measurements indicate a best case performance within 3x of
  65 native code and worst case typically within 10x, making it an
  66 attractive solution for code which is not performance-critical.
  67
  68 \end{abstract}
  69
  70 \section{Introduction}
  71
  72 Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
  73 been in use much longer than any of today's widely accepted safe
  74 languages such as Java \cite{java} and C\# \cite{csharp}.  Consequently, there is
  75 a huge library of software written in these languages.  Although safe
  76 languages offer substantial benefits, their comparatively young age
  77 often puts them at a disadvantage when breadth of existing support
  78 code is an important criterion.
  79
  80 The typical solution to this dilemma is to use a native interface such
  81 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
  82 virtual machine or otherwise safe environment.  Unfortunately, there
  83 are a number of situations in which this is not an acceptable
  84 solution.  These situations can be broadly classified into two
  85 categories: {\it security concerns} and {\it portability concerns}.
  86
  87 Using Java as an example, JNI and CNI are prohibited in a number of
  88 contexts, including applets environments and servlet containers with a
  89 {\tt SecurityManager}.  Additionally, even in the context of trusted
  90 code, {\tt native} methods invoked via JNI are susceptible to buffer
  91 overflow and heap corruption attacks which are not a concern for
  92 verified bytecode.
  93
  94 The second class of disadvantages revolves around portability
  95 concerns; native interfaces require the native library to be compiled
  96 ahead of time, for every architecture on which they will be
  97 deployed.  This is unworkable for situations in which the full set of
  98 target architectures is not known at deployment time.  Additionally,
  99 some JVM platform variants such as J2ME \cite{j2me} simply do not offer
 100 support for native code.
 101
 102 The technique we present here uses typical compiler to compile unsafe
 103 code into a MIPS binary, which is then translated on an
 104 instruction-by-instruction basis into Java bytecode.  The technique
 105 presented here is general; we anticipate that it can be applied to
 106 other secure virtual machines such as Microsoft's .NET \cite{msil}, Perl
 107 Parrot \cite{parrot}, or Python bytecode \cite{python}.
 108
 109 \section{Approaches to Translation}
 110
 111 The four program representations of interest in this context, along
 112 with their specific types in the C-to-JVM instantiation of the
 113 problem are shown in the following diagram:
 114
 115 \begin{pdfpic}
 116 \newlength{\MyLength}
 117 \settowidth{\MyLength}{machine code}
 118 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
 119 \begin{psmatrix}[colsep=2,rowsep=0]
 120   & \\[0pt]
 121   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 122   [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
 123   & \\[0pt]
 124   & \\[0pt]
 125   & \\[0pt]
 126   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 127   [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
 128   & \\[0pt]
 129   \psset{nodesep=5pt,arrows=->}
 130 \end{psmatrix}
 131 \end{pdfpic}
 132
 133 To illustrate the context of this diagram, the following arcs show the
 134 translations performed by a few familiar tools:
 135
 136 \begin{pdfpic}
 137 \newlength{\MyLength}
 138 \settowidth{\MyLength}{xmachine codex}
 139 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 140 \psmatrix[colsep=2,rowsep=0,nrot=:D]
 141   & \\[0pt]
 142   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 143   & \\[0pt]
 144   & \\[0pt]
 145   & \\[0pt]
 146   & \\[0pt]
 147   & \\[0pt]
 148   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 149   & \\[0pt]
 150   \psset{nodesep=5pt,arrows=->}
 151   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 152   \ncline{s1}{b0}\bput{:D}{\tt gcj}
 153   \ncline{s1}{b1}\aput{:U}{\tt javac}
 154   \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
 155 \endpsmatrix
 156 \end{pdfpic}
 157
 158 Techniques for translating unsafe code into VM bytecode generally fall
 159 into four categories, which we expand upon in the next two sections:
 160
 161 \begin{itemize}
 162 \item source-to-source translation
 163 \item source-to-binary translation
 164 \item binary-to-source translation
 165 \item binary-to-binary translation
 166 \end{itemize}
 167
 168 \section{Existing Work}
 169
 170 \subsection{Source-to-Source Translation}
 171
 172 The most common technique employed is partial translation from unsafe
 173 source code to safe source code:
 174
 175 \begin{pdfpic}
 176 \newlength{\MyLength}
 177 \settowidth{\MyLength}{xmachine codex}
 178 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 179 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 180   & \\[0pt]
 181   & \\[0pt]
 182   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 183   & \\[0pt]
 184   & \\[0pt]
 185   & \\[0pt]
 186   & \\[0pt]
 187   & \\[0pt]
 188   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 189   & \\[0pt]
 190   \psset{nodesep=5pt,arrows=->}
 191   \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
 192   \ncline{s1}{b1}\aput{:U}{\tt javac}
 193 \endpsmatrix
 194 \end{pdfpic}
 195
 196 A number of existing systems employ this technique; they can
 197 be divided into two categories: those which perform a partial
 198 translation which is completed by a human, and those which perform a
 199 total translation but fail (yield an error) on a large class of input
 200 programs.
 201
 202
 203 \subsubsection{Incomplete Translation}
 204
 205 Jazillian \cite{jazillian} is a commercial solution which produces
 206 extremely readable Java source code from C source code, but ony
 207 translates a small portion of the C language.  Jazillian is unique in
 208 that in addition to {\it language migration}, it also performs {\it
 209 API migration}; for example, Jazillian is intelligent enough
 210 to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}.
 211
 212 Unfortunately such deep analysis is intractible for most of the C
 213 language and standard library; Jazillian's documentation notes that
 214 {\it ``This is not your father's language translator.  It's not
 215 generating ugly code that's guaranteed to work out of the
 216 box... Jazillian does not always produce code that works correctly.''}
 217
 218 MoHCA-Java \cite{mohca} is the other major tool in this category, and steps
 219 beyond Jazillian by providing tools for analysis of the source C++
 220 abstract syntax tree.  Additionally, MoHCA-Java's analysis engine is
 221 extensible, making it a platform for constructing application-specific
 222 translators rather than a single translation tool.  However,
 223 MoHCA-Java does not always generate complete Java code for all of the C++
 224 programs which it accepts.
 225
 226
 227 \subsubsection{Partial Domain Translation}
 228
 229 The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp},
 230 and Ephedra \cite{ephedra} systems each provide support for complete
 231 translation of a {\it subset} of the source language (C or C++).  Each
 232 of the four tools supports a progressively greater subset than the one
 233 preceding it; however none covers the entire input language.
 234
 235 Ephedra, the most advanced of the four, supports most of the C++
 236 language, and claims to produce ``human readable'' Java code as
 237 output.  Notable omissions from the input domain include support for
 238 fully general pointer arithmetic, casting between unrelated types, and
 239 reading from a {\tt union} via a different member than the one most
 240 recently written.
 241
 242 Unfortunately, when the program being translated is large and complex,
 243 it is quite likely that it will use an unsupported feature in at least
 244 one place.  In the absence of a programmer who understands the source
 245 program, a single anomoly is often enough to render the entire
 246 translation process useless.  As a result, these tools are mainly
 247 useful as an {\it aid} to programmers who could normally perform the
 248 conversion themselves, but want to save time by automating most of the
 249 process.
 250
 251
 252 \subsection{Source-to-Binary Translation}
 253
 254 Source-to-binary translation involves a compiler for the unsafe
 255 language which has been modified to emit safe bytecode.
 256
 257 \begin{pdfpic}
 258 \newlength{\MyLength}
 259 \settowidth{\MyLength}{xmachine codex}
 260 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 261 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 262   & \\[0pt]
 263   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 264   & \\[0pt]
 265   & \\[0pt]
 266   & \\[0pt]
 267   & \\[0pt]
 268   & \\[0pt]
 269   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 270   & \\[0pt]
 271   \psset{nodesep=5pt,arrows=->}
 272   \ncline{s0}{b1}\bput{:U}{source-to-binary}
 273 \endpsmatrix
 274 \end{pdfpic}
 275
 276 The primary occupant of this category is {\tt egcs-jvm}
 277 \cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler
 278 Collection ( {\tt gcc} ) \cite{gcc}.  Since {\tt gcc} employs a highlym
 279 odular architecture, it {\it is} possible to add RTL code generators
 280 for nonstandard processors.  However, {\tt gcc}'s parsing, RTL
 281 generation, and optimization layers make fundamental assumptions (such
 282 as the availability of pointer math) which cannot be directly
 283 supported; thus the compiler still fails for a substantial class of
 284 input programs.
 285
 286
 287
 288 \section{NestedVM}
 289
 290 The principal difference between NestedVM and other approaches is that
 291 NestedVM {\it does not} attempt to deal with source code as an input.
 292 This leads immediately to three advantages:
 293
 294 \begin{itemize}
 295 \item {\bf Total coverage of all language features}
 296
 297       Because NestedVM does not attempt to implement the parsing and
 298       code generation steps of compilation, it is freed from the
 299       extremely complex task of faithfully implementing languages
 300       which are often not fully or formally specified (such as C and
 301       C++).
 302
 303 \item {\bf No build process modifications}
 304
 305       NestedVM does not modify existing build processes, which can be
 306       extremely complex and dependent on strange preprocessor usage as
 307       well as the complex interplay between compiler switches and
 308       header file locations.
 309
 310 \item {\bf Bug-for-bug compiler compatability}
 311
 312       Since NestedVM uses the compiler's {\it output} as its own {\it
 313       input}, it ensures that programs which are inadvertently
 314       dependent on the vagaries of a particular compiler can still be
 315       used.
 316
 317 \end{itemize}
 318
 319 NestedVM's approach carries a fourth benefit as well, arising from its
 320 totality:
 321
 322 \begin{itemize}
 323 \item {\bf No post-translation human intervention}
 324
 325       NestedVM offers total support for all non-privileged
 326       instructions, registers, and resources found on a MIPS {\tt
 327       R2000} CPU, including the add/multiply unit and floating point
 328       coprocessor.  As such, it constitutes a total function mapping
 329       from the entire domain of (non-kernel-mode) programs onto (a
 330       subset of) the semantics of the Java Virtual Machine.  This
 331       ensures that the translation process is fully automated and
 332       always succeeds for valid input binaries.
 333 \end{itemize}
 334
 335 This is a much more important factor than is obvious at first glance.
 336 If post-translation human intervention is required, then the {\it
 337 human becomes part of the build process}.  This means that if a third
 338 party library used in the project needs to be upgraded, {\it a human
 339 must intervene} in the rebuild process.  In addition to slowing the
 340 process and introducing opportunities for error, this task often
 341 requires specialized knowledge which becomes tied to the particular
 342 individual performing this task, rather than being encoded in build
 343 scripts which persist throughout the lifetime of the project.
 344
 345 \subsection{Why MIPS?}
 346
 347 We chose MIPS as a source format for three reasons: the availability
 348 of tools to compile legacy code into MIPS binaries, the close
 349 similarity between the MIPS ISA and the Java Virtual Machine, and the
 350 relatively high degree of program structure that can be inferred from
 351 ABI-adherent binaries.
 352
 353 The MIPS architecture has been around for quite some time, and is well
 354 supported by the GNU Compiler Collection, which is capable of
 355 compiling C, C++, Java, Fortran, Pascal, and Objective C
 356 into MIPS binaries.
 357
 358 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
 359 Machine:
 360
 361 \begin{itemize}
 362
 363 \item Most of the instructions in the original MIPS ISA operate only
 364       on 32-bit aligned memory locations. This allows NestedVM to
 365       represent memory as a Java {\tt int[]} array without introducing
 366       additional overhead.  The remaining non-aligned memory load
 367       instructions are only rarely emitted by most compilers since
 368       they carry a performance penalty on physical MIPS
 369       implementations.
 370
 371 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
 372       multiply and divide instructions as well as a single and double
 373       precision floating point unit.  These capabilities map nicely
 374       onto Java's arithmetic instructions.
 375
 376 \end{itemize}
 377
 378 Finally, the MIPS ISA and ABI convey quite a bit of information about
 379 program structure.  This information can be used for optimization
 380 purposes:
 381
 382 \begin{itemize}
 383
 384 \item The structure of MIPS branching and jump instructions make it
 385       easy to infer the set of likely target instructions.
 386
 387 \item The MIPS ABI specifies particular registers as caller-save and
 388       callee-save, as well as designating a register for the return
 389       address after a function call.  This allows NestedVM to optimize
 390       many operations for the common case of ABI-adherent binaries.
 391
 392 \item All MIPS instructions are exactly 32 bits long.
 393
 394 \end{itemize}
 395
 396
 397
 398 \subsection{Binary-to-Source}
 399
 400 The simplest operational mode for NestedVM is binary-to-source
 401 translation.  In this mode, NestedVM translates MIPS binaries into
 402 Java source code, which is then fed to a Java compiler in order to
 403 produce bytecode files:
 404
 405 \begin{pdfpic}
 406 \newlength{\MyLength}
 407 \settowidth{\MyLength}{xmachine codex}
 408 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 409 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 410   & \\[0pt]
 411   & \\[0pt]
 412   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 413   & \\[0pt]
 414   & \\[0pt]
 415   & \\[0pt]
 416   & \\[0pt]
 417   & \\[0pt]
 418   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 419   \psset{nodesep=5pt,arrows=->}
 420   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 421   \ncline{s1}{b1}\aput{:U}{\tt javac}
 422   \ncline{b0}{s1}\naput{\tt NestedVM}
 423 \endpsmatrix
 424 \end{pdfpic}
 425
 426 \begin{figure*}[t]
 427 \begin{minipage}[c]{7in}%
 428 \begin{multicols}{2}
 429 {\footnotesize\begin{verbatim}
 430 private final static int r0 = 0;
 431 private int r1, r2, r3,...,r30;
 432 private int r31 = 0xdeadbeef;
 433 private int pc = ENTRY_POINT;
 434
 435 public void run() {
 436     for(;;) {
 437         switch(pc) {
 438             case 0x10000:
 439                 r29 = r29 - 32;
 440             case 0x10004:
 441                 r1 = r4 + r5;
 442             case 0x10008:
 443                 if(r1 == r6) {
 444                     /* delay slot */
 445                     r1 = r1 + 1;
 446                     pc = 0x10018:
 447                     continue;
 448                 }
 449             case 0x1000C:
 450                 r1 = r1 + 1;
 451             case 0x10010:
 452                 r31 = 0x10018;
 453                 pc = 0x10210;
 454                 continue;
 455             case 0x10014:
 456                 /* nop */
 457             case 0x10018:
 458                 pc = r31;
 459                 continue;
 460             ...
 461             case 0xdeadbeef:
 462                 System.err.println(``Exited.'');
 463                 System.exit(1);
 464         }
 465     }
 466 }
 467 \end{verbatim}}
 468 \vspace{1in}
 469 {\footnotesize\begin{verbatim}
 470 public void run_0x10000() {
 471     for(;;) {
 472     switch(pc) {
 473         case 0x10000:
 474             ...
 475         case 0x10004:
 476             ...
 477         ...
 478         case 0x10010:
 479             r31 = 0x10018;
 480             pc = 0x10210;
 481             return;
 482         ...
 483     }
 484     }
 485 }
 486
 487 pubic void run_0x10200() {
 488     for(;;) {
 489     switch(pc) {
 490         case 0x10200:
 491             ...
 492         case 0x10204:
 493             ...
 494     }
 495     }
 496 }
 497
 498 public void trampoline() {
 499     for(;;) {
 500     switch(pc&0xfffffe00) {
 501             case 0x10000: run_0x10000(); break;
 502             case 0x10200: run_0x10200(); break;
 503             case 0xdeadbe00:
 504                 ...
 505         }
 506     }
 507 }
 508 \end{verbatim}}
 509 \end{multicols}
 510 \end{minipage}
 511 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 512 \end{figure*}
 513
 514 Translating unsafe code for use within a JVM proceeds as follows:
 515
 516 \begin{enumerate}
 517
 518 \item Compile the source code to a statically linked binary, targeting
 519       the MIPS R2000 ISA.  Typically this will involve linking against
 520       {\tt libc}, which translates system requests (such as {\tt
 521       open()}, {\tt read()}, or {\tt write()}) into appropriate
 522       invocations of the MIPS {\tt SYSCALL} instruction.
 523
 524 \item Invoke {\tt NestedVM} on the statically linked binary.
 525
 526 \item Compile the resulting {\tt .java} code using {\tt jikes}
 527       \cite{jikes} or {\tt javac}.
 528
 529 \item From java code, invoke the {\tt run()} method on the generated
 530       class.  This is equivalent to the {\tt main()} entry point.
 531
 532 \end{enumerate}
 533
 534 \subsubsection{Optimizations}
 535
 536 Generating Java source code instead of bytecode frees NestedVM from
 537 having to perform simple constant propagation optimizations, as most
 538 Java compilers already do this.  A recurring example is the treatment
 539 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
 540
 541 Lacking the ability to generate specially optimized bytecode
 542 sequences, a straightforward mapping of the general purpose hardware
 543 registers to 32 {\tt int} fields turned out to be optimal.
 544
 545
 546 \epsfig{file=charts/chart1,width=3in}
 547
 548 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
 549 for a single method.  This presents a problem for NestedVM, and
 550 necessitates a {\it trampoline transformation}, as shown in
 551 Figure~\ref{code1}.  With this trampoline in place, large binaries can
 552 be handled without much difficulty -- fortunately, there is no
 553 corresponding limit on the size of a classfile as a whole.
 554
 555 One difficulty that arose as a result of using the trampoline
 556 transformation was the fact that {\tt javac} and {\tt jikes} are
 557 unable to properly optimize its switch statements.  For example, the
 558 following code is compiled into a comparatively inefficient {\tt
 559 LOOKUPSWITCH}:
 560
 561 {\footnotesize
 562 \begin{verbatim}
 563     switch(pc&0xffffff00) {
 564         case 0x00000100: run_100(); break;
 565         case 0x00000200: run_200(); break;
 566         case 0x00000300: run_300(); break;
 567     }
 568 \end{verbatim}}
 569
 570 Whereas the next block of code code optimized into a {\tt
 571 TABLESWITCH}:
 572
 573 {\footnotesize
 574 \begin{verbatim}
 575     switch(pc>>>8) {
 576         case 0x1: run_100();
 577         case 0x2: run_200();
 578         case 0x3: run_300();
 579     }
 580 \end{verbatim}}
 581
 582 This problem was surmounted by switching on a denser set of {\tt case}
 583 values, which is more amenable to the {\tt TABLESWITCH} structure.
 584 This change alone nearly doubled the speed of the compiled binary.
 585
 586 The next performance improvement came from tuning the size of the
 587 methods invoked from the trampoline.  Trial and error led to the
 588 onclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
 589 -- performs best when 128 MIPS instructions are mapped to each method.
 590
 591 \epsfig{file=chart5,width=3in}
 592
 593 \epsfig{file=chart6,width=3in}
 594
 595 This phenomenon is due to two factors:
 596
 597 \begin{itemize}
 598
 599 \item While the trampoline method's {\tt switch} statement can be
 600       coded as a {\tt TABLESWITCH}, the {\tt switch} statement
 601       within the individual methods is to sparse to encode this way.
 602
 603 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
 604       favor smaller methods since they are easier to compile and are
 605       better candidates for compilation in ``normal'' programs (unlike
 606       NestedVM programs).
 607
 608 \end{itemize}
 609
 610 After tuning method sizes, our next performance boost came from
 611 eliminating exraneous case branches.  Having case statements before
 612 each instruction prevents JIT compilers from being able to optimize
 613 across instruction boundaries, since control flow can enter the body
 614 of a {\tt switch} statement at any of the {\tt case}s.  In order to
 615 eliminate unnecessary case statements we needed to identify all
 616 possible jump targets.  Jump targets can come from three sources:
 617
 618 \begin{itemize}
 619
 620 \item {\bf The {\tt .text} segment}
 621
 622       Every instruction in the text segment is scanned, and every
 623       branch instruction's destination is added to the list of
 624       possible branch targets.  In addition, any function that sets
 625       the link register is added to the list \footnote{actually {\tt addr+8}}.
 626       Finally, combinations of {\tt LUI} (Load Upper Immediate) and
 627       {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
 628       addresses in the {\tt .text} segment since this combination of
 629       instructions is often used to load a 32-bit word into a
 630       register.
 631
 632 \item {\bf The {\tt .data} segment}
 633
 634       When compiling {\tt switch} statements, compilers often use a
 635       jump table stored in the {\tt .data} segment.  Unfortunately
 636       they typically do not identify these jump tables in any way.
 637       Therefore, the entire {\tt .data} segment is conservatively
 638       scanned for possible addresses in the {\tt .text} segment.
 639
 640 \item {\bf The symbol table}
 641
 642       This is mainly used as a backup.  Scanning the {\tt .text} and
 643       {\tt .data} segments should identify any possible jump targets;
 644       however, adding all function symbols in the ELF symbol table
 645       also catches functions that are never called directly from the
 646       MIPS binary, such as those invoked only via the NestedVM
 647       runtime's {\tt call()} method.
 648
 649 \end{itemize}
 650
 651 Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
 652 increase.
 653
 654 Despite all the above optimizations, one insurmountable obstacle
 655 remained: the Java {\tt .class} file format limits the constant pool
 656 to 65535 entries.  Every integer literal greater than {\tt 32767}
 657 requires an entry in this pool, and each branch instruction generates
 658 one of these.
 659
 660 One suboptimal solution was to express constants as offsets from a few
 661 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
 662 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
 663 inlining it).  This was sufficient to get reasonably large binaries to
 664 compile, and caused only a small (approximately 5\%) performance
 665 degredation and a similarly small increase in the size of the {\tt
 666 .class} file.  However, as we will see in the next section, compiling
 667 directly to {\tt .class} files (without the intermediate {\tt .java}
 668 file) eliminates this problem entirely.
 669
 670
 671 \subsection{Binary-to-Binary}
 672
 673 After implementing the binary-to-source compiler, a binary-to-binary
 674 translation mode was added.
 675
 676 \begin{pdfpic}
 677 \newlength{\MyLength}
 678 \settowidth{\MyLength}{xmachine codex}
 679 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 680 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 681   & \\[0pt]
 682   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 683   & \\[0pt]
 684   & \\[0pt]
 685   & \\[0pt]
 686   & \\[0pt]
 687   & \\[0pt]
 688   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 689   & \\[0pt]
 690   \psset{nodesep=5pt,arrows=->}
 691   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 692   \ncline{b0}{b1}\naput{\tt NestedVM}
 693 \endpsmatrix
 694 \end{pdfpic}
 695
 696 This mode has several advantages:
 697
 698 \begin{itemize}
 699
 700 \item There are quite a few interesting bytecode sequences that cannot
 701       be generated as a result of compiling Java source code.
 702
 703 \item Directly generating {\tt .class} files Eliminates the
 704       time-consuming {\tt javac} step.
 705
 706 \item Direct compilation to {\tt .class} files opens up the
 707       interesting possibility of dynamically translating MIPS binaries
 708       and loading them via {\tt ClassLoader.fromBytes()} {\it at
 709       deployment time}, eliminating the need to compile binaries ahead
 710       of time.
 711
 712 \end{itemize}
 713
 714 Most of the performance improvemen where made where in the handling of
 715 branch instructions and in taking advantage of the JVM stack to
 716 eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
 717
 718 \epsfig{file=chart7,width=3in}
 719
 720 The first optimization gained by direct bytecode generation came from
 721 the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
 722 Java {\it language} does not have a {\tt goto} keyword, the VM does in
 723 fact have a corresponding instruction which is used quite heavily by
 724 {\tt javac}.  NestedVM's binary-to-binary mode exploits this
 725 instruction to avoid emitting inefficient {\tt switch..case}
 726 structures.
 727
 728 Related to the {\tt GOTO} instruction is branch statement
 729 optimization.  When emitting source code, NestedVM translates branches
 730 into Java source code like this:
 731
 732 {\footnotesize\begin{verbatim}
 733     if (condition) {
 734         pc = TARGET;
 735         continue;
 736     }
 737 \end{verbatim}}
 738
 739 This requires a branch in the JVM {\it regardless} of whether the MIPS
 740 branch is actually taken.  If {\tt condition} is false the JVM has to
 741 jump over the code to set {\tt pc} and go back to the {\tt switch}
 742 statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
 743 switch} block.  By generating bytecode directly, NestedVM is able to
 744 emit a JVM bytecode branching directly to the address corresponding to
 745 the target of the MIPS branch.  In the case where the branch is not
 746 taken the JVM doesn't branch at all.
 747
 748 A side effect of the previous two optimizations is a solution to the
 749 excess constant pool entries problem.  When jumps are implemented as
 750 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
 751 not need to be set.  This eliminates a huge number of constant pool
 752 entries.  The {\tt .class} file constant pool size limit is still
 753 present, but it is less likely to be encountered.
 754
 755 Implementation of the MIPS delay slot offers another opportunity for
 756 bytecode-level optimization.  In order to take advantage of
 757 instructions already in the pipeline, the MIPS ISA specifies that the
 758 instruction after a jump or branch is always executed, even if the
 759 jump/branch is taken.  This instruction is referred to as the ``delay
 760 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
 761 early MIPS CPUs, so they have to discard instructions anyways}.''  The
 762 instruction in the delay slot is actually executed {\it before} the
 763 branch is taken.  To further complicate matters, values from the
 764 register file are loaded {\it before} the delay slot is executed.
 765
 766 Fortunately there is a very elegent solution to this problem which can
 767 be expressed in JVM bytecode.  When a branch instruction is
 768 encountered, the registers needed for the comparison are pushed onto
 769 the stack to prepare for the JVM branch instruction.  Then, {\it
 770 after} the values are on the stack the delay slot instruction is
 771 emitted, followed by the actual JVM branch instruction.  Because the
 772 values were pushed to the stack before the delay slot was executed, any
 773 changes the delay slot made to the registers are not visible to the
 774 branch bytecode.
 775
 776 One final advantage that generating bytecode directly allows is a
 777 reduction in the size of the ultimate {\tt .class} file.  All the
 778 optimizations above lead to more compact bytecode as a beneficial side
 779 effect; in addition, NestedVM performs a few additional optimizations.
 780
 781 When encountering the following {\tt switch} block, both {\tt javac}
 782 and {\tt jikes} generate redundant bytecode.
 783
 784 {\footnotesize\begin{verbatim}
 785     switch(pc>>>8) {
 786         case 0x1: run_1(); break;
 787         case 0x2: run_2(); break
 788         ...
 789         case 0x100: run_100(); break;
 790     }
 791 \end{verbatim}}
 792
 793 The first bytecode in each case arm in the switch statement is {\tt
 794 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
 795 lifting this bytecode outside of the {\tt switch} statement, each {\tt
 796 case} arm shrinks by one instruction.
 797
 798 \subsubsection{Compiler Flags}
 799
 800 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
 801 profile is nothing like that of actual silicon.  In particular, {\tt
 802 gcc} makes several optimizations that increase performance on an
 803 actually MIPS CPU but actually decrease the performance of
 804 NestedVM-generated bytecode.  We found the following compiler options
 805 could be used to improve performance:
 806
 807 \begin{itemize}
 808
 809 \item {\tt -falign-functions}
 810
 811       Normally a function's location in memory has no effect on its
 812       execution speed.  However, in the NestedVM binary translator,
 813       the {\tt .text} segment is split on power-of-two boundaries.  If
 814       a function starts near the end of one of these boundaries, a
 815       performance critical part of the function winds up spanning two
 816       Java methods.  Telling {\tt gcc} to align all functions along
 817       these boundaries decreases the chance of this sort of splitting.
 818
 819 \item {\tt -fno-rename-registers}
 820
 821       On an actual silicon chip, using additional registers carries no
 822       performance penalty (as long as none are spilled to the stack).
 823       However, when generating bytecode, using {\it fewer}
 824       ``registers'' helps the JVM optimize the machine code it
 825       generates by simplifying the constraints it needs to deal with.
 826       Disabling register renaming has this effect.
 827
 828 \item {\tt -fno-schedule-insns}
 829
 830       Results of MIPS load operations are not available until {\it
 831       two} instructions after the load.  Without the {\tt
 832       -fno-schedule-insns} instruction, {\tt gcc} will attempt to
 833       reorder instructions to do other useful work during this period
 834       of unavailability.  NestedVM is under no such constraint, so
 835       removing this reordering typically generates simpler machine
 836       code.
 837
 838 \item {\tt -mmemcpy}
 839
 840       Enabling this instruction causes {\tt gcc} to use the system
 841       {\tt memcpy()} routine instead of generating loads and stores.
 842       As explained in the next section, the NestedVM runtime
 843       implements {\tt memcpy()} using {\tt System.arraycopy()}, which
 844       is substantially more efficient.
 845
 846 \item {\tt -ffunction-sections -fdata-sections}
 847
 848       These two options are used in conjunction with the {\tt
 849       --gc-section} linker option, prompting the linker to more
 850       aggressively prune dead code.
 851
 852 \end{itemize}
 853
 854 The effects of the various optimizations presented in this chapter are
 855 summarized in the table below.
 856
 857 \epsfig{file=chart4,width=3in}
 858
 859 \epsfig{file=chart3,width=3in}
 860
 861 \section{The NestedVM Runtime}
 862
 863 In addition to binary-to-source and binary-to-binary translation,
 864 NestedVM also includes a MIPS binary interpreter.  All three
 865 translation approaches expose the same API to both the translated
 866 binary and the surrounding VM (including peer Java code).
 867
 868 \subsection{The Runtime Class}
 869
 870 The runtime fulfills four roles:
 871
 872 \begin{itemize}
 873
 874 \item It provides a simple, consistent external interface.  The method
 875       of actually executing the code (currently only translated
 876       binaries and the interpreter) can be changed without any code
 877       changes to the caller because only runtime exposes a public
 878       interface.  This includes methods to pass arguments to the
 879       binary's {\tt main()} function, read and write from memory, and
 880       call individual functions in the binary.
 881
 882 \item It manages the process's memory.  The runtime class contains
 883       large {\tt int[]} arrays that represent the process`s entire
 884       memory space.  Subclasses read and write to these arrays as
 885       required by the instructions they are executing, and can expand
 886       their memory space using the {\tt sbrk} system call.
 887
 888 \item The runtime provides access to the host file system and standard
 889       I/O streams.  Subclasses of {\tt runtime} can access the file
 890       system through standard UNIX syscalls ({\tt read()}, {\tt
 891       write()}, {\tt open()}, etc).  The runtime manages the file
 892       descriptor table that maps UNIX file descriptors to Java {\tt
 893       RandomAccessFile}s, {\tt InputStream}s, {\tt OutputStream}s, and
 894       {\tt Socket}s.
 895
 896 \item It provides general OS services, including {\tt sleep()}, {\tt
 897       gettimeofday()}, {\tt getpagesize()}, {\tt sysconf()}, {\tt
 898       fcntl()}, and so on.
 899
 900 \end{itemize}
 901
 902 \section{Future Directions}
 903
 904 Although we have only implemented it for the Java Virtual Machine, our
 905 technique generalizes to other safe bytecode architectures.  In
 906 particular we would like to demonstrate this generality by retargeting
 907 the translator to the Microsoft Intermediate Language \cite{msil}.
 908
 909 Additionally, we would like to explore other uses for dynamic loading
 910 of translated MIPS binaries by combining NestedVM (which itself is
 911 written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
 912
 913
 914 \section{Conclusion}
 915
 916 We have presented a novel technique for using libraries written in
 917 unsafe languages within a safe virtual machine without resorting to
 918 native interfaces.  We have implemented this technique in NestedVM,
 919 which is currently used by the Ibex project\footnote{{\tt
 920 http://www.ibex.org}} to perform font rasterization (via {\tt
 921 libfreetype}), JPEG decoding (via {\tt libjpeg}), and CAB archive
 922 extraction (via {\tt libmspack}), three libraries for which no
 923 equivalent Java classes exist.
 924
 925 NestedVM is available under an open source license, and can be
 926 obtained from
 927 \begin{verbatim}
 928     http://nestedvm.ibex.org
 929 \end{verbatim}
 930
 931
 932 \section{Appendix: Testing Methodology}
 933
 934 All times are measured in seconds. These were all run on a dual 1Ghz
 935 Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
 936 test was run 8 times within a single VM. The highest and lowest times
 937 were removed and the remaining 6 were averaged.  In each case only the
 938 first run differed significantly from the rest.
 939
 940 The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
 941 writing a tga.  The {\tt mspack} test consisted of extracting all
 942 members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
 943 and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
 944 rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
 945 to 48 incrementing by 4 for a total of 950 glyphs.
 946
 947 \bibliography{nestedvm}
 948
 949 \end{document}
 950