doc/ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{pdftricks}
   8 \begin{psinputs}
   9   \usepackage{pstricks}
  10   \usepackage{pst-node}
  11 \end{psinputs}
  12 \usepackage{parskip}
  13 \usepackage{tabularx}
  14 \usepackage{alltt}
  15 \bibliographystyle{amsplain}
  16
  17 \title{\textbf{\textsf{
  18 Complete Translation of Unsafe Native Code to Safe Bytecode
  19 }}}
  20 \date{}
  21 \author{\begin{tabular}{@{}c@{}}
  22         {\em {Brian Alliet}} \\
  23         {Rochester Institute of Technology}\\
  24         {\tt bja8464@cs.rit.edu}
  25    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  26         {\em {Adam Megacz}} \\
  27         {University of California, Berkeley} \\
  28         {\tt megacz@cs.berkeley.edu}
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 {\it This document was typeset using D. E. Knuth's original \TeX 89,
  35      which was both compiled and executed entirely within a Java
  36      Virtual Machine without the use of native code.}
  37
  38 \begin{abstract}
  39
  40 FIXME: mention unsigned arithmetic, threading, memory model.
  41
  42 Existing techniques for using code written in an unsafe language
  43 within a safe virtual machine generally involve transformations from
  44 one source code language (such as C, Pascal, or Fortran) to another
  45 (such as Java) which is the compiled into virtual machine bytecodes.
  46
  47 We present an alternative approach which translate MIPS binaries
  48 produced by any compiler into safe virtual machine bytecodes.  This
  49 approach offers four key advantages over existing techniques:
  50
  51 \begin{itemize}
  52 \item Language-agnostic
  53 \item Bug-for-bug compiler compatability
  54 \item No post-translation human intervention
  55 \item No build process modifications
  56 \end{itemize}
  57
  58 We also present NestedVM, an implementation of this technique, and
  59 discuss six examples: LINPACK (Fortran), which was used as one of our
  60 performance tests, \TeX\ (Pascal), which was used to typeset this
  61 document, {\tt libjpeg}, {\tt libmspack}, and FreeType (all C source),
  62 which are currently in production use as part of the Ibex Project, and
  63 {\tt gcc}, which was used to compile all of the aforementioned.
  64
  65 Performance measurements indicate a best case performance within 3x of
  66 native code and worst case typically within 10x, making it an
  67 attractive solution for code which is not performance-critical.
  68
  69 \end{abstract}
  70
  71 \section{Introduction}
  72
  73 Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
  74 been in use much longer than any of today's widely accepted safe
  75 languages such as Java \cite{java} and C\# \cite{csharp}.  Consequently, there is
  76 a huge library of software written in these languages.  Although safe
  77 languages offer substantial benefits, their comparatively young age
  78 often puts them at a disadvantage when breadth of existing support
  79 code is an important criterion.
  80
  81 The typical solution to this dilemma is to use a native interface such
  82 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
  83 virtual machine or otherwise safe environment.  Unfortunately, there
  84 are a number of situations in which this is not an acceptable
  85 solution.  These situations can be broadly classified into two
  86 categories: {\it security concerns} and {\it portability concerns}.
  87
  88 Using Java as an example, JNI and CNI are prohibited in a number of
  89 contexts, including applets environments and servlet containers with a
  90 {\tt SecurityManager}.  Additionally, even in the context of trusted
  91 code, {\tt native} methods invoked via JNI are susceptible to buffer
  92 overflow and heap corruption attacks which are not a concern for
  93 verified bytecode.
  94
  95 The second class of disadvantages revolves around portability
  96 concerns; native interfaces require the native library to be compiled
  97 ahead of time, for every architecture on which they will be
  98 deployed.  This is unworkable for situations in which the full set of
  99 target architectures is not known at deployment time.  Additionally,
 100 some JVM platform variants such as J2ME \cite{j2me} simply do not offer
 101 support for native code.
 102
 103 The technique we present here uses typical compiler to compile unsafe
 104 code into a MIPS binary, which is then translated on an
 105 instruction-by-instruction basis into Java bytecode.  The technique
 106 presented here is general; we anticipate that it can be applied to
 107 other secure virtual machines such as Microsoft's .NET \cite{msil}, Perl
 108 Parrot \cite{parrot}, or Python bytecode \cite{python}.
 109
 110 \section{Approaches to Translation}
 111
 112 The four program representations of interest in this context, along
 113 with their specific types in the C-to-JVM instantiation of the
 114 problem are shown in the following diagram:
 115
 116 \begin{pdfpic}
 117 \newlength{\MyLength}
 118 \settowidth{\MyLength}{machine code}
 119 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
 120 \begin{psmatrix}[colsep=2,rowsep=0]
 121   & \\[0pt]
 122   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 123   [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
 124   & \\[0pt]
 125   & \\[0pt]
 126   & \\[0pt]
 127   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 128   [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
 129   & \\[0pt]
 130   \psset{nodesep=5pt,arrows=->}
 131 \end{psmatrix}
 132 \end{pdfpic}
 133
 134 To illustrate the context of this diagram, the following arcs show the
 135 translations performed by a few familiar tools:
 136
 137 \begin{pdfpic}
 138 \newlength{\MyLength}
 139 \settowidth{\MyLength}{xmachine codex}
 140 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 141 \psmatrix[colsep=2,rowsep=0,nrot=:D]
 142   & \\[0pt]
 143   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 144   & \\[0pt]
 145   & \\[0pt]
 146   & \\[0pt]
 147   & \\[0pt]
 148   & \\[0pt]
 149   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 150   & \\[0pt]
 151   \psset{nodesep=5pt,arrows=->}
 152   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 153   \ncline{s1}{b0}\bput{:D}{\tt gcj}
 154   \ncline{s1}{b1}\aput{:U}{\tt javac}
 155   \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
 156 \endpsmatrix
 157 \end{pdfpic}
 158
 159 Techniques for translating unsafe code into VM bytecode generally fall
 160 into four categories, which we expand upon in the next two sections:
 161
 162 \begin{itemize}
 163 \item source-to-source translation
 164 \item source-to-binary translation
 165 \item binary-to-source translation
 166 \item binary-to-binary translation
 167 \end{itemize}
 168
 169 \section{Existing Work}
 170
 171 \subsection{Source-to-Source Translation}
 172
 173 The most common technique employed is partial translation from unsafe
 174 source code to safe source code:
 175
 176 \begin{pdfpic}
 177 \newlength{\MyLength}
 178 \settowidth{\MyLength}{xmachine codex}
 179 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 180 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 181   & \\[0pt]
 182   & \\[0pt]
 183   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 184   & \\[0pt]
 185   & \\[0pt]
 186   & \\[0pt]
 187   & \\[0pt]
 188   & \\[0pt]
 189   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 190   & \\[0pt]
 191   \psset{nodesep=5pt,arrows=->}
 192   \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
 193   \ncline{s1}{b1}\aput{:U}{\tt javac}
 194 \endpsmatrix
 195 \end{pdfpic}
 196
 197 A number of existing systems employ this technique; they can
 198 be divided into two categories: those which perform a partial
 199 translation which is completed by a human, and those which perform a
 200 total translation but fail (yield an error) on a large class of input
 201 programs.
 202
 203
 204 \subsubsection{Incomplete Translation}
 205
 206 Jazillian \cite{jazillian} is a commercial solution which produces
 207 extremely readable Java source code from C source code, but ony
 208 translates a small portion of the C language.  Jazillian is unique in
 209 that in addition to {\it language migration}, it also performs {\it
 210 API migration}; for example, Jazillian is intelligent enough
 211 to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}.
 212
 213 Unfortunately such deep analysis is intractible for most of the C
 214 language and standard library; Jazillian's documentation notes that
 215 {\it ``This is not your father's language translator.  It's not
 216 generating ugly code that's guaranteed to work out of the
 217 box... Jazillian does not always produce code that works correctly.''}
 218
 219 MoHCA-Java \cite{mohca} is the other major tool in this category, and steps
 220 beyond Jazillian by providing tools for analysis of the source C++
 221 abstract syntax tree.  Additionally, MoHCA-Java's analysis engine is
 222 extensible, making it a platform for constructing application-specific
 223 translators rather than a single translation tool.  However,
 224 MoHCA-Java does not always generate complete Java code for all of the C++
 225 programs which it accepts.
 226
 227
 228 \subsubsection{Partial Domain Translation}
 229
 230 The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp},
 231 and Ephedra \cite{ephedra} systems each provide support for complete
 232 translation of a {\it subset} of the source language (C or C++).  Each
 233 of the four tools supports a progressively greater subset than the one
 234 preceding it; however none covers the entire input language.
 235
 236 Ephedra, the most advanced of the four, supports most of the C++
 237 language, and claims to produce ``human readable'' Java code as
 238 output.  Notable omissions from the input domain include support for
 239 fully general pointer arithmetic, casting between unrelated types, and
 240 reading from a {\tt union} via a different member than the one most
 241 recently written.
 242
 243 Unfortunately, when the program being translated is large and complex,
 244 it is quite likely that it will use an unsupported feature in at least
 245 one place.  In the absence of a programmer who understands the source
 246 program, a single anomoly is often enough to render the entire
 247 translation process useless.  As a result, these tools are mainly
 248 useful as an {\it aid} to programmers who could normally perform the
 249 conversion themselves, but want to save time by automating most of the
 250 process.
 251
 252
 253 \subsection{Source-to-Binary Translation}
 254
 255 Source-to-binary translation involves a compiler for the unsafe
 256 language which has been modified to emit safe bytecode.
 257
 258 \begin{pdfpic}
 259 \newlength{\MyLength}
 260 \settowidth{\MyLength}{xmachine codex}
 261 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 262 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 263   & \\[0pt]
 264   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 265   & \\[0pt]
 266   & \\[0pt]
 267   & \\[0pt]
 268   & \\[0pt]
 269   & \\[0pt]
 270   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 271   & \\[0pt]
 272   \psset{nodesep=5pt,arrows=->}
 273   \ncline{s0}{b1}\bput{:U}{source-to-binary}
 274 \endpsmatrix
 275 \end{pdfpic}
 276
 277 The primary occupant of this category is {\tt egcs-jvm}
 278 \cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler
 279 Collection ( {\tt gcc} ) \cite{gcc}.  Since {\tt gcc} employs a highly
 280 modular architecture, it {\it is} possible to add RTL code generators
 281 for nonstandard processors.  However, {\tt gcc}'s parsing, RTL
 282 generation, and optimization layers make fundamental assumptions (such
 283 as the availability of pointer math) which cannot be directly
 284 supported; thus the compiler still fails for a substantial class of
 285 input programs.
 286
 287 A Java backend for the {\tt lcc} [CITE] compiler, known as {\tt
 288 lcc-java} [CITE], but was not completed.  {\tt lcc-java} also lacks
 289 any form of system library ({\tt libc}), so very few C programs will
 290 run without custom modification, which would cause them to diverge
 291 from the upstream sources.  Finally, {\tt lcc-java}'s memory model is
 292 much more restricted; it uses a fixed-size array to represent all
 293 memory, and expands the array by allocating a new array and copying,
 294 which is extremely inefficient.  No attempt is made to take advantage
 295 of {\tt NullPointerException} checking (which costs nothing if the
 296 exception is not thrown since most JVMs use the MMU to detect this).
 297 Finally, {\tt lcc-java} targets Java source code, which places the
 298 vast majority of NestedVM's optimizations beyond its reach, and
 299 severely restricts the maximum program size {\tt lcc-java} can handle.
 300
 301 Finally, {\tt lcc-java} maintains a separate memory area for the
 302 stack, which appears to limit the exchange of stack pointers and heap
 303 pointers.  It is unclear from the documentation how this is handled.
 304
 305
 306 \section{NestedVM}
 307
 308 The principal difference between NestedVM and other approaches is that
 309 NestedVM {\it does not} attempt to deal with source code as an input.
 310 This leads immediately to three advantages:
 311
 312 \begin{itemize}
 313 \item {\bf Language Agnostic}
 314
 315       Because NestedVM does not attempt to implement the parsing and
 316       code generation steps of compilation, it is freed from the
 317       extremely complex task of faithfully implementing languages
 318       which are often not fully or formally specified (such as C and
 319       C++), and is able to support any langage for which a
 320       MIPS-targeted compiler exists.
 321
 322 \item {\bf Bug-for-bug compiler compatability}
 323
 324       Since NestedVM uses the compiler's {\it output} as its own {\it
 325       input}, it ensures that programs which are inadvertently
 326       dependent on the vagaries of a particular compiler can still be
 327       used.
 328
 329 \item {\bf No build process modifications}
 330
 331       NestedVM does not modify existing build processes, which can be
 332       extremely complex and dependent on strange preprocessor usage as
 333       well as the complex interplay between compiler switches and
 334       header file locations.
 335
 336 \end{itemize}
 337
 338 NestedVM's approach carries a fourth benefit as well, arising from its
 339 totality:
 340
 341 \begin{itemize}
 342 \item {\bf No post-translation human intervention}
 343
 344       NestedVM offers total support for all non-privileged
 345       instructions, registers, and resources found on a MIPS {\tt
 346       R2000} CPU, including the add/multiply unit and floating point
 347       coprocessor.  As such, it constitutes a total function mapping
 348       from the entire domain of (non-kernel-mode) programs onto (a
 349       subset of) the semantics of the Java Virtual Machine.  This
 350       ensures that the translation process is fully automated and
 351       always succeeds for valid input binaries.
 352 \end{itemize}
 353
 354 This is a much more important factor than is obvious at first glance.
 355 If post-translation human intervention is required, then the {\it
 356 human becomes part of the build process}.  This means that if a third
 357 party library used in the project needs to be upgraded, {\it a human
 358 must intervene} in the rebuild process.  In addition to slowing the
 359 process and introducing opportunities for error, this task often
 360 requires specialized knowledge which becomes tied to the particular
 361 individual performing this task, rather than being encoded in build
 362 scripts which persist throughout the lifetime of the project.
 363
 364 \subsection{Why MIPS?}
 365
 366 We chose MIPS as a source format for three reasons: the availability
 367 of tools to compile legacy code into MIPS binaries, the close
 368 similarity (FIXME: explain) between the MIPS ISA and the Java Virtual
 369 Machine, and the relatively high degree of program structure that can
 370 be inferred from ABI-adherent binaries.
 371
 372 The MIPS architecture has been around for quite some time, and is well
 373 supported by the GNU Compiler Collection, which is capable of
 374 compiling C, C++, Java, Fortran, Pascal, and Objective C
 375 into MIPS binaries.
 376
 377 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
 378 Machine:
 379
 380 \begin{itemize}
 381
 382 \item Most of the instructions in the original MIPS ISA operate only
 383       on 32-bit aligned memory locations. This allows NestedVM to
 384       represent memory as a Java {\tt int[]} array without introducing
 385       additional overhead.  The remaining non-aligned memory load
 386       instructions are only rarely emitted by most compilers since
 387       they carry a performance penalty on physical MIPS
 388       implementations.
 389
 390 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
 391       multiply and divide instructions as well as a single and double
 392       precision floating point unit.  These capabilities map nicely
 393       onto Java's arithmetic instructions.
 394
 395 \item Although MIPS offers unsigned arithmetic and Java does not, few
 396       MIPS instructions actually depend on non-two's-complement
 397       handling of integer math.  Furthermore, since most high-level
 398       languages such as C do not expose access to arithmetic-overflow
 399       exceptions, these instructions are rarely found except in
 400       hand-coded assembler.  In the few situations where these
 401       instructions {\it are} encountered, the {\tt unsigned int} is
 402       cast (bitwise) to a Java {\tt long}, the operation is performed,
 403       and the result is cast back.  On architectures offering 64-bit
 404       integer math this conversion carries no overhead.
 405
 406 \end{itemize}
 407
 408 Finally, the MIPS ISA and ABI convey quite a bit of information about
 409 program structure.  This information can be used for optimization
 410 purposes:
 411
 412 \begin{itemize}
 413
 414 \item The structure of MIPS branching and jump instructions make it
 415       easy to infer the set of likely target instructions.
 416
 417 \item The MIPS ABI specifies particular registers as caller-save and
 418       callee-save, as well as designating a register for the return
 419       address after a function call.  This allows NestedVM to optimize
 420       many operations for the common case of ABI-adherent binaries.
 421
 422 \item All MIPS instructions are exactly 32 bits long.
 423
 424 \end{itemize}
 425
 426
 427
 428 \subsection{Binary-to-Source}
 429
 430 The simplest operational mode for NestedVM is binary-to-source
 431 translation.  In this mode, NestedVM translates MIPS binaries into
 432 Java source code, which is then fed to a Java compiler in order to
 433 produce bytecode files:
 434
 435 \begin{pdfpic}
 436 \newlength{\MyLength}
 437 \settowidth{\MyLength}{xmachine codex}
 438 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 439 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 440   & \\[0pt]
 441   & \\[0pt]
 442   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 443   & \\[0pt]
 444   & \\[0pt]
 445   & \\[0pt]
 446   & \\[0pt]
 447   & \\[0pt]
 448   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 449   \psset{nodesep=5pt,arrows=->}
 450   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 451   \ncline{s1}{b1}\aput{:U}{\tt javac}
 452   \ncline{b0}{s1}\naput{\tt NestedVM}
 453 \endpsmatrix
 454 \end{pdfpic}
 455
 456 \begin{figure*}[t]
 457 \begin{minipage}[c]{7in}%
 458 \begin{multicols}{2}
 459 {\footnotesize\begin{verbatim}
 460 private final static int r0 = 0;
 461 private int r1, r2, r3,...,r30;
 462 private int r31 = 0xdeadbeef;
 463 private int pc = ENTRY_POINT;
 464
 465 public void run() {
 466     for(;;) {
 467         switch(pc) {
 468             case 0x10000:
 469                 r29 = r29 - 32;
 470             case 0x10004:
 471                 r1 = r4 + r5;
 472             case 0x10008:
 473                 if(r1 == r6) {
 474                     /* delay slot */
 475                     r1 = r1 + 1;
 476                     pc = 0x10018:
 477                     continue;
 478                 }
 479             case 0x1000C:
 480                 r1 = r1 + 1;
 481             case 0x10010:
 482                 r31 = 0x10018;
 483                 pc = 0x10210;
 484                 continue;
 485             case 0x10014:
 486                 /* nop */
 487             case 0x10018:
 488                 pc = r31;
 489                 continue;
 490             ...
 491             case 0xdeadbeef:
 492                 System.err.println(``Exited.'');
 493                 System.exit(1);
 494         }
 495     }
 496 }
 497 \end{verbatim}}
 498 \vspace{1in}
 499 {\footnotesize\begin{verbatim}
 500 public void run_0x10000() {
 501     for(;;) {
 502     switch(pc) {
 503         case 0x10000:
 504             ...
 505         case 0x10004:
 506             ...
 507         ...
 508         case 0x10010:
 509             r31 = 0x10018;
 510             pc = 0x10210;
 511             return;
 512         ...
 513     }
 514     }
 515 }
 516
 517 pubic void run_0x10200() {
 518     for(;;) {
 519     switch(pc) {
 520         case 0x10200:
 521             ...
 522         case 0x10204:
 523             ...
 524     }
 525     }
 526 }
 527
 528 public void trampoline() {
 529     for(;;) {
 530     switch(pc&0xfffffe00) {
 531             case 0x10000: run_0x10000(); break;
 532             case 0x10200: run_0x10200(); break;
 533             case 0xdeadbe00:
 534                 ...
 535         }
 536     }
 537 }
 538 \end{verbatim}}
 539 \end{multicols}
 540 \end{minipage}
 541 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 542 \end{figure*}
 543
 544 Translating unsafe code for use within a JVM proceeds as follows:
 545
 546 \begin{enumerate}
 547
 548 \item Compile the source code to a statically linked binary, targeting
 549       the MIPS R2000 ISA.  Typically this will involve linking against
 550       {\tt libc}, which translates system requests (such as {\tt
 551       open()}, {\tt read()}, or {\tt write()}) into appropriate
 552       invocations of the MIPS {\tt SYSCALL} instruction.  Any other
 553       libraries which are not tied to a particular OS kernel can be
 554       linked in (even in binary form) using standard linker commands.
 555
 556 \item Invoke {\tt NestedVM} on the statically linked binary.
 557
 558 \item Compile the resulting {\tt .java} code using {\tt jikes}
 559       \cite{jikes} or {\tt javac}.
 560
 561 \item From java code, invoke the {\tt run()} method on the generated
 562       class.  This is equivalent to the {\tt main()} entry point.
 563
 564 \end{enumerate}
 565
 566 \subsubsection{Optimizations}
 567
 568 Generating Java source code instead of bytecode frees NestedVM from
 569 having to perform simple constant propagation optimizations, as most
 570 Java compilers already do this.  A recurring example is the treatment
 571 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
 572
 573 Lacking the ability to generate specially optimized bytecode
 574 sequences, a straightforward mapping of the general purpose hardware
 575 registers to 32 {\tt int} fields turned out to be optimal.
 576
 577
 578 \epsfig{file=charts/chart1,width=3in}
 579
 580 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
 581 for a single method.  This presents a problem for NestedVM, and
 582 necessitates a {\it trampoline transformation}, as shown in
 583 Figure~\ref{code1}.  With this trampoline in place, large binaries can
 584 be handled without much difficulty -- fortunately, there is no
 585 corresponding limit on the size of a classfile as a whole.
 586
 587 One difficulty that arose as a result of using the trampoline
 588 transformation was the fact that {\tt javac} and {\tt jikes} are
 589 unable to properly optimize its switch statements.  For example, the
 590 following code is compiled into a comparatively inefficient {\tt
 591 LOOKUPSWITCH}:
 592
 593 {\footnotesize
 594 \begin{verbatim}
 595     switch(pc&0xffffff00) {
 596         case 0x00000100: run_100(); break;
 597         case 0x00000200: run_200(); break;
 598         case 0x00000300: run_300(); break;
 599     }
 600 \end{verbatim}}
 601
 602 Whereas the next block of code code optimized into a {\tt
 603 TABLESWITCH}:
 604
 605 {\footnotesize
 606 \begin{verbatim}
 607     switch(pc>>>8) {
 608         case 0x1: run_100();
 609         case 0x2: run_200();
 610         case 0x3: run_300();
 611     }
 612 \end{verbatim}}
 613
 614 This problem was surmounted by switching on a denser set of {\tt case}
 615 values, which is more amenable to the {\tt TABLESWITCH} structure.
 616 This change alone nearly doubled the speed of the compiled binary.
 617
 618 The next performance improvement came from tuning the size of the
 619 methods invoked from the trampoline.  Trial and error led to the
 620 onclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
 621 -- performs best when 128 MIPS instructions are mapped to each method.
 622
 623 \epsfig{file=chart5,width=3in}
 624
 625 \epsfig{file=chart6,width=3in}
 626
 627 This phenomenon is due to two factors:
 628
 629 \begin{itemize}
 630
 631 \item While the trampoline method's {\tt switch} statement can be
 632       coded as a {\tt TABLESWITCH}, the {\tt switch} statement
 633       within the individual methods is to sparse to encode this way.
 634
 635 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
 636       favor smaller methods since they are easier to compile and are
 637       better candidates for compilation in ``normal'' programs (unlike
 638       NestedVM programs).
 639
 640 \end{itemize}
 641
 642 After tuning method sizes, our next performance boost came from
 643 eliminating exraneous case branches.  Having case statements before
 644 each instruction prevents JIT compilers from being able to optimize
 645 across instruction boundaries, since control flow can enter the body
 646 of a {\tt switch} statement at any of the {\tt case}s.  In order to
 647 eliminate unnecessary case statements we needed to identify all
 648 possible jump targets.  Jump targets can come from three sources:
 649
 650 \begin{itemize}
 651
 652 \item {\bf The {\tt .text} segment}
 653
 654       Every instruction in the text segment is scanned, and every
 655       branch instruction's destination is added to the list of
 656       possible branch targets.  In addition, any function that sets
 657       the link register is added to the list \footnote{actually {\tt addr+8}}.
 658       Finally, combinations of {\tt LUI} (Load Upper Immediate) and
 659       {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
 660       addresses in the {\tt .text} segment since this combination of
 661       instructions is often used to load a 32-bit word into a
 662       register.
 663
 664 \item {\bf The {\tt .data} segment}
 665
 666       When compiling {\tt switch} statements, compilers often use a
 667       jump table stored in the {\tt .data} segment.  Unfortunately
 668       they typically do not identify these jump tables in any way.
 669       Therefore, the entire {\tt .data} segment is conservatively
 670       scanned for possible addresses in the {\tt .text} segment.
 671
 672 \item {\bf The symbol table}
 673
 674       This is mainly used as a backup.  Scanning the {\tt .text} and
 675       {\tt .data} segments should identify any possible jump targets;
 676       however, adding all function symbols in the ELF symbol table
 677       also catches functions that are never called directly from the
 678       MIPS binary, such as those invoked only via the NestedVM
 679       runtime's {\tt call()} method.
 680
 681 \end{itemize}
 682
 683 Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
 684 increase.
 685
 686 Despite all the above optimizations, one insurmountable obstacle
 687 remained: the Java {\tt .class} file format limits the constant pool
 688 to 65535 entries.  Every integer literal greater than {\tt 32767}
 689 requires an entry in this pool, and each branch instruction generates
 690 one of these.
 691
 692 One suboptimal solution was to express constants as offsets from a few
 693 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
 694 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
 695 inlining it).  This was sufficient to get reasonably large binaries to
 696 compile, and caused only a small (approximately 5\%) performance
 697 degredation and a similarly small increase in the size of the {\tt
 698 .class} file.  However, as we will see in the next section, compiling
 699 directly to {\tt .class} files (without the intermediate {\tt .java}
 700 file) eliminates this problem entirely.
 701
 702
 703 \subsection{Binary-to-Binary}
 704
 705 After implementing the binary-to-source compiler, a binary-to-binary
 706 translation mode was added.
 707
 708 \begin{pdfpic}
 709 \newlength{\MyLength}
 710 \settowidth{\MyLength}{xmachine codex}
 711 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 712 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 713   & \\[0pt]
 714   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 715   & \\[0pt]
 716   & \\[0pt]
 717   & \\[0pt]
 718   & \\[0pt]
 719   & \\[0pt]
 720   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 721   & \\[0pt]
 722   \psset{nodesep=5pt,arrows=->}
 723   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 724   \ncline{b0}{b1}\naput{\tt NestedVM}
 725 \endpsmatrix
 726 \end{pdfpic}
 727
 728 This mode has several advantages:
 729
 730 \begin{itemize}
 731
 732 \item There are quite a few interesting bytecode sequences that cannot
 733       be generated as a result of compiling Java source code.
 734
 735 \item Directly generating {\tt .class} files Eliminates the
 736       time-consuming {\tt javac} step.
 737
 738 \item Direct compilation to {\tt .class} files opens up the
 739       interesting possibility of dynamically translating MIPS binaries
 740       and loading them via {\tt ClassLoader.fromBytes()} {\it at
 741       deployment time}, eliminating the need to compile binaries ahead
 742       of time.
 743
 744 \end{itemize}
 745
 746 Most of the performance improvemen where made where in the handling of
 747 branch instructions and in taking advantage of the JVM stack to
 748 eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
 749
 750 \epsfig{file=chart7,width=3in}
 751
 752 The first optimization gained by direct bytecode generation came from
 753 the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
 754 Java {\it language} does not have a {\tt goto} keyword, the VM does in
 755 fact have a corresponding instruction which is used quite heavily by
 756 {\tt javac}.  NestedVM's binary-to-binary mode exploits this
 757 instruction to avoid emitting inefficient {\tt switch..case}
 758 structures.
 759
 760 Related to the {\tt GOTO} instruction is branch statement
 761 optimization.  When emitting source code, NestedVM translates branches
 762 into Java source code like this:
 763
 764 {\footnotesize\begin{verbatim}
 765     if (condition) {
 766         pc = TARGET;
 767         continue;
 768     }
 769 \end{verbatim}}
 770
 771 This requires a branch in the JVM {\it regardless} of whether the MIPS
 772 branch is actually taken.  If {\tt condition} is false the JVM has to
 773 jump over the code to set {\tt pc} and go back to the {\tt switch}
 774 statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
 775 switch} block.  By generating bytecode directly, NestedVM is able to
 776 emit a JVM bytecode branching directly to the address corresponding to
 777 the target of the MIPS branch.  In the case where the branch is not
 778 taken the JVM doesn't branch at all.
 779
 780 A side effect of the previous two optimizations is a solution to the
 781 excess constant pool entries problem.  When jumps are implemented as
 782 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
 783 not need to be set.  This eliminates a huge number of constant pool
 784 entries.  The {\tt .class} file constant pool size limit is still
 785 present, but it is less likely to be encountered.
 786
 787 Implementation of the MIPS delay slot offers another opportunity for
 788 bytecode-level optimization.  In order to take advantage of
 789 instructions already in the pipeline, the MIPS ISA specifies that the
 790 instruction after a jump or branch is always executed, even if the
 791 jump/branch is taken.  This instruction is referred to as the ``delay
 792 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
 793 early MIPS CPUs, so they have to discard instructions anyways}.''  The
 794 instruction in the delay slot is actually executed {\it before} the
 795 branch is taken.  To further complicate matters, values from the
 796 register file are loaded {\it before} the delay slot is executed.
 797
 798 Fortunately there is a very elegent solution to this problem which can
 799 be expressed in JVM bytecode.  When a branch instruction is
 800 encountered, the registers needed for the comparison are pushed onto
 801 the stack to prepare for the JVM branch instruction.  Then, {\it
 802 after} the values are on the stack the delay slot instruction is
 803 emitted, followed by the actual JVM branch instruction.  Because the
 804 values were pushed to the stack before the delay slot was executed, any
 805 changes the delay slot made to the registers are not visible to the
 806 branch bytecode.
 807
 808 One final advantage that generating bytecode directly allows is a
 809 reduction in the size of the ultimate {\tt .class} file.  All the
 810 optimizations above lead to more compact bytecode as a beneficial side
 811 effect; in addition, NestedVM performs a few additional optimizations.
 812
 813 When encountering the following {\tt switch} block, both {\tt javac}
 814 and {\tt jikes} generate redundant bytecode.
 815
 816 {\footnotesize\begin{verbatim}
 817     switch(pc>>>8) {
 818         case 0x1: run_1(); break;
 819         case 0x2: run_2(); break
 820         ...
 821         case 0x100: run_100(); break;
 822     }
 823 \end{verbatim}}
 824
 825 The first bytecode in each case arm in the switch statement is {\tt
 826 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
 827 lifting this bytecode outside of the {\tt switch} statement, each {\tt
 828 case} arm shrinks by one instruction.
 829
 830 \subsubsection{Compiler Flags}
 831
 832 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
 833 profile is nothing like that of actual silicon.  In particular, {\tt
 834 gcc} makes several optimizations that increase performance on an
 835 actually MIPS CPU but actually decrease the performance of
 836 NestedVM-generated bytecode.  We found the following compiler options
 837 could be used to improve performance:
 838
 839 \begin{itemize}
 840
 841 \item {\tt -falign-functions}
 842
 843       Normally a function's location in memory has no effect on its
 844       execution speed.  However, in the NestedVM binary translator,
 845       the {\tt .text} segment is split on power-of-two boundaries.  If
 846       a function starts near the end of one of these boundaries, a
 847       performance critical part of the function winds up spanning two
 848       Java methods.  Telling {\tt gcc} to align all functions along
 849       these boundaries decreases the chance of this sort of splitting.
 850
 851 \item {\tt -fno-rename-registers}
 852
 853       On an actual silicon chip, using additional registers carries no
 854       performance penalty (as long as none are spilled to the stack).
 855       However, when generating bytecode, using {\it fewer}
 856       ``registers'' helps the JVM optimize the machine code it
 857       generates by simplifying the constraints it needs to deal with.
 858       Disabling register renaming has this effect.
 859
 860 \item {\tt -fno-schedule-insns}
 861
 862       Results of MIPS load operations are not available until {\it
 863       two} instructions after the load.  Without the {\tt
 864       -fno-schedule-insns} instruction, {\tt gcc} will attempt to
 865       reorder instructions to do other useful work during this period
 866       of unavailability.  NestedVM is under no such constraint, so
 867       removing this reordering typically generates simpler machine
 868       code.
 869
 870 \item {\tt -mmemcpy}
 871
 872       Enabling this instruction causes {\tt gcc} to use the system
 873       {\tt memcpy()} routine instead of generating loads and stores.
 874       As explained in the next section, the NestedVM runtime
 875       implements {\tt memcpy()} using {\tt System.arraycopy()}, which
 876       is substantially more efficient.
 877
 878 \item {\tt -ffunction-sections -fdata-sections}
 879
 880       These two options are used in conjunction with the {\tt
 881       --gc-section} linker option, prompting the linker to more
 882       aggressively prune dead code.
 883
 884 \end{itemize}
 885
 886 The effects of the various optimizations presented in this chapter are
 887 summarized in the table below.
 888
 889 \epsfig{file=chart4,width=3in}
 890
 891 \epsfig{file=chart3,width=3in}
 892
 893 \section{Experiences}
 894
 895 \subsection{FreeType, {\tt libmspack}, and {\tt libjpeg}}
 896
 897 The Ibex Project utilizes three libraries for which no Java-only
 898 equivalent exists.  The first is the FreeType font library, which
 899 parses, hints, and rasterizes TrueType and Postscript fonts with
 900 exceptional quality.  The project also needed an open source JPEG
 901 decompressor; surprisingly, none exist for Java.  While encoders are
 902 plentiful
 903
 904 These three libraries make minimal use of the standard library and OS
 905 services, and are all written in very portable ANSI C code, which made
 906 them easy targets for initial development.
 907
 908 \subsection{The GNU Compiler Collection}
 909
 910 Our next target, {\tt gcc}, was chosen in order to relieve developers
 911 from the time-consuming and complex task of building a compiler
 912 themselves.  The Ibex Project requires a specially configured and
 913 patched version of {\tt gcc} and its ahead-of-time Java compiler ({\tt
 914 gcj}) which is not distributed in binary form.
 915
 916 GCC was the first ``major'' application NestedVM was used on, and
 917 drove the development of most of the system library interface
 918 development; particularly support for {\tt fork()} and {\tt exec()},
 919 which require the NestedVM Runtime to perform binary-to-bytecode
 920 translation on the fly.
 921
 922 GCC also makes extensive use of 64-bit integers ({\tt long long}),
 923 which -- for performance reasons -- are typically manipulated using
 924 nonobvious instruction sequences on the 32-bit MIPS architecture.
 925 Dealing with these operations uncovered a number of bugs in the
 926 translator.
 927
 928 Despite our original goal, we have not yet been able to translate the
 929 {\tt C++} or Java front-ends, as the resulting binary produces a
 930 trampoline which exceeds the maximum size of a single method.  Future
 931 work will explore a multi-level trampoline to address this issue.
 932
 933
 934
 935 \subsection{\TeX and LINPACK}
 936
 937 In order to distinguish NestedVM from other single-language
 938 translators for the JVM, we undertook the task of translating \TeX 89
 939 (written in Pascal) and the Fortran source code for LINPACK into Java
 940 bytecodes.
 941
 942 Although actually producing the initial MIPS binaries from the \TeX\
 943 source code turned out to be an exceptionally tedious and frustrating
 944 task, the resulting binary translated and executed perfectly on the
 945 first run, as did LINPACK.  Our reward for this effort was typesetting
 946 our presentation of NestedVM using NestedVM itself.  We have also had
 947 initial successes running \TeX\ in a Java Applet, and intend to
 948 produce a {\tt jar} for embedding \TeX\ code (``\TeX lets'') in web
 949 pages without the use of a post-processing tool.
 950
 951 The LINPACK benchmark called our attention to Java's lack of an API
 952 for checking the ``cpu time'' of a process.  Unfortunately we had to
 953 substitute wall-clock time on an otherwise-quiescent machine as an
 954 approximation.
 955
 956
 957
 958 \section{The NestedVM Runtime}
 959
 960 In addition to binary-to-source and binary-to-binary translation,
 961 NestedVM also includes a MIPS binary interpreter.  All three
 962 translation approaches expose the same API to both the translated
 963 binary and the surrounding VM (including peer Java code).
 964
 965 \subsection{The Runtime Class}
 966
 967 The runtime fulfills four roles:
 968
 969 \begin{itemize}
 970
 971 \item It provides a simple, consistent external interface.  The method
 972       of actually executing the code (currently only translated
 973       binaries and the interpreter) can be changed without any code
 974       changes to the caller because only runtime exposes a public
 975       interface.  This includes methods to pass arguments to the
 976       binary's {\tt main()} function, read and write from memory, and
 977       call individual functions in the binary.
 978
 979 \item It manages the process's memory.  The runtime class contains
 980       large {\tt int[]} arrays that represent the process`s entire
 981       memory space.  Subclasses read and write to these arrays as
 982       required by the instructions they are executing, and can expand
 983       their memory space using the {\tt sbrk} system call.
 984
 985 \item The runtime provides access to the host file system and standard
 986       I/O streams.  Subclasses of {\tt runtime} can access the file
 987       system through standard UNIX syscalls ({\tt read()}, {\tt
 988       write()}, {\tt open()}, etc).  The runtime manages the file
 989       descriptor table that maps UNIX file descriptors to Java {\tt
 990       RandomAccessFile}s, {\tt InputStream}s, {\tt OutputStream}s, and
 991       {\tt Socket}s.
 992
 993 \item It provides general OS services, including {\tt sleep()}, {\tt
 994       gettimeofday()}, {\tt getpagesize()}, {\tt sysconf()}, {\tt
 995       fcntl()}, and so on.
 996
 997 \end{itemize}
 998
 999 \section{Future Directions}
1000
1001 Although we have only implemented it for the Java Virtual Machine, our
1002 technique generalizes to other safe bytecode architectures.  In
1003 particular we would like to demonstrate this generality by retargeting
1004 the translator to the Microsoft Intermediate Language \cite{msil}.
1005
1006 Additionally, we would like to explore other uses for dynamic loading
1007 of translated MIPS binaries by combining NestedVM (which itself is
1008 written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
1009
1010
1011 \section{Conclusion}
1012
1013 We have presented a novel technique for using libraries written in
1014 unsafe languages within a safe virtual machine without resorting to
1015 native interfaces.  We have implemented this technique in NestedVM,
1016 which is currently used by the Ibex project\footnote{{\tt
1017 http://www.ibex.org}} to perform font rasterization (via {\tt
1018 libfreetype}), JPEG decoding (via {\tt libjpeg}), and CAB archive
1019 extraction (via {\tt libmspack}), three libraries for which no
1020 equivalent Java classes exist.
1021
1022 NestedVM is available under an open source license, and can be
1023 obtained from
1024 \begin{verbatim}
1025     http://nestedvm.ibex.org
1026 \end{verbatim}
1027
1028
1029 \section{Appendix: Testing Methodology}
1030
1031 All times are measured in seconds. These were all run on a dual 1Ghz
1032 Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
1033 test was run 8 times within a single VM. The highest and lowest times
1034 were removed and the remaining 6 were averaged.  In each case only the
1035 first run differed significantly from the rest.
1036
1037 The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
1038 writing a tga.  The {\tt mspack} test consisted of extracting all
1039 members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
1040 and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
1041 rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
1042 to 48 incrementing by 4 for a total of 950 glyphs.
1043
1044 \bibliography{nestedvm}
1045
1046 \end{document}
1047