doc/ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{pdftricks}
   8 \begin{psinputs}
   9   \usepackage{pstricks}
  10   \usepackage{pst-node}
  11 \end{psinputs}
  12 \usepackage{parskip}
  13 \usepackage{tabularx}
  14 \usepackage{alltt}
  15 \bibliographystyle{amsplain}
  16
  17 \title{\textbf{\textsf{
  18 Complete Translation of Unsafe Native Code to Safe Bytecode
  19 }}}
  20 \date{}
  21 \author{\begin{tabular}{@{}c@{}}
  22         {\em {Brian Alliet}} \\
  23         {Rochester Institute of Technology}\\
  24         {\tt bja8464@cs.rit.edu}
  25    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  26         {\em {Adam Megacz}} \\
  27         {University of California, Berkeley} \\
  28         {\tt megacz@cs.berkeley.edu}
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 {\it This document was typeset using D. E. Knuth's original \TeX 89,
  35      which was both compiled and executed entirely within a Java
  36      Virtual Machine without the use of native code.}
  37
  38 \begin{abstract}
  39
  40 Existing techniques for using code written in an unsafe language
  41 within a safe virtual machine generally involve transformations from
  42 one source code language (such as C, Pascal, or Fortran) to another
  43 (such as Java) which is the compiled into virtual machine bytecodes.
  44
  45 We present an alternative approach which translate MIPS binaries
  46 produced by any compiler into safe virtual machine bytecodes.  This
  47 approach offers four key advantages over existing techniques:
  48
  49 \begin{itemize}
  50 \item Language-agnostic
  51 \item Bug-for-bug compiler compatability
  52 \item No post-translation human intervention
  53 \item No build process modifications
  54 \end{itemize}
  55
  56 We also present NestedVM, an implementation of this technique, and
  57 discuss six examples: LINPACK (Fortran), which was used as one of our
  58 performance tests, \TeX\ (Pascal), which was used to typeset this
  59 document, {\tt libjpeg}, {\tt libmspack}, and FreeType (all C source),
  60 which are currently in production use as part of the Ibex Project, and
  61 {\tt gcc}, which was used to compile all of the aforementioned.
  62
  63 Performance measurements indicate a best case performance within 3x of
  64 native code and worst case typically within 10x, making it an
  65 attractive solution for code which is not performance-critical.
  66
  67 \end{abstract}
  68
  69 \section{Introduction}
  70
  71 Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
  72 been in use much longer than any of today's widely accepted safe
  73 languages such as Java \cite{java} and C\# \cite{csharp}.  Consequently, there is
  74 a huge library of software written in these languages.  Although safe
  75 languages offer substantial benefits, their comparatively young age
  76 often puts them at a disadvantage when breadth of existing support
  77 code is an important criterion.
  78
  79 The typical solution to this dilemma is to use a native interface such
  80 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
  81 virtual machine or otherwise safe environment.  Unfortunately, there
  82 are a number of situations in which this is not an acceptable
  83 solution.  These situations can be broadly classified into two
  84 categories: {\it security concerns} and {\it portability concerns}.
  85
  86 Using Java as an example, JNI and CNI are prohibited in a number of
  87 contexts, including applets environments and servlet containers with a
  88 {\tt SecurityManager}.  Additionally, even in the context of trusted
  89 code, {\tt native} methods invoked via JNI are susceptible to buffer
  90 overflow and heap corruption attacks which are not a concern for
  91 verified bytecode.
  92
  93 The second class of disadvantages revolves around portability
  94 concerns; native interfaces require the native library to be compiled
  95 ahead of time, for every architecture on which they will be
  96 deployed.  This is unworkable for situations in which the full set of
  97 target architectures is not known at deployment time.  Additionally,
  98 some JVM platform variants such as J2ME \cite{j2me} simply do not offer
  99 support for native code.
 100
 101 The technique we present here uses typical compiler to compile unsafe
 102 code into a MIPS binary, which is then translated on an
 103 instruction-by-instruction basis into Java bytecode.  The technique
 104 presented here is general; we anticipate that it can be applied to
 105 other secure virtual machines such as Microsoft's .NET \cite{msil}, Perl
 106 Parrot \cite{parrot}, or Python bytecode \cite{python}.
 107
 108 \section{Approaches to Translation}
 109
 110 The four program representations of interest in this context, along
 111 with their specific types in the C-to-JVM instantiation of the
 112 problem are shown in the following diagram:
 113
 114 \begin{pdfpic}
 115 \newlength{\MyLength}
 116 \settowidth{\MyLength}{machine code}
 117 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
 118 \begin{psmatrix}[colsep=2,rowsep=0]
 119   & \\[0pt]
 120   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 121   [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
 122   & \\[0pt]
 123   & \\[0pt]
 124   & \\[0pt]
 125   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 126   [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
 127   & \\[0pt]
 128   \psset{nodesep=5pt,arrows=->}
 129 \end{psmatrix}
 130 \end{pdfpic}
 131
 132 To illustrate the context of this diagram, the following arcs show the
 133 translations performed by a few familiar tools:
 134
 135 \begin{pdfpic}
 136 \newlength{\MyLength}
 137 \settowidth{\MyLength}{xmachine codex}
 138 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 139 \psmatrix[colsep=2,rowsep=0,nrot=:D]
 140   & \\[0pt]
 141   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 142   & \\[0pt]
 143   & \\[0pt]
 144   & \\[0pt]
 145   & \\[0pt]
 146   & \\[0pt]
 147   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 148   & \\[0pt]
 149   \psset{nodesep=5pt,arrows=->}
 150   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 151   \ncline{s1}{b0}\bput{:D}{\tt gcj}
 152   \ncline{s1}{b1}\aput{:U}{\tt javac}
 153   \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
 154 \endpsmatrix
 155 \end{pdfpic}
 156
 157 Techniques for translating unsafe code into VM bytecode generally fall
 158 into four categories, which we expand upon in the next two sections:
 159
 160 \begin{itemize}
 161 \item source-to-source translation
 162 \item source-to-binary translation
 163 \item binary-to-source translation
 164 \item binary-to-binary translation
 165 \end{itemize}
 166
 167 \section{Existing Work}
 168
 169 \subsection{Source-to-Source Translation}
 170
 171 The most common technique employed is partial translation from unsafe
 172 source code to safe source code:
 173
 174 \begin{pdfpic}
 175 \newlength{\MyLength}
 176 \settowidth{\MyLength}{xmachine codex}
 177 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 178 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 179   & \\[0pt]
 180   & \\[0pt]
 181   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 182   & \\[0pt]
 183   & \\[0pt]
 184   & \\[0pt]
 185   & \\[0pt]
 186   & \\[0pt]
 187   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 188   & \\[0pt]
 189   \psset{nodesep=5pt,arrows=->}
 190   \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
 191   \ncline{s1}{b1}\aput{:U}{\tt javac}
 192 \endpsmatrix
 193 \end{pdfpic}
 194
 195 A number of existing systems employ this technique; they can
 196 be divided into two categories: those which perform a partial
 197 translation which is completed by a human, and those which perform a
 198 total translation but fail (yield an error) on a large class of input
 199 programs.
 200
 201
 202 \subsubsection{Incomplete Translation}
 203
 204 Jazillian \cite{jazillian} is a commercial solution which produces
 205 extremely readable Java source code from C source code, but ony
 206 translates a small portion of the C language.  Jazillian is unique in
 207 that in addition to {\it language migration}, it also performs {\it
 208 API migration}; for example, Jazillian is intelligent enough
 209 to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}.
 210
 211 Unfortunately such deep analysis is intractible for most of the C
 212 language and standard library; Jazillian's documentation notes that
 213 {\it ``This is not your father's language translator.  It's not
 214 generating ugly code that's guaranteed to work out of the
 215 box... Jazillian does not always produce code that works correctly.''}
 216
 217 MoHCA-Java \cite{mohca} is the other major tool in this category, and steps
 218 beyond Jazillian by providing tools for analysis of the source C++
 219 abstract syntax tree.  Additionally, MoHCA-Java's analysis engine is
 220 extensible, making it a platform for constructing application-specific
 221 translators rather than a single translation tool.  However,
 222 MoHCA-Java does not always generate complete Java code for all of the C++
 223 programs which it accepts.
 224
 225
 226 \subsubsection{Partial Domain Translation}
 227
 228 The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp},
 229 and Ephedra \cite{ephedra} systems each provide support for complete
 230 translation of a {\it subset} of the source language (C or C++).  Each
 231 of the four tools supports a progressively greater subset than the one
 232 preceding it; however none covers the entire input language.
 233
 234 Ephedra, the most advanced of the four, supports most of the C++
 235 language, and claims to produce ``human readable'' Java code as
 236 output.  Notable omissions from the input domain include support for
 237 fully general pointer arithmetic, casting between unrelated types, and
 238 reading from a {\tt union} via a different member than the one most
 239 recently written.
 240
 241 Unfortunately, when the program being translated is large and complex,
 242 it is quite likely that it will use an unsupported feature in at least
 243 one place.  In the absence of a programmer who understands the source
 244 program, a single anomoly is often enough to render the entire
 245 translation process useless.  As a result, these tools are mainly
 246 useful as an {\it aid} to programmers who could normally perform the
 247 conversion themselves, but want to save time by automating most of the
 248 process.
 249
 250
 251 \subsection{Source-to-Binary Translation}
 252
 253 Source-to-binary translation involves a compiler for the unsafe
 254 language which has been modified to emit safe bytecode.
 255
 256 \begin{pdfpic}
 257 \newlength{\MyLength}
 258 \settowidth{\MyLength}{xmachine codex}
 259 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 260 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 261   & \\[0pt]
 262   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 263   & \\[0pt]
 264   & \\[0pt]
 265   & \\[0pt]
 266   & \\[0pt]
 267   & \\[0pt]
 268   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 269   & \\[0pt]
 270   \psset{nodesep=5pt,arrows=->}
 271   \ncline{s0}{b1}\bput{:U}{source-to-binary}
 272 \endpsmatrix
 273 \end{pdfpic}
 274
 275 The primary occupant of this category is {\tt egcs-jvm}
 276 \cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler
 277 Collection ( {\tt gcc} ) \cite{gcc}.  Since {\tt gcc} employs a highly
 278 modular architecture, it {\it is} possible to add RTL code generators
 279 for nonstandard processors.  However, {\tt gcc}'s parsing, RTL
 280 generation, and optimization layers make fundamental assumptions (such
 281 as the availability of pointer math) which cannot be directly
 282 supported; thus the compiler still fails for a substantial class of
 283 input programs.
 284
 285 A Java backend for the {\tt lcc} [CITE] compiler, known as {\tt
 286 lcc-java} [CITE], but was not completed.  {\tt lcc-java} also lacks
 287 any form of system library ({\tt libc}), so very few C programs will
 288 run without custom modification, which would cause them to diverge
 289 from the upstream sources.  Finally, {\tt lcc-java}'s memory model is
 290 much more restricted; it uses a fixed-size array to represent all
 291 memory, and expands the array by allocating a new array and copying,
 292 which is extremely inefficient.  No attempt is made to take advantage
 293 of {\tt NullPointerException} checking (which costs nothing if the
 294 exception is not thrown since most JVMs use the MMU to detect this).
 295 Finally, {\tt lcc-java} targets Java source code, which places the
 296 vast majority of NestedVM's optimizations beyond its reach, and
 297 severely restricts the maximum program size {\tt lcc-java} can handle.
 298
 299 Finally, {\tt lcc-java} maintains a separate memory area for the
 300 stack, which appears to limit the exchange of stack pointers and heap
 301 pointers.  It is unclear from the documentation how this is handled.
 302
 303
 304 \section{NestedVM}
 305
 306 The principal difference between NestedVM and other approaches is that
 307 NestedVM {\it does not} attempt to deal with source code as an input.
 308 This leads immediately to three advantages:
 309
 310 \begin{itemize}
 311 \item {\bf Language Agnostic}
 312
 313       Because NestedVM does not attempt to implement the parsing and
 314       code generation steps of compilation, it is freed from the
 315       extremely complex task of faithfully implementing languages
 316       which are often not fully or formally specified (such as C and
 317       C++), and is able to support any langage for which a
 318       MIPS-targeted compiler exists.
 319
 320 \item {\bf Bug-for-bug compiler compatability}
 321
 322       Since NestedVM uses the compiler's {\it output} as its own {\it
 323       input}, it ensures that programs which are inadvertently
 324       dependent on the vagaries of a particular compiler can still be
 325       used.
 326
 327 \item {\bf No build process modifications}
 328
 329       NestedVM does not modify existing build processes, which can be
 330       extremely complex and dependent on strange preprocessor usage as
 331       well as the complex interplay between compiler switches and
 332       header file locations.
 333
 334 \end{itemize}
 335
 336 NestedVM's approach carries a fourth benefit as well, arising from its
 337 totality:
 338
 339 \begin{itemize}
 340 \item {\bf No post-translation human intervention}
 341
 342       NestedVM offers total support for all non-privileged
 343       instructions, registers, and resources found on a MIPS {\tt
 344       R2000} CPU, including the add/multiply unit and floating point
 345       coprocessor.  As such, it constitutes a total function mapping
 346       from the entire domain of (non-kernel-mode) programs onto (a
 347       subset of) the semantics of the Java Virtual Machine.  This
 348       ensures that the translation process is fully automated and
 349       always succeeds for valid input binaries.
 350 \end{itemize}
 351
 352 This is a much more important factor than is obvious at first glance.
 353 If post-translation human intervention is required, then the {\it
 354 human becomes part of the build process}.  This means that if a third
 355 party library used in the project needs to be upgraded, {\it a human
 356 must intervene} in the rebuild process.  In addition to slowing the
 357 process and introducing opportunities for error, this task often
 358 requires specialized knowledge which becomes tied to the particular
 359 individual performing this task, rather than being encoded in build
 360 scripts which persist throughout the lifetime of the project.
 361
 362 \subsection{Why MIPS?}
 363
 364 We chose MIPS as a source format for three reasons: the availability
 365 of tools to compile legacy code into MIPS binaries, the close
 366 similarity (FIXME: explain) between the MIPS ISA and the Java Virtual
 367 Machine, and the relatively high degree of program structure that can
 368 be inferred from ABI-adherent binaries.
 369
 370 The MIPS architecture has been around for quite some time, and is well
 371 supported by the GNU Compiler Collection, which is capable of
 372 compiling C, C++, Java, Fortran, Pascal, and Objective C
 373 into MIPS binaries.
 374
 375 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
 376 Machine:
 377
 378 \begin{itemize}
 379
 380 \item Most of the instructions in the original MIPS ISA operate only
 381       on 32-bit aligned memory locations. This allows NestedVM to
 382       represent memory as a Java {\tt int[]} array without introducing
 383       additional overhead.  The remaining non-aligned memory load
 384       instructions are only rarely emitted by most compilers since
 385       they carry a performance penalty on physical MIPS
 386       implementations.
 387
 388 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
 389       multiply and divide instructions as well as a single and double
 390       precision floating point unit.  These capabilities map nicely
 391       onto Java's arithmetic instructions.
 392
 393 \item Although MIPS offers unsigned arithmetic and Java does not, few
 394       MIPS instructions actually depend on non-two's-complement
 395       handling of integer math.  Furthermore, since most high-level
 396       languages such as C do not expose access to arithmetic-overflow
 397       exceptions, these instructions are rarely found except in
 398       hand-coded assembler.  In the few situations where these
 399       instructions {\it are} encountered, the {\tt unsigned int} is
 400       cast (bitwise) to a Java {\tt long}, the operation is performed,
 401       and the result is cast back.  On architectures offering 64-bit
 402       integer math this conversion carries no overhead.
 403
 404 \end{itemize}
 405
 406 Finally, the MIPS ISA and ABI convey quite a bit of information about
 407 program structure.  This information can be used for optimization
 408 purposes:
 409
 410 \begin{itemize}
 411
 412 \item The structure of MIPS branching and jump instructions make it
 413       easy to infer the set of likely target instructions.
 414
 415 \item The MIPS ABI specifies particular registers as caller-save and
 416       callee-save, as well as designating a register for the return
 417       address after a function call.  This allows NestedVM to optimize
 418       many operations for the common case of ABI-adherent binaries.
 419
 420 \item All MIPS instructions are exactly 32 bits long.
 421
 422 \end{itemize}
 423
 424
 425
 426 \subsection{Binary-to-Source}
 427
 428 The simplest operational mode for NestedVM is binary-to-source
 429 translation.  In this mode, NestedVM translates MIPS binaries into
 430 Java source code, which is then fed to a Java compiler in order to
 431 produce bytecode files:
 432
 433 \begin{pdfpic}
 434 \newlength{\MyLength}
 435 \settowidth{\MyLength}{xmachine codex}
 436 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 437 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 438   & \\[0pt]
 439   & \\[0pt]
 440   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 441   & \\[0pt]
 442   & \\[0pt]
 443   & \\[0pt]
 444   & \\[0pt]
 445   & \\[0pt]
 446   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 447   \psset{nodesep=5pt,arrows=->}
 448   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 449   \ncline{s1}{b1}\aput{:U}{\tt javac}
 450   \ncline{b0}{s1}\naput{\tt NestedVM}
 451 \endpsmatrix
 452 \end{pdfpic}
 453
 454 \begin{figure*}[t]
 455 \begin{minipage}[c]{7in}%
 456 \begin{multicols}{2}
 457 {\footnotesize\begin{verbatim}
 458 private final static int r0 = 0;
 459 private int r1, r2, r3,...,r30;
 460 private int r31 = 0xdeadbeef;
 461 private int pc = ENTRY_POINT;
 462
 463 public void run() {
 464     for(;;) {
 465         switch(pc) {
 466             case 0x10000:
 467                 r29 = r29 - 32;
 468             case 0x10004:
 469                 r1 = r4 + r5;
 470             case 0x10008:
 471                 if(r1 == r6) {
 472                     /* delay slot */
 473                     r1 = r1 + 1;
 474                     pc = 0x10018:
 475                     continue;
 476                 }
 477             case 0x1000C:
 478                 r1 = r1 + 1;
 479             case 0x10010:
 480                 r31 = 0x10018;
 481                 pc = 0x10210;
 482                 continue;
 483             case 0x10014:
 484                 /* nop */
 485             case 0x10018:
 486                 pc = r31;
 487                 continue;
 488             ...
 489             case 0xdeadbeef:
 490                 System.err.println(``Exited.'');
 491                 System.exit(1);
 492         }
 493     }
 494 }
 495 \end{verbatim}}
 496 \vspace{1in}
 497 {\footnotesize\begin{verbatim}
 498 public void run_0x10000() {
 499     for(;;) {
 500     switch(pc) {
 501         case 0x10000:
 502             ...
 503         case 0x10004:
 504             ...
 505         ...
 506         case 0x10010:
 507             r31 = 0x10018;
 508             pc = 0x10210;
 509             return;
 510         ...
 511     }
 512     }
 513 }
 514
 515 pubic void run_0x10200() {
 516     for(;;) {
 517     switch(pc) {
 518         case 0x10200:
 519             ...
 520         case 0x10204:
 521             ...
 522     }
 523     }
 524 }
 525
 526 public void trampoline() {
 527     for(;;) {
 528     switch(pc&0xfffffe00) {
 529             case 0x10000: run_0x10000(); break;
 530             case 0x10200: run_0x10200(); break;
 531             case 0xdeadbe00:
 532                 ...
 533         }
 534     }
 535 }
 536 \end{verbatim}}
 537 \end{multicols}
 538 \end{minipage}
 539 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 540 \end{figure*}
 541
 542 Translating unsafe code for use within a JVM proceeds as follows:
 543
 544 \begin{enumerate}
 545
 546 \item Compile the source code to a statically linked binary, targeting
 547       the MIPS R2000 ISA.  Typically this will involve linking against
 548       {\tt libc}, which translates system requests (such as {\tt
 549       open()}, {\tt read()}, or {\tt write()}) into appropriate
 550       invocations of the MIPS {\tt SYSCALL} instruction.  Any other
 551       libraries which are not tied to a particular OS kernel can be
 552       linked in (even in binary form) using standard linker commands.
 553
 554 \item Invoke {\tt NestedVM} on the statically linked binary.
 555
 556 \item Compile the resulting {\tt .java} code using {\tt jikes}
 557       \cite{jikes} or {\tt javac}.
 558
 559 \item From java code, invoke the {\tt run()} method on the generated
 560       class.  This is equivalent to the {\tt main()} entry point.
 561
 562 \end{enumerate}
 563
 564 \subsubsection{Optimizations}
 565
 566 Generating Java source code instead of bytecode frees NestedVM from
 567 having to perform simple constant propagation optimizations, as most
 568 Java compilers already do this.  A recurring example is the treatment
 569 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
 570
 571 Lacking the ability to generate specially optimized bytecode
 572 sequences, a straightforward mapping of the general purpose hardware
 573 registers to 32 {\tt int} fields turned out to be optimal.
 574
 575
 576 \epsfig{file=charts/chart1,width=3in}
 577
 578 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
 579 for a single method.  This presents a problem for NestedVM, and
 580 necessitates a {\it trampoline transformation}, as shown in
 581 Figure~\ref{code1}.  With this trampoline in place, large binaries can
 582 be handled without much difficulty -- fortunately, there is no
 583 corresponding limit on the size of a classfile as a whole.
 584
 585 One difficulty that arose as a result of using the trampoline
 586 transformation was the fact that {\tt javac} and {\tt jikes} are
 587 unable to properly optimize its switch statements.  For example, the
 588 following code is compiled into a comparatively inefficient {\tt
 589 LOOKUPSWITCH}:
 590
 591 {\footnotesize
 592 \begin{verbatim}
 593     switch(pc&0xffffff00) {
 594         case 0x00000100: run_100(); break;
 595         case 0x00000200: run_200(); break;
 596         case 0x00000300: run_300(); break;
 597     }
 598 \end{verbatim}}
 599
 600 Whereas the next block of code code optimized into a {\tt
 601 TABLESWITCH}:
 602
 603 {\footnotesize
 604 \begin{verbatim}
 605     switch(pc>>>8) {
 606         case 0x1: run_100();
 607         case 0x2: run_200();
 608         case 0x3: run_300();
 609     }
 610 \end{verbatim}}
 611
 612 This problem was surmounted by switching on a denser set of {\tt case}
 613 values, which is more amenable to the {\tt TABLESWITCH} structure.
 614 This change alone nearly doubled the speed of the compiled binary.
 615
 616 The next performance improvement came from tuning the size of the
 617 methods invoked from the trampoline.  Trial and error led to the
 618 onclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
 619 -- performs best when 128 MIPS instructions are mapped to each method.
 620
 621 \epsfig{file=chart5,width=3in}
 622
 623 \epsfig{file=chart6,width=3in}
 624
 625 This phenomenon is due to two factors:
 626
 627 \begin{itemize}
 628
 629 \item While the trampoline method's {\tt switch} statement can be
 630       coded as a {\tt TABLESWITCH}, the {\tt switch} statement
 631       within the individual methods is to sparse to encode this way.
 632
 633 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
 634       favor smaller methods since they are easier to compile and are
 635       better candidates for compilation in ``normal'' programs (unlike
 636       NestedVM programs).
 637
 638 \end{itemize}
 639
 640 After tuning method sizes, our next performance boost came from
 641 eliminating exraneous case branches.  Having case statements before
 642 each instruction prevents JIT compilers from being able to optimize
 643 across instruction boundaries, since control flow can enter the body
 644 of a {\tt switch} statement at any of the {\tt case}s.  In order to
 645 eliminate unnecessary case statements we needed to identify all
 646 possible jump targets.  Jump targets can come from three sources:
 647
 648 \begin{itemize}
 649
 650 \item {\bf The {\tt .text} segment}
 651
 652       Every instruction in the text segment is scanned, and every
 653       branch instruction's destination is added to the list of
 654       possible branch targets.  In addition, any function that sets
 655       the link register is added to the list \footnote{actually {\tt addr+8}}.
 656       Finally, combinations of {\tt LUI} (Load Upper Immediate) and
 657       {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
 658       addresses in the {\tt .text} segment since this combination of
 659       instructions is often used to load a 32-bit word into a
 660       register.
 661
 662 \item {\bf The {\tt .data} segment}
 663
 664       When compiling {\tt switch} statements, compilers often use a
 665       jump table stored in the {\tt .data} segment.  Unfortunately
 666       they typically do not identify these jump tables in any way.
 667       Therefore, the entire {\tt .data} segment is conservatively
 668       scanned for possible addresses in the {\tt .text} segment.
 669
 670 \item {\bf The symbol table}
 671
 672       This is mainly used as a backup.  Scanning the {\tt .text} and
 673       {\tt .data} segments should identify any possible jump targets;
 674       however, adding all function symbols in the ELF symbol table
 675       also catches functions that are never called directly from the
 676       MIPS binary, such as those invoked only via the NestedVM
 677       runtime's {\tt call()} method.
 678
 679 \end{itemize}
 680
 681 Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
 682 increase.
 683
 684 Despite all the above optimizations, one insurmountable obstacle
 685 remained: the Java {\tt .class} file format limits the constant pool
 686 to 65535 entries.  Every integer literal greater than {\tt 32767}
 687 requires an entry in this pool, and each branch instruction generates
 688 one of these.
 689
 690 One suboptimal solution was to express constants as offsets from a few
 691 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
 692 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
 693 inlining it).  This was sufficient to get reasonably large binaries to
 694 compile, and caused only a small (approximately 5\%) performance
 695 degredation and a similarly small increase in the size of the {\tt
 696 .class} file.  However, as we will see in the next section, compiling
 697 directly to {\tt .class} files (without the intermediate {\tt .java}
 698 file) eliminates this problem entirely.
 699
 700
 701 \subsection{Binary-to-Binary}
 702
 703 After implementing the binary-to-source compiler, a binary-to-binary
 704 translation mode was added.
 705
 706 \begin{pdfpic}
 707 \newlength{\MyLength}
 708 \settowidth{\MyLength}{xmachine codex}
 709 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 710 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 711   & \\[0pt]
 712   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 713   & \\[0pt]
 714   & \\[0pt]
 715   & \\[0pt]
 716   & \\[0pt]
 717   & \\[0pt]
 718   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 719   & \\[0pt]
 720   \psset{nodesep=5pt,arrows=->}
 721   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 722   \ncline{b0}{b1}\naput{\tt NestedVM}
 723 \endpsmatrix
 724 \end{pdfpic}
 725
 726 This mode has several advantages:
 727
 728 \begin{itemize}
 729
 730 \item There are quite a few interesting bytecode sequences that cannot
 731       be generated as a result of compiling Java source code.
 732
 733 \item Directly generating {\tt .class} files Eliminates the
 734       time-consuming {\tt javac} step.
 735
 736 \item Direct compilation to {\tt .class} files opens up the
 737       interesting possibility of dynamically translating MIPS binaries
 738       and loading them via {\tt ClassLoader.fromBytes()} {\it at
 739       deployment time}, eliminating the need to compile binaries ahead
 740       of time.
 741
 742 \end{itemize}
 743
 744 Most of the performance improvemen where made where in the handling of
 745 branch instructions and in taking advantage of the JVM stack to
 746 eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
 747
 748 \epsfig{file=chart7,width=3in}
 749
 750 The first optimization gained by direct bytecode generation came from
 751 the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
 752 Java {\it language} does not have a {\tt goto} keyword, the VM does in
 753 fact have a corresponding instruction which is used quite heavily by
 754 {\tt javac}.  NestedVM's binary-to-binary mode exploits this
 755 instruction to avoid emitting inefficient {\tt switch..case}
 756 structures.
 757
 758 Related to the {\tt GOTO} instruction is branch statement
 759 optimization.  When emitting source code, NestedVM translates branches
 760 into Java source code like this:
 761
 762 {\footnotesize\begin{verbatim}
 763     if (condition) {
 764         pc = TARGET;
 765         continue;
 766     }
 767 \end{verbatim}}
 768
 769 This requires a branch in the JVM {\it regardless} of whether the MIPS
 770 branch is actually taken.  If {\tt condition} is false the JVM has to
 771 jump over the code to set {\tt pc} and go back to the {\tt switch}
 772 statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
 773 switch} block.  By generating bytecode directly, NestedVM is able to
 774 emit a JVM bytecode branching directly to the address corresponding to
 775 the target of the MIPS branch.  In the case where the branch is not
 776 taken the JVM doesn't branch at all.
 777
 778 A side effect of the previous two optimizations is a solution to the
 779 excess constant pool entries problem.  When jumps are implemented as
 780 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
 781 not need to be set.  This eliminates a huge number of constant pool
 782 entries.  The {\tt .class} file constant pool size limit is still
 783 present, but it is less likely to be encountered.
 784
 785 Implementation of the MIPS delay slot offers another opportunity for
 786 bytecode-level optimization.  In order to take advantage of
 787 instructions already in the pipeline, the MIPS ISA specifies that the
 788 instruction after a jump or branch is always executed, even if the
 789 jump/branch is taken.  This instruction is referred to as the ``delay
 790 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
 791 early MIPS CPUs, so they have to discard instructions anyways}.''  The
 792 instruction in the delay slot is actually executed {\it before} the
 793 branch is taken.  To further complicate matters, values from the
 794 register file are loaded {\it before} the delay slot is executed.
 795
 796 Fortunately there is a very elegent solution to this problem which can
 797 be expressed in JVM bytecode.  When a branch instruction is
 798 encountered, the registers needed for the comparison are pushed onto
 799 the stack to prepare for the JVM branch instruction.  Then, {\it
 800 after} the values are on the stack the delay slot instruction is
 801 emitted, followed by the actual JVM branch instruction.  Because the
 802 values were pushed to the stack before the delay slot was executed, any
 803 changes the delay slot made to the registers are not visible to the
 804 branch bytecode.
 805
 806 One final advantage that generating bytecode directly allows is a
 807 reduction in the size of the ultimate {\tt .class} file.  All the
 808 optimizations above lead to more compact bytecode as a beneficial side
 809 effect; in addition, NestedVM performs a few additional optimizations.
 810
 811 When encountering the following {\tt switch} block, both {\tt javac}
 812 and {\tt jikes} generate redundant bytecode.
 813
 814 {\footnotesize\begin{verbatim}
 815     switch(pc>>>8) {
 816         case 0x1: run_1(); break;
 817         case 0x2: run_2(); break
 818         ...
 819         case 0x100: run_100(); break;
 820     }
 821 \end{verbatim}}
 822
 823 The first bytecode in each case arm in the switch statement is {\tt
 824 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
 825 lifting this bytecode outside of the {\tt switch} statement, each {\tt
 826 case} arm shrinks by one instruction.
 827
 828 \subsubsection{Compiler Flags}
 829
 830 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
 831 profile is nothing like that of actual silicon.  In particular, {\tt
 832 gcc} makes several optimizations that increase performance on an
 833 actually MIPS CPU but actually decrease the performance of
 834 NestedVM-generated bytecode.  We found the following compiler options
 835 could be used to improve performance:
 836
 837 \begin{itemize}
 838
 839 \item {\tt -falign-functions}
 840
 841       Normally a function's location in memory has no effect on its
 842       execution speed.  However, in the NestedVM binary translator,
 843       the {\tt .text} segment is split on power-of-two boundaries.  If
 844       a function starts near the end of one of these boundaries, a
 845       performance critical part of the function winds up spanning two
 846       Java methods.  Telling {\tt gcc} to align all functions along
 847       these boundaries decreases the chance of this sort of splitting.
 848
 849 \item {\tt -fno-rename-registers}
 850
 851       On an actual silicon chip, using additional registers carries no
 852       performance penalty (as long as none are spilled to the stack).
 853       However, when generating bytecode, using {\it fewer}
 854       ``registers'' helps the JVM optimize the machine code it
 855       generates by simplifying the constraints it needs to deal with.
 856       Disabling register renaming has this effect.
 857
 858 \item {\tt -fno-schedule-insns}
 859
 860       Results of MIPS load operations are not available until {\it
 861       two} instructions after the load.  Without the {\tt
 862       -fno-schedule-insns} instruction, {\tt gcc} will attempt to
 863       reorder instructions to do other useful work during this period
 864       of unavailability.  NestedVM is under no such constraint, so
 865       removing this reordering typically generates simpler machine
 866       code.
 867
 868 \item {\tt -mmemcpy}
 869
 870       Enabling this instruction causes {\tt gcc} to use the system
 871       {\tt memcpy()} routine instead of generating loads and stores.
 872       As explained in the next section, the NestedVM runtime
 873       implements {\tt memcpy()} using {\tt System.arraycopy()}, which
 874       is substantially more efficient.
 875
 876 \item {\tt -ffunction-sections -fdata-sections}
 877
 878       These two options are used in conjunction with the {\tt
 879       --gc-section} linker option, prompting the linker to more
 880       aggressively prune dead code.
 881
 882 \end{itemize}
 883
 884 The effects of the various optimizations presented in this chapter are
 885 summarized in the table below.
 886
 887 \epsfig{file=chart4,width=3in}
 888
 889 \epsfig{file=chart3,width=3in}
 890
 891 \section{Experiences}
 892
 893 \subsection{FreeType, {\tt libmspack}, and {\tt libjpeg}}
 894
 895 The Ibex Project utilizes three libraries for which no Java-only
 896 equivalent exists.  The first is the FreeType font library, which
 897 parses, hints, and rasterizes TrueType and Postscript fonts with
 898 exceptional quality.  The project also needed an open source JPEG
 899 decompressor; surprisingly, none exist for Java.  While encoders are
 900 plentiful
 901
 902 These three libraries make minimal use of the standard library and OS
 903 services, and are all written in very portable ANSI C code, which made
 904 them easy targets for initial development.
 905
 906 \subsection{The GNU Compiler Collection}
 907
 908 Our next target, {\tt gcc}, was chosen in order to relieve developers
 909 from the time-consuming and complex task of building a compiler
 910 themselves.  The Ibex Project requires a specially configured and
 911 patched version of {\tt gcc} and its ahead-of-time Java compiler ({\tt
 912 gcj}) which is not distributed in binary form.
 913
 914 GCC was the first ``major'' application NestedVM was used on, and
 915 drove the development of most of the system library interface
 916 development; particularly support for {\tt fork()} and {\tt exec()},
 917 which require the NestedVM Runtime to perform binary-to-bytecode
 918 translation on the fly.
 919
 920 GCC also makes extensive use of 64-bit integers ({\tt long long}),
 921 which -- for performance reasons -- are typically manipulated using
 922 nonobvious instruction sequences on the 32-bit MIPS architecture.
 923 Dealing with these operations uncovered a number of bugs in the
 924 translator.
 925
 926 Despite our original goal, we have not yet been able to translate the
 927 {\tt C++} or Java front-ends, as the resulting binary produces a
 928 trampoline which exceeds the maximum size of a single method.  Future
 929 work will explore a multi-level trampoline to address this issue.
 930
 931
 932
 933 \subsection{\TeX and LINPACK}
 934
 935 In order to distinguish NestedVM from other single-language
 936 translators for the JVM, we undertook the task of translating \TeX 89
 937 (written in Pascal) and the Fortran source code for LINPACK into Java
 938 bytecodes.
 939
 940 Although actually producing the initial MIPS binaries from the \TeX\
 941 source code turned out to be an exceptionally tedious and frustrating
 942 task, the resulting binary translated and executed perfectly on the
 943 first run, as did LINPACK.  Our reward for this effort was typesetting
 944 our presentation of NestedVM using NestedVM itself.  We have also had
 945 initial successes running \TeX\ in a Java Applet, and intend to
 946 produce a {\tt jar} for embedding \TeX\ code (``\TeX lets'') in web
 947 pages without the use of a post-processing tool.
 948
 949 The LINPACK benchmark called our attention to Java's lack of an API
 950 for checking the ``cpu time'' of a process.  Unfortunately we had to
 951 substitute wall-clock time on an otherwise-quiescent machine as an
 952 approximation.
 953
 954
 955
 956 \section{The NestedVM Runtime}
 957
 958 In addition to binary-to-source and binary-to-binary translation,
 959 NestedVM also includes a MIPS binary interpreter.  All three
 960 translation approaches expose the same API to both the translated
 961 binary and the surrounding VM (including peer Java code).
 962
 963 The NestedVM Runtime (various subclasses of {\tt
 964 org.ibex.nestedvm.Runtime}) fill the role of an OS Kernel.
 965 Communication between MIPS code and the outside world is via the MIPS
 966 {\tt SYSCALL} instruction, just as the {\tt libc}-kernel interface is
 967 on real MIPS implementations.
 968
 969 Two implemenations of the runtime are offered; a simple runtime with
 970 the minimum support required to comply with ANSI C, and a more
 971 sophisticated runtime which emulates a large portion of the POSIX API.
 972
 973 \subsection{The ANSI C Runtime}
 974
 975 The ANSI C runtime offers typical file I/O operations including {\tt
 976 open()}, {\tt close()}, {\tt read()}, {\tt write()}, and {\tt seek()}.
 977 File descriptors are implemented much as they are in OS kernels; a
 978 table of open files is maintained and descriptors act as an index into
 979 that table.  Each file is represented as a Java {\tt RandomAccessFile}
 980 in order to match the semantics of {\tt seek()}.
 981
 982 Process-level memory management is done through the {\tt sbrk()}
 983 system call, which extends the process heap by adding more pages to
 984 the memory page table.  Fast memory clear and copy operations can be
 985 performed with {\tt memset()} and {\tt memcpy()}, which invoke the
 986 Java {\tt System.arraycopy()} method, which is generally much faster
 987 than a {\tt for()} loop.
 988
 989 The {\tt exit()} call records the exit status, marks the VM instance
 990 as terminated and halts execution.  The {\tt pause()} syscall
 991 implements a crude form of Java-MIPS communication by returning
 992 control to the Java code which spawned the MIPS process.
 993
 994 \subsection{The Unix Runtime}
 995
 996 The Unix runtime extends the simple ANSI file I/O model to include a
 997 single-root filesystem, and device nodes, as well as {\tt fcntl()}
 998 APIs to manipulate these.  Device nodes are generally simulated by
 999 mapping reads, writes, and {\tt fcntl()}s on the device to the
1000 appropriate Java API.
1001
1002 MIPS processes can ``mount'' other filesystems within the virtual
1003 filesystem made visible to the MIPS process.  Each filesystem is
1004 implemented by a Java class, which could, for example, offer access to
1005 the host filesystem (including {\tt state()}, {\tt lstat()}, {\tt
1006 mkdir}, and {\tt unlink()}, and {\tt getdents()}), the contents of a
1007 zip archive, or even a remote HTTP server.
1008
1009 MIPS processes can also spawn subprocesses using the {\tt fork()} and
1010 {\tt exec()} calls, which create new Java threads to run the process.
1011 The {\tt fork()} call -- which is supposed to duplicate the memory
1012 image of a process -- is implemented in an elegant manner by calling
1013 the Java {\tt clone()} method (deep copy) on the VM object itself.
1014 Copy-on-write is not currently implemented.  The new instance is added
1015 to a static process table to facilitate interprocess communication.
1016
1017 The {\tt waitpid()} API allows a parent process to block pending the
1018 completion of a child process, which is done quite easily with the
1019 Java {\tt wait()} method.
1020
1021 The {\tt exec()} method actually loads a MIPS binary image from the
1022 filesystem, feeds it to the MIPS-to-bytecode translator, and then
1023 loads the resulting bytecode on the fly using {\tt
1024 ClassLoader.loadBytes()}.  The {\tt pipe()} system call permits
1025 parent-to-child IPC just as on a normal Unix system.
1026
1027 Simple networking support is provided by the {\tt opensocket()}, {\tt
1028 listensocket()}, and {\tt accept()} methods, which are not currently
1029 compatible with the usual Berkeley sockets API.
1030
1031
1032 \subsection{Security Concerns}
1033
1034 RuntimeExceptions don't escape, they care caught and turned into
1035 checked exceptions filesystem access does though security manager
1036 (NestedVM Runtime.SecurityManager, and the JVM's)
1037
1038
1039 \subsection{Threading}
1040
1041 The NestedVM runtime currently does not support threading.  Providing
1042 robust support for ``true threads'', whereby each MIPS thread maps to
1043 a Java thread is probably not possible as the Java Memory Model
1044 [CITE], since all MIPS memory is stored in a set of {\tt int[]}'s and
1045 the Java Memory Model does not permit varying treatment or coherency
1046 policies at the granularity of a single array element.
1047
1048 While this presents a major barrier for applications that use
1049 sophisticated locking schemes (such as {\it hash synchronization})
1050 which depend on atomic memory operations, it is probably possible to
1051 apply this threading model to ``well behaved'' multithreaded
1052 applications which depend only on OS-provided semaphores and mutexes
1053 for synchronization.
1054
1055 Complex synchronization and incorrectly synchronized applications can
1056 be supported by implementing a variant of {\it user threads} within a
1057 single Java thread by providing a timer interrupt (via a Java
1058 asynchronous exception).  Unfortunately this requires that the
1059 compiled binary be able to restart from any arbitrary instruction
1060 address, which would require a {\tt case} statement for every
1061 instruction (rather than every jump target), which would degrade
1062 performance and increase the size of the resulting class file.
1063
1064 Theoretical limitations: threads, reading from code, self-modifying
1065 code, COW?
1066
1067
1068 \section{Future Directions}
1069
1070 Although we have only implemented it for the Java Virtual Machine, our
1071 technique generalizes to other safe bytecode architectures.  In
1072 particular we would like to demonstrate this generality by retargeting
1073 the translator to the Microsoft Intermediate Language \cite{msil}.
1074
1075 Additionally, we would like to explore other uses for dynamic loading
1076 of translated MIPS binaries by combining NestedVM (which itself is
1077 written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
1078
1079
1080 \section{Conclusion}
1081
1082 We have presented a novel technique for using libraries written in
1083 unsafe languages within a safe virtual machine without resorting to
1084 native interfaces.  We have implemented this technique in NestedVM,
1085 which is currently used by the Ibex project\footnote{{\tt
1086 http://www.ibex.org}} to perform font rasterization (via {\tt
1087 libfreetype}), JPEG decoding (via {\tt libjpeg}), and CAB archive
1088 extraction (via {\tt libmspack}), three libraries for which no
1089 equivalent Java classes exist.
1090
1091 NestedVM is available under an open source license, and can be
1092 obtained from
1093 \begin{verbatim}
1094     http://nestedvm.ibex.org
1095 \end{verbatim}
1096
1097
1098 \section{Appendix: Testing Methodology}
1099
1100 All times are measured in seconds. These were all run on a dual 1Ghz
1101 Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
1102 test was run 8 times within a single VM. The highest and lowest times
1103 were removed and the remaining 6 were averaged.  In each case only the
1104 first run differed significantly from the rest.
1105
1106 The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
1107 writing a tga.  The {\tt mspack} test consisted of extracting all
1108 members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
1109 and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
1110 rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
1111 to 48 incrementing by 4 for a total of 950 glyphs.
1112
1113 \bibliography{nestedvm}
1114
1115 \end{document}
1116