doc/ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{url}
   8 \usepackage{pdftricks}
   9 \begin{psinputs}
  10   \usepackage{pstricks}
  11   \usepackage{pst-node}
  12 \end{psinputs}
  13 \usepackage{parskip}
  14 \usepackage{tabularx}
  15 \usepackage{alltt}
  16 \bibliographystyle{amsplain}
  17
  18 \title{\textbf{\textsf{
  19 Complete Translation of Unsafe Native Code to Safe Bytecode
  20 }}}
  21 \date{}
  22 \author{\begin{tabular}{@{}c@{}}
  23         {\em {Brian Alliet}} \\
  24         {Rochester Institute of Technology}\\
  25         {\tt bja8464@cs.rit.edu}
  26    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  27         {\em {Adam Megacz}} \\
  28         {University of California, Berkeley} \\
  29         {\tt megacz@cs.berkeley.edu}
  30 \end{tabular}}
  31 \begin{document}
  32
  33 \maketitle
  34
  35 {\it This document was typeset using D. E. Knuth's original \TeX ,
  36      which was both compiled and executed entirely within a Java
  37      Virtual Machine without the use of native code.}
  38
  39 \begin{abstract}
  40
  41 Existing techniques for using code written in an unsafe language
  42 within a safe virtual machine generally involve transformations from
  43 one source code language (such as C, Pascal, or Fortran) to another
  44 (such as Java) which is then compiled into virtual machine bytecodes.
  45
  46 We present an alternative approach which translate MIPS binaries
  47 produced by any compiler into safe virtual machine bytecodes.  This
  48 approach offers four key advantages over existing techniques: it is
  49 language agnostic, it offers bug-for-bug compiler compatibility,
  50 requires no post-translation human intervention, and introduces no
  51 build process modifications.
  52
  53 We also present NestedVM, an implementation of this technique, and
  54 discuss its application to six software packages: LINPACK (Fortran),
  55 which was used as one of our performance tests, \TeX\ (Pascal), which
  56 was used to typeset this document, {\tt libjpeg}, {\tt libmspack}, and
  57 FreeType (all C source), which are currently in production use as part
  58 of the Ibex Project \cite{ibex}, and {\tt gcc}, which was used to
  59 compile all of the aforementioned.
  60
  61 Performance measurements indicate a best case performance within 3x of
  62 native code and worst case typically within 10x, making it an
  63 attractive solution for code which is not performance-critical.
  64
  65 \end{abstract}
  66
  67 \section{Introduction}
  68
  69 Unsafe languages such as C and C++ have been in use much longer than
  70 any of today's widely accepted safe languages such as Java and C\#
  71 Consequently, there is a huge library of software written in these
  72 languages.  Although safe languages offer substantial benefits, their
  73 comparatively young age often puts them at a disadvantage when breadth
  74 of existing support code is an important criterion.
  75
  76 The typical solution to this dilemma is to use a native interface such
  77 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
  78 virtual machine or otherwise safe environment.  Unfortunately, there
  79 are a number of situations in which this is not an acceptable
  80 solution.  These situations can be broadly classified into two
  81 categories: {\it security concerns} and {\it portability concerns}.
  82
  83 Security is often a major concern when integrating native code.  Using
  84 Java as an example, JNI and CNI are prohibited in a number of
  85 contexts, including applet environments and servlet containers with a
  86 {\tt SecurityManager}.  Additionally, even in the context of trusted
  87 code, {\tt native} methods invoked via JNI are susceptible to buffer
  88 overflow and heap corruption attacks which are not a concern for
  89 verified, bounds-checked bytecode.
  90
  91 The second class of disadvantages revolves around portability
  92 concerns; native interfaces require the native library to be compiled
  93 ahead of time for every architecture on which it will be deployed.
  94 This is unacceptable for scenarios in which the full set of target
  95 architectures is not known at deployment time.  Additionally, some JVM
  96 platform variants such as J2ME \cite{j2me} simply do not offer support
  97 for native code.
  98
  99 The technique we present here uses typical compiler to compile unsafe
 100 code into a MIPS binary, which is then translated on an
 101 instruction-by-instruction basis into Java bytecode.  The technique
 102 presented here is general; we anticipate that it can be applied to
 103 other secure virtual machines such as Microsoft's .NET \cite{msil},
 104 Perl Parrot \cite{parrot}, or Python bytecode \cite{python}.
 105
 106 The remainder of this paper is divided as follows: in the next section
 107 we review the relevant set of program representations (safe source,
 108 unsafe source, binary, and bytecode) and survey existing work for
 109 performing transformations between them.  In the third section we
 110 introduce NestedVM and cover its two primary translation modes in
 111 detail.  Section four describes the NestedVM runtime, which plays the
 112 role of the OS kernel.  Section five addresses the optimizations we
 113 employ and quantifies NestedVM's performance.  Section six reviews our
 114 experiences in applying NestedVM to various popular software packages.
 115 We conclude with an analysis of NestedVM's weaknesses and potential
 116 for future improvements.
 117
 118
 119 \section{Existing Work}
 120
 121 The four program representations of interest in this context, along
 122 with their specific types in the C-to-JVM instantiation of the
 123 problem are shown in the following diagram:
 124
 125 \begin{pdfpic}
 126 \newlength{\MyLength}
 127 \settowidth{\MyLength}{machine code}
 128 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
 129 \begin{psmatrix}[colsep=2,rowsep=0]
 130   & \\[0pt]
 131   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 132   [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
 133   & \\[0pt]
 134   & \\[0pt]
 135   & \\[0pt]
 136   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 137   [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
 138   & \\[0pt]
 139   \psset{nodesep=5pt,arrows=->}
 140 \end{psmatrix}
 141 \end{pdfpic}
 142
 143 To illustrate the context of this diagram, the following arcs show the
 144 translations performed by a few familiar tools:
 145
 146 \begin{pdfpic}
 147 \newlength{\MyLength}
 148 \settowidth{\MyLength}{xmachine codex}
 149 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 150 \psmatrix[colsep=2,rowsep=0,nrot=:D]
 151   & \\[0pt]
 152   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 153   & \\[0pt]
 154   & \\[0pt]
 155   & \\[0pt]
 156   & \\[0pt]
 157   & \\[0pt]
 158   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 159   & \\[0pt]
 160   \psset{nodesep=5pt,arrows=->}
 161   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 162   \ncline{s1}{b0}\bput{:D}{\tt gcj}
 163   \ncline{s1}{b1}\aput{:U}{\tt javac}
 164   \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
 165 \endpsmatrix
 166 \end{pdfpic}
 167
 168 Existing techniques for translating unsafe code into VM bytecode
 169 generally fall into two categories, which we expand upon in the
 170 remainder of this section: source-to-source translation and
 171 source-to-binary translation.
 172
 173 \subsection{Source-to-Source Translation}
 174
 175 The most common technique employed is partial translation from unsafe
 176 source code to safe source code:
 177
 178 \begin{pdfpic}
 179 \newlength{\MyLength}
 180 \settowidth{\MyLength}{xmachine codex}
 181 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 182 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 183   & \\[0pt]
 184   & \\[0pt]
 185   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 186   & \\[0pt]
 187   & \\[0pt]
 188   & \\[0pt]
 189   & \\[0pt]
 190   & \\[0pt]
 191   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 192   & \\[0pt]
 193   \psset{nodesep=5pt,arrows=->}
 194   \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
 195   \ncline{s1}{b1}\aput{:U}{\tt javac}
 196 \endpsmatrix
 197 \end{pdfpic}
 198
 199 A number of existing systems employ this technique; they can
 200 be divided into two categories: those which perform a partial
 201 translation which is completed by a human, and those which perform a
 202 total translation but fail (yield an error) on a large class of input
 203 programs.
 204
 205 \subsubsection{Human-Assisted Translation}
 206
 207 Jazillian \cite{jazillian} is a commercial solution which produces
 208 extremely readable Java source code from C source code, but only
 209 translates a small portion of the C language.  Jazillian is unique in
 210 that in addition to {\it language migration}, it also performs {\it
 211 API migration}; for example, Jazillian is intelligent enough to
 212 translate ``{\tt char*~s1~=~strcpy(s2)}'' into ``{\tt
 213 String~s1~=~s2}''.  Unfortunately such deep analysis is intractable
 214 for most of the C language and standard library; indeed, Jazillian's
 215 documentation notes that {\it ``This is not your father's language
 216 translator... Jazillian does not always produce code that works
 217 correctly.''}
 218
 219 MoHCA-Java \cite{mohca} is the other major tool in this category, and
 220 steps beyond Jazillian by providing tools for analysis of the source
 221 C++ abstract syntax tree.  Additionally, MoHCA-Java's analysis engine
 222 is extensible, making it a platform for constructing
 223 application-specific translators rather than a single translation
 224 tool.  However, MoHCA-Java does not always generate complete Java code
 225 for all of the C++ programs which it accepts.
 226
 227 \subsubsection{Partial-Domain Translation}
 228
 229 The {\tt c2j} \cite{c2j}, {\tt c2j++}, Cappucinno
 230 \cite{capp}, and Ephedra \cite{ephedra} systems each provide support
 231 for complete translation of a {\it subset} of the source language (C
 232 or C++).  Each of the four tools supports a progressively greater
 233 subset than the one preceding it; however none covers the entire input
 234 language.
 235
 236 Ephedra, the most advanced of the four, supports most of the C++
 237 language, and claims to produce ``human readable'' Java code as
 238 output.  Notable omissions from the input domain include support for
 239 fully general pointer arithmetic, casting between unrelated types, and
 240 reading from a {\tt union} via a different member than the one most
 241 recently written.
 242
 243 Unfortunately, when the program being translated is large and complex,
 244 it is quite likely that it will use an unsupported feature in at least
 245 one place.  In the absence of a programmer who understands the source
 246 program, a single anomaly is often enough to render the entire
 247 translation process useless.  As a result, these tools are mainly
 248 useful as an {\it aid} to programmers who could normally perform the
 249 conversion themselves, but want to save time by automating most of the
 250 process.
 251
 252
 253 \subsection{Source-to-Binary Translation}
 254
 255 Source-to-binary translation involves a compiler for the unsafe
 256 language which has been modified to emit safe bytecode.
 257
 258 \begin{pdfpic}
 259 \newlength{\MyLength}
 260 \settowidth{\MyLength}{xmachine codex}
 261 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 262 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 263   & \\[0pt]
 264   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 265   & \\[0pt]
 266   & \\[0pt]
 267   & \\[0pt]
 268   & \\[0pt]
 269   & \\[0pt]
 270   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 271   & \\[0pt]
 272   \psset{nodesep=5pt,arrows=->}
 273   \ncline{s0}{b1}\bput{:U}{source-to-binary}
 274 \endpsmatrix
 275 \end{pdfpic}
 276
 277 An experimental ``JVM backend'' for the {\tt gcc} compiler, known as
 278 {\tt egcs-jvm} \cite{egcsjvm}, attempts this approach.  Since {\tt
 279 gcc} employs a highly modular architecture, it {\it is} possible to
 280 add RTL code generators for nonstandard processors.  However, {\tt
 281 gcc}'s parsing, RTL generation, and optimization layers make
 282 fundamental assumptions (such as the availability of pointer math)
 283 which cannot be directly supported; thus the compiler still fails for
 284 a substantial class of input programs.
 285
 286 A Java backend for the {\tt lcc} compiler \cite{lcc}, known as {\tt
 287 lcc-java}, is also available.  Although this system is quite
 288 clean and elegantly designed, it lacks any form of system library
 289 ({\tt libc}), so very few C programs will run without custom
 290 modification (which would cause them to diverge from the upstream
 291 sources).  The memory model employed by {\tt lcc-java} is also
 292 somewhat awkward; a separate {\tt int[]} is maintained for the stack
 293 and heap, leading to difficulties mingling pointers to these two
 294 memory regions.  Additionally, the heap is a single {\tt int[]}, which
 295 means that it must be copied in order to be expanded, and prevents
 296 {\tt lcc-java} from taking advantage of {\tt NullPointerException}
 297 checking, which costs nothing in the ``common case'' since nearly all
 298 JVMs use the host CPU's MMU to detect this condition.
 299
 300
 301 \section{NestedVM}
 302
 303 The principal difference between NestedVM and other approaches is that
 304 NestedVM {\it does not} attempt to deal with source code as an input,
 305 instead opting for {\it binary-to-source} and {\it binary-to-binary}
 306 translation.  This offers three immediate advantages:
 307
 308 \begin{itemize}
 309 \item {\bf Language Agnosticism}
 310
 311       Because NestedVM does not attempt to implement the parsing and
 312       code generation steps of compilation, it is freed from the
 313       extremely complex task of faithfully implementing languages
 314       which are often not fully or formally specified (such as C and
 315       C++), and is able to support any language for which a
 316       MIPS-targeted compiler exists.
 317
 318 \item {\bf Bug-for-bug compiler compatibility}
 319
 320       Since NestedVM uses the compiler's {\it output} as its own {\it
 321       input}, it ensures that programs which are inadvertently
 322       dependent on the vagaries of a particular compiler can still be
 323       used.
 324
 325 \item {\bf No build process modifications}
 326
 327       NestedVM does not modify existing build processes, which can be
 328       extremely complex and dependent on strange preprocessor usage as
 329       well as the complex interplay between compiler switches and
 330       header file locations.
 331
 332 \end{itemize}
 333
 334 NestedVM's approach carries a fourth benefit as well, arising from its
 335 totality:
 336
 337 \begin{itemize}
 338 \item {\bf No post-translation human intervention}
 339
 340       NestedVM offers total support for all non-privileged
 341       instructions, registers, and resources found on a MIPS {\tt
 342       R2000} CPU, including the add/multiply unit and floating point
 343       coprocessor.  As such, it constitutes a total function mapping
 344       from the entire domain of (non-kernel-mode) programs onto (a
 345       subset of) the semantics of the Java Virtual Machine.  This
 346       ensures that the translation process is fully automated and
 347       always succeeds for valid input binaries.
 348 \end{itemize}
 349
 350 This last point has important software engineering implications.
 351 If post-translation human intervention is required, then the {\it
 352 human becomes part of the build process}.  This means that if a third
 353 party library used in the project needs to be upgraded, {\it a human
 354 must intervene} in the rebuild process.  In addition to slowing the
 355 process and introducing opportunities for error, this task often
 356 requires specialized knowledge which becomes tied to the particular
 357 individual performing this task, rather than being encoded in build
 358 scripts which persist throughout the lifetime of the project.
 359
 360 \subsection{Mapping the R2000 onto the JVM}
 361
 362 We chose MIPS as a source format for three reasons: the availability
 363 of tools to compile legacy code into MIPS binaries, the many
 364 similarities between the MIPS ISA and the Java Virtual Machine, and
 365 the relatively high degree of program structure that can be inferred
 366 from ABI-adherent binaries.
 367
 368 The MIPS architecture has been around for quite some time, and is well
 369 supported by the GNU Compiler Collection, which is capable of
 370 compiling C, C++, Java, Fortran, Pascal, and Objective C
 371 into MIPS binaries.
 372
 373 The MIPS R2000 ISA bears many similarities to the Java Virtual
 374 Machine.  Most of the instructions in the original MIPS ISA operate
 375 only on 32-bit aligned memory locations. This allows NestedVM to
 376 represent memory as a Java {\tt int[][]} array indexed by page (the
 377 top $n$ bits of the address) and offset (the remaining bits) without
 378 introducing additional overhead.  MIPS's non-aligned memory load
 379 instructions are only rarely emitted by most compilers since they
 380 carry a performance penalty on physical MIPS implementations.
 381
 382 Our choice of a paged representation for memory carries only a small
 383 performance disadvantage:
 384
 385 \epsfig{file=charts/chart5,width=3in}
 386
 387 Additionally, this representation lets us to take advantage of the
 388 fact that on most JVM's, checking for a {\tt NullPointerException}
 389 carries no performance penalty unless the exception is thrown (the
 390 host CPU's MMU is generally used to detect this condition).  This
 391 allows us to lazily expand the MIPS memory space as it is used.
 392 Additionally, we maintain two page arrays, one which is used for read
 393 operations and another for writes.  Most of the time these page arrays
 394 will have identical entries; however, we can simulate a portion of the
 395 MIPS MMU functionality by setting the appropriate entry in the write
 396 page table to {\tt null}, thereby write protecting the page while
 397 still allowing reads.
 398
 399 Unlike its predecessor, the R2000 supports 32-bit by 32-bit multiply
 400 and divide instructions as well as a single and double precision
 401 floating point unit.  These capabilities map nicely onto Java's
 402 arithmetic instructions and {\tt int}, {\tt long}, {\tt float}, and
 403 {\tt double} types.
 404
 405 Although MIPS offers unsigned arithmetic and Java does not, few MIPS
 406 instructions actually depend on non-two's-complement handling of
 407 integer math.  In the few situations where these instructions {\it
 408 are} encountered, the {\tt unsigned int} is cast (bitwise) to a Java
 409 {\tt long}, the operation is performed, and the result is cast back.
 410 On host architectures offering 64-bit arithmetic this operation
 411 carries no performance penalty.
 412
 413 In addition to its similarities to the JVM, the MIPS ISA and ABI
 414 convey quite a bit of information about program structure.  This
 415 information can be used for optimization purposes:
 416
 417 \begin{itemize}
 418
 419 \item The structure of MIPS branching and jump instructions make it
 420       easy to infer the set of likely target instructions.
 421
 422 \item The MIPS ABI specifies particular registers as caller-save and
 423       callee-save, as well as designating a register for the return
 424       address after a function call.  This allows NestedVM to optimize
 425       many operations for the common case of ABI-adherent binaries.
 426
 427 \item All MIPS instructions are exactly 32 bits long.
 428
 429 \end{itemize}
 430
 431
 432
 433 \subsection{Binary-to-Source Mode}
 434
 435 The simplest operational mode for NestedVM is binary-to-source
 436 translation.  In this mode, NestedVM translates MIPS binaries into
 437 Java source code, which is then fed to a Java compiler in order to
 438 produce bytecode files:
 439
 440 \begin{pdfpic}
 441 \newlength{\MyLength}
 442 \settowidth{\MyLength}{xmachine codex}
 443 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 444 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 445   & \\[0pt]
 446   & \\[0pt]
 447   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 448   & \\[0pt]
 449   & \\[0pt]
 450   & \\[0pt]
 451   & \\[0pt]
 452   & \\[0pt]
 453   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 454   \psset{nodesep=5pt,arrows=->}
 455   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 456   \ncline{s1}{b1}\aput{:U}{\tt javac}
 457   \ncline{b0}{s1}\naput{\tt NestedVM}
 458 \endpsmatrix
 459 \end{pdfpic}
 460
 461 \begin{figure*}[t]
 462 \begin{minipage}[c]{7in}%
 463 \begin{multicols}{2}
 464 {\footnotesize\begin{verbatim}
 465 private final static int r0 = 0;
 466 private int r1, r2, r3, /* ... */ r30;
 467 private int r31 = 0xdeadbeef;
 468 private int pc = ENTRY_POINT;
 469
 470 public void run() {
 471     while (true)
 472         switch(pc) {
 473             case 0x10000: r29 = r29 - 32;
 474             case 0x10004: r1 = r4 + r5;
 475             case 0x10008: if (r1 == r6) {
 476                             /* delay slot */
 477                             r1 = r1 + 1;
 478                             pc = 0x10018:
 479                             continue; }
 480             case 0x1000C: r1 = r1 + 1;
 481             case 0x10010: r31 = 0x10018;
 482                           pc = 0x10210;
 483                           continue;
 484             case 0x10014: /* nop */
 485             case 0x10018: pc = r31; continue;
 486             ...
 487             case 0xdeadbeef:  System.exit(1);
 488 ...
 489 \end{verbatim}}
 490 \vspace{1in}
 491 {\footnotesize\begin{verbatim}
 492 public void run_0x10000() {
 493     while (true) switch(pc) {
 494         case 0x10000: ...
 495         case 0x10004: ...
 496         case 0x10010: r31 = 0x10018;
 497                       pc = 0x10210;
 498                       continue;
 499 ....
 500
 501 pubic void run_0x10200() {
 502     while (true) switch(pc) {
 503         case 0x10200: ...
 504         case 0x10204: ...
 505 ....
 506
 507 public void trampoline() {
 508     while (true) switch(pc&0xfffffe00) {
 509         case 0x10000:    run_0x10000(); break;
 510         case 0x10200:    run_0x10200(); break;
 511         case 0xdeadbe00: ...
 512 ....
 513 \end{verbatim}}
 514 \end{multicols}
 515 \end{minipage}
 516 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 517 \end{figure*}
 518
 519 Translating unsafe code for use within a JVM proceeds as follows:
 520
 521 \begin{enumerate}
 522 \item The unsafe source code is compiled to a statically linked
 523       binary, including any libraries (including {\tt libc}) it needs.
 524
 525 \item {\tt NestedVM} is invoked on the statically linked binary, and
 526       emits a {\tt .java} file.
 527
 528 \item The resulting {\tt .java} code is compiled into a {\tt .class}
 529       file using {\tt jikes} or {\tt javac}.
 530
 531 \item At runtime, the host Java code invokes the {\tt run()} method on
 532       the generated class.  This is equivalent to the {\tt main()}
 533       entry point.
 534 \end{enumerate}
 535
 536
 537 \subsection{Binary-to-Binary Mode}
 538
 539 After implementing the binary-to-source compiler, a binary-to-binary
 540 translation mode was added.
 541
 542 \begin{pdfpic}
 543 \newlength{\MyLength}
 544 \settowidth{\MyLength}{xmachine codex}
 545 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 546 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 547   & \\[0pt]
 548   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 549   & \\[0pt]
 550   & \\[0pt]
 551   & \\[0pt]
 552   & \\[0pt]
 553   & \\[0pt]
 554   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 555   & \\[0pt]
 556   \psset{nodesep=5pt,arrows=->}
 557   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 558   \ncline{b0}{b1}\naput{\tt NestedVM}
 559 \endpsmatrix
 560 \end{pdfpic}
 561
 562 This mode has several advantages:
 563
 564 \begin{itemize}
 565
 566 \item There are quite a few interesting bytecode sequences that cannot
 567       be generated as a result of compiling Java source code.
 568
 569 \item Directly generating {\tt .class} files Eliminates the
 570       time-consuming {\tt javac} step.
 571
 572 \item Direct compilation to {\tt .class} files opens up the
 573       interesting possibility of dynamically translating MIPS binaries
 574       and loading them via {\tt ClassLoader.fromBytes()} {\it at
 575       deployment time}, eliminating the need to compile binaries ahead
 576       of time.
 577
 578 \end{itemize}
 579
 580
 581 \section{The NestedVM Runtime}
 582
 583 The NestedVM runtime fills the role typically assumed by an OS Kernel.
 584 Communication between MIPS code and the runtime is mediated by the
 585 {\tt SYSCALL} instruction, just as the {\tt libc}-kernel interface is
 586 on other MIPS implementations.
 587
 588 Two implementations of the runtime are offered; a simple runtime with
 589 the minimum support required to comply with ANSI C, and a more
 590 sophisticated runtime which emulates a large portion of the POSIX API.
 591
 592 \subsection{The ANSI C Runtime}
 593
 594 The ANSI C runtime offers typical file I/O operations including {\tt
 595 open()}, {\tt close()}, {\tt read()}, {\tt write()}, and {\tt seek()}.
 596 File descriptors are implemented much as they are in OS kernels; a
 597 table of open files is maintained and descriptors act as an index into
 598 that table.  Each file descriptor is represented as a Java {\tt
 599 RandomAccessFile} in order to match the semantics of {\tt seek()}.
 600
 601 Process-level memory management is done through the {\tt sbrk()}
 602 system call, which extends the process heap by adding more pages to
 603 the memory page table.  Fast memory clear and copy operations can be
 604 performed with {\tt memset()} and {\tt memcpy()}, which invoke the
 605 Java {\tt System.arraycopy()} method.
 606
 607 The {\tt exit()} call records the process' exit status, marks the VM
 608 instance as terminated and halts execution.  The {\tt pause()} syscall
 609 implements a crude form of Java-MIPS communication by returning
 610 control to the Java code which spawned the MIPS process.
 611
 612
 613 \subsection{The Unix Runtime}
 614
 615 The Unix runtime extends the simple ANSI C file I/O model to include a
 616 unified-root filesystem, device nodes, and {\tt fcntl()} APIs.  Device
 617 nodes are generally simulated by mapping reads, writes, and {\tt
 618 fcntl()}s on the device to the appropriate Java API.
 619
 620 MIPS processes can ``mount'' other filesystems within the virtual
 621 filesystem made visible to the MIPS process.  Each filesystem is
 622 implemented by a Java class, which could, for example, offer access to
 623 the host filesystem (including {\tt state()}, {\tt lstat()}, {\tt
 624 mkdir}, and {\tt unlink()}, and {\tt getdents()}), the contents of a
 625 zip archive, or even a remote HTTP server.
 626
 627 The {\tt fork()} call is implemented in an elegant manner by calling
 628 the Java {\tt clone()} method (deep copy) on the VM object itself.
 629 The new instance is then added to a static process table to facilitate
 630 interprocess communication.
 631
 632 The {\tt exec()} method actually loads a MIPS binary image from the
 633 filesystem, feeds it to the MIPS-to-bytecode translator, and then
 634 loads the resulting bytecode on the fly using {\tt
 635 ClassLoader.loadBytes()}.
 636
 637 The {\tt waitpid()} API allows a parent process to block pending the
 638 completion of a child process, which is modeled quite easily with the
 639 Java {\tt wait()} method.  The {\tt pipe()} system call permits
 640 parent-to-child IPC just as on a normal Unix system.
 641
 642 Simple networking support is provided by the {\tt opensocket()}, {\tt
 643 listensocket()}, and {\tt accept()} methods, which are not yet fully
 644 compatible with the standard Berkeley sockets API.
 645
 646
 647 \subsection{Security Concerns}
 648
 649 NestedVM processes are completely isolated from the outside world
 650 except for the {\tt SYSCALL} instruction.  As previously mentioned,
 651 the programmer can choose from various runtime implementations which
 652 translate these invocations into operations in the outside world.  By
 653 default, none of these implementations allows file or network I/O;
 654 this must be explicitly enabled, typically on a per-file basis.
 655
 656 Wild writes within the MIPS VM have no effect on the larger JVM or the
 657 host OS; they can only cause the {\tt SYSCALL} instruction to be
 658 invoked.  A judicious choice of which system calls to enable offers
 659 extremely strong security; for example, the {\tt libjpeg} library does
 660 not need any host services whatsoever -- its sole task is to
 661 decompress one image (which is pre-written into memory before it
 662 starts up), write the decompressed image to another part of memory,
 663 and exit.  With all system calls disabled, {\tt libjpeg} will function
 664 correctly, and even if it is compromised (for example by a
 665 maliciously-constructed invalid image), the only effect it can have on
 666 the host is to generate an incorrect decompressed image.
 667
 668
 669 \subsection{Threading}
 670
 671 The NestedVM runtime currently does not support threading.  Providing
 672 robust support for ``true threads'', whereby each MIPS thread maps to
 673 a Java thread is probably not possible as the Java Memory Model
 674 \cite{jmm}, since all MIPS memory is stored in a set of {\tt
 675 int[]}'s and the Java Memory Model does not permit varying treatment
 676 or coherency policies at the granularity of a single array element.
 677
 678 While this presents a major barrier for applications that use
 679 sophisticated locking schemes such as {\it hash synchronization} and
 680 depend on atomic memory operations, it is probably possible to apply
 681 this threading model to ``well behaved'' multithreaded applications
 682 which make no concurrency assumptions other than those explicitly
 683 offered by OS-provided semaphores and mutexes.
 684
 685 Complex synchronization and incorrectly synchronized applications can
 686 be supported by implementing a variant of {\it user threads} within a
 687 single Java thread by providing a timer interrupt (via a Java
 688 asynchronous exception).  Unfortunately this requires that the
 689 compiled binary be able to restart at any arbitrary instruction
 690 address, which would require a {\tt case} statement for every
 691 instruction (rather than every jump target), which would degrade
 692 performance and increase the size of the resulting class file.
 693
 694 \section{Optimization and Performance}
 695
 696 \subsection{Binary-to-Source Mode}
 697
 698 Generating Java source code instead of bytecode frees NestedVM from
 699 having to perform simple constant propagation optimizations, as most
 700 Java compilers already do this.  A recurring example is the treatment
 701 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
 702
 703 Lacking the ability to generate specially optimized bytecode
 704 sequences, a straightforward mapping of the general purpose hardware
 705 registers to 32 {\tt int} fields turned out to be optimal.  Using
 706 local variables for registers did not offer much of a performance
 707 advantage, presumably since the JVM's JIT is intelligent enough to
 708 register-allocate temporaries for fields.
 709
 710 \epsfig{file=charts/chart4,width=3in}
 711
 712 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
 713 for a single method.  This presents a problem for NestedVM, and
 714 necessitates a {\it trampoline transformation}, as shown in
 715 Figure~\ref{code1}.  With this trampoline in place, large binaries can
 716 be handled without much difficulty -- fortunately, there is no
 717 corresponding limit on the size of a classfile as a whole.
 718
 719 One difficulty that arose as a result of using the trampoline
 720 transformation was the fact that {\tt javac} and {\tt jikes} are
 721 unable to properly optimize its switch statements.  For example, the
 722 following code is compiled into a comparatively inefficient {\tt
 723 LOOKUPSWITCH}:
 724
 725 {\footnotesize
 726 \begin{verbatim}
 727     switch(pc&0xffffff00) {
 728         case 0x00000100: run_100(); break;
 729         case 0x00000200: run_200(); break;
 730         case 0x00000300: run_300(); break;
 731     }
 732 \end{verbatim}}
 733
 734 Whereas the next block of code code optimized into a {\tt
 735 TABLESWITCH}:
 736
 737 {\footnotesize
 738 \begin{verbatim}
 739     switch(pc>>>8) {
 740         case 0x1: run_100();
 741         case 0x2: run_200();
 742         case 0x3: run_300();
 743     }
 744 \end{verbatim}}
 745
 746 This problem was surmounted by switching on a denser set of {\tt case}
 747 values, which is more amenable to the {\tt TABLESWITCH} structure.
 748 This change alone nearly doubled the speed of the compiled binary.
 749
 750 The next performance improvement came from tuning the size of the
 751 methods invoked from the trampoline.  Trial and error led to the
 752 conclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
 753 -- performs best when 128 MIPS instructions are mapped to each method.
 754
 755 \epsfig{file=charts/chart1,width=3in}
 756
 757 This phenomenon is due to two factors:
 758
 759 \begin{itemize}
 760
 761 \item While the trampoline method's {\tt switch} statement can be
 762       coded as a {\tt TABLESWITCH}, the {\tt switch} statement
 763       within the individual methods is to sparse to encode this way.
 764
 765 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
 766       favor smaller methods since they are easier to compile and are
 767       better candidates for compilation in ``normal'' programs (unlike
 768       NestedVM programs).
 769
 770 \end{itemize}
 771
 772 After tuning method sizes, our next performance boost came from
 773 eliminating extraneous case branches, which yielded approximately a
 774 10\%-25\% performance improvement.  Having case statements before each
 775 instruction prevents JIT compilers from being able to optimize across
 776 instruction boundaries, since control flow can enter the body of a
 777 {\tt switch} statement at any of the {\tt case}s.  In order to
 778 eliminate unnecessary case statements we needed to identify all
 779 possible jump targets.  Jump targets can come from three sources:
 780
 781 \begin{itemize}
 782
 783 \item {\bf The {\tt .text} segment}
 784
 785       Every instruction in the text segment is scanned, and every
 786       branch instruction's destination is added to the list of
 787       possible branch targets.  In addition, the
 788       address\footnote{actually {\tt addr+8}} of any function that
 789       sets the link register is added to the list.  Finally,
 790       combinations of {\tt LUI} (Load Upper Immediate) and {\tt ADDIU}
 791       (Add Immediate Unsigned) are scanned for possible addresses in
 792       the {\tt .text} segment since this combination of instructions
 793       is often used to load a 32-bit word into a register.
 794
 795 \item {\bf The {\tt .data} segment}
 796
 797       When compiling {\tt switch} statements, compilers often use a
 798       jump table stored in the {\tt .data} segment.  Unfortunately
 799       they typically do not identify these jump tables in any way.
 800       Therefore, the entire {\tt .data} segment is conservatively
 801       scanned for possible addresses in the {\tt .text} segment.
 802
 803 \item {\bf The symbol table}
 804
 805       The symbol table is used primarily as a backup source of jump
 806       targets.  Scanning the {\tt .text} and {\tt .data} segments
 807       should identify any possible jump targets; however, adding all
 808       function symbols in the ELF symbol table also catches functions
 809       that are never called directly from the MIPS binary, such as
 810       those invoked only via the NestedVM runtime's {\tt call()}
 811       method.
 812
 813 \end{itemize}
 814
 815 Despite all the above optimizations, one insurmountable obstacle
 816 remained: the Java {\tt .class} file format limits the constant pool
 817 to 65535 entries.  Every integer literal greater than {\tt 32767}
 818 requires an entry in this pool, and each branch instruction generates
 819 one of these.
 820
 821 One suboptimal solution was to express constants as offsets from a few
 822 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
 823 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
 824 inlining it).  This was sufficient to get reasonably large binaries to
 825 compile, and caused only a small (approximately 5\%) performance
 826 degradation and a similarly small increase in the size of the {\tt
 827 .class} file.  However, as we will see in the next section, compiling
 828 directly to {\tt .class} files (without the intermediate {\tt .java}
 829 file) eliminates this problem entirely.
 830
 831 \subsection{Binary-to-Binary Mode}
 832
 833 Compiling directly to bytecode offers a substantial performance gain:
 834
 835 \epsfig{file=charts/chart3,width=3in}
 836
 837 Most of this improvement comes from the handling of branch
 838 instructions and from taking advantage of the JVM stack to eliminate
 839 unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
 840
 841 The first optimization gained by direct bytecode generation came from
 842 the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
 843 Java {\it language} does not have a {\tt goto} keyword, the VM does in
 844 fact have a corresponding instruction which is used quite heavily by
 845 {\tt javac}.  NestedVM's binary-to-binary mode exploits this
 846 instruction to avoid emitting inefficient {\tt switch..case}
 847 structures.
 848
 849 Related to the {\tt GOTO} instruction is branch statement
 850 optimization.  When emitting source code, NestedVM translates branches
 851 into Java source code like this:
 852
 853 {\footnotesize\begin{verbatim}
 854     if (condition) {
 855         pc = TARGET;
 856         continue;
 857     }
 858 \end{verbatim}}
 859
 860 This requires a branch in the JVM {\it regardless} of whether the MIPS
 861 branch is actually taken.  If {\tt condition} is false the JVM has to
 862 jump over the code to set {\tt pc} and go back to the {\tt switch}
 863 statement; if {\tt condition} is true the JVM has to jump to the {\tt
 864 switch} block.  By generating bytecode directly, NestedVM is able to
 865 emit a JVM bytecode branching directly to the address corresponding to
 866 the target of the MIPS branch.  In the case where the branch is not
 867 taken the JVM doesn't branch at all.
 868
 869 A side effect of the previous two optimizations is a solution to the
 870 excess constant pool entries problem.  When jumps are implemented as
 871 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
 872 not need to be set.  This eliminates a huge number of constant pool
 873 entries.  The {\tt .class} file constant pool size limit is still
 874 present, but it is less likely to be encountered.
 875
 876 Implementation of the MIPS delay slot offers another opportunity for
 877 bytecode-level optimization.  In order to take advantage of
 878 instructions already in the pipeline, the MIPS ISA specifies that the
 879 instruction after a jump or branch is always executed, even if the
 880 jump/branch is taken.  This instruction is referred to as the ``delay
 881 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
 882 early MIPS CPUs, so they have to discard instructions anyways}.''  The
 883 instruction in the delay slot is actually executed {\it before} the
 884 branch is taken.  To further complicate matters, values from the
 885 register file are loaded {\it before} the delay slot is executed.
 886
 887 Fortunately there is a very elegant solution to this problem when
 888 directly emitting bytecode.  When a branch instruction is encountered,
 889 the registers needed for the comparison are pushed onto the stack to
 890 prepare for the JVM branch instruction.  Then, {\it after} the values
 891 are on the stack the delay slot instruction is emitted, followed by
 892 the actual JVM branch instruction.  Because the values were pushed to
 893 the stack before the delay slot was executed, any changes the delay
 894 slot instruction makes to the registers are not visible to the branch
 895 bytecode.
 896
 897 One final advantage of emitting bytecode is a reduction in the size of
 898 the ultimate {\tt .class} file.  All the optimizations above lead to
 899 more compact bytecode as a beneficial side effect in addition,
 900 NestedVM performs a few additional size optimizations.
 901
 902 When encountering the following {\tt switch} block, both {\tt javac}
 903 and {\tt jikes} generate redundant bytecode.
 904
 905 {\footnotesize\begin{verbatim}
 906     switch(pc>>>8) {
 907         case 0x1: run_1(); break;
 908         case 0x2: run_2(); break
 909         ...
 910         case 0x100: run_100(); break;
 911     }
 912 \end{verbatim}}
 913
 914 The first bytecode in each case arm in the switch statement is {\tt
 915 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
 916 lifting this bytecode outside of the {\tt switch} statement, each {\tt
 917 case} arm shrinks by one instruction.
 918
 919 The net result is quite reasonably sized {\tt .class} files:
 920
 921 \epsfig{file=charts/chart9,width=3in}
 922
 923
 924 \subsection{Compiler Flags}
 925
 926 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
 927 profile is nothing like that of actual silicon.  In particular, {\tt
 928 gcc} makes several optimizations that increase performance on an
 929 actually MIPS CPU but actually decrease the performance of
 930 NestedVM-generated bytecode.  We found the following compiler options
 931 generally improve performance when using {\tt gcc} as the
 932 source-to-MIPS compiler:
 933
 934 \begin{itemize}
 935
 936 \item {\tt -falign-functions}
 937
 938       Normally a function's location in memory has no effect on its
 939       execution speed.  However, in the NestedVM binary translator,
 940       the {\tt .text} segment is split on power-of-two boundaries due
 941       to the trampoline.  If a function starts near the end of one of
 942       these boundaries, a performance critical part of the function
 943       winds up spanning two Java methods.  Telling {\tt gcc} to align
 944       all functions along these boundaries decreases the chance of
 945       this sort of splitting.
 946
 947 \item {\tt -fno-rename-registers}
 948
 949       On an actual silicon chip, using additional registers carries no
 950       performance penalty (as long as none are spilled to the stack).
 951       However, when generating bytecode, using {\it fewer}
 952       ``registers'' helps the JVM optimize the machine code it
 953       generates by simplifying the constraints it needs to deal with.
 954       Disabling register renaming has this effect.
 955
 956 \item {\tt -fno-schedule-insns}
 957
 958       Results of MIPS load operations are not available until {\it
 959       two} instructions after the load.  Without the {\tt
 960       -fno-schedule-insns} instruction, {\tt gcc} will attempt to
 961       reorder instructions to do other useful work during this period
 962       of unavailability.  NestedVM is under no such constraint, and
 963       removing this reordering typically results in simpler bytecode.
 964
 965 \item {\tt -mmemcpy}
 966
 967       Enabling this instruction causes {\tt gcc} to use the system
 968       {\tt memcpy()} routine instead of generating loads and stores.
 969       As explained in the previous section, the NestedVM runtime
 970       implements {\tt memcpy()} using {\tt System.arraycopy()}, which
 971       is substantially more efficient.
 972
 973 \item {\tt -ffunction-sections -fdata-sections}
 974
 975       These two options are used in conjunction with the {\tt
 976       --gc-sections} linker option, prompting the linker to more
 977       aggressively prune dead code.
 978
 979 \end{itemize}
 980
 981 The following chart quantifies the performance gain conferred by each
 982 of the major optimizations outlined in this section:
 983
 984 \epsfig{file=charts/chart10,width=3in}
 985
 986
 987
 988 \subsection{Overall Performance}
 989
 990 All times are measured in seconds.  All tests were performed on a dual
 991 1Ghz Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK
 992 1.4.1). Each test was run 8 times within a single VM. The highest and
 993 lowest times were removed and the remaining 6 were averaged.  In each
 994 case only the first run differed significantly from the rest.
 995
 996 The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
 997 writing a tga.  The {\tt mspack} test consisted of extracting all
 998 members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
 999 and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
1000 rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
1001 to 48 incrementing by 4 for a total of 950 glyphs.
1002
1003 \epsfig{file=charts/chart8,width=3in}
1004
1005 \epsfig{file=charts/chart7,width=3in}
1006
1007 \epsfig{file=charts/chart6,width=3in}
1008
1009 \section{Sample Applications}
1010
1011 \subsection{FreeType, {\tt libmspack}, and {\tt libjpeg}}
1012
1013 The Ibex Project utilizes three libraries for which no Java-only
1014 equivalent exists.  The first is the FreeType font library, which
1015 parses, hints, and rasterizes TrueType and Postscript fonts with
1016 exceptional quality.  The project also needed an open source JPEG
1017 decompresser; surprisingly, none exist for Java.  While encoders are
1018 plentiful, decoders are rare, since Sun's J2SE VM includes a native
1019 method to invoke {\tt libjpeg}.
1020
1021 These three libraries make minimal use of the standard library and OS
1022 services, and are all written in very portable ANSI C code, which made
1023 them easy targets for initial development.
1024
1025 \subsection{The GNU Compiler Collection}
1026
1027 Our next target, {\tt gcc}, was initially chosen in order to relieve
1028 developers from the time-consuming and complex task of building a
1029 compiler themselves; The Ibex Project requires a specially configured
1030 and patched version of {\tt gcc} and its ahead-of-time Java compiler
1031 ({\tt gcj}) which is cumbersome to build.
1032
1033 {\tt Gcc} was the first ``major'' application NestedVM was used on,
1034 and drove the development of most of the system library interface
1035 development; particularly support for {\tt fork()} and {\tt exec()},
1036 which require the NestedVM Runtime to perform binary-to-bytecode
1037 translation on the fly.
1038
1039 {\tt Gcc} also makes extensive use of 64-bit integers ({\tt long
1040 long}), which -- for performance reasons -- are typically manipulated
1041 using nonobvious instruction sequences on the 32-bit MIPS
1042 architecture.  Dealing with these operations uncovered a number of
1043 bugs in the translator.
1044
1045 Despite our original goal, we have not yet been able to translate the
1046 {\tt C++} or Java front-ends, as the resulting binary produces a
1047 trampoline which exceeds the maximum size of a single method.  Future
1048 work will explore a multi-level trampoline to address this issue.
1049
1050
1051 \subsection{\TeX and LINPACK}
1052
1053 In order to distinguish NestedVM from other single-language
1054 translators for the JVM, we undertook the task of translating \TeX\
1055 (written in Pascal) and the Fortran source code for LINPACK into Java
1056 bytecodes.
1057
1058 Although actually producing the initial MIPS binaries from the \TeX\
1059 source code turned out to be an exceptionally tedious and frustrating
1060 task, the resulting binary translated and executed perfectly on the
1061 first run, as did LINPACK.  Our reward for this effort was typesetting
1062 our presentation of NestedVM using NestedVM itself.  We have also had
1063 initial successes running \TeX\ in a Java Applet, and intend to
1064 produce a {\tt jar} for embedding \TeX\ code (``\TeX lets'') in web
1065 pages without the use of a post-processing tool.
1066
1067 The LINPACK benchmark called our attention to Java's lack of an API
1068 for checking the ``cpu time'' of a process.  Unfortunately we had to
1069 substitute wall-clock time on an otherwise-quiescent machine as an
1070 approximation.
1071
1072 \epsfig{file=charts/chart11,width=3in}
1073
1074
1075
1076 \section{Conclusion}
1077
1078 We have presented a novel technique for using libraries written in
1079 unsafe languages within a safe virtual machine without resorting to
1080 native interfaces.  We have implemented this technique in NestedVM and
1081 demonstrated its utility by translating six popular software
1082 applications.
1083
1084 \subsection{Future Directions}
1085
1086 Although we have only implemented it for the Java Virtual Machine, our
1087 technique generalizes to other safe bytecode architectures.  In
1088 particular we would like to demonstrate this generality by re-targeting
1089 the translator to the Microsoft Intermediate Language \cite{msil}.
1090
1091 Additionally, we would like to explore other uses for dynamic loading
1092 of translated MIPS binaries by combining NestedVM (which itself is
1093 written in Java) and the {\tt ClassLoader.defineClass()} mechanism.
1094
1095 \subsection{Availability}
1096
1097 NestedVM is available under an open source license, and can be
1098 obtained from
1099 \begin{verbatim}
1100     http://nestedvm.ibex.org
1101 \end{verbatim}
1102
1103
1104 \bibliography{ivme04}
1105
1106 \end{document}
1107