doc/ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{pdftricks}
   8 \begin{psinputs}
   9   \usepackage{pstricks}
  10   \usepackage{pst-node}
  11 \end{psinputs}
  12 \usepackage{parskip}
  13 \usepackage{tabularx}
  14 \usepackage{alltt}
  15 \bibliographystyle{amsplain}
  16
  17 \title{\textbf{\textsf{
  18 Complete Translation of Unsafe Native Code to Safe Bytecode
  19 }}}
  20 \date{}
  21 \author{\begin{tabular}{@{}c@{}}
  22         {\em {Brian Alliet}} \\
  23         {Rochester Institute of Technology}\\
  24         {\tt bja8464@cs.rit.edu}
  25    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  26         {\em {Adam Megacz}} \\
  27         {University of California, Berkeley} \\
  28         {\tt megacz@cs.berkeley.edu}
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 {\it This document was typeset using D. E. Knuth's original \TeX 89,
  35      which was both compiled and executed entirely within a Java
  36      Virtual Machine without the use of native code.}
  37
  38 \begin{abstract}
  39
  40 Existing techniques for using code written in an unsafe language
  41 within a safe virtual machine generally involve transformations from
  42 one source code language (such as C, Pascal, or Fortran) to another
  43 (such as Java) which is then compiled into virtual machine bytecodes.
  44
  45 We present an alternative approach which translate MIPS binaries
  46 produced by any compiler into safe virtual machine bytecodes.  This
  47 approach offers four key advantages over existing techniques: it is
  48 language agnostic, it offers bug-for-bug compiler compatability,
  49 requires no post-translation human intervention, and introduces no
  50 build process modifications.
  51
  52 We also present NestedVM, an implementation of this technique, and
  53 discuss its application to six software packages: LINPACK (Fortran),
  54 which was used as one of our performance tests, \TeX\ (Pascal), which
  55 was used to typeset this document, {\tt libjpeg}, {\tt libmspack}, and
  56 FreeType (all C source), which are currently in production use as part
  57 of the Ibex Project, and {\tt gcc}, which was used to compile all of
  58 the aforementioned.
  59
  60 Performance measurements indicate a best case performance within 3x of
  61 native code and worst case typically within 10x, making it an
  62 attractive solution for code which is not performance-critical.
  63
  64 \end{abstract}
  65
  66 \section{Introduction}
  67
  68 Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
  69 been in use much longer than any of today's widely accepted safe
  70 languages such as Java \cite{java} and C\# \cite{csharp}.
  71 Consequently, there is a huge library of software written in these
  72 languages.  Although safe languages offer substantial benefits, their
  73 comparatively young age often puts them at a disadvantage when breadth
  74 of existing support code is an important criterion.
  75
  76 The typical solution to this dilemma is to use a native interface such
  77 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
  78 virtual machine or otherwise safe environment.  Unfortunately, there
  79 are a number of situations in which this is not an acceptable
  80 solution.  These situations can be broadly classified into two
  81 categories: {\it security concerns} and {\it portability concerns}.
  82
  83 Security is often a major concern when integrating native code.  Using
  84 Java as an example, JNI and CNI are prohibited in a number of
  85 contexts, including applet environments and servlet containers with a
  86 {\tt SecurityManager}.  Additionally, even in the context of trusted
  87 code, {\tt native} methods invoked via JNI are susceptible to buffer
  88 overflow and heap corruption attacks which are not a concern for
  89 verified, bounds-checked bytecode.
  90
  91 The second class of disadvantages revolves around portability
  92 concerns; native interfaces require the native library to be compiled
  93 ahead of time for every architecture on which it will be deployed.
  94 This is unacceptable for scenarios in which the full set of target
  95 architectures is not known at deployment time.  Additionally, some JVM
  96 platform variants such as J2ME \cite{j2me} simply do not offer support
  97 for native code.
  98
  99 The technique we present here uses typical compiler to compile unsafe
 100 code into a MIPS binary, which is then translated on an
 101 instruction-by-instruction basis into Java bytecode.  The technique
 102 presented here is general; we anticipate that it can be applied to
 103 other secure virtual machines such as Microsoft's .NET \cite{msil},
 104 Perl Parrot \cite{parrot}, or Python bytecode \cite{python}.
 105
 106 The remainder of this paper is divided as follows: in the next section
 107 we review the relevant set of program representations (safe source,
 108 unsafe source, binary, and bytecode) and survey existing work for
 109 performing transformations between them.  In the third section we
 110 introduce NestedVM and cover its two primary translation modes in
 111 detail.  Section four describes the NestedVM runtime, which plays the
 112 role of the OS kernel.  Section five adresses the optimizations we
 113 employ and quantifies NestedVM's performance.  Section six reviews our
 114 experiences in applying NestedVM to various popular software packages.
 115 We conclude with an analysis of NestedVM's weaknesses and potential
 116 for future improvements.
 117
 118
 119 \section{Existing Work}
 120
 121 The four program representations of interest in this context, along
 122 with their specific types in the C-to-JVM instantiation of the
 123 problem are shown in the following diagram:
 124
 125 \begin{pdfpic}
 126 \newlength{\MyLength}
 127 \settowidth{\MyLength}{machine code}
 128 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
 129 \begin{psmatrix}[colsep=2,rowsep=0]
 130   & \\[0pt]
 131   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 132   [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
 133   & \\[0pt]
 134   & \\[0pt]
 135   & \\[0pt]
 136   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 137   [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
 138   & \\[0pt]
 139   \psset{nodesep=5pt,arrows=->}
 140 \end{psmatrix}
 141 \end{pdfpic}
 142
 143 To illustrate the context of this diagram, the following arcs show the
 144 translations performed by a few familiar tools:
 145
 146 \begin{pdfpic}
 147 \newlength{\MyLength}
 148 \settowidth{\MyLength}{xmachine codex}
 149 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 150 \psmatrix[colsep=2,rowsep=0,nrot=:D]
 151   & \\[0pt]
 152   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 153   & \\[0pt]
 154   & \\[0pt]
 155   & \\[0pt]
 156   & \\[0pt]
 157   & \\[0pt]
 158   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 159   & \\[0pt]
 160   \psset{nodesep=5pt,arrows=->}
 161   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 162   \ncline{s1}{b0}\bput{:D}{\tt gcj}
 163   \ncline{s1}{b1}\aput{:U}{\tt javac}
 164   \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
 165 \endpsmatrix
 166 \end{pdfpic}
 167
 168 Existing techniques for translating unsafe code into VM bytecode
 169 generally fall into two categories, which we expand upon in the
 170 remainder of this section: source-to-source translation and
 171 source-to-binary translation.
 172
 173 \subsection{Source-to-Source Translation}
 174
 175 The most common technique employed is partial translation from unsafe
 176 source code to safe source code:
 177
 178 \begin{pdfpic}
 179 \newlength{\MyLength}
 180 \settowidth{\MyLength}{xmachine codex}
 181 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 182 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 183   & \\[0pt]
 184   & \\[0pt]
 185   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 186   & \\[0pt]
 187   & \\[0pt]
 188   & \\[0pt]
 189   & \\[0pt]
 190   & \\[0pt]
 191   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 192   & \\[0pt]
 193   \psset{nodesep=5pt,arrows=->}
 194   \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
 195   \ncline{s1}{b1}\aput{:U}{\tt javac}
 196 \endpsmatrix
 197 \end{pdfpic}
 198
 199 A number of existing systems employ this technique; they can
 200 be divided into two categories: those which perform a partial
 201 translation which is completed by a human, and those which perform a
 202 total translation but fail (yield an error) on a large class of input
 203 programs.
 204
 205 \subsubsection{Human-Assisted Translation}
 206
 207 Jazillian \cite{jazillian} is a commercial solution which produces
 208 extremely readable Java source code from C source code, but ony
 209 translates a small portion of the C language.  Jazillian is unique in
 210 that in addition to {\it language migration}, it also performs {\it
 211 API migration}; for example, Jazillian is intelligent enough to
 212 translate ``{\tt char*~s1~=~strcpy(s2)}'' into ``{\tt
 213 String~s1~=~s2}''.  Unfortunately such deep analysis is intractible
 214 for most of the C language and standard library; indeed, Jazillian's
 215 documentation notes that {\it ``This is not your father's language
 216 translator... Jazillian does not always produce code that works
 217 correctly.''}
 218
 219 MoHCA-Java \cite{mohca} is the other major tool in this category, and
 220 steps beyond Jazillian by providing tools for analysis of the source
 221 C++ abstract syntax tree.  Additionally, MoHCA-Java's analysis engine
 222 is extensible, making it a platform for constructing
 223 application-specific translators rather than a single translation
 224 tool.  However, MoHCA-Java does not always generate complete Java code
 225 for all of the C++ programs which it accepts.
 226
 227 \subsubsection{Partial-Domain Translation}
 228
 229 The {\tt c2j} \cite{c2j}, {\tt c2j++} \cite{c2jpp}, Cappucinno
 230 \cite{capp}, and Ephedra \cite{ephedra} systems each provide support
 231 for complete translation of a {\it subset} of the source language (C
 232 or C++).  Each of the four tools supports a progressively greater
 233 subset than the one preceding it; however none covers the entire input
 234 language.
 235
 236 Ephedra, the most advanced of the four, supports most of the C++
 237 language, and claims to produce ``human readable'' Java code as
 238 output.  Notable omissions from the input domain include support for
 239 fully general pointer arithmetic, casting between unrelated types, and
 240 reading from a {\tt union} via a different member than the one most
 241 recently written.
 242
 243 Unfortunately, when the program being translated is large and complex,
 244 it is quite likely that it will use an unsupported feature in at least
 245 one place.  In the absence of a programmer who understands the source
 246 program, a single anomoly is often enough to render the entire
 247 translation process useless.  As a result, these tools are mainly
 248 useful as an {\it aid} to programmers who could normally perform the
 249 conversion themselves, but want to save time by automating most of the
 250 process.
 251
 252
 253 \subsection{Source-to-Binary Translation}
 254
 255 Source-to-binary translation involves a compiler for the unsafe
 256 language which has been modified to emit safe bytecode.
 257
 258 \begin{pdfpic}
 259 \newlength{\MyLength}
 260 \settowidth{\MyLength}{xmachine codex}
 261 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 262 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 263   & \\[0pt]
 264   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 265   & \\[0pt]
 266   & \\[0pt]
 267   & \\[0pt]
 268   & \\[0pt]
 269   & \\[0pt]
 270   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 271   & \\[0pt]
 272   \psset{nodesep=5pt,arrows=->}
 273   \ncline{s0}{b1}\bput{:U}{source-to-binary}
 274 \endpsmatrix
 275 \end{pdfpic}
 276
 277 An experimental ``JVM backend'' for the {\tt gcc} compiler, known as
 278 {\tt egcs-jvm} \cite{egcsjvm}, attempts this approach.  Since {\tt
 279 gcc} employs a highly modular architecture, it {\it is} possible to
 280 add RTL code generators for nonstandard processors.  However, {\tt
 281 gcc}'s parsing, RTL generation, and optimization layers make
 282 fundamental assumptions (such as the availability of pointer math)
 283 which cannot be directly supported; thus the compiler still fails for
 284 a substantial class of input programs.
 285
 286 A Java backend for the {\tt lcc} [CITE] compiler, known as {\tt
 287 lcc-java} [CITE], is also available.  Although this system is quite
 288 clean and elegantly designed, it lacks any form of system library
 289 ({\tt libc}), so very few C programs will run without custom
 290 modification (which would cause them to diverge from the upstream
 291 sources).  The memory model employed by {\tt lcc-java} is also
 292 somewhat awkward; a separate {\tt int[]} is maintained for the stack
 293 and heap, leading to difficulties mingling pointers to these two
 294 memory regions.  Additionally, the heap is a single {\tt int[]}, which
 295 means that it must be copied in order to be expanded, and prevents
 296 {\tt lcc-java} from taking advantage of {\tt NullPointerException}
 297 checking, which costs nothing in the ``common case'' since nearly all
 298 JVMs use the host CPU's MMU to detect this condition.
 299
 300
 301 \section{NestedVM}
 302
 303 The principal difference between NestedVM and other approaches is that
 304 NestedVM {\it does not} attempt to deal with source code as an input,
 305 instead opting for {\it binary-to-source} and {\it binary-to-binary}
 306 translation.  This offers three immediate advantages:
 307
 308 \begin{itemize}
 309 \item {\bf Language Agnostic}
 310
 311       Because NestedVM does not attempt to implement the parsing and
 312       code generation steps of compilation, it is freed from the
 313       extremely complex task of faithfully implementing languages
 314       which are often not fully or formally specified (such as C and
 315       C++), and is able to support any langage for which a
 316       MIPS-targeted compiler exists.
 317
 318 \item {\bf Bug-for-bug compiler compatability}
 319
 320       Since NestedVM uses the compiler's {\it output} as its own {\it
 321       input}, it ensures that programs which are inadvertently
 322       dependent on the vagaries of a particular compiler can still be
 323       used.
 324
 325 \item {\bf No build process modifications}
 326
 327       NestedVM does not modify existing build processes, which can be
 328       extremely complex and dependent on strange preprocessor usage as
 329       well as the complex interplay between compiler switches and
 330       header file locations.
 331
 332 \end{itemize}
 333
 334 NestedVM's approach carries a fourth benefit as well, arising from its
 335 totality:
 336
 337 \begin{itemize}
 338 \item {\bf No post-translation human intervention}
 339
 340       NestedVM offers total support for all non-privileged
 341       instructions, registers, and resources found on a MIPS {\tt
 342       R2000} CPU, including the add/multiply unit and floating point
 343       coprocessor.  As such, it constitutes a total function mapping
 344       from the entire domain of (non-kernel-mode) programs onto (a
 345       subset of) the semantics of the Java Virtual Machine.  This
 346       ensures that the translation process is fully automated and
 347       always succeeds for valid input binaries.
 348 \end{itemize}
 349
 350 This is a much more important factor than is obvious at first glance.
 351 If post-translation human intervention is required, then the {\it
 352 human becomes part of the build process}.  This means that if a third
 353 party library used in the project needs to be upgraded, {\it a human
 354 must intervene} in the rebuild process.  In addition to slowing the
 355 process and introducing opportunities for error, this task often
 356 requires specialized knowledge which becomes tied to the particular
 357 individual performing this task, rather than being encoded in build
 358 scripts which persist throughout the lifetime of the project.
 359
 360 \subsection{Why MIPS?}
 361
 362 We chose MIPS as a source format for three reasons: the availability
 363 of tools to compile legacy code into MIPS binaries, the close
 364 similarity (FIXME: explain) between the MIPS ISA and the Java Virtual
 365 Machine, and the relatively high degree of program structure that can
 366 be inferred from ABI-adherent binaries.
 367
 368 The MIPS architecture has been around for quite some time, and is well
 369 supported by the GNU Compiler Collection, which is capable of
 370 compiling C, C++, Java, Fortran, Pascal, and Objective C
 371 into MIPS binaries.
 372
 373 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
 374 Machine:
 375
 376 \begin{itemize}
 377
 378 \item Most of the instructions in the original MIPS ISA operate only
 379       on 32-bit aligned memory locations. This allows NestedVM to
 380       represent memory as a Java {\tt int[]} array without introducing
 381       additional overhead.  The remaining non-aligned memory load
 382       instructions are only rarely emitted by most compilers since
 383       they carry a performance penalty on physical MIPS
 384       implementations.
 385
 386 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
 387       multiply and divide instructions as well as a single and double
 388       precision floating point unit.  These capabilities map nicely
 389       onto Java's arithmetic instructions.
 390
 391 \item Although MIPS offers unsigned arithmetic and Java does not, few
 392       MIPS instructions actually depend on non-two's-complement
 393       handling of integer math.  Furthermore, since most high-level
 394       languages such as C do not expose access to arithmetic-overflow
 395       exceptions, these instructions are rarely found except in
 396       hand-coded assembler.  In the few situations where these
 397       instructions {\it are} encountered, the {\tt unsigned int} is
 398       cast (bitwise) to a Java {\tt long}, the operation is performed,
 399       and the result is cast back.  On architectures offering 64-bit
 400       integer math this conversion carries no overhead.
 401
 402 \end{itemize}
 403
 404 Finally, the MIPS ISA and ABI convey quite a bit of information about
 405 program structure.  This information can be used for optimization
 406 purposes:
 407
 408 \begin{itemize}
 409
 410 \item The structure of MIPS branching and jump instructions make it
 411       easy to infer the set of likely target instructions.
 412
 413 \item The MIPS ABI specifies particular registers as caller-save and
 414       callee-save, as well as designating a register for the return
 415       address after a function call.  This allows NestedVM to optimize
 416       many operations for the common case of ABI-adherent binaries.
 417
 418 \item All MIPS instructions are exactly 32 bits long.
 419
 420 \end{itemize}
 421
 422
 423
 424 \subsection{Binary-to-Source}
 425
 426 The simplest operational mode for NestedVM is binary-to-source
 427 translation.  In this mode, NestedVM translates MIPS binaries into
 428 Java source code, which is then fed to a Java compiler in order to
 429 produce bytecode files:
 430
 431 \begin{pdfpic}
 432 \newlength{\MyLength}
 433 \settowidth{\MyLength}{xmachine codex}
 434 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 435 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 436   & \\[0pt]
 437   & \\[0pt]
 438   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 439   & \\[0pt]
 440   & \\[0pt]
 441   & \\[0pt]
 442   & \\[0pt]
 443   & \\[0pt]
 444   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 445   \psset{nodesep=5pt,arrows=->}
 446   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 447   \ncline{s1}{b1}\aput{:U}{\tt javac}
 448   \ncline{b0}{s1}\naput{\tt NestedVM}
 449 \endpsmatrix
 450 \end{pdfpic}
 451
 452 \begin{figure*}[t]
 453 \begin{minipage}[c]{7in}%
 454 \begin{multicols}{2}
 455 {\footnotesize\begin{verbatim}
 456 private final static int r0 = 0;
 457 private int r1, r2, r3, /* ... */ r30;
 458 private int r31 = 0xdeadbeef;
 459 private int pc = ENTRY_POINT;
 460
 461 public void run() {
 462     while (true)
 463         switch(pc) {
 464             case 0x10000: r29 = r29 - 32;
 465             case 0x10004: r1 = r4 + r5;
 466             case 0x10008: if (r1 == r6) {
 467                             /* delay slot */
 468                             r1 = r1 + 1;
 469                             pc = 0x10018:
 470                             continue; }
 471             case 0x1000C: r1 = r1 + 1;
 472             case 0x10010: r31 = 0x10018;
 473                           pc = 0x10210;
 474                           continue;
 475             case 0x10014: /* nop */
 476             case 0x10018: pc = r31; continue;
 477             ...
 478             case 0xdeadbeef:  System.exit(1);
 479 ...
 480 \end{verbatim}}
 481 \vspace{1in}
 482 {\footnotesize\begin{verbatim}
 483 public void run_0x10000() {
 484     while (true) switch(pc) {
 485         case 0x10000: ...
 486         case 0x10004: ...
 487         case 0x10010: r31 = 0x10018;
 488                       pc = 0x10210;
 489                       continue;
 490 ....
 491
 492 pubic void run_0x10200() {
 493     while (true) switch(pc) {
 494         case 0x10200: ...
 495         case 0x10204: ...
 496 ....
 497
 498 public void trampoline() {
 499     while (true) switch(pc&0xfffffe00) {
 500         case 0x10000:    run_0x10000(); break;
 501         case 0x10200:    run_0x10200(); break;
 502         case 0xdeadbe00: ...
 503 ....
 504 \end{verbatim}}
 505 \end{multicols}
 506 \end{minipage}
 507 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 508 \end{figure*}
 509
 510 Translating unsafe code for use within a JVM proceeds as follows:
 511
 512 \begin{enumerate}
 513
 514 \item Compile the source code to a statically linked binary, targeting
 515       the MIPS R2000 ISA.  Typically this will involve linking against
 516       {\tt libc}, which translates system requests (such as {\tt
 517       open()}, {\tt read()}, or {\tt write()}) into appropriate
 518       invocations of the MIPS {\tt SYSCALL} instruction.  Any other
 519       libraries which are not tied to a particular OS kernel can be
 520       linked in (even in binary form) using standard linker commands.
 521
 522 \item Invoke {\tt NestedVM} on the statically linked binary.
 523
 524 \item Compile the resulting {\tt .java} code using {\tt jikes}
 525       \cite{jikes} or {\tt javac}.
 526
 527 \item From java code, invoke the {\tt run()} method on the generated
 528       class.  This is equivalent to the {\tt main()} entry point.
 529
 530 \end{enumerate}
 531
 532
 533 \subsection{Binary-to-Binary}
 534
 535 After implementing the binary-to-source compiler, a binary-to-binary
 536 translation mode was added.
 537
 538 \begin{pdfpic}
 539 \newlength{\MyLength}
 540 \settowidth{\MyLength}{xmachine codex}
 541 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 542 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 543   & \\[0pt]
 544   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 545   & \\[0pt]
 546   & \\[0pt]
 547   & \\[0pt]
 548   & \\[0pt]
 549   & \\[0pt]
 550   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 551   & \\[0pt]
 552   \psset{nodesep=5pt,arrows=->}
 553   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 554   \ncline{b0}{b1}\naput{\tt NestedVM}
 555 \endpsmatrix
 556 \end{pdfpic}
 557
 558 This mode has several advantages:
 559
 560 \begin{itemize}
 561
 562 \item There are quite a few interesting bytecode sequences that cannot
 563       be generated as a result of compiling Java source code.
 564
 565 \item Directly generating {\tt .class} files Eliminates the
 566       time-consuming {\tt javac} step.
 567
 568 \item Direct compilation to {\tt .class} files opens up the
 569       interesting possibility of dynamically translating MIPS binaries
 570       and loading them via {\tt ClassLoader.fromBytes()} {\it at
 571       deployment time}, eliminating the need to compile binaries ahead
 572       of time.
 573
 574 \end{itemize}
 575
 576 \section{The NestedVM Runtime}
 577
 578 In addition to binary-to-source and binary-to-binary translation,
 579 NestedVM also includes a MIPS binary interpreter.  All three
 580 translation approaches expose the same API to both the translated
 581 binary and the surrounding VM (including peer Java code).
 582
 583 The NestedVM Runtime (various subclasses of {\tt
 584 org.ibex.nestedvm.Runtime}) fill the role of an OS Kernel.
 585 Communication between MIPS code and the outside world is via the MIPS
 586 {\tt SYSCALL} instruction, just as the {\tt libc}-kernel interface is
 587 on real MIPS implementations.
 588
 589 Two implemenations of the runtime are offered; a simple runtime with
 590 the minimum support required to comply with ANSI C, and a more
 591 sophisticated runtime which emulates a large portion of the POSIX API.
 592
 593 \subsection{The ANSI C Runtime}
 594
 595 The ANSI C runtime offers typical file I/O operations including {\tt
 596 open()}, {\tt close()}, {\tt read()}, {\tt write()}, and {\tt seek()}.
 597 File descriptors are implemented much as they are in OS kernels; a
 598 table of open files is maintained and descriptors act as an index into
 599 that table.  Each file is represented as a Java {\tt RandomAccessFile}
 600 in order to match the semantics of {\tt seek()}.
 601
 602 Process-level memory management is done through the {\tt sbrk()}
 603 system call, which extends the process heap by adding more pages to
 604 the memory page table.  Fast memory clear and copy operations can be
 605 performed with {\tt memset()} and {\tt memcpy()}, which invoke the
 606 Java {\tt System.arraycopy()} method, which is generally much faster
 607 than a {\tt for()} loop.
 608
 609 The {\tt exit()} call records the exit status, marks the VM instance
 610 as terminated and halts execution.  The {\tt pause()} syscall
 611 implements a crude form of Java-MIPS communication by returning
 612 control to the Java code which spawned the MIPS process.
 613
 614 \subsection{The Unix Runtime}
 615
 616 The Unix runtime extends the simple ANSI file I/O model to include a
 617 single-root filesystem, and device nodes, as well as {\tt fcntl()}
 618 APIs to manipulate these.  Device nodes are generally simulated by
 619 mapping reads, writes, and {\tt fcntl()}s on the device to the
 620 appropriate Java API.
 621
 622 MIPS processes can ``mount'' other filesystems within the virtual
 623 filesystem made visible to the MIPS process.  Each filesystem is
 624 implemented by a Java class, which could, for example, offer access to
 625 the host filesystem (including {\tt state()}, {\tt lstat()}, {\tt
 626 mkdir}, and {\tt unlink()}, and {\tt getdents()}), the contents of a
 627 zip archive, or even a remote HTTP server.
 628
 629 MIPS processes can also spawn subprocesses using the {\tt fork()} and
 630 {\tt exec()} calls, which create new Java threads to run the process.
 631 The {\tt fork()} call -- which is supposed to duplicate the memory
 632 image of a process -- is implemented in an elegant manner by calling
 633 the Java {\tt clone()} method (deep copy) on the VM object itself.
 634 Copy-on-write is not currently implemented.  The new instance is added
 635 to a static process table to facilitate interprocess communication.
 636
 637 The {\tt waitpid()} API allows a parent process to block pending the
 638 completion of a child process, which is done quite easily with the
 639 Java {\tt wait()} method.
 640
 641 The {\tt exec()} method actually loads a MIPS binary image from the
 642 filesystem, feeds it to the MIPS-to-bytecode translator, and then
 643 loads the resulting bytecode on the fly using {\tt
 644 ClassLoader.loadBytes()}.  The {\tt pipe()} system call permits
 645 parent-to-child IPC just as on a normal Unix system.
 646
 647 Simple networking support is provided by the {\tt opensocket()}, {\tt
 648 listensocket()}, and {\tt accept()} methods, which are not currently
 649 compatible with the usual Berkeley sockets API.
 650
 651
 652 \subsection{Security Concerns}
 653
 654 RuntimeExceptions don't escape, they care caught and turned into
 655 checked exceptions filesystem access does though security manager
 656 (NestedVM Runtime.SecurityManager, and the JVM's)
 657
 658
 659 \subsection{Threading}
 660
 661 The NestedVM runtime currently does not support threading.  Providing
 662 robust support for ``true threads'', whereby each MIPS thread maps to
 663 a Java thread is probably not possible as the Java Memory Model
 664 [CITE], since all MIPS memory is stored in a set of {\tt int[]}'s and
 665 the Java Memory Model does not permit varying treatment or coherency
 666 policies at the granularity of a single array element.
 667
 668 While this presents a major barrier for applications that use
 669 sophisticated locking schemes (such as {\it hash synchronization})
 670 which depend on atomic memory operations, it is probably possible to
 671 apply this threading model to ``well behaved'' multithreaded
 672 applications which depend only on OS-provided semaphores and mutexes
 673 for synchronization.
 674
 675 Complex synchronization and incorrectly synchronized applications can
 676 be supported by implementing a variant of {\it user threads} within a
 677 single Java thread by providing a timer interrupt (via a Java
 678 asynchronous exception).  Unfortunately this requires that the
 679 compiled binary be able to restart from any arbitrary instruction
 680 address, which would require a {\tt case} statement for every
 681 instruction (rather than every jump target), which would degrade
 682 performance and increase the size of the resulting class file.
 683
 684 \section{Optimization and Performance}
 685
 686 \subsection{Binary-to-Source mode}
 687
 688 Generating Java source code instead of bytecode frees NestedVM from
 689 having to perform simple constant propagation optimizations, as most
 690 Java compilers already do this.  A recurring example is the treatment
 691 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
 692
 693 Lacking the ability to generate specially optimized bytecode
 694 sequences, a straightforward mapping of the general purpose hardware
 695 registers to 32 {\tt int} fields turned out to be optimal.
 696
 697
 698 \epsfig{file=charts/chart1,width=3in}
 699
 700 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
 701 for a single method.  This presents a problem for NestedVM, and
 702 necessitates a {\it trampoline transformation}, as shown in
 703 Figure~\ref{code1}.  With this trampoline in place, large binaries can
 704 be handled without much difficulty -- fortunately, there is no
 705 corresponding limit on the size of a classfile as a whole.
 706
 707 One difficulty that arose as a result of using the trampoline
 708 transformation was the fact that {\tt javac} and {\tt jikes} are
 709 unable to properly optimize its switch statements.  For example, the
 710 following code is compiled into a comparatively inefficient {\tt
 711 LOOKUPSWITCH}:
 712
 713 {\footnotesize
 714 \begin{verbatim}
 715     switch(pc&0xffffff00) {
 716         case 0x00000100: run_100(); break;
 717         case 0x00000200: run_200(); break;
 718         case 0x00000300: run_300(); break;
 719     }
 720 \end{verbatim}}
 721
 722 Whereas the next block of code code optimized into a {\tt
 723 TABLESWITCH}:
 724
 725 {\footnotesize
 726 \begin{verbatim}
 727     switch(pc>>>8) {
 728         case 0x1: run_100();
 729         case 0x2: run_200();
 730         case 0x3: run_300();
 731     }
 732 \end{verbatim}}
 733
 734 This problem was surmounted by switching on a denser set of {\tt case}
 735 values, which is more amenable to the {\tt TABLESWITCH} structure.
 736 This change alone nearly doubled the speed of the compiled binary.
 737
 738 The next performance improvement came from tuning the size of the
 739 methods invoked from the trampoline.  Trial and error led to the
 740 onclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
 741 -- performs best when 128 MIPS instructions are mapped to each method.
 742
 743 \epsfig{file=charts/chart5,width=3in}
 744
 745 \epsfig{file=charts/chart6,width=3in}
 746
 747 This phenomenon is due to two factors:
 748
 749 \begin{itemize}
 750
 751 \item While the trampoline method's {\tt switch} statement can be
 752       coded as a {\tt TABLESWITCH}, the {\tt switch} statement
 753       within the individual methods is to sparse to encode this way.
 754
 755 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
 756       favor smaller methods since they are easier to compile and are
 757       better candidates for compilation in ``normal'' programs (unlike
 758       NestedVM programs).
 759
 760 \end{itemize}
 761
 762 After tuning method sizes, our next performance boost came from
 763 eliminating exraneous case branches.  Having case statements before
 764 each instruction prevents JIT compilers from being able to optimize
 765 across instruction boundaries, since control flow can enter the body
 766 of a {\tt switch} statement at any of the {\tt case}s.  In order to
 767 eliminate unnecessary case statements we needed to identify all
 768 possible jump targets.  Jump targets can come from three sources:
 769
 770 \begin{itemize}
 771
 772 \item {\bf The {\tt .text} segment}
 773
 774       Every instruction in the text segment is scanned, and every
 775       branch instruction's destination is added to the list of
 776       possible branch targets.  In addition, any function that sets
 777       the link register is added to the list \footnote{actually {\tt addr+8}}.
 778       Finally, combinations of {\tt LUI} (Load Upper Immediate) and
 779       {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
 780       addresses in the {\tt .text} segment since this combination of
 781       instructions is often used to load a 32-bit word into a
 782       register.
 783
 784 \item {\bf The {\tt .data} segment}
 785
 786       When compiling {\tt switch} statements, compilers often use a
 787       jump table stored in the {\tt .data} segment.  Unfortunately
 788       they typically do not identify these jump tables in any way.
 789       Therefore, the entire {\tt .data} segment is conservatively
 790       scanned for possible addresses in the {\tt .text} segment.
 791
 792 \item {\bf The symbol table}
 793
 794       This is mainly used as a backup.  Scanning the {\tt .text} and
 795       {\tt .data} segments should identify any possible jump targets;
 796       however, adding all function symbols in the ELF symbol table
 797       also catches functions that are never called directly from the
 798       MIPS binary, such as those invoked only via the NestedVM
 799       runtime's {\tt call()} method.
 800
 801 \end{itemize}
 802
 803 Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
 804 increase.
 805
 806 Despite all the above optimizations, one insurmountable obstacle
 807 remained: the Java {\tt .class} file format limits the constant pool
 808 to 65535 entries.  Every integer literal greater than {\tt 32767}
 809 requires an entry in this pool, and each branch instruction generates
 810 one of these.
 811
 812 One suboptimal solution was to express constants as offsets from a few
 813 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
 814 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
 815 inlining it).  This was sufficient to get reasonably large binaries to
 816 compile, and caused only a small (approximately 5\%) performance
 817 degredation and a similarly small increase in the size of the {\tt
 818 .class} file.  However, as we will see in the next section, compiling
 819 directly to {\tt .class} files (without the intermediate {\tt .java}
 820 file) eliminates this problem entirely.
 821
 822 \subsection{Binary-to-Binary mode}
 823
 824 Most of the performance improvemen where made where in the handling of
 825 branch instructions and in taking advantage of the JVM stack to
 826 eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
 827
 828 \epsfig{file=charts/chart7,width=3in}
 829
 830 The first optimization gained by direct bytecode generation came from
 831 the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
 832 Java {\it language} does not have a {\tt goto} keyword, the VM does in
 833 fact have a corresponding instruction which is used quite heavily by
 834 {\tt javac}.  NestedVM's binary-to-binary mode exploits this
 835 instruction to avoid emitting inefficient {\tt switch..case}
 836 structures.
 837
 838 Related to the {\tt GOTO} instruction is branch statement
 839 optimization.  When emitting source code, NestedVM translates branches
 840 into Java source code like this:
 841
 842 {\footnotesize\begin{verbatim}
 843     if (condition) {
 844         pc = TARGET;
 845         continue;
 846     }
 847 \end{verbatim}}
 848
 849 This requires a branch in the JVM {\it regardless} of whether the MIPS
 850 branch is actually taken.  If {\tt condition} is false the JVM has to
 851 jump over the code to set {\tt pc} and go back to the {\tt switch}
 852 statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
 853 switch} block.  By generating bytecode directly, NestedVM is able to
 854 emit a JVM bytecode branching directly to the address corresponding to
 855 the target of the MIPS branch.  In the case where the branch is not
 856 taken the JVM doesn't branch at all.
 857
 858 A side effect of the previous two optimizations is a solution to the
 859 excess constant pool entries problem.  When jumps are implemented as
 860 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
 861 not need to be set.  This eliminates a huge number of constant pool
 862 entries.  The {\tt .class} file constant pool size limit is still
 863 present, but it is less likely to be encountered.
 864
 865 Implementation of the MIPS delay slot offers another opportunity for
 866 bytecode-level optimization.  In order to take advantage of
 867 instructions already in the pipeline, the MIPS ISA specifies that the
 868 instruction after a jump or branch is always executed, even if the
 869 jump/branch is taken.  This instruction is referred to as the ``delay
 870 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
 871 early MIPS CPUs, so they have to discard instructions anyways}.''  The
 872 instruction in the delay slot is actually executed {\it before} the
 873 branch is taken.  To further complicate matters, values from the
 874 register file are loaded {\it before} the delay slot is executed.
 875
 876 Fortunately there is a very elegent solution to this problem which can
 877 be expressed in JVM bytecode.  When a branch instruction is
 878 encountered, the registers needed for the comparison are pushed onto
 879 the stack to prepare for the JVM branch instruction.  Then, {\it
 880 after} the values are on the stack the delay slot instruction is
 881 emitted, followed by the actual JVM branch instruction.  Because the
 882 values were pushed to the stack before the delay slot was executed, any
 883 changes the delay slot made to the registers are not visible to the
 884 branch bytecode.
 885
 886 One final advantage that generating bytecode directly allows is a
 887 reduction in the size of the ultimate {\tt .class} file.  All the
 888 optimizations above lead to more compact bytecode as a beneficial side
 889 effect; in addition, NestedVM performs a few additional optimizations.
 890
 891 When encountering the following {\tt switch} block, both {\tt javac}
 892 and {\tt jikes} generate redundant bytecode.
 893
 894 {\footnotesize\begin{verbatim}
 895     switch(pc>>>8) {
 896         case 0x1: run_1(); break;
 897         case 0x2: run_2(); break
 898         ...
 899         case 0x100: run_100(); break;
 900     }
 901 \end{verbatim}}
 902
 903 The first bytecode in each case arm in the switch statement is {\tt
 904 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
 905 lifting this bytecode outside of the {\tt switch} statement, each {\tt
 906 case} arm shrinks by one instruction.
 907
 908 \subsection{Compiler Flags}
 909
 910 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
 911 profile is nothing like that of actual silicon.  In particular, {\tt
 912 gcc} makes several optimizations that increase performance on an
 913 actually MIPS CPU but actually decrease the performance of
 914 NestedVM-generated bytecode.  We found the following compiler options
 915 could be used to improve performance:
 916
 917 \begin{itemize}
 918
 919 \item {\tt -falign-functions}
 920
 921       Normally a function's location in memory has no effect on its
 922       execution speed.  However, in the NestedVM binary translator,
 923       the {\tt .text} segment is split on power-of-two boundaries.  If
 924       a function starts near the end of one of these boundaries, a
 925       performance critical part of the function winds up spanning two
 926       Java methods.  Telling {\tt gcc} to align all functions along
 927       these boundaries decreases the chance of this sort of splitting.
 928
 929 \item {\tt -fno-rename-registers}
 930
 931       On an actual silicon chip, using additional registers carries no
 932       performance penalty (as long as none are spilled to the stack).
 933       However, when generating bytecode, using {\it fewer}
 934       ``registers'' helps the JVM optimize the machine code it
 935       generates by simplifying the constraints it needs to deal with.
 936       Disabling register renaming has this effect.
 937
 938 \item {\tt -fno-schedule-insns}
 939
 940       Results of MIPS load operations are not available until {\it
 941       two} instructions after the load.  Without the {\tt
 942       -fno-schedule-insns} instruction, {\tt gcc} will attempt to
 943       reorder instructions to do other useful work during this period
 944       of unavailability.  NestedVM is under no such constraint, so
 945       removing this reordering typically generates simpler machine
 946       code.
 947
 948 \item {\tt -mmemcpy}
 949
 950       Enabling this instruction causes {\tt gcc} to use the system
 951       {\tt memcpy()} routine instead of generating loads and stores.
 952       As explained in the next section, the NestedVM runtime
 953       implements {\tt memcpy()} using {\tt System.arraycopy()}, which
 954       is substantially more efficient.
 955
 956 \item {\tt -ffunction-sections -fdata-sections}
 957
 958       These two options are used in conjunction with the {\tt
 959       --gc-section} linker option, prompting the linker to more
 960       aggressively prune dead code.
 961
 962 \end{itemize}
 963
 964 The effects of the various optimizations presented in this chapter are
 965 summarized in the table below.
 966
 967 \epsfig{file=charts/chart4,width=3in}
 968
 969 \epsfig{file=charts/chart3,width=3in}
 970
 971 \epsfig{file=charts/chart8,width=3in}
 972
 973 \epsfig{file=charts/chart9,width=3in}
 974
 975 \section{Sample Applications}
 976
 977 \subsection{FreeType, {\tt libmspack}, and {\tt libjpeg}}
 978
 979 The Ibex Project utilizes three libraries for which no Java-only
 980 equivalent exists.  The first is the FreeType font library, which
 981 parses, hints, and rasterizes TrueType and Postscript fonts with
 982 exceptional quality.  The project also needed an open source JPEG
 983 decompressor; surprisingly, none exist for Java.  While encoders are
 984 plentiful
 985
 986 These three libraries make minimal use of the standard library and OS
 987 services, and are all written in very portable ANSI C code, which made
 988 them easy targets for initial development.
 989
 990 \subsection{The GNU Compiler Collection}
 991
 992 Our next target, {\tt gcc}, was chosen in order to relieve developers
 993 from the time-consuming and complex task of building a compiler
 994 themselves.  The Ibex Project requires a specially configured and
 995 patched version of {\tt gcc} and its ahead-of-time Java compiler ({\tt
 996 gcj}) which is not distributed in binary form.
 997
 998 GCC was the first ``major'' application NestedVM was used on, and
 999 drove the development of most of the system library interface
1000 development; particularly support for {\tt fork()} and {\tt exec()},
1001 which require the NestedVM Runtime to perform binary-to-bytecode
1002 translation on the fly.
1003
1004 GCC also makes extensive use of 64-bit integers ({\tt long long}),
1005 which -- for performance reasons -- are typically manipulated using
1006 nonobvious instruction sequences on the 32-bit MIPS architecture.
1007 Dealing with these operations uncovered a number of bugs in the
1008 translator.
1009
1010 Despite our original goal, we have not yet been able to translate the
1011 {\tt C++} or Java front-ends, as the resulting binary produces a
1012 trampoline which exceeds the maximum size of a single method.  Future
1013 work will explore a multi-level trampoline to address this issue.
1014
1015
1016
1017 \subsection{\TeX and LINPACK}
1018
1019 In order to distinguish NestedVM from other single-language
1020 translators for the JVM, we undertook the task of translating \TeX 89
1021 (written in Pascal) and the Fortran source code for LINPACK into Java
1022 bytecodes.
1023
1024 Although actually producing the initial MIPS binaries from the \TeX\
1025 source code turned out to be an exceptionally tedious and frustrating
1026 task, the resulting binary translated and executed perfectly on the
1027 first run, as did LINPACK.  Our reward for this effort was typesetting
1028 our presentation of NestedVM using NestedVM itself.  We have also had
1029 initial successes running \TeX\ in a Java Applet, and intend to
1030 produce a {\tt jar} for embedding \TeX\ code (``\TeX lets'') in web
1031 pages without the use of a post-processing tool.
1032
1033 The LINPACK benchmark called our attention to Java's lack of an API
1034 for checking the ``cpu time'' of a process.  Unfortunately we had to
1035 substitute wall-clock time on an otherwise-quiescent machine as an
1036 approximation.
1037
1038
1039
1040
1041
1042 \section{Conclusion}
1043
1044 \subsection{Theoretical Limitations}
1045
1046 Theoretical limitations: threads, reading from code, self-modifying
1047 code, COW?
1048
1049 \subsection{Future Directions}
1050
1051 Although we have only implemented it for the Java Virtual Machine, our
1052 technique generalizes to other safe bytecode architectures.  In
1053 particular we would like to demonstrate this generality by retargeting
1054 the translator to the Microsoft Intermediate Language \cite{msil}.
1055
1056 Additionally, we would like to explore other uses for dynamic loading
1057 of translated MIPS binaries by combining NestedVM (which itself is
1058 written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
1059
1060 \subsection{Availability}
1061
1062 NestedVM is available under an open source license, and can be
1063 obtained from
1064 \begin{verbatim}
1065     http://nestedvm.ibex.org
1066 \end{verbatim}
1067
1068 \appendix
1069 \section{Appendix: Testing Methodology}
1070
1071 All times are measured in seconds. These were all run on a dual 1Ghz
1072 Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
1073 test was run 8 times within a single VM. The highest and lowest times
1074 were removed and the remaining 6 were averaged.  In each case only the
1075 first run differed significantly from the rest.
1076
1077 The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
1078 writing a tga.  The {\tt mspack} test consisted of extracting all
1079 members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
1080 and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
1081 rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
1082 to 48 incrementing by 4 for a total of 950 glyphs.
1083
1084 \bibliography{nestedvm}
1085
1086 \end{document}
1087