doc/ivme04.tex

   1 \documentclass{acmconf}
   2 \usepackage{graphicx}
   3 \usepackage{multicol}
   4 \usepackage{amssymb,amsmath,epsfig,alltt}
   5 \sloppy
   6 \usepackage{palatino}
   7 \usepackage{pdftricks}
   8 \begin{psinputs}
   9   \usepackage{pstricks}
  10   \usepackage{pst-node}
  11 \end{psinputs}
  12 \usepackage{parskip}
  13 \usepackage{tabularx}
  14 \usepackage{alltt}
  15 \bibliographystyle{amsplain}
  16
  17 \title{\textbf{\textsf{
  18 Complete Translation of Unsafe Native Code to Safe Bytecode
  19 }}}
  20 \date{}
  21 \author{\begin{tabular}{@{}c@{}}
  22         {\em {Brian Alliet}} \\
  23         {Rochester Institute of Technology}\\
  24         {\tt bja8464@cs.rit.edu}
  25    \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
  26         {\em {Adam Megacz}} \\
  27         {University of California, Berkeley} \\
  28         {\tt megacz@cs.berkeley.edu}
  29 \end{tabular}}
  30 \begin{document}
  31
  32 \maketitle
  33
  34 {\it This document was typeset using D. E. Knuth's original \TeX 89,
  35      which was both compiled and executed entirely within a Java
  36      Virtual Machine without the use of native code.}
  37
  38 \begin{abstract}
  39
  40 Existing techniques for using code written in an unsafe language
  41 within a safe virtual machine generally involve transformations from
  42 one source code language (such as C, Pascal, or Fortran) to another
  43 (such as Java) which is then compiled into virtual machine bytecodes.
  44
  45 We present an alternative approach which translate MIPS binaries
  46 produced by any compiler into safe virtual machine bytecodes.  This
  47 approach offers four key advantages over existing techniques: it is
  48 language agnostic, it offers bug-for-bug compiler compatability,
  49 requires no post-translation human intervention, and introduces no
  50 build process modifications.
  51
  52 We also present NestedVM, an implementation of this technique, and
  53 discuss its application to six software packages: LINPACK (Fortran),
  54 which was used as one of our performance tests, \TeX\ (Pascal), which
  55 was used to typeset this document, {\tt libjpeg}, {\tt libmspack}, and
  56 FreeType (all C source), which are currently in production use as part
  57 of the Ibex Project, and {\tt gcc}, which was used to compile all of
  58 the aforementioned.
  59
  60 Performance measurements indicate a best case performance within 3x of
  61 native code and worst case typically within 10x, making it an
  62 attractive solution for code which is not performance-critical.
  63
  64 \end{abstract}
  65
  66 \section{Introduction}
  67
  68 Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
  69 been in use much longer than any of today's widely accepted safe
  70 languages such as Java \cite{java} and C\# \cite{csharp}.
  71 Consequently, there is a huge library of software written in these
  72 languages.  Although safe languages offer substantial benefits, their
  73 comparatively young age often puts them at a disadvantage when breadth
  74 of existing support code is an important criterion.
  75
  76 The typical solution to this dilemma is to use a native interface such
  77 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
  78 virtual machine or otherwise safe environment.  Unfortunately, there
  79 are a number of situations in which this is not an acceptable
  80 solution.  These situations can be broadly classified into two
  81 categories: {\it security concerns} and {\it portability concerns}.
  82
  83 Security is often a major concern when integrating native code.  Using
  84 Java as an example, JNI and CNI are prohibited in a number of
  85 contexts, including applet environments and servlet containers with a
  86 {\tt SecurityManager}.  Additionally, even in the context of trusted
  87 code, {\tt native} methods invoked via JNI are susceptible to buffer
  88 overflow and heap corruption attacks which are not a concern for
  89 verified, bounds-checked bytecode.
  90
  91 The second class of disadvantages revolves around portability
  92 concerns; native interfaces require the native library to be compiled
  93 ahead of time for every architecture on which it will be deployed.
  94 This is unacceptable for scenarios in which the full set of target
  95 architectures is not known at deployment time.  Additionally, some JVM
  96 platform variants such as J2ME \cite{j2me} simply do not offer support
  97 for native code.
  98
  99 The technique we present here uses typical compiler to compile unsafe
 100 code into a MIPS binary, which is then translated on an
 101 instruction-by-instruction basis into Java bytecode.  The technique
 102 presented here is general; we anticipate that it can be applied to
 103 other secure virtual machines such as Microsoft's .NET \cite{msil},
 104 Perl Parrot \cite{parrot}, or Python bytecode \cite{python}.
 105
 106 The remainder of this paper is divided as follows: in the next section
 107 we review the relevant set of program representations (safe source,
 108 unsafe source, binary, and bytecode) and survey existing work for
 109 performing transformations between them.  In the third section we
 110 introduce NestedVM and cover its two primary translation modes in
 111 detail.  Section four adresses the optimizations we employ and
 112 quantifies NestedVM's performance.  Section five describes the
 113 NestedVM runtime, which plays the role of the OS kernel.  We conclude
 114 with an analysis of NestedVM's weaknesses and potential for future
 115 improvements.
 116
 117
 118 \section{Approaches to Translation}
 119
 120 The four program representations of interest in this context, along
 121 with their specific types in the C-to-JVM instantiation of the
 122 problem are shown in the following diagram:
 123
 124 \begin{pdfpic}
 125 \newlength{\MyLength}
 126 \settowidth{\MyLength}{machine code}
 127 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
 128 \begin{psmatrix}[colsep=2,rowsep=0]
 129   & \\[0pt]
 130   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 131   [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
 132   & \\[0pt]
 133   & \\[0pt]
 134   & \\[0pt]
 135   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 136   [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
 137   & \\[0pt]
 138   \psset{nodesep=5pt,arrows=->}
 139 \end{psmatrix}
 140 \end{pdfpic}
 141
 142 To illustrate the context of this diagram, the following arcs show the
 143 translations performed by a few familiar tools:
 144
 145 \begin{pdfpic}
 146 \newlength{\MyLength}
 147 \settowidth{\MyLength}{xmachine codex}
 148 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 149 \psmatrix[colsep=2,rowsep=0,nrot=:D]
 150   & \\[0pt]
 151   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 152   & \\[0pt]
 153   & \\[0pt]
 154   & \\[0pt]
 155   & \\[0pt]
 156   & \\[0pt]
 157   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 158   & \\[0pt]
 159   \psset{nodesep=5pt,arrows=->}
 160   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 161   \ncline{s1}{b0}\bput{:D}{\tt gcj}
 162   \ncline{s1}{b1}\aput{:U}{\tt javac}
 163   \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
 164 \endpsmatrix
 165 \end{pdfpic}
 166
 167 Existing techniques for translating unsafe code into VM bytecode
 168 generally fall into two categories, which we expand upon in the
 169 remainder of this section: source-to-source translation and
 170 source-to-binary translation.
 171
 172 \subsection{Source-to-Source Translation}
 173
 174 The most common technique employed is partial translation from unsafe
 175 source code to safe source code:
 176
 177 \begin{pdfpic}
 178 \newlength{\MyLength}
 179 \settowidth{\MyLength}{xmachine codex}
 180 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 181 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 182   & \\[0pt]
 183   & \\[0pt]
 184   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 185   & \\[0pt]
 186   & \\[0pt]
 187   & \\[0pt]
 188   & \\[0pt]
 189   & \\[0pt]
 190   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 191   & \\[0pt]
 192   \psset{nodesep=5pt,arrows=->}
 193   \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
 194   \ncline{s1}{b1}\aput{:U}{\tt javac}
 195 \endpsmatrix
 196 \end{pdfpic}
 197
 198 A number of existing systems employ this technique; they can
 199 be divided into two categories: those which perform a partial
 200 translation which is completed by a human, and those which perform a
 201 total translation but fail (yield an error) on a large class of input
 202 programs.
 203
 204
 205 Jazillian \cite{jazillian} is a commercial solution which produces
 206 extremely readable Java source code from C source code, but ony
 207 translates a small portion of the C language.  Jazillian is unique in
 208 that in addition to {\it language migration}, it also performs {\it
 209 API migration}; for example, Jazillian is intelligent enough
 210 to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}.
 211
 212 Unfortunately such deep analysis is intractible for most of the C
 213 language and standard library; Jazillian's documentation notes that
 214 {\it ``This is not your father's language translator.  It's not
 215 generating ugly code that's guaranteed to work out of the
 216 box... Jazillian does not always produce code that works correctly.''}
 217
 218 MoHCA-Java \cite{mohca} is the other major tool in this category, and steps
 219 beyond Jazillian by providing tools for analysis of the source C++
 220 abstract syntax tree.  Additionally, MoHCA-Java's analysis engine is
 221 extensible, making it a platform for constructing application-specific
 222 translators rather than a single translation tool.  However,
 223 MoHCA-Java does not always generate complete Java code for all of the C++
 224 programs which it accepts.
 225
 226 The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp},
 227 and Ephedra \cite{ephedra} systems each provide support for complete
 228 translation of a {\it subset} of the source language (C or C++).  Each
 229 of the four tools supports a progressively greater subset than the one
 230 preceding it; however none covers the entire input language.
 231
 232 Ephedra, the most advanced of the four, supports most of the C++
 233 language, and claims to produce ``human readable'' Java code as
 234 output.  Notable omissions from the input domain include support for
 235 fully general pointer arithmetic, casting between unrelated types, and
 236 reading from a {\tt union} via a different member than the one most
 237 recently written.
 238
 239 Unfortunately, when the program being translated is large and complex,
 240 it is quite likely that it will use an unsupported feature in at least
 241 one place.  In the absence of a programmer who understands the source
 242 program, a single anomoly is often enough to render the entire
 243 translation process useless.  As a result, these tools are mainly
 244 useful as an {\it aid} to programmers who could normally perform the
 245 conversion themselves, but want to save time by automating most of the
 246 process.
 247
 248
 249 \subsection{Source-to-Binary Translation}
 250
 251 Source-to-binary translation involves a compiler for the unsafe
 252 language which has been modified to emit safe bytecode.
 253
 254 \begin{pdfpic}
 255 \newlength{\MyLength}
 256 \settowidth{\MyLength}{xmachine codex}
 257 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 258 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 259   & \\[0pt]
 260   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 261   & \\[0pt]
 262   & \\[0pt]
 263   & \\[0pt]
 264   & \\[0pt]
 265   & \\[0pt]
 266   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 267   & \\[0pt]
 268   \psset{nodesep=5pt,arrows=->}
 269   \ncline{s0}{b1}\bput{:U}{source-to-binary}
 270 \endpsmatrix
 271 \end{pdfpic}
 272
 273 The primary occupant of this category is {\tt egcs-jvm}
 274 \cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler
 275 Collection ( {\tt gcc} ) \cite{gcc}.  Since {\tt gcc} employs a highly
 276 modular architecture, it {\it is} possible to add RTL code generators
 277 for nonstandard processors.  However, {\tt gcc}'s parsing, RTL
 278 generation, and optimization layers make fundamental assumptions (such
 279 as the availability of pointer math) which cannot be directly
 280 supported; thus the compiler still fails for a substantial class of
 281 input programs.
 282
 283 A Java backend for the {\tt lcc} [CITE] compiler, known as {\tt
 284 lcc-java} [CITE], but was not completed.  {\tt lcc-java} also lacks
 285 any form of system library ({\tt libc}), so very few C programs will
 286 run without custom modification, which would cause them to diverge
 287 from the upstream sources.  Finally, {\tt lcc-java}'s memory model is
 288 much more restricted; it uses a fixed-size array to represent all
 289 memory, and expands the array by allocating a new array and copying,
 290 which is extremely inefficient.  No attempt is made to take advantage
 291 of {\tt NullPointerException} checking (which costs nothing if the
 292 exception is not thrown since most JVMs use the MMU to detect this).
 293 Finally, {\tt lcc-java} targets Java source code, which places the
 294 vast majority of NestedVM's optimizations beyond its reach, and
 295 severely restricts the maximum program size {\tt lcc-java} can handle.
 296
 297 Finally, {\tt lcc-java} maintains a separate memory area for the
 298 stack, which appears to limit the exchange of stack pointers and heap
 299 pointers.  It is unclear from the documentation how this is handled.
 300
 301
 302 \section{NestedVM}
 303
 304 The principal difference between NestedVM and other approaches is that
 305 NestedVM {\it does not} attempt to deal with source code as an input,
 306 instead opting for {\it binary-to-source} and {\it binary-to-binary}
 307 translation.  This offers three immediate advantages:
 308
 309 \begin{itemize}
 310 \item {\bf Language Agnostic}
 311
 312       Because NestedVM does not attempt to implement the parsing and
 313       code generation steps of compilation, it is freed from the
 314       extremely complex task of faithfully implementing languages
 315       which are often not fully or formally specified (such as C and
 316       C++), and is able to support any langage for which a
 317       MIPS-targeted compiler exists.
 318
 319 \item {\bf Bug-for-bug compiler compatability}
 320
 321       Since NestedVM uses the compiler's {\it output} as its own {\it
 322       input}, it ensures that programs which are inadvertently
 323       dependent on the vagaries of a particular compiler can still be
 324       used.
 325
 326 \item {\bf No build process modifications}
 327
 328       NestedVM does not modify existing build processes, which can be
 329       extremely complex and dependent on strange preprocessor usage as
 330       well as the complex interplay between compiler switches and
 331       header file locations.
 332
 333 \end{itemize}
 334
 335 NestedVM's approach carries a fourth benefit as well, arising from its
 336 totality:
 337
 338 \begin{itemize}
 339 \item {\bf No post-translation human intervention}
 340
 341       NestedVM offers total support for all non-privileged
 342       instructions, registers, and resources found on a MIPS {\tt
 343       R2000} CPU, including the add/multiply unit and floating point
 344       coprocessor.  As such, it constitutes a total function mapping
 345       from the entire domain of (non-kernel-mode) programs onto (a
 346       subset of) the semantics of the Java Virtual Machine.  This
 347       ensures that the translation process is fully automated and
 348       always succeeds for valid input binaries.
 349 \end{itemize}
 350
 351 This is a much more important factor than is obvious at first glance.
 352 If post-translation human intervention is required, then the {\it
 353 human becomes part of the build process}.  This means that if a third
 354 party library used in the project needs to be upgraded, {\it a human
 355 must intervene} in the rebuild process.  In addition to slowing the
 356 process and introducing opportunities for error, this task often
 357 requires specialized knowledge which becomes tied to the particular
 358 individual performing this task, rather than being encoded in build
 359 scripts which persist throughout the lifetime of the project.
 360
 361 \subsection{Why MIPS?}
 362
 363 We chose MIPS as a source format for three reasons: the availability
 364 of tools to compile legacy code into MIPS binaries, the close
 365 similarity (FIXME: explain) between the MIPS ISA and the Java Virtual
 366 Machine, and the relatively high degree of program structure that can
 367 be inferred from ABI-adherent binaries.
 368
 369 The MIPS architecture has been around for quite some time, and is well
 370 supported by the GNU Compiler Collection, which is capable of
 371 compiling C, C++, Java, Fortran, Pascal, and Objective C
 372 into MIPS binaries.
 373
 374 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
 375 Machine:
 376
 377 \begin{itemize}
 378
 379 \item Most of the instructions in the original MIPS ISA operate only
 380       on 32-bit aligned memory locations. This allows NestedVM to
 381       represent memory as a Java {\tt int[]} array without introducing
 382       additional overhead.  The remaining non-aligned memory load
 383       instructions are only rarely emitted by most compilers since
 384       they carry a performance penalty on physical MIPS
 385       implementations.
 386
 387 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
 388       multiply and divide instructions as well as a single and double
 389       precision floating point unit.  These capabilities map nicely
 390       onto Java's arithmetic instructions.
 391
 392 \item Although MIPS offers unsigned arithmetic and Java does not, few
 393       MIPS instructions actually depend on non-two's-complement
 394       handling of integer math.  Furthermore, since most high-level
 395       languages such as C do not expose access to arithmetic-overflow
 396       exceptions, these instructions are rarely found except in
 397       hand-coded assembler.  In the few situations where these
 398       instructions {\it are} encountered, the {\tt unsigned int} is
 399       cast (bitwise) to a Java {\tt long}, the operation is performed,
 400       and the result is cast back.  On architectures offering 64-bit
 401       integer math this conversion carries no overhead.
 402
 403 \end{itemize}
 404
 405 Finally, the MIPS ISA and ABI convey quite a bit of information about
 406 program structure.  This information can be used for optimization
 407 purposes:
 408
 409 \begin{itemize}
 410
 411 \item The structure of MIPS branching and jump instructions make it
 412       easy to infer the set of likely target instructions.
 413
 414 \item The MIPS ABI specifies particular registers as caller-save and
 415       callee-save, as well as designating a register for the return
 416       address after a function call.  This allows NestedVM to optimize
 417       many operations for the common case of ABI-adherent binaries.
 418
 419 \item All MIPS instructions are exactly 32 bits long.
 420
 421 \end{itemize}
 422
 423
 424
 425 \subsection{Binary-to-Source}
 426
 427 The simplest operational mode for NestedVM is binary-to-source
 428 translation.  In this mode, NestedVM translates MIPS binaries into
 429 Java source code, which is then fed to a Java compiler in order to
 430 produce bytecode files:
 431
 432 \begin{pdfpic}
 433 \newlength{\MyLength}
 434 \settowidth{\MyLength}{xmachine codex}
 435 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 436 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 437   & \\[0pt]
 438   & \\[0pt]
 439   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 440   & \\[0pt]
 441   & \\[0pt]
 442   & \\[0pt]
 443   & \\[0pt]
 444   & \\[0pt]
 445   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 446   \psset{nodesep=5pt,arrows=->}
 447   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 448   \ncline{s1}{b1}\aput{:U}{\tt javac}
 449   \ncline{b0}{s1}\naput{\tt NestedVM}
 450 \endpsmatrix
 451 \end{pdfpic}
 452
 453 \begin{figure*}[t]
 454 \begin{minipage}[c]{7in}%
 455 \begin{multicols}{2}
 456 {\footnotesize\begin{verbatim}
 457 private final static int r0 = 0;
 458 private int r1, r2, r3,...,r30;
 459 private int r31 = 0xdeadbeef;
 460 private int pc = ENTRY_POINT;
 461
 462 public void run() {
 463     for(;;) {
 464         switch(pc) {
 465             case 0x10000:
 466                 r29 = r29 - 32;
 467             case 0x10004:
 468                 r1 = r4 + r5;
 469             case 0x10008:
 470                 if(r1 == r6) {
 471                     /* delay slot */
 472                     r1 = r1 + 1;
 473                     pc = 0x10018:
 474                     continue;
 475                 }
 476             case 0x1000C:
 477                 r1 = r1 + 1;
 478             case 0x10010:
 479                 r31 = 0x10018;
 480                 pc = 0x10210;
 481                 continue;
 482             case 0x10014:
 483                 /* nop */
 484             case 0x10018:
 485                 pc = r31;
 486                 continue;
 487             ...
 488             case 0xdeadbeef:
 489                 System.err.println(``Exited.'');
 490                 System.exit(1);
 491         }
 492     }
 493 }
 494 \end{verbatim}}
 495 \vspace{1in}
 496 {\footnotesize\begin{verbatim}
 497 public void run_0x10000() {
 498     for(;;) {
 499     switch(pc) {
 500         case 0x10000:
 501             ...
 502         case 0x10004:
 503             ...
 504         ...
 505         case 0x10010:
 506             r31 = 0x10018;
 507             pc = 0x10210;
 508             return;
 509         ...
 510     }
 511     }
 512 }
 513
 514 pubic void run_0x10200() {
 515     for(;;) {
 516     switch(pc) {
 517         case 0x10200:
 518             ...
 519         case 0x10204:
 520             ...
 521     }
 522     }
 523 }
 524
 525 public void trampoline() {
 526     for(;;) {
 527     switch(pc&0xfffffe00) {
 528             case 0x10000: run_0x10000(); break;
 529             case 0x10200: run_0x10200(); break;
 530             case 0xdeadbe00:
 531                 ...
 532         }
 533     }
 534 }
 535 \end{verbatim}}
 536 \end{multicols}
 537 \end{minipage}
 538 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
 539 \end{figure*}
 540
 541 Translating unsafe code for use within a JVM proceeds as follows:
 542
 543 \begin{enumerate}
 544
 545 \item Compile the source code to a statically linked binary, targeting
 546       the MIPS R2000 ISA.  Typically this will involve linking against
 547       {\tt libc}, which translates system requests (such as {\tt
 548       open()}, {\tt read()}, or {\tt write()}) into appropriate
 549       invocations of the MIPS {\tt SYSCALL} instruction.  Any other
 550       libraries which are not tied to a particular OS kernel can be
 551       linked in (even in binary form) using standard linker commands.
 552
 553 \item Invoke {\tt NestedVM} on the statically linked binary.
 554
 555 \item Compile the resulting {\tt .java} code using {\tt jikes}
 556       \cite{jikes} or {\tt javac}.
 557
 558 \item From java code, invoke the {\tt run()} method on the generated
 559       class.  This is equivalent to the {\tt main()} entry point.
 560
 561 \end{enumerate}
 562
 563
 564 \subsection{Binary-to-Binary}
 565
 566 After implementing the binary-to-source compiler, a binary-to-binary
 567 translation mode was added.
 568
 569 \begin{pdfpic}
 570 \newlength{\MyLength}
 571 \settowidth{\MyLength}{xmachine codex}
 572 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
 573 \psmatrix[colsep=2,rowsep=0,nrot=:U]
 574   & \\[0pt]
 575   [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
 576   & \\[0pt]
 577   & \\[0pt]
 578   & \\[0pt]
 579   & \\[0pt]
 580   & \\[0pt]
 581   [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
 582   & \\[0pt]
 583   \psset{nodesep=5pt,arrows=->}
 584   \ncline{s0}{b0}\bput{:U}{\tt gcc}
 585   \ncline{b0}{b1}\naput{\tt NestedVM}
 586 \endpsmatrix
 587 \end{pdfpic}
 588
 589 This mode has several advantages:
 590
 591 \begin{itemize}
 592
 593 \item There are quite a few interesting bytecode sequences that cannot
 594       be generated as a result of compiling Java source code.
 595
 596 \item Directly generating {\tt .class} files Eliminates the
 597       time-consuming {\tt javac} step.
 598
 599 \item Direct compilation to {\tt .class} files opens up the
 600       interesting possibility of dynamically translating MIPS binaries
 601       and loading them via {\tt ClassLoader.fromBytes()} {\it at
 602       deployment time}, eliminating the need to compile binaries ahead
 603       of time.
 604
 605 \end{itemize}
 606
 607 \section{Optimization and Quantitative Performance}
 608
 609 \subsection{Binary-to-Source mode}
 610
 611 Generating Java source code instead of bytecode frees NestedVM from
 612 having to perform simple constant propagation optimizations, as most
 613 Java compilers already do this.  A recurring example is the treatment
 614 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
 615
 616 Lacking the ability to generate specially optimized bytecode
 617 sequences, a straightforward mapping of the general purpose hardware
 618 registers to 32 {\tt int} fields turned out to be optimal.
 619
 620
 621 \epsfig{file=charts/chart1,width=3in}
 622
 623 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
 624 for a single method.  This presents a problem for NestedVM, and
 625 necessitates a {\it trampoline transformation}, as shown in
 626 Figure~\ref{code1}.  With this trampoline in place, large binaries can
 627 be handled without much difficulty -- fortunately, there is no
 628 corresponding limit on the size of a classfile as a whole.
 629
 630 One difficulty that arose as a result of using the trampoline
 631 transformation was the fact that {\tt javac} and {\tt jikes} are
 632 unable to properly optimize its switch statements.  For example, the
 633 following code is compiled into a comparatively inefficient {\tt
 634 LOOKUPSWITCH}:
 635
 636 {\footnotesize
 637 \begin{verbatim}
 638     switch(pc&0xffffff00) {
 639         case 0x00000100: run_100(); break;
 640         case 0x00000200: run_200(); break;
 641         case 0x00000300: run_300(); break;
 642     }
 643 \end{verbatim}}
 644
 645 Whereas the next block of code code optimized into a {\tt
 646 TABLESWITCH}:
 647
 648 {\footnotesize
 649 \begin{verbatim}
 650     switch(pc>>>8) {
 651         case 0x1: run_100();
 652         case 0x2: run_200();
 653         case 0x3: run_300();
 654     }
 655 \end{verbatim}}
 656
 657 This problem was surmounted by switching on a denser set of {\tt case}
 658 values, which is more amenable to the {\tt TABLESWITCH} structure.
 659 This change alone nearly doubled the speed of the compiled binary.
 660
 661 The next performance improvement came from tuning the size of the
 662 methods invoked from the trampoline.  Trial and error led to the
 663 onclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
 664 -- performs best when 128 MIPS instructions are mapped to each method.
 665
 666 \epsfig{file=chart5,width=3in}
 667
 668 \epsfig{file=chart6,width=3in}
 669
 670 This phenomenon is due to two factors:
 671
 672 \begin{itemize}
 673
 674 \item While the trampoline method's {\tt switch} statement can be
 675       coded as a {\tt TABLESWITCH}, the {\tt switch} statement
 676       within the individual methods is to sparse to encode this way.
 677
 678 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
 679       favor smaller methods since they are easier to compile and are
 680       better candidates for compilation in ``normal'' programs (unlike
 681       NestedVM programs).
 682
 683 \end{itemize}
 684
 685 After tuning method sizes, our next performance boost came from
 686 eliminating exraneous case branches.  Having case statements before
 687 each instruction prevents JIT compilers from being able to optimize
 688 across instruction boundaries, since control flow can enter the body
 689 of a {\tt switch} statement at any of the {\tt case}s.  In order to
 690 eliminate unnecessary case statements we needed to identify all
 691 possible jump targets.  Jump targets can come from three sources:
 692
 693 \begin{itemize}
 694
 695 \item {\bf The {\tt .text} segment}
 696
 697       Every instruction in the text segment is scanned, and every
 698       branch instruction's destination is added to the list of
 699       possible branch targets.  In addition, any function that sets
 700       the link register is added to the list \footnote{actually {\tt addr+8}}.
 701       Finally, combinations of {\tt LUI} (Load Upper Immediate) and
 702       {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
 703       addresses in the {\tt .text} segment since this combination of
 704       instructions is often used to load a 32-bit word into a
 705       register.
 706
 707 \item {\bf The {\tt .data} segment}
 708
 709       When compiling {\tt switch} statements, compilers often use a
 710       jump table stored in the {\tt .data} segment.  Unfortunately
 711       they typically do not identify these jump tables in any way.
 712       Therefore, the entire {\tt .data} segment is conservatively
 713       scanned for possible addresses in the {\tt .text} segment.
 714
 715 \item {\bf The symbol table}
 716
 717       This is mainly used as a backup.  Scanning the {\tt .text} and
 718       {\tt .data} segments should identify any possible jump targets;
 719       however, adding all function symbols in the ELF symbol table
 720       also catches functions that are never called directly from the
 721       MIPS binary, such as those invoked only via the NestedVM
 722       runtime's {\tt call()} method.
 723
 724 \end{itemize}
 725
 726 Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
 727 increase.
 728
 729 Despite all the above optimizations, one insurmountable obstacle
 730 remained: the Java {\tt .class} file format limits the constant pool
 731 to 65535 entries.  Every integer literal greater than {\tt 32767}
 732 requires an entry in this pool, and each branch instruction generates
 733 one of these.
 734
 735 One suboptimal solution was to express constants as offsets from a few
 736 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
 737 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
 738 inlining it).  This was sufficient to get reasonably large binaries to
 739 compile, and caused only a small (approximately 5\%) performance
 740 degredation and a similarly small increase in the size of the {\tt
 741 .class} file.  However, as we will see in the next section, compiling
 742 directly to {\tt .class} files (without the intermediate {\tt .java}
 743 file) eliminates this problem entirely.
 744
 745 \subsection{Binary-to-Binary mode}
 746
 747 Most of the performance improvemen where made where in the handling of
 748 branch instructions and in taking advantage of the JVM stack to
 749 eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
 750
 751 \epsfig{file=chart7,width=3in}
 752
 753 The first optimization gained by direct bytecode generation came from
 754 the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
 755 Java {\it language} does not have a {\tt goto} keyword, the VM does in
 756 fact have a corresponding instruction which is used quite heavily by
 757 {\tt javac}.  NestedVM's binary-to-binary mode exploits this
 758 instruction to avoid emitting inefficient {\tt switch..case}
 759 structures.
 760
 761 Related to the {\tt GOTO} instruction is branch statement
 762 optimization.  When emitting source code, NestedVM translates branches
 763 into Java source code like this:
 764
 765 {\footnotesize\begin{verbatim}
 766     if (condition) {
 767         pc = TARGET;
 768         continue;
 769     }
 770 \end{verbatim}}
 771
 772 This requires a branch in the JVM {\it regardless} of whether the MIPS
 773 branch is actually taken.  If {\tt condition} is false the JVM has to
 774 jump over the code to set {\tt pc} and go back to the {\tt switch}
 775 statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
 776 switch} block.  By generating bytecode directly, NestedVM is able to
 777 emit a JVM bytecode branching directly to the address corresponding to
 778 the target of the MIPS branch.  In the case where the branch is not
 779 taken the JVM doesn't branch at all.
 780
 781 A side effect of the previous two optimizations is a solution to the
 782 excess constant pool entries problem.  When jumps are implemented as
 783 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
 784 not need to be set.  This eliminates a huge number of constant pool
 785 entries.  The {\tt .class} file constant pool size limit is still
 786 present, but it is less likely to be encountered.
 787
 788 Implementation of the MIPS delay slot offers another opportunity for
 789 bytecode-level optimization.  In order to take advantage of
 790 instructions already in the pipeline, the MIPS ISA specifies that the
 791 instruction after a jump or branch is always executed, even if the
 792 jump/branch is taken.  This instruction is referred to as the ``delay
 793 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
 794 early MIPS CPUs, so they have to discard instructions anyways}.''  The
 795 instruction in the delay slot is actually executed {\it before} the
 796 branch is taken.  To further complicate matters, values from the
 797 register file are loaded {\it before} the delay slot is executed.
 798
 799 Fortunately there is a very elegent solution to this problem which can
 800 be expressed in JVM bytecode.  When a branch instruction is
 801 encountered, the registers needed for the comparison are pushed onto
 802 the stack to prepare for the JVM branch instruction.  Then, {\it
 803 after} the values are on the stack the delay slot instruction is
 804 emitted, followed by the actual JVM branch instruction.  Because the
 805 values were pushed to the stack before the delay slot was executed, any
 806 changes the delay slot made to the registers are not visible to the
 807 branch bytecode.
 808
 809 One final advantage that generating bytecode directly allows is a
 810 reduction in the size of the ultimate {\tt .class} file.  All the
 811 optimizations above lead to more compact bytecode as a beneficial side
 812 effect; in addition, NestedVM performs a few additional optimizations.
 813
 814 When encountering the following {\tt switch} block, both {\tt javac}
 815 and {\tt jikes} generate redundant bytecode.
 816
 817 {\footnotesize\begin{verbatim}
 818     switch(pc>>>8) {
 819         case 0x1: run_1(); break;
 820         case 0x2: run_2(); break
 821         ...
 822         case 0x100: run_100(); break;
 823     }
 824 \end{verbatim}}
 825
 826 The first bytecode in each case arm in the switch statement is {\tt
 827 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
 828 lifting this bytecode outside of the {\tt switch} statement, each {\tt
 829 case} arm shrinks by one instruction.
 830
 831 \subsection{Compiler Flags}
 832
 833 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
 834 profile is nothing like that of actual silicon.  In particular, {\tt
 835 gcc} makes several optimizations that increase performance on an
 836 actually MIPS CPU but actually decrease the performance of
 837 NestedVM-generated bytecode.  We found the following compiler options
 838 could be used to improve performance:
 839
 840 \begin{itemize}
 841
 842 \item {\tt -falign-functions}
 843
 844       Normally a function's location in memory has no effect on its
 845       execution speed.  However, in the NestedVM binary translator,
 846       the {\tt .text} segment is split on power-of-two boundaries.  If
 847       a function starts near the end of one of these boundaries, a
 848       performance critical part of the function winds up spanning two
 849       Java methods.  Telling {\tt gcc} to align all functions along
 850       these boundaries decreases the chance of this sort of splitting.
 851
 852 \item {\tt -fno-rename-registers}
 853
 854       On an actual silicon chip, using additional registers carries no
 855       performance penalty (as long as none are spilled to the stack).
 856       However, when generating bytecode, using {\it fewer}
 857       ``registers'' helps the JVM optimize the machine code it
 858       generates by simplifying the constraints it needs to deal with.
 859       Disabling register renaming has this effect.
 860
 861 \item {\tt -fno-schedule-insns}
 862
 863       Results of MIPS load operations are not available until {\it
 864       two} instructions after the load.  Without the {\tt
 865       -fno-schedule-insns} instruction, {\tt gcc} will attempt to
 866       reorder instructions to do other useful work during this period
 867       of unavailability.  NestedVM is under no such constraint, so
 868       removing this reordering typically generates simpler machine
 869       code.
 870
 871 \item {\tt -mmemcpy}
 872
 873       Enabling this instruction causes {\tt gcc} to use the system
 874       {\tt memcpy()} routine instead of generating loads and stores.
 875       As explained in the next section, the NestedVM runtime
 876       implements {\tt memcpy()} using {\tt System.arraycopy()}, which
 877       is substantially more efficient.
 878
 879 \item {\tt -ffunction-sections -fdata-sections}
 880
 881       These two options are used in conjunction with the {\tt
 882       --gc-section} linker option, prompting the linker to more
 883       aggressively prune dead code.
 884
 885 \end{itemize}
 886
 887 The effects of the various optimizations presented in this chapter are
 888 summarized in the table below.
 889
 890 \epsfig{file=chart4,width=3in}
 891
 892 \epsfig{file=chart3,width=3in}
 893
 894
 895 \subsection{FreeType, {\tt libmspack}, and {\tt libjpeg}}
 896
 897 The Ibex Project utilizes three libraries for which no Java-only
 898 equivalent exists.  The first is the FreeType font library, which
 899 parses, hints, and rasterizes TrueType and Postscript fonts with
 900 exceptional quality.  The project also needed an open source JPEG
 901 decompressor; surprisingly, none exist for Java.  While encoders are
 902 plentiful
 903
 904 These three libraries make minimal use of the standard library and OS
 905 services, and are all written in very portable ANSI C code, which made
 906 them easy targets for initial development.
 907
 908 \subsection{The GNU Compiler Collection}
 909
 910 Our next target, {\tt gcc}, was chosen in order to relieve developers
 911 from the time-consuming and complex task of building a compiler
 912 themselves.  The Ibex Project requires a specially configured and
 913 patched version of {\tt gcc} and its ahead-of-time Java compiler ({\tt
 914 gcj}) which is not distributed in binary form.
 915
 916 GCC was the first ``major'' application NestedVM was used on, and
 917 drove the development of most of the system library interface
 918 development; particularly support for {\tt fork()} and {\tt exec()},
 919 which require the NestedVM Runtime to perform binary-to-bytecode
 920 translation on the fly.
 921
 922 GCC also makes extensive use of 64-bit integers ({\tt long long}),
 923 which -- for performance reasons -- are typically manipulated using
 924 nonobvious instruction sequences on the 32-bit MIPS architecture.
 925 Dealing with these operations uncovered a number of bugs in the
 926 translator.
 927
 928 Despite our original goal, we have not yet been able to translate the
 929 {\tt C++} or Java front-ends, as the resulting binary produces a
 930 trampoline which exceeds the maximum size of a single method.  Future
 931 work will explore a multi-level trampoline to address this issue.
 932
 933
 934
 935 \subsection{\TeX and LINPACK}
 936
 937 In order to distinguish NestedVM from other single-language
 938 translators for the JVM, we undertook the task of translating \TeX 89
 939 (written in Pascal) and the Fortran source code for LINPACK into Java
 940 bytecodes.
 941
 942 Although actually producing the initial MIPS binaries from the \TeX\
 943 source code turned out to be an exceptionally tedious and frustrating
 944 task, the resulting binary translated and executed perfectly on the
 945 first run, as did LINPACK.  Our reward for this effort was typesetting
 946 our presentation of NestedVM using NestedVM itself.  We have also had
 947 initial successes running \TeX\ in a Java Applet, and intend to
 948 produce a {\tt jar} for embedding \TeX\ code (``\TeX lets'') in web
 949 pages without the use of a post-processing tool.
 950
 951 The LINPACK benchmark called our attention to Java's lack of an API
 952 for checking the ``cpu time'' of a process.  Unfortunately we had to
 953 substitute wall-clock time on an otherwise-quiescent machine as an
 954 approximation.
 955
 956
 957
 958 \section{The NestedVM Runtime}
 959
 960 In addition to binary-to-source and binary-to-binary translation,
 961 NestedVM also includes a MIPS binary interpreter.  All three
 962 translation approaches expose the same API to both the translated
 963 binary and the surrounding VM (including peer Java code).
 964
 965 The NestedVM Runtime (various subclasses of {\tt
 966 org.ibex.nestedvm.Runtime}) fill the role of an OS Kernel.
 967 Communication between MIPS code and the outside world is via the MIPS
 968 {\tt SYSCALL} instruction, just as the {\tt libc}-kernel interface is
 969 on real MIPS implementations.
 970
 971 Two implemenations of the runtime are offered; a simple runtime with
 972 the minimum support required to comply with ANSI C, and a more
 973 sophisticated runtime which emulates a large portion of the POSIX API.
 974
 975 \subsection{The ANSI C Runtime}
 976
 977 The ANSI C runtime offers typical file I/O operations including {\tt
 978 open()}, {\tt close()}, {\tt read()}, {\tt write()}, and {\tt seek()}.
 979 File descriptors are implemented much as they are in OS kernels; a
 980 table of open files is maintained and descriptors act as an index into
 981 that table.  Each file is represented as a Java {\tt RandomAccessFile}
 982 in order to match the semantics of {\tt seek()}.
 983
 984 Process-level memory management is done through the {\tt sbrk()}
 985 system call, which extends the process heap by adding more pages to
 986 the memory page table.  Fast memory clear and copy operations can be
 987 performed with {\tt memset()} and {\tt memcpy()}, which invoke the
 988 Java {\tt System.arraycopy()} method, which is generally much faster
 989 than a {\tt for()} loop.
 990
 991 The {\tt exit()} call records the exit status, marks the VM instance
 992 as terminated and halts execution.  The {\tt pause()} syscall
 993 implements a crude form of Java-MIPS communication by returning
 994 control to the Java code which spawned the MIPS process.
 995
 996 \subsection{The Unix Runtime}
 997
 998 The Unix runtime extends the simple ANSI file I/O model to include a
 999 single-root filesystem, and device nodes, as well as {\tt fcntl()}
1000 APIs to manipulate these.  Device nodes are generally simulated by
1001 mapping reads, writes, and {\tt fcntl()}s on the device to the
1002 appropriate Java API.
1003
1004 MIPS processes can ``mount'' other filesystems within the virtual
1005 filesystem made visible to the MIPS process.  Each filesystem is
1006 implemented by a Java class, which could, for example, offer access to
1007 the host filesystem (including {\tt state()}, {\tt lstat()}, {\tt
1008 mkdir}, and {\tt unlink()}, and {\tt getdents()}), the contents of a
1009 zip archive, or even a remote HTTP server.
1010
1011 MIPS processes can also spawn subprocesses using the {\tt fork()} and
1012 {\tt exec()} calls, which create new Java threads to run the process.
1013 The {\tt fork()} call -- which is supposed to duplicate the memory
1014 image of a process -- is implemented in an elegant manner by calling
1015 the Java {\tt clone()} method (deep copy) on the VM object itself.
1016 Copy-on-write is not currently implemented.  The new instance is added
1017 to a static process table to facilitate interprocess communication.
1018
1019 The {\tt waitpid()} API allows a parent process to block pending the
1020 completion of a child process, which is done quite easily with the
1021 Java {\tt wait()} method.
1022
1023 The {\tt exec()} method actually loads a MIPS binary image from the
1024 filesystem, feeds it to the MIPS-to-bytecode translator, and then
1025 loads the resulting bytecode on the fly using {\tt
1026 ClassLoader.loadBytes()}.  The {\tt pipe()} system call permits
1027 parent-to-child IPC just as on a normal Unix system.
1028
1029 Simple networking support is provided by the {\tt opensocket()}, {\tt
1030 listensocket()}, and {\tt accept()} methods, which are not currently
1031 compatible with the usual Berkeley sockets API.
1032
1033
1034 \subsection{Security Concerns}
1035
1036 RuntimeExceptions don't escape, they care caught and turned into
1037 checked exceptions filesystem access does though security manager
1038 (NestedVM Runtime.SecurityManager, and the JVM's)
1039
1040
1041 \subsection{Threading}
1042
1043 The NestedVM runtime currently does not support threading.  Providing
1044 robust support for ``true threads'', whereby each MIPS thread maps to
1045 a Java thread is probably not possible as the Java Memory Model
1046 [CITE], since all MIPS memory is stored in a set of {\tt int[]}'s and
1047 the Java Memory Model does not permit varying treatment or coherency
1048 policies at the granularity of a single array element.
1049
1050 While this presents a major barrier for applications that use
1051 sophisticated locking schemes (such as {\it hash synchronization})
1052 which depend on atomic memory operations, it is probably possible to
1053 apply this threading model to ``well behaved'' multithreaded
1054 applications which depend only on OS-provided semaphores and mutexes
1055 for synchronization.
1056
1057 Complex synchronization and incorrectly synchronized applications can
1058 be supported by implementing a variant of {\it user threads} within a
1059 single Java thread by providing a timer interrupt (via a Java
1060 asynchronous exception).  Unfortunately this requires that the
1061 compiled binary be able to restart from any arbitrary instruction
1062 address, which would require a {\tt case} statement for every
1063 instruction (rather than every jump target), which would degrade
1064 performance and increase the size of the resulting class file.
1065
1066
1067 \section{Conclusion}
1068
1069 \subsection{Theoretical Limitations}
1070
1071 Theoretical limitations: threads, reading from code, self-modifying
1072 code, COW?
1073
1074 \subsection{Future Directions}
1075
1076 Although we have only implemented it for the Java Virtual Machine, our
1077 technique generalizes to other safe bytecode architectures.  In
1078 particular we would like to demonstrate this generality by retargeting
1079 the translator to the Microsoft Intermediate Language \cite{msil}.
1080
1081 Additionally, we would like to explore other uses for dynamic loading
1082 of translated MIPS binaries by combining NestedVM (which itself is
1083 written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
1084
1085 \subsection{Availability}
1086
1087 NestedVM is available under an open source license, and can be
1088 obtained from
1089 \begin{verbatim}
1090     http://nestedvm.ibex.org
1091 \end{verbatim}
1092
1093 \appendix
1094 \section{Appendix: Testing Methodology}
1095
1096 All times are measured in seconds. These were all run on a dual 1Ghz
1097 Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
1098 test was run 8 times within a single VM. The highest and lowest times
1099 were removed and the remaining 6 were averaged.  In each case only the
1100 first run differed significantly from the rest.
1101
1102 The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
1103 writing a tga.  The {\tt mspack} test consisted of extracting all
1104 members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
1105 and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
1106 rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
1107 to 48 incrementing by 4 for a total of 950 glyphs.
1108
1109 \bibliography{nestedvm}
1110
1111 \end{document}
1112