conflict merge

[nestedvm.git] / doc / nestedvm.ivme04.tex
diff --git a/doc/nestedvm.ivme04.tex b/doc/nestedvm.ivme04.tex

index 87be5a3..c6cccbb 100644 (file)
--- a/doc/nestedvm.ivme04.tex
+++ b/doc/nestedvm.ivme04.tex
@@ -12,20 +12,20 @@
  \usepackage{parskip}
  \usepackage{tabularx}
  \usepackage{alltt}
-\bibliographystyle{alpha}
+\bibliographystyle{amsplain}
  
  \title{\textbf{\textsf{
-NestedVM: Total Translation of Native Code into Safe Bytecode
+Complete Translation of Unsafe Native Code to Safe Bytecode
  }}}
  \date{}
  \author{\begin{tabular}{@{}c@{}}
          {\em {Brian Alliet}} \\
          {Rochester Institute of Technology}\\
-        {\tt brian@ibex.org}
+        {\tt bja8464@cs.rit.edu}
     \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
          {\em {Adam Megacz}} \\
-        {UC Berkeley Statistical Computing Facility} \\
-        {\tt adam@ibex.org}
+        {University of California, Berkeley} \\
+        {\tt megacz@cs.berkeley.edu}
  \end{tabular}}
  \begin{document}
  
@@ -33,228 +33,318 @@ NestedVM: Total Translation of Native Code into Safe Bytecode
  
  \begin{abstract}
  
-We present a new approach to utilizing unsafe legacy code
-within safe virtual machines by compiling to MIPS machine code as an
-intermediate language.  This approach carries N key benefits over
-existing techniques:
+Most existing techniques for using code written in an unsafe language
+within a safe virtual machine involve transformations from one source
+code language (such as C) to another (such as Java) and then to
+virtual machine bytecodes.  We present an alternative approach which
+uses a standard compiler to turn unsafe source code into unsafe MIPS
+binaries, which are then translated into safe virtual machine
+bytecodes.  This approach offers four key advantages over existing
+techniques:
  
  \begin{itemize}
-\item total coverage of all language features, unlike source translation
-\item no build process modifications
-\item no post-translation human intervention
-\item efficient bytecode
+\item Total coverage of all language features
+\item No post-translation human intervention
+\item No build process modifications
+\item Bug-for-bug compiler compatability
  \end{itemize}
  
-We also present NestedVM, a complete system in production use which
-implements this technique.  We conclude with quantitative performance
-measurements and suggestions for VM acceleration of the resulting
-bytecodes.
-
+We have implemented this technique in NestedVM, a binary-to-source and
+binary-to-binary translator targeting the Java Virtual Machine.
+NestedVM-translated versions of the {\tt libfreetype}, {\tt libjpeg},
+and {\tt libmspack} libraries are currently in production use.
+Performance measurements indicate a best case performance within 3x of
+native code and worst case typically within 10x, making it an
+attractive solution for code which is not performance-critical.
  
  \end{abstract}
  
  \section{Introduction}
  
-The C programming language \cite{KR} has been in use for over 30
-years.  Consequently, there is a huge library of software written in
-this language.  Although Java offers substantial benefits \cite{} over
-C (and C++), its comparatively young age means that it often lacks
-equivalents of many C/C++ libraries.
-
-The typical solution to this dilemma is to use JNI \cite{} or CNI
-\cite{} to invoke C code from within a Java VM.  Unfortunately, there
+Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
+been in use much longer than any of today's widely accepted safe
+languages such as Java \cite{java} and C\# \cite{csharp}.  Consequently, there is
+a huge library of software written in these languages.  Although safe
+languages offer substantial benefits, their comparatively young age
+often puts them at a disadvantage when breadth of existing support
+code is an important criterion.
+
+The typical solution to this dilemma is to use a native interface such
+as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
+virtual machine or otherwise safe environment.  Unfortunately, there
  are a number of situations in which this is not an acceptable
-solution due to security concerns:
-
-\begin{itemize}
-
-\item Java Applets are not permitted to invoke {\tt
-      Runtime.loadLibrary()}
-
-\item Java Servlet containers with a {\tt SecurityManager} will not
-      permit loading new JNI libraries.  This configuration is popular
-      with {\it shared hosting} providers and corporate intranets
-      where a number of different parties contribute individual web
-      applications which are run together in a single container.
-
-\item Unlike Java Bytecode, JNI code is susceptible to buffer overflow
-      and heap corruption attacks.  This can be a major security
-      vulnerability.
-
-\end{itemize}
-
-In addition to security concerns, JNI and CNI carry other
-disadvantages:
-
-\begin{itemize}
-
-\item JNI requires the native library to be compiled ahead of time,
-      separately, for every architecture on which it will be deployed.
-      This is unworkable for situations in which the full set of
-      target architectures is not known at deployment time.
-
-\item The increasingly popular J2ME \cite{} platform does not support
-      JNI or CNI.
-
-\item JNI often introduces undesirable added complexity to an
-      application.
-
-\end{itemize}
-
-The technique we present here is based on using a typical ANSI C
-compiler to compile C/C++ code into a MIPS binary, and then using a
-tool to translate that binary on an instruction-by-instruction basis
-into Java bytecode.
-
-The technique presented here is general; we anticipate that it can be
-applied to other secure virtual machines such as Microsoft's .NET
-\cite{}, Perl Parrot \cite{}, or Python bytecode \cite{}.
+solution.  These situations can be broadly classified into two
+categories: {\it security concerns} and {\it portability concerns}.
+
+Using Java as an example, JNI and CNI are prohibited in a number of
+contexts, including applets environments and servlet containers with a
+{\tt SecurityManager}.  Additionally, even in the context of trusted
+code, {\tt native} methods invoked via JNI are susceptible to buffer
+overflow and heap corruption attacks which are not a concern for
+verified bytecode.
+
+The second class of disadvantages revolves around portability
+concerns; native interfaces require the native library to be compiled
+ahead of time, for every architecture on which they will be
+deployed.  This is unworkable for situations in which the full set of
+target architectures is not known at deployment time.  Additionally,
+some JVM platform variants such as J2ME \cite{j2me} simply do not offer
+support for native code.
+
+The technique we present here uses typical compiler to compile unsafe
+code into a MIPS binary, which is then translated on an
+instruction-by-instruction basis into Java bytecode.  The technique
+presented here is general; we anticipate that it can be applied to
+other secure virtual machines such as Microsoft's .NET \cite{msil}, Perl
+Parrot \cite{parrot}, or Python bytecode \cite{python}.
  
  \section{Approaches to Translation}
  
-Techniques for translating unsafe code into VM bytecode generally fall
-into four categories:
-
-\begin{itemize}
-\item source-to-source translation
-\item source-to-binary translation
-\item binary-to-source translation
-\item binary-to-binary translation
-\end{itemize}
+The four program representations of interest in this context, along
+with their specific types in the C-to-JVM instantiation of the
+problem are shown in the following diagram:
  
-\begin{figure}[h]
  \begin{pdfpic}
  \newlength{\MyLength}
  \settowidth{\MyLength}{machine code}
-\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
-\begin{psmatrix}[colsep=3,rowsep=3]
+\newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
+\begin{psmatrix}[colsep=2,rowsep=0]
+  & \\[0pt]
    [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
-  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\
+  [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)}   \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
+  [name=b00]\MyBox{\tt (.o)}  & [name=b11]\MyBox{\tt (.class)} \\
+  & \\[0pt]
    \psset{nodesep=5pt,arrows=->}
-  \ncline{s0}{b0}<{\it gcc}
-  \ncline{s0}{s1}\aput{:U}{\it c2java}
-  \ncline{s0}{b1}\aput{:U}{\it gcc bytecode backend}
-  \ncline{s1}{b1}>{\it javac}
  \end{psmatrix}
  \end{pdfpic}
-\caption{\label{lattice} Conversion Lattice with examples of tools specific to a C/JVM scenario}
-\end{figure}
  
-\begin{figure}[h]
+To illustrate the context of this diagram, the following arcs show the
+translations performed by a few familiar tools:
+
  \begin{pdfpic}
  \newlength{\MyLength}
-\settowidth{\MyLength}{machine code}
+\settowidth{\MyLength}{xmachine codex}
  \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
-\begin{psmatrix}[colsep=3,rowsep=3,nrot=:U]
+\psmatrix[colsep=2,rowsep=0,nrot=:D]
+  & \\[0pt]
    [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
-  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
+  & \\[0pt]
    \psset{nodesep=5pt,arrows=->}
-  \ncline{s0}{b0}<{\it gcc}
-  \ncline{s1}{b1}>{\it javac}
-  \ncline{b0}{s1}\naput{\it NestedVM}
-  \ncline{b0}{s1}\nbput{\it binary-to-source}
-\end{psmatrix}
+  \ncline{s0}{b0}\bput{:U}{\tt gcc}
+  \ncline{s1}{b0}\bput{:D}{\tt gcj}
+  \ncline{s1}{b1}\aput{:U}{\tt javac}
+  \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
+\endpsmatrix
  \end{pdfpic}
-\caption{\label{lattice2} Conversion Lattice including NestedVM in {\it source-output} mode}
-\end{figure}
  
-\begin{figure}[h]
+Techniques for translating unsafe code into VM bytecode generally fall
+into four categories, which we expand upon in the next two sections:
+
+\begin{itemize}
+\item source-to-source translation
+\item source-to-binary translation
+\item binary-to-source translation
+\item binary-to-binary translation
+\end{itemize}
+
+\section{Existing Work}
+
+\subsection{Source-to-Source Translation}
+
+The most common technique employed is partial translation from unsafe
+source code to safe source code:
+
  \begin{pdfpic}
  \newlength{\MyLength}
-\settowidth{\MyLength}{machine code}
+\settowidth{\MyLength}{xmachine codex}
  \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
-\begin{psmatrix}[colsep=3,rowsep=3,nrot=:U]
+\psmatrix[colsep=2,rowsep=0,nrot=:U]
+  & \\[0pt]
+  & \\[0pt]
    [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
-  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
+  & \\[0pt]
    \psset{nodesep=5pt,arrows=->}
-  \ncline{s0}{b0}<{\it gcc}
-  \ncline{s1}{b1}>{\it javac}
-  \ncline{b0}{b1}\naput{\it NestedVM}
-  \ncline{b0}{b1}\nbput{\it binary-to-binary}
-\end{psmatrix}
+  \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
+  \ncline{s1}{b1}\aput{:U}{\tt javac}
+\endpsmatrix
  \end{pdfpic}
-\caption{\label{lattice3} Conversion Lattice including NestedVM in {\it bytecode-output} mode}
-\end{figure}
  
-A diagram showing these four translation approaches in the context of
-running C/C++ code within a Java VM is shown in Figure~\ref{lattice}.
+A number of existing systems employ this technique; they can
+be divided into two categories: those which perform a partial
+translation which is completed by a human, and those which perform a
+total translation but fail (yield an error) on a large class of input
+programs.
  
-\subsection{Existing Work}
-\subsubsection{Source-to-Source Translation}
  
-\begin{itemize}
-\item c2java
-\item commercial products
-\end{itemize}
+\subsubsection{Incomplete Translation}
+
+Jazillian \cite{jazillian} is a commercial solution which produces
+extremely readable Java source code from C source code, but ony
+translates a small portion of the C language.  Jazillian is unique in
+that in addition to {\it language migration}, it also performs {\it
+API migration}; for example, Jazillian is intelligent enough
+to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}.
+
+Unfortunately such deep analysis is intractible for most of the C
+language and standard library; Jazillian's documentation notes that
+{\it ``This is not your father's language translator.  It's not
+generating ugly code that's guaranteed to work out of the
+box... Jazillian does not always produce code that works correctly.''}
+
+MoHCA-Java \cite{mohca} is the other major tool in this category, and steps
+beyond Jazillian by providing tools for analysis of the source C++
+abstract syntax tree.  Additionally, MoHCA-Java's analysis engine is
+extensible, making it a platform for constructing application-specific
+translators rather than a single translation tool.  However,
+MoHCA-Java does not always generate complete Java code for all of the C++
+programs which it accepts.
+
+
+\subsubsection{Partial Domain Translation}
+
+The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp},
+and Ephedra \cite{ephedra} systems each provide support for complete
+translation of a {\it subset} of the source language (C or C++).  Each
+of the four tools supports a progressively greater subset than the one
+preceding it; however none covers the entire input language.
+
+Ephedra, the most advanced of the four, supports most of the C++
+language, and claims to produce ``human readable'' Java code as
+output.  Notable omissions from the input domain include support for
+fully general pointer arithmetic, casting between unrelated types, and
+reading from a {\tt union} via a different member than the one most
+recently written.
+
+Unfortunately, when the program being translated is large and complex,
+it is quite likely that it will use an unsupported feature in at least
+one place.  In the absence of a programmer who understands the source
+program, a single anomoly is often enough to render the entire
+translation process useless.  As a result, these tools are mainly
+useful as an {\it aid} to programmers who could normally perform the
+conversion themselves, but want to save time by automating most of the
+process.
  
-A number of commercial products and research projects attempt to
-translate C++ code to Java code, preserving the mapping of C++ classes
-to Java classes.  Unfortunately this is problematic since there is no
-way to do pointer arithmetic except within arrays, and even in that
-case, arithmetic cannot be done in terms of fractional objects.
  
-Mention gcc backend
+\subsection{Source-to-Binary Translation}
+
+Source-to-binary translation involves a compiler for the unsafe
+language which has been modified to emit safe bytecode.
+
+\begin{pdfpic}
+\newlength{\MyLength}
+\settowidth{\MyLength}{xmachine codex}
+\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
+\psmatrix[colsep=2,rowsep=0,nrot=:U]
+  & \\[0pt]
+  [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
+  & \\[0pt]
+  \psset{nodesep=5pt,arrows=->}
+  \ncline{s0}{b1}\bput{:U}{source-to-binary}
+\endpsmatrix
+\end{pdfpic}
+
+The primary occupant of this category is {\tt egcs-jvm}
+\cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler
+Collection ( {\tt gcc} ) \cite{gcc}.  Since {\tt gcc} employs a highlym
+odular architecture, it {\it is} possible to add RTL code generators
+for nonstandard processors.  However, {\tt gcc}'s parsing, RTL
+generation, and optimization layers make fundamental assumptions (such
+as the availability of pointer math) which cannot be directly
+supported; thus the compiler still fails for a substantial class of
+input programs.
  
-Many of these products advise the user to tweak the code which results
-from the translation.  Unfortunately, hand-modifying machine-generated
-code is generally a bad idea, since this modification cannot be
-automated.  This means that every time the origin code changes, the
-code generator must be re-run, and the hand modifications must be
-performed yet again.  This is an error-prone process.
  
-Furthermore, NestedVM does not attempt to read C code directly.  This
-frees it from the complex task of faithfully implementing the ANSI C
-standard (or, in the case of non ANSI-C compliant code, some other
-interface).  This also saves the user the chore of altering their
-build process to accomodate NestedVM.
  
  \section{NestedVM}
  
-NestedVM takes a novel approach; it uses compiled machine code as a
-starting point for the translation process.  NestedVM has gone through
-two iterations:
+The principal difference between NestedVM and other approaches is that
+NestedVM {\it does not} attempt to deal with source code as an input.
+This leads immediately to three advantages:
  
  \begin{itemize}
-\item binary-to-source compilation  (Figure~\ref{lattice2})
-\item binary-to-binary compilation  (Figure~\ref{lattice3})
-\end{itemize}
+\item {\bf Total coverage of all language features}
  
-\subsection{Translation Process}
+      Because NestedVM does not attempt to implement the parsing and
+      code generation steps of compilation, it is freed from the
+      extremely complex task of faithfully implementing languages
+      which are often not fully or formally specified (such as C and
+      C++).
  
-Translating a legacy library for use within a JVM proceeds as follows:
+\item {\bf No build process modifications}
  
-\begin{enumerate}
+      NestedVM does not modify existing build processes, which can be
+      extremely complex and dependent on strange preprocessor usage as
+      well as the complex interplay between compiler switches and
+      header file locations.
  
-\item Compile the source code to a statically linked binary, targeting
-      the MIPS R2000 ISA.
-
-\item Invoke {\tt NestedVM} on the statically linked binary.
-      Typically this will involve linking against {\tt libc}, which
-      translates system requests (such as {\tt open()}, {\tt read()},
-      or {\tt write()}) into appropriate invocations of the MIPS
-      {\tt SYSCALL} instruction.
+\item {\bf Bug-for-bug compiler compatability}
  
-\item (If using binary-to-source translation) compile the resulting
-      {\tt .java} code using {\tt jikes} or {\tt javac}.
+      Since NestedVM uses the compiler's {\it output} as its own {\it
+      input}, it ensures that programs which are inadvertently
+      dependent on the vagaries of a particular compiler can still be
+      used.
  
-\item (Optional) compile the resulting bytecode into a {\it safe}
-      native binary using {\tt gcj}.
+\end{itemize}
  
-\item From java code, invoke the {\tt run()} method on the generated
-      class.  This is equivalent to the {\tt main()} entry point.
+NestedVM's approach carries a fourth benefit as well, arising from its
+totality:
  
-\end{enumerate}
+\begin{itemize}
+\item {\bf No post-translation human intervention}
+
+      NestedVM offers total support for all non-privileged
+      instructions, registers, and resources found on a MIPS {\tt
+      R2000} CPU, including the add/multiply unit and floating point
+      coprocessor.  As such, it constitutes a total function mapping
+      from the entire domain of (non-kernel-mode) programs onto (a
+      subset of) the semantics of the Java Virtual Machine.  This
+      ensures that the translation process is fully automated and
+      always succeeds for valid input binaries.
+\end{itemize}
  
+This is a much more important factor than is obvious at first glance.
+If post-translation human intervention is required, then the {\it
+human becomes part of the build process}.  This means that if a third
+party library used in the project needs to be upgraded, {\it a human
+must intervene} in the rebuild process.  In addition to slowing the
+process and introducing opportunities for error, this task often
+requires specialized knowledge which becomes tied to the particular
+individual performing this task, rather than being encoded in build
+scripts which persist throughout the lifetime of the project.
  
  \subsection{Why MIPS?}
  
-We chose MIPS as a source format for two primary reasons: the
-availability of tools to translate legacy code into MIPS binaries, and
-the close similarity between the MIPS ISA and the Java Virtual Machine.
+We chose MIPS as a source format for three reasons: the availability
+of tools to compile legacy code into MIPS binaries, the close
+similarity between the MIPS ISA and the Java Virtual Machine, and the
+relatively high degree of program structure that can be inferred from
+ABI-adherent binaries.
  
  The MIPS architecture has been around for quite some time, and is well
  supported by the GNU Compiler Collection, which is capable of
-compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C
+compiling C, C++, Java, Fortran, Pascal, and Objective C
  into MIPS binaries.
  
  The MIPS R2000 ISA bears a striking similarity to the Java Virtual
@@ -262,13 +352,13 @@ Machine:
  
  \begin{itemize}
  
-%\item The original MIPS ISA supports only 32-bit aligned memory loads
-%      and stores.  This allows NestedVM to represent memory as a Java
-%      {\tt int[]} without introducing additional overhead.
-\item Most of the instructions in the original MIPS ISA operate only on
-      32-bit aligned memory locations. This allows NestedVM to represent
-      memory as a Java {\tt int[]} array without introducing additional 
-      overhead.
+\item Most of the instructions in the original MIPS ISA operate only
+      on 32-bit aligned memory locations. This allows NestedVM to
+      represent memory as a Java {\tt int[]} array without introducing
+      additional overhead.  The remaining non-aligned memory load
+      instructions are only rarely emitted by most compilers since
+      they carry a performance penalty on physical MIPS
+      implementations.
  
  \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
        multiply and divide instructions as well as a single and double
@@ -277,23 +367,53 @@ Machine:
  
  \end{itemize}
  
+Finally, the MIPS ISA and ABI convey quite a bit of information about
+program structure.  This information can be used for optimization
+purposes:
  
-\subsection{Binary-to-Source Compilation}
+\begin{itemize}
  
-The first incarnation of NestedVM was a binary-to-source compiler.
-This version reads in a MIPS binary and emits Java source code, which
-can be compiled with {\tt javac}, {\tt jikes}, or {\tt gcj}.
+\item The structure of MIPS branching and jump instructions make it
+      easy to infer the set of likely target instructions.
  
-This implementation was primarily a first step towards the
-binary-to-binary compiler.  Conveniently, generating Java source code
-frees NestedVM from having to perform simple constant propagation
-optimizations, since most Java compilers already do this.  A recurring
-example is the treatment of the {\tt r0} register, which is fixed as
-{\tt 0} in the MIPS ISA.
+\item The MIPS ABI specifies particular registers as caller-save and
+      callee-save, as well as designating a register for the return
+      address after a function call.  This allows NestedVM to optimize
+      many operations for the common case of ABI-adherent binaries.
  
-Lacking the ability to generate specially optimized bytecode
-sequences, a straightforward mapping of the general purpose hardware
-registers to 32 {\tt int} fields was optimal.
+\item All MIPS instructions are exactly 32 bits long.
+
+\end{itemize}
+
+
+
+\subsection{Binary-to-Source}
+
+The simplest operational mode for NestedVM is binary-to-source
+translation.  In this mode, NestedVM translates MIPS binaries into
+Java source code, which is then fed to a Java compiler in order to
+produce bytecode files:
+
+\begin{pdfpic}
+\newlength{\MyLength}
+\settowidth{\MyLength}{xmachine codex}
+\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
+\psmatrix[colsep=2,rowsep=0,nrot=:U]
+  & \\[0pt]
+  & \\[0pt]
+  [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
+  \psset{nodesep=5pt,arrows=->}
+  \ncline{s0}{b0}\bput{:U}{\tt gcc}
+  \ncline{s1}{b1}\aput{:U}{\tt javac}
+  \ncline{b0}{s1}\naput{\tt NestedVM}
+\endpsmatrix
+\end{pdfpic}
  
  \begin{figure*}[t]
  \begin{minipage}[c]{7in}%
@@ -383,402 +503,440 @@ public void trampoline() {
  \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
  \end{figure*}
  
-Unfortunately Java imposes a 64kb limit on the size of the bytecode
+Translating unsafe code for use within a JVM proceeds as follows:
+
+\begin{enumerate}
+
+\item Compile the source code to a statically linked binary, targeting
+      the MIPS R2000 ISA.  Typically this will involve linking against
+      {\tt libc}, which translates system requests (such as {\tt
+      open()}, {\tt read()}, or {\tt write()}) into appropriate
+      invocations of the MIPS {\tt SYSCALL} instruction.
+
+\item Invoke {\tt NestedVM} on the statically linked binary.
+
+\item Compile the resulting {\tt .java} code using {\tt jikes}
+      \cite{jikes} or {\tt javac}.
+
+\item From java code, invoke the {\tt run()} method on the generated
+      class.  This is equivalent to the {\tt main()} entry point.
+
+\end{enumerate}
+
+\subsubsection{Optimizations}
+
+Generating Java source code instead of bytecode frees NestedVM from
+having to perform simple constant propagation optimizations, as most
+Java compilers already do this.  A recurring example is the treatment
+of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
+
+Lacking the ability to generate specially optimized bytecode
+sequences, a straightforward mapping of the general purpose hardware
+registers to 32 {\tt int} fields turned out to be optimal.
+
+
+\epsfig{file=chart1,width=3in}
+
+Unfortunately, Java imposes a 64kb limit on the size of the bytecode
  for a single method.  This presents a problem for NestedVM, and
  necessitates a {\it trampoline transformation}, as shown in
-Figure~\ref{code1}.  With this trampoline in place somewhat large
-binaries can be handled without much difficulty -- fortunately there
-is no corresponding limit on the size of a classfile as a whole.
+Figure~\ref{code1}.  With this trampoline in place, large binaries can
+be handled without much difficulty -- fortunately, there is no
+corresponding limit on the size of a classfile as a whole.
+
+One difficulty that arose as a result of using the trampoline
+transformation was the fact that {\tt javac} and {\tt jikes} are
+unable to properly optimize its switch statements.  For example, the
+following code is compiled into a comparatively inefficient {\tt
+LOOKUPSWITCH}:
+
+{\footnotesize
+\begin{verbatim}
+    switch(pc&0xffffff00) {
+        case 0x00000100: run_100(); break;
+        case 0x00000200: run_200(); break;
+        case 0x00000300: run_300(); break;
+    }
+\end{verbatim}}
  
-Another interesting problem that was discovered while creating the
-trampoline method was javac and Jikes' inability to properly optimize
-switch statements.  The code in Figure~\ref{lookupswitch} is compiled
-into a comparatively inefficient {\tt LOOKUPSWITCH}, while the code in
-Figure~\ref{tableswitch} is optimized into a {\tt TABLESWITCH}.
+Whereas the next block of code code optimized into a {\tt
+TABLESWITCH}:
  
-\begin{figure}
-{\footnotesize\begin{verbatim}
-switch(pc&0xffffff00) {
-    case 0x00000100: run_100(); break;
-    case 0x00000200: run_200(); break;
-    case 0x00000300: run_300(); break;
-}
+{\footnotesize
+\begin{verbatim}
+    switch(pc>>>8) {
+        case 0x1: run_100();
+        case 0x2: run_200();
+        case 0x3: run_300();
+    }
  \end{verbatim}}
-\caption{\label{lookupswitch} Code which {\it is not} optimized into a tableswitch}
-\end{figure}
  
-\begin{figure}
-{\footnotesize\begin{verbatim}
-Brian, we're missing the code here... can you put it in?
-\end{verbatim}}
-\caption{\label{tableswitch} Code which {\it is} optimized into a tableswitch}
-\end{figure}
+This problem was surmounted by switching on a denser set of {\tt case}
+values, which is more amenable to the {\tt TABLESWITCH} structure.
+This change alone nearly doubled the speed of the compiled binary.
+
+The next performance improvement came from tuning the size of the
+methods invoked from the trampoline.  Trial and error led to the
+onclusion that HotSpot \cite{hotspot} -- the most widely deployed JVM
+-- performs best when 128 MIPS instructions are mapped to each method.
+
+\epsfig{file=chart5,width=3in}
  
-Javac isn't smart enough to see the pattern in the case values and
-generates very suboptimal bytecode. Manually doing the shifts
-convinces javac to emit a tableswitch statement, which is
-significantly faster. This change alone nearly doubled the speed of
-the compiled binary.
+\epsfig{file=chart6,width=3in}
  
-Finding the optimal method size lead to the next big performance
-increase.  It was determined through experimentation that the optimal
-number of MIPS instructions per method is 128 (considering only power
-of two options). Going above or below that lead to performance
-decreases. This is most likely due to a combination of two factors.
+This phenomenon is due to two factors:
  
  \begin{itemize}
  
-\item The two levels of switch statements jumps have to pass though -
-      The first switch statement jumps go through is the trampoline
-      switch statement. This is implemented as a {\tt TABLESWITCH} in JVM
-      bytecode so it is very fast. The second level switch statement
-      in the individual run\_ methods is implemented as a
-      {\tt LOOKUPSWITCH}, which is much slower. Using smaller methods puts
-      more burden on the faster {\tt TABLESWITCH} and less on the slower
-      {\tt LOOKUPSWITCH}.
+\item While the trampoline method's {\tt switch} statement can be
+      coded as a {\tt TABLESWITCH}, the {\tt switch} statement
+      within the individual methods is to sparse to encode this way.
  
-\item JIT compilers probably favor smaller methods smaller methods are
-      easier to compile and are probably better candidates for JIT
-      compilation than larger methods.
+\item Hybrid Interpretive-JIT compilers such as HotSpot generally
+      favor smaller methods since they are easier to compile and are
+      better candidates for compilation in ``normal'' programs (unlike
+      NestedVM programs).
  
  \end{itemize}
  
-Put a chart in here
-
-The next big optimization was eliminating unnecessary case
-statements. Having case statements before each instruction prevents
-JIT compilers from being able to optimize across instruction
-boundaries. In order to eliminate unnecessary case statements every
-possible address that could be jumped to directly needed to be
-identified. The sources for possible jump targets come from 3 places.
+After tuning method sizes, our next performance boost came from
+eliminating exraneous case branches.  Having case statements before
+each instruction prevents JIT compilers from being able to optimize
+across instruction boundaries, since control flow can enter the body
+of a {\tt switch} statement at any of the {\tt case}s.  In order to
+eliminate unnecessary case statements we needed to identify all
+possible jump targets.  Jump targets can come from three sources:
  
  \begin{itemize}
  
-\item The .text segment - Every instruction in the text segment is
-      scanned for jump targets. Every branch instruction (BEQ, JAL,
-      etc) has its destination added to the list of possible branch
-      targets. In addition, functions that set the link register have
-      theirpc+8 added to the list (the address that would've been put
-      to the link register). Finally, combinations of LUI (Load Upper
-      Immediate) of ADDIU (Add Immediate Unsigned) are scanned for
-      possible addresses in the text segment. This combination of
+\item {\bf The {\tt .text} segment}
+
+      Every instruction in the text segment is scanned, and every
+      branch instruction's destination is added to the list of
+      possible branch targets.  In addition, any function that sets
+      the link register is added to the list \footnote{actually {\tt addr+8}}.
+      Finally, combinations of {\tt LUI} (Load Upper Immediate) and
+      {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
+      addresses in the {\tt .text} segment since this combination of
        instructions is often used to load a 32-bit word into a
        register.
  
-\item The .data segment - When GCC generates switch() statements it
-      often uses a jump table stored in the .data
-      segment. Unfortunately gcc doesn't identify these jump tables in
-      any way. Therefore, the entire .data segment is conservatively
-      scanned for possible addresses in the .text segment.
+\item {\bf The {\tt .data} segment}
+
+      When compiling {\tt switch} statements, compilers often use a
+      jump table stored in the {\tt .data} segment.  Unfortunately
+      they typically do not identify these jump tables in any way.
+      Therefore, the entire {\tt .data} segment is conservatively
+      scanned for possible addresses in the {\tt .text} segment.
        
-\item The symbol table - This is mainly used as a backup. Scanning the
-      .text and .data segments should identify any possible jump
-      targets but adding every function in the symbol table in the ELF
-      binary doesn't hurt. This will also catch functions that are
-      never called directly from the MIPS binary (for example,
-      functions called with the call() method in the runtime).
+\item {\bf The symbol table}
+
+      This is mainly used as a backup.  Scanning the {\tt .text} and
+      {\tt .data} segments should identify any possible jump targets;
+      however, adding all function symbols in the ELF symbol table
+      also catches functions that are never called directly from the
+      MIPS binary, such as those invoked only via the NestedVM
+      runtime's {\tt call()} method.
  
  \end{itemize}
  
-Eliminating unnecessary case statements provided a 10-25\% speed
+Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
  increase.
  
-Despite all the above optimizations and workaround an impossible to
-workaround hard classfile limit was eventually hit, the constant
-pool. The constant pool in classfiles is limited to 65536
-entries. Every Integer with a magnitude greater than 32767 requires an
-entry in the constant pool. Every time the compiler emits a
-jump or branch instruction the PC field is set to the branch target. This
-means nearly every branch instruction requires an entry in the
-constant pool. Large binaries hit this limit fairly quickly. One
-workaround that was employed in the Java source compiler was to
-express constants as offsets from a few central values. For example:
-``pc = N\_0x00010000 + 0x10'' where N\_0x000100000 is a non-final
-field to prevent javac from inlining it. This was sufficient to get
-reasonable large binaries to compile. It has a small (approximately
-5\%) performance impact on the generated code. It also makes the
-generated classfile somewhat larger.  Fortunately, the classfile
-compiler eliminates this problem.
-
-
-\subsection{Binary-to-Binary Translation}
-
-The next step in the evolution of NestedVM was to compile directly to
-JVM bytecode eliminating the intermediate javac step. This had several
-advantages:
+Despite all the above optimizations, one insurmountable obstacle
+remained: the Java {\tt .class} file format limits the constant pool
+to 65535 entries.  Every integer literal greater than {\tt 32767}
+requires an entry in this pool, and each branch instruction generates
+one of these.
+
+One suboptimal solution was to express constants as offsets from a few
+central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
+{\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
+inlining it).  This was sufficient to get reasonably large binaries to
+compile, and caused only a small (approximately 5\%) performance
+degredation and a similarly small increase in the size of the {\tt
+.class} file.  However, as we will see in the next section, compiling
+directly to {\tt .class} files (without the intermediate {\tt .java}
+file) eliminates this problem entirely.
+
+
+\subsection{Binary-to-Binary}
+
+After implementing the binary-to-source compiler, a binary-to-binary
+translation mode was added.
+
+\begin{pdfpic}
+\newlength{\MyLength}
+\settowidth{\MyLength}{xmachine codex}
+\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
+\psmatrix[colsep=2,rowsep=0,nrot=:U]
+  & \\[0pt]
+  [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source}   \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  & \\[0pt]
+  [name=b0]\MyBox{machine code}  & [name=b1]\MyBox{safe bytecode} \\[0pt]
+  & \\[0pt]
+  \psset{nodesep=5pt,arrows=->}
+  \ncline{s0}{b0}\bput{:U}{\tt gcc}
+  \ncline{b0}{b1}\naput{\tt NestedVM}
+\endpsmatrix
+\end{pdfpic}
+
+This mode has several advantages:
  
  \begin{itemize}
        
-\item There are little tricks that can be done in JVM bytecode that
-      can't be done in Java source code.
+\item There are quite a few interesting bytecode sequences that cannot
+      be generated as a result of compiling Java source code.
  
-\item Eliminates the time-consuming javac step - Javac takes a long
-      time to parse and compile the output from the java source
-      compiler.
+\item Directly generating {\tt .class} files Eliminates the
+      time-consuming {\tt javac} step.
  
-\item Allows for MIPS binaries to be compiled and loaded into a
-      running VM using a class loader. This eliminates the need to
-      compile the binaries ahead of time.
+\item Direct compilation to {\tt .class} files opens up the
+      interesting possibility of dynamically translating MIPS binaries
+      and loading them via {\tt ClassLoader.fromBytes()} {\it at
+      deployment time}, eliminating the need to compile binaries ahead
+      of time.
  
  \end{itemize}
  
-By generating code at the bytecode level there are many areas where
-the compiler can be smarter than javac. Most of the areas where
-improvements where made where in the handling of branch instructions
-and in taking advantage of the JVM stack to eliminate unnecessary
-LOADs and STOREs to local variables.
-
-The first obvious optimization that generating bytecode allows for is the
-use of GOTO. Despite the fact that java doesn't have a GOTO keyword a GOTO
-bytecode does exist and is used heavily in the code generates by javac.
-Unfortunately the java language doesn't provide any way to take advantage of
-this. As result of this, jumps within a method were implemented in the
-binary-to-source compiler by setting the PC field to the new address and
-making a trip back to the initial switch statement.  In the classfile
-compiler these jumps are implemented as GOTOs directly to the target
-instruction. This saves a costly trip back through the LOOKUPSWITCH
-statement and is a huge win for small loops within a method.
-
-Somewhat related to using GOTO is the ability to optimize branch
-statements. In the Java source compiler branch statements are
-implemented as follows (delay slots are ignored for the purpose of
-this example):
+Most of the performance improvemen where made where in the handling of
+branch instructions and in taking advantage of the JVM stack to
+eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
+
+\epsfig{file=chart7,width=3in}
+
+The first optimization gained by direct bytecode generation came from
+the use of the JVM {\tt GOTO} instruction.  Despite the fact that the
+Java {\it language} does not have a {\tt goto} keyword, the VM does in
+fact have a corresponding instruction which is used quite heavily by
+{\tt javac}.  NestedVM's binary-to-binary mode exploits this
+instruction to avoid emitting inefficient {\tt switch..case}
+structures.
+
+Related to the {\tt GOTO} instruction is branch statement
+optimization.  When emitting source code, NestedVM translates branches
+into Java source code like this:
  
  {\footnotesize\begin{verbatim}
-if(condition) { pc = TARGET; continue; }
+    if (condition) {
+        pc = TARGET;
+        continue;
+    }
  \end{verbatim}}
  
-This requires a branch in the JVM regardless of whether the MIPS
-branch is actually taken. If condition is false the JVM has to jump
-over the code to set the PC and go back to the switch block. If
-condition is true the JVM as to jump to the switch block. By
-generating bytecode directly we can make the target of the JVM branch
-statement the actual bytecode of the final destination. In the case
-where the branch isn't taken the JVM doesn't need to branch at all.
-
-A side affect of the above two optimizations is a solution to the
-excess constant pool entries problem. When jumps are implemented as
-GOTOs and direct branches to the target the PC field doesn't need to
-be set. This eliminates many of the constant pool entries the java
-source compiler requires. The limit is still there however, and given
-a large enough binary it will still be reached.
-
-Delay slots are another area where things are done somewhat
-inefficiently in the Java source compiler. In order to take advantage
-of instructions already in the pipeline MIPS cpu have a ``delay
-slot''. That is, an instruction after a branch or jump instruction that
-is executed regardless of whether the branch is taken. This is done
-because by the time the branch or jump instruction is finished being
-processes the next instruction is already ready to be executed and it
-is wasteful to discard it. (However, newer MIPS CPUs have pipelines
-that are much larger than early MIPS CPUs so they have to discard many
-instructions anyway.) As a result of this the instruction in the delay
-slot is actually executed BEFORE the branch is taken. To make things
-even more difficult, values from the register file are loaded BEFORE
-the delay slot is executed.  Here is a small piece of MIPS assembly:
+This requires a branch in the JVM {\it regardless} of whether the MIPS
+branch is actually taken.  If {\tt condition} is false the JVM has to
+jump over the code to set {\tt pc} and go back to the {\tt switch}
+statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
+switch} block.  By generating bytecode directly, NestedVM is able to
+emit a JVM bytecode branching directly to the address corresponding to
+the target of the MIPS branch.  In the case where the branch is not
+taken the JVM doesn't branch at all.
+
+A side effect of the previous two optimizations is a solution to the
+excess constant pool entries problem.  When jumps are implemented as
+{\tt GOTO}s and branches are taken directly, the {\tt pc} field does
+not need to be set.  This eliminates a huge number of constant pool
+entries.  The {\tt .class} file constant pool size limit is still
+present, but it is less likely to be encountered.
+
+Implementation of the MIPS delay slot offers another opportunity for
+bytecode-level optimization.  In order to take advantage of
+instructions already in the pipeline, the MIPS ISA specifies that the
+instruction after a jump or branch is always executed, even if the
+jump/branch is taken.  This instruction is referred to as the ``delay
+slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
+early MIPS CPUs, so they have to discard instructions anyways}.''  The
+instruction in the delay slot is actually executed {\it before} the
+branch is taken.  To further complicate matters, values from the
+register file are loaded {\it before} the delay slot is executed.
+
+Fortunately there is a very elegent solution to this problem which can
+be expressed in JVM bytecode.  When a branch instruction is
+encountered, the registers needed for the comparison are pushed onto
+the stack to prepare for the JVM branch instruction.  Then, {\it
+after} the values are on the stack the delay slot instruction is
+emitted, followed by the actual JVM branch instruction.  Because the
+values were pushed to the stack before the delay slot was executed, any
+changes the delay slot made to the registers are not visible to the
+branch bytecode.
+
+One final advantage that generating bytecode directly allows is a
+reduction in the size of the ultimate {\tt .class} file.  All the
+optimizations above lead to more compact bytecode as a beneficial side
+effect; in addition, NestedVM performs a few additional optimizations.
+
+When encountering the following {\tt switch} block, both {\tt javac}
+and {\tt jikes} generate redundant bytecode.
  
  {\footnotesize\begin{verbatim}
-ADDIU r2,r0,-1
-BLTZ r2, target
-ADDIU r2,r2,10
-...
-:target
+    switch(pc>>>8) {
+        case 0x1: run_1(); break;
+        case 0x2: run_2(); break
+        ...
+        case 0x100: run_100(); break;
+    }
  \end{verbatim}}
  
-This piece of code is executed as follows
+The first bytecode in each case arm in the switch statement is {\tt
+ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call.  By simply
+lifting this bytecode outside of the {\tt switch} statement, each {\tt
+case} arm shrinks by one instruction.
  
-\begin{enumerate}
-
-\item r2 is set to -1
+\subsubsection{Compiler Flags}
  
-\item r2 is loaded from the register file by the BLTEZ instruction
-      
-\item 10 is added to r2 by the ADDIU instruction
+Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
+profile is nothing like that of actual silicon.  In particular, {\tt
+gcc} makes several optimizations that increase performance on an
+actually MIPS CPU but actually decrease the performance of
+NestedVM-generated bytecode.  We found the following compiler options
+could be used to improve performance:
  
-\item The branch is taken because at the time the BLTZ instruction was
-      executed r2 was -1, but r2 is now 9 (-1 + 10)
+\begin{itemize}
  
-\end{enumerate}
+\item {\tt -falign-functions}
  
-There is a very elegent solution to this problem when using JVM
-bytecode. When a branch instruction is encountered the registers
-needed for the comparison are pushed onto the stack to prepare for the
-JVM branch instruction. Then, AFTER the values are on the stack the
-delay slot is emitted, and then finally the actual JVM branch
-instruction. Because the values were pushed to the stack before the
-delay slot was executed any changes the delay slot made to the
-registers are not visible to the branch bytecode. This allows delay
-slots to be used with no performance penalty or size penalty.
-
-One final advantage that generating bytecode directly allows is
-smaller more compact bytecode. All the optimization above lead to
-smaller bytecode as a side effect. There are also a few other areas
-where the generated bytecode can be optimized for size with more
-knowledge of the program as a whole.
-
-When encountering the following switch block both javac and Jikes
-generate redundant bytecode.
+      Normally a function's location in memory has no effect on its
+      execution speed.  However, in the NestedVM binary translator,
+      the {\tt .text} segment is split on power-of-two boundaries.  If
+      a function starts near the end of one of these boundaries, a
+      performance critical part of the function winds up spanning two
+      Java methods.  Telling {\tt gcc} to align all functions along
+      these boundaries decreases the chance of this sort of splitting.
  
-{\footnotesize\begin{verbatim}
-switch(pc>>>8) {
-    case 0x1: run_1(); break;
-    case 0x2: run_2(); break
-    ...
-    case 0x100: run_100(); break;
-}
-\end{verbatim}}
+\item {\tt -fno-rename-registers}
  
-The first bytecode in each case arm in the switch statement is ALOAD\_0 to
-prepare for a invoke special call. By simple moving this outside the switch
-statement each case arm was reduced in size by one instruction. Similar
-optimizations were also done in other parts of the compiler.
+      On an actual silicon chip, using additional registers carries no
+      performance penalty (as long as none are spilled to the stack).
+      However, when generating bytecode, using {\it fewer}
+      ``registers'' helps the JVM optimize the machine code it
+      generates by simplifying the constraints it needs to deal with.
+      Disabling register renaming has this effect.
  
+\item {\tt -fno-schedule-insns}
  
-\section{Interfacing with Java Code}
+      Results of MIPS load operations are not available until {\it
+      two} instructions after the load.  Without the {\tt
+      -fno-schedule-insns} instruction, {\tt gcc} will attempt to
+      reorder instructions to do other useful work during this period
+      of unavailability.  NestedVM is under no such constraint, so
+      removing this reordering typically generates simpler machine
+      code.
  
-Java source code can create a copy of the translated binary by
-instantiating the corresponding class, which extends {\tt Runtime}.
-Invoking the {\tt main()} method on this class is equivalent to
-calling the {\tt main()} function within the binary; the {\tt String}
-arguments to this function are copied into the binary's memory space
-and made available as {\tt **argv} and {\tt argc}.
+\item {\tt -mmemcpy}
  
-The translated binary communicates with the rest of the VM by
-executing MIPS {\tt SYSCALL} instructions, which are translated into
-invocations of the {\tt syscall()} method.  This calls back to the
-native Java world, which can manipulate the binary's environment by
-reading and writing to its memory space, checking its exit status,
-pausing the VM, and restarting the VM.
+      Enabling this instruction causes {\tt gcc} to use the system
+      {\tt memcpy()} routine instead of generating loads and stores.
+      As explained in the next section, the NestedVM runtime
+      implements {\tt memcpy()} using {\tt System.arraycopy()}, which
+      is substantially more efficient.
  
+\item {\tt -ffunction-sections -fdata-sections}
  
-\subsection{Virtualization}
+      These two options are used in conjunction with the {\tt
+      --gc-section} linker option, prompting the linker to more
+      aggressively prune dead code.
  
-The {\tt Runtime} class implements the majority of the standard {\tt
-libc} syscalls, providing a complete interface to the filesystem,
-network socket library, time of day, (Brian: what else goes here?).
+\end{itemize}
  
-\begin{itemize}
+The effects of the various optimizations presented in this chapter are
+summarized in the table below.
  
-\item ability to provide the same interface to CNI code and
-      NestedVMified code
-      
-\item security advantages (chroot the {\tt fork()}ed process)
+\epsfig{file=chart4,width=3in}
  
-\end{itemize}
+\epsfig{file=chart3,width=3in}
  
+\section{The NestedVM Runtime}
  
-\section{Quantitative Performance}
+In addition to binary-to-source and binary-to-binary translation,
+NestedVM also includes a MIPS binary interpreter.  All three
+translation approaches expose the same API to both the translated
+binary and the surrounding VM (including peer Java code).
  
-\subsection{Charts}
+\subsection{The Runtime Class}
  
-(Note that none of these libraries have pure-Java equivalents.)
+The runtime fulfills four roles:
  
  \begin{itemize}
-\item libjpeg
-\item libfreetype
-\item libmspack
+      
+\item It provides a simple, consistent external interface.  The method
+      of actually executing the code (currently only translated
+      binaries and the interpreter) can be changed without any code
+      changes to the caller because only runtime exposes a public
+      interface.  This includes methods to pass arguments to the
+      binary's {\tt main()} function, read and write from memory, and
+      call individual functions in the binary.
+      
+\item It manages the process's memory.  The runtime class contains
+      large {\tt int[]} arrays that represent the process`s entire
+      memory space.  Subclasses read and write to these arrays as
+      required by the instructions they are executing, and can expand
+      their memory space using the {\tt sbrk} system call.
+      
+\item The runtime provides access to the host file system and standard
+      I/O streams.  Subclasses of {\tt runtime} can access the file
+      system through standard UNIX syscalls ({\tt read()}, {\tt
+      write()}, {\tt open()}, etc).  The runtime manages the file
+      descriptor table that maps UNIX file descriptors to Java {\tt
+      RandomAccessFile}s, {\tt InputStream}s, {\tt OutputStream}s, and
+      {\tt Socket}s.
+      
+\item It provides general OS services, including {\tt sleep()}, {\tt
+      gettimeofday()}, {\tt getpagesize()}, {\tt sysconf()}, {\tt
+      fcntl()}, and so on.
+      
  \end{itemize}
  
+\section{Future Directions}
  
-\subsection{Optimizations}
+Although we have only implemented it for the Java Virtual Machine, our
+technique generalizes to other safe bytecode architectures.  In
+particular we would like to demonstrate this generality by retargeting
+the translator to the Microsoft Intermediate Language \cite{msil}.
  
-Brian, can you write something to go here?  Just mention which
-optimizations helped and which ones hurt.
+Additionally, we would like to explore other uses for dynamic loading
+of translated MIPS binaries by combining NestedVM (which itself is
+written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
  
-\begin{itemize}
-\item {\tt trampoline}
-\item {\tt optimal method size}
-\item {\tt -msingle-float}
-\item {\tt -mmemcpy}
-\item {\tt fastmem}
-\item {\tt local vars for registers (useless)}
-\item {\tt -fno-rename-registers}
-\item {\tt -ffast-math}
-\item {\tt -fno-trapping-math}
-\item {\tt -fsingle-precision-constant}
-\item {\tt -mfused-madd}
-\item {\tt -freg-struct-return}
-\item {\tt -freduce-all-givs}
-\item {\tt -fno-peephole}
-\item {\tt -fno-peephole2}
-\item {\tt -fmove-all-movables}
-\item {\tt -fno-sched-spec-load}
-\item {\tt -fno-sched-spec}
-\item {\tt -fno-schedule-insns}
-\item {\tt -fno-schedule-insns2}
-\item {\tt -fno-delayed-branch}
-\item {\tt -fno-function-cse}
-\item {\tt -ffunction-sections}
-\item {\tt -fdata-sections}
-\item {\tt array bounds checking}
-\item {\tt -falign-functions=n}
-\item {\tt -falign-labels=n}
-\item {\tt -falign-loops=n}
-\item {\tt -falign-jumps=n}
-\item {\tt -fno-function-cse}
-\end{itemize}
  
-\section{Future Directions}
+\section{Conclusion}
  
-World domination.
+We have presented a novel technique for using libraries written in
+unsafe languages within a safe virtual machine without resorting to
+native interfaces.  We have implemented this technique in NestedVM,
+which is currently used by the Ibex project\footnote{{\tt
+http://www.ibex.org}} to perform font rasterization (via {\tt
+libfreetype}), JPEG decoding (via {\tt libjpeg}), and CAB archive
+extraction (via {\tt libmspack}), three libraries for which no
+equivalent Java classes exist.
  
-\section{Conclusion}
+NestedVM is available under an open source license, and can be
+obtained from
+\begin{verbatim}
+    http://nestedvm.ibex.org
+\end{verbatim}
  
-We rock the hizzouse.
  
-\section{References}
+\section{Appendix: Testing Methodology}
  
-Yer mom.
+All times are measured in seconds. These were all run on a dual 1Ghz
+Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
+test was run 8 times within a single VM. The highest and lowest times
+were removed and the remaining 6 were averaged.  In each case only the
+first run differed significantly from the rest.
  
-\section{stuff}
-\begin{onecolumn}
-{\footnotesize\begin{verbatim}
+The {\tt libjpeg} test consisted of decoding a 1280x1024 jpeg and
+writing a tga.  The {\tt mspack} test consisted of extracting all
+members from {\tt arial32.exe}, {\tt comic32.exe}, {\tt times32.exe},
+and {\tt verdan32.exe}. The {\tt libfreetype} test consisted of
+rendering ASCII characters 32-127 of {\tt Comic.TTF} at sizes from 8
+to 48 incrementing by 4 for a total of 950 glyphs.
+
+\bibliography{nestedvm}
  
-libjpeg (render thebride_1280.jpg)
-Native -  0.235s
-JavaSource - 1.86s
-ClassFile - 1.37s
-
-freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to
-48 incrementing by 4)
-Native - 0.201s
-JavaSource - 2.02s
-ClassFile - 1.46s
-
-                                          libjpeg  libmspack libfreetype
-Interpreted MIPS Binary                   22.2      12.9      21.4
-Compled MIPS Binary (fastest options)     3.39      2.23      4.31
-Native -O3                                0.235    0.084     0.201
-
-Compled - with all case statements        3.50      2.30      4.99
-Compiled - with pruned case statement     3.39      2.23      4.31
-
-Compiled - 512 instructions/method        62.7      27.7      56.9
-Compiled - 256 instructions/method        3.54      2.55      4.43
-Compiled - 128 instructions/method        3.39      2.23      4.31
-Compiled - 64 instructions/method         3.56      2.31      4.40
-Compiled - 32 instruction/method          3.71      2.46      4.64
-
-Compild MIPS Binary (Server VM)           3.21      2.00      4.54
-Compiled MIPS Binary (Client VM)          3.39      2.23      4.31
-
-All times are measured in seconds. These were all run on a dual 1ghz G4
-running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
-was run 8 times within a single VM. The highest and lowest times were
-removed and the remaining 6 were averaged. In each case only the first
-run differed significantly from the rest.
-
-The libjpeg test consisted of decoding a 1280x1024 jpeg
-(thebride_1280.jpg) and writing a tga. The mspack test consisted of
-extracting all members from arial32.exe, comic32.exe, times32.exe, and
-verdan32.exe. The freetype test consisted of rendering characters
-32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
-about 950 individual glyphs).
-
-I can provide you with the source for any of these test if you'd like.
-
--Brian
-\end{verbatim}}
-\end{onecolumn}
  \end{document}