X-Git-Url: http://git.megacz.com/?p=nestedvm.git;a=blobdiff_plain;f=doc%2Fnestedvm.ivme04.tex;h=9751cd98cbc1340ff6d9d3322a7b560690561e6d;hp=b4c96a1a1501cc0221ef878fb6e2141a506cf8ea;hb=a84e68b3afbccc743521349e13ea6fa508199e30;hpb=35b4af7ef98c04944208a52dd230dd1c8803eed7 diff --git a/doc/nestedvm.ivme04.tex b/doc/nestedvm.ivme04.tex index b4c96a1..9751cd9 100644 --- a/doc/nestedvm.ivme04.tex +++ b/doc/nestedvm.ivme04.tex @@ -1,708 +1,1386 @@ -% -% FIXME: - Add something about size limits on the constant pool -% how we worked around that and the performance impact -% of -o lessconstants -% - Add something about encoding data sections as string constants -% and the UTF8 non-7-bit-ascii penalty -% - \documentclass{acmconf} \usepackage{graphicx} +\usepackage{multicol} \usepackage{amssymb,amsmath,epsfig,alltt} -\sloppy % better line breaks +\sloppy \usepackage{palatino} +\usepackage{pdftricks} +\begin{psinputs} + \usepackage{pstricks} + \usepackage{pst-node} +\end{psinputs} \usepackage{parskip} \usepackage{tabularx} \usepackage{alltt} -\bibliographystyle{alpha} +\bibliographystyle{amsplain} \title{\textbf{\textsf{ -Running Legacy C/C++ Libraries in a Pure Java Environment +Complete Translation of Unsafe Native Code to Safe Bytecode }}} \date{} \author{\begin{tabular}{@{}c@{}} - {\em {Brian Alliet}} \\ \\ - {{\it Affiliation Goes Here}}\relax + {\em {Brian Alliet}} \\ + {Rochester Institute of Technology}\\ + {\tt bja8464@cs.rit.edu} \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}} - {\em {Adam Megacz}} \\ \\ - {UC Berkeley}\relax + {\em {Adam Megacz}} \\ + {University of California, Berkeley} \\ + {\tt megacz@cs.berkeley.edu} \end{tabular}} \begin{document} \maketitle \begin{abstract} -Abstract -\end{abstract} -\section{Introduction} +Most existing techniques for using code written in an unsafe language +within a safe virtual machine involve transformations from one source +code language (such as C) to another (such as Java) and then to +virtual machine bytecodes. We present an alternative approach which +uses a standard compiler to turn unsafe source code into unsafe MIPS +binaries, which are then translated into safe virtual machine +bytecodes. This approach offers four key advantages over existing +techniques: + +\begin{itemize} +\item Total coverage of all language features +\item No post-translation human intervention +\item No build process modifications +\item Bug-for-bug compiler compatability +\end{itemize} -\subsection{Why would you want to do this?} +We have implemented this technique in NestedVM, a binary-to-source and +binary-to-binary translator targeting the Java Virtual Machine. +NestedVM-translated versions of the {\tt libfreetype}, {\tt libjpeg}, +and {\tt libmspack} libraries are currently in production use. +Performance measurements indicate a best case performance within 3x of +native code and worst case typically within 10x, making it an +attractive solution for code which is not performance-critical. -The C programming language has been around for over 30 years now. -There is a huge library of software written in this language. By -contrast, Java has been around for less than ten years. Although it -offers substantial advantages over C, the set of libraries written in -this language still lags behind C/C++. +\end{abstract} -The typical solution to this dilemma is to use JNI or CNI to invoke C -code from within a Java VM. Unfortunately, there are a number of -situations in which this is not an acceptable solution: +\section{Introduction} + +Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have +been in use much longer than any of today's widely accepted safe +languages such as Java \cite{java} and C\# \cite{csharp}. Consequently, there is +a huge library of software written in these languages. Although safe +languages offer substantial benefits, their comparatively young age +often puts them at a disadvantage when breadth of existing support +code is an important criterion. + +The typical solution to this dilemma is to use a native interface such +as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a +virtual machine or otherwise safe environment. Unfortunately, there +are a number of situations in which this is not an acceptable +solution. These situations can be broadly classified into two +categories: {\it security concerns} and {\it portability concerns}. + +Using Java as an example, JNI and CNI are prohibited in a number of +contexts, including applets environments and servlet containers with a +{\tt SecurityManager}. Additionally, even in the context of trusted +code, {\tt native} methods invoked via JNI are susceptible to buffer +overflow and heap corruption attacks which are not a concern for +verified bytecode. + +The second class of disadvantages revolves around portability +concerns; native interfaces require the native library to be compiled +ahead of time, for every architecture on which they will be +deployed. This is unworkable for situations in which the full set of +target architectures is not known at deployment time. Additionally, +some JVM platform variants such as J2ME \cite{j2me} simply do not offer +support for native code. + +The technique we present here uses typical compiler to compile unsafe +code into a MIPS binary, which is then translated on an +instruction-by-instruction basis into Java bytecode. The technique +presented here is general; we anticipate that it can be applied to +other secure virtual machines such as Microsoft's .NET \cite{msil}, Perl +Parrot \cite{parrot}, or Python bytecode \cite{python}. + +\section{Approaches to Translation} + +The four program representations of interest in this context, along +with their specific types in the C-to-JVM instantiation of the +problem are shown in the following diagram: + +\begin{pdfpic} +\newlength{\MyLength} +\settowidth{\MyLength}{machine code} +\newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}} +\begin{psmatrix}[colsep=2,rowsep=0] + & \\[0pt] + [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt] + [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)} \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt] + [name=b00]\MyBox{\tt (.o)} & [name=b11]\MyBox{\tt (.class)} \\ + & \\[0pt] + \psset{nodesep=5pt,arrows=->} +\end{psmatrix} +\end{pdfpic} + +To illustrate the context of this diagram, the following arcs show the +translations performed by a few familiar tools: + +\begin{pdfpic} +\newlength{\MyLength} +\settowidth{\MyLength}{xmachine codex} +\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}} +\psmatrix[colsep=2,rowsep=0,nrot=:D] + & \\[0pt] + [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt] + & \\[0pt] + \psset{nodesep=5pt,arrows=->} + \ncline{s0}{b0}\bput{:U}{\tt gcc} + \ncline{s1}{b0}\bput{:D}{\tt gcj} + \ncline{s1}{b1}\aput{:U}{\tt javac} + \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs} +\endpsmatrix +\end{pdfpic} + +Techniques for translating unsafe code into VM bytecode generally fall +into four categories, which we expand upon in the next two sections: \begin{itemize} -\item Java Applets are not permitted to invoke {\tt Runtime.loadLibrary()} -\item Java Servlet containers with a {\tt SecurityManager} will not - permit loading new JNI libraries. This configuration is popular - with {\it shared hosting} providers and corporate intranets - where a number of different parties contribute individual web - applications which are run together in a single container. -\item JNI requires the native library to be compiled ahead of time, - separately, for every architecture on which it will be deployed. - This is unworkable for situations in which the full set of - target architectures is not known at deployment time. -\item The increasingly popular J2ME platform does not support JNI or CNI. -\item Unlike Java Bytecode, JNI code is susceptible to buffer overflow - and heap corruption attacks. This can be a major security - vulnerability. -\item JNI often introduces undesirable added complexity to an - application. +\item source-to-source translation +\item source-to-binary translation +\item binary-to-source translation +\item binary-to-binary translation \end{itemize} -The technique we present here is based on using a typical ANSI C -compiler to compile C/C++ code into a MIPS binary, and then using a -tool to translate that binary on an instruction-by-instruction basis -into Java bytecode. +\section{Existing Work} + +\subsection{Source-to-Source Translation} + +The most common technique employed is partial translation from unsafe +source code to safe source code: + +\begin{pdfpic} +\newlength{\MyLength} +\settowidth{\MyLength}{xmachine codex} +\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}} +\psmatrix[colsep=2,rowsep=0,nrot=:U] + & \\[0pt] + & \\[0pt] + [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt] + & \\[0pt] + \psset{nodesep=5pt,arrows=->} + \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source} + \ncline{s1}{b1}\aput{:U}{\tt javac} +\endpsmatrix +\end{pdfpic} + +A number of existing systems employ this technique; they can +be divided into two categories: those which perform a partial +translation which is completed by a human, and those which perform a +total translation but fail (yield an error) on a large class of input +programs. + + +\subsubsection{Incomplete Translation} + +Jazillian \cite{jazillian} is a commercial solution which produces +extremely readable Java source code from C source code, but ony +translates a small portion of the C language. Jazillian is unique in +that in addition to {\it language migration}, it also performs {\it +API migration}; for example, Jazillian is intelligent enough +to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}. + +Unfortunately such deep analysis is intractible for most of the C +language and standard library; Jazillian's documentation notes that +{\it ``This is not your father's language translator. It's not +generating ugly code that's guaranteed to work out of the +box... Jazillian does not always produce code that works correctly.''} + +MoHCA-Java \cite{mohca} is the other major tool in this category, and steps +beyond Jazillian by providing tools for analysis of the source C++ +abstract syntax tree. Additionally, MoHCA-Java's analysis engine is +extensible, making it a platform for constructing application-specific +translators rather than a single translation tool. However, +MoHCA-Java does not always generate complete Java code for all of the C++ +programs which it accepts. + + +\subsubsection{Partial Domain Translation} + +The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp}, +and Ephedra \cite{ephedra} systems each provide support for complete +translation of a {\it subset} of the source language (C or C++). Each +of the four tools supports a progressively greater subset than the one +preceding it; however none covers the entire input language. + +Ephedra, the most advanced of the four, supports most of the C++ +language, and claims to produce ``human readable'' Java code as +output. Notable omissions from the input domain include support for +fully general pointer arithmetic, casting between unrelated types, and +reading from a {\tt union} via a different member than the one most +recently written. + +Unfortunately, when the program being translated is large and complex, +it is quite likely that it will use an unsupported feature in at least +one place. In the absence of a programmer who understands the source +program, a single anomoly is often enough to render the entire +translation process useless. As a result, these tools are mainly +useful as an {\it aid} to programmers who could normally perform the +conversion themselves, but want to save time by automating most of the +process. + + +\subsection{Source-to-Binary Translation} + +Source-to-binary translation involves a compiler for the unsafe +language which has been modified to emit safe bytecode. + +\begin{pdfpic} +\newlength{\MyLength} +\settowidth{\MyLength}{xmachine codex} +\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}} +\psmatrix[colsep=2,rowsep=0,nrot=:U] + & \\[0pt] + [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt] + & \\[0pt] + \psset{nodesep=5pt,arrows=->} + \ncline{s0}{b1}\bput{:U}{source-to-binary} +\endpsmatrix +\end{pdfpic} + +The primary occupant of this category is {\tt egcs-jvm} +\cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler +Collection ( {\tt gcc} ) \cite{gcc}. Since {\tt gcc} employs a highlym +odular architecture, it {\it is} possible to add RTL code generators +for nonstandard processors. However, {\tt gcc}'s parsing, RTL +generation, and optimization layers make fundamental assumptions (such +as the availability of pointer math) which cannot be directly +supported; thus the compiler still fails for a substantial class of +input programs. + + + +\section{NestedVM} + +The principal difference between NestedVM and other approaches is that +NestedVM {\it does not} attempt to deal with source code as an input. +This leads immediately to three advantages: -The technique presented here is general; we anticipate that it can be -applied to other secure virtual machines such as Microsoft's .NET. +\begin{itemize} +\item {\bf Total coverage of all language features} + Because NestedVM does not attempt to implement the parsing and + code generation steps of compilation, it is freed from the + extremely complex task of faithfully implementing languages + which are often not fully or formally specified (such as C and + C++). -\section{Existing Work: Source-to-Source Translation} +\item {\bf No build process modifications} -\begin{itemize} -\item c2java -\item commercial products -\end{itemize} + NestedVM does not modify existing build processes, which can be + extremely complex and dependent on strange preprocessor usage as + well as the complex interplay between compiler switches and + header file locations. -\section{Mips2Java: Binary-to-Binary Translation} +\item {\bf Bug-for-bug compiler compatability} -We present Mips2Java, a binary-to-binary translation tool to convert -MIPS binaries into Java bytecode files. + Since NestedVM uses the compiler's {\it output} as its own {\it + input}, it ensures that programs which are inadvertently + dependent on the vagaries of a particular compiler can still be + used. -The process of utilizing Mips2Java begins by using any compiler for -any language to compile the source library into a statically linked -MIPS binary. We used {\tt gcc 3.3.3}, but any compiler which can -target the MIPS platform should be acceptable. The binary is -statically linked with a system library (in the case of C code this is -{\tt libc}) which translates system requests (such as {\tt open()}, -{\tt read()}, or {\tt write()}) into appropriate invocations of the -MIPS {\tt SYSCALL} instruction. +\end{itemize} -The statically linked MIPS binary is then fed to the Mips2Java tool -(which is itself written in Java), which emits a sequence of Java -Bytecodes in the form of a {\tt .class} file equivalent to the -provided binary. This {\tt .class} file contains a single class which -implements the {\tt Runtime} interface. This class may then be -instantiated; invoking the {\tt execute()} method is equivalent to -invoking the {\tt main()} function in the original binary. +NestedVM's approach carries a fourth benefit as well, arising from its +totality: +\begin{itemize} +\item {\bf No post-translation human intervention} + + NestedVM offers total support for all non-privileged + instructions, registers, and resources found on a MIPS {\tt + R2000} CPU, including the add/multiply unit and floating point + coprocessor. As such, it constitutes a total function mapping + from the entire domain of (non-kernel-mode) programs onto (a + subset of) the semantics of the Java Virtual Machine. This + ensures that the translation process is fully automated and + always succeeds for valid input binaries. +\end{itemize} + +This is a much more important factor than is obvious at first glance. +If post-translation human intervention is required, then the {\it +human becomes part of the build process}. This means that if a third +party library used in the project needs to be upgraded, {\it a human +must intervene} in the rebuild process. In addition to slowing the +process and introducing opportunities for error, this task often +requires specialized knowledge which becomes tied to the particular +individual performing this task, rather than being encoded in build +scripts which persist throughout the lifetime of the project. \subsection{Why MIPS?} -We chose MIPS as a source format for two primary reasons: the -availability of tools to translate legacy code into MIPS binaries, and -the close similarity between the MIPS ISA and the Java Virtual Machine. +We chose MIPS as a source format for three reasons: the availability +of tools to compile legacy code into MIPS binaries, the close +similarity between the MIPS ISA and the Java Virtual Machine, and the +relatively high degree of program structure that can be inferred from +ABI-adherent binaries. The MIPS architecture has been around for quite some time, and is well supported by the GNU Compiler Collection, which is capable of -compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C +compiling C, C++, Java, Fortran, Pascal, and Objective C into MIPS binaries. -The MIPS R1000 ISA bears a striking similarity to the Java Virtual -Machine. This early revision of the MIPS supports only 32-bit aligned -loads and stores from memory, which is precisely the access model -supported by Java for {\tt int[]}s. +The MIPS R2000 ISA bears a striking similarity to the Java Virtual +Machine: -Cover dynamic page allocation. +\begin{itemize} -Cover stack setup. +\item Most of the instructions in the original MIPS ISA operate only + on 32-bit aligned memory locations. This allows NestedVM to + represent memory as a Java {\tt int[]} array without introducing + additional overhead. The remaining non-aligned memory load + instructions are only rarely emitted by most compilers since + they carry a performance penalty on physical MIPS + implementations. -Brian, are there any other fortunate similarities we should mention -here? I seem to remember there being a bunch, but I can't recall them -right now; it's been a while since I dealt with this stuff in detail. +\item Unlike its predecessor, the R2000 supports 32-bit by 32-bit + multiply and divide instructions as well as a single and double + precision floating point unit. These capabilities map nicely + onto Java's arithmetic instructions. +\end{itemize} -\subsection{Interpreter} +Finally, the MIPS ISA and ABI convey quite a bit of information about +program structure. This information can be used for optimization +purposes: \begin{itemize} -\item slow, but easy to write -\end{itemize} +\item The structure of MIPS branching and jump instructions make it + easy to infer the set of likely target instructions. -\subsection{Compiling to Java Source} -\begin{itemize} -\item performance boost -\item pushes the limits of {\tt javac} and {\tt jikes} -\end{itemize} +\item The MIPS ABI specifies particular registers as caller-save and + callee-save, as well as designating a register for the return + address after a function call. This allows NestedVM to optimize + many operations for the common case of ABI-adherent binaries. + +\item All MIPS instructions are exactly 32 bits long. -\subsection{Compiling directly to Java Bytecode} -\begin{itemize} -\item further performance boost (quantify) -\item brian, can you add any comments here? \end{itemize} -\subsection{Interfacing with Java Code} -Java source code can create a copy of the translated binary by -instantiating the corresponding class, which extends {\tt Runtime}. -Invoking the {\tt main()} method on this class is equivalent to -calling the {\tt main()} function within the binary; the {\tt String} -arguments to this function are copied into the binary's memory space -and made available as {\tt argv**} and {\tt argc}. -The translated binary communicates with the rest of the VM by -executing MIPS {\tt SYSCALL} instructions, which are translated into -invocations of the {\tt syscall()} method. This calls back to the -native Java world, which can manipulate the binary's environment by -reading and writing to its memory space, checking its exit status, -pausing the VM, and restarting the VM. +\subsection{Binary-to-Source} + +The simplest operational mode for NestedVM is binary-to-source +translation. In this mode, NestedVM translates MIPS binaries into +Java source code, which is then fed to a Java compiler in order to +produce bytecode files: + +\begin{pdfpic} +\newlength{\MyLength} +\settowidth{\MyLength}{xmachine codex} +\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}} +\psmatrix[colsep=2,rowsep=0,nrot=:U] + & \\[0pt] + & \\[0pt] + [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt] + \psset{nodesep=5pt,arrows=->} + \ncline{s0}{b0}\bput{:U}{\tt gcc} + \ncline{s1}{b1}\aput{:U}{\tt javac} + \ncline{b0}{s1}\naput{\tt NestedVM} +\endpsmatrix +\end{pdfpic} + +\begin{figure*}[t] +\begin{minipage}[c]{7in}% +\begin{multicols}{2} +{\footnotesize\begin{verbatim} +private final static int r0 = 0; +private int r1, r2, r3,...,r30; +private int r31 = 0xdeadbeef; +private int pc = ENTRY_POINT; +public void run() { + for(;;) { + switch(pc) { + case 0x10000: + r29 = r29 - 32; + case 0x10004: + r1 = r4 + r5; + case 0x10008: + if(r1 == r6) { + /* delay slot */ + r1 = r1 + 1; + pc = 0x10018: + continue; + } + case 0x1000C: + r1 = r1 + 1; + case 0x10010: + r31 = 0x10018; + pc = 0x10210; + continue; + case 0x10014: + /* nop */ + case 0x10018: + pc = r31; + continue; + ... + case 0xdeadbeef: + System.err.println(``Exited.''); + System.exit(1); + } + } +} +\end{verbatim}} +\vspace{1in} +{\footnotesize\begin{verbatim} +public void run_0x10000() { + for(;;) { + switch(pc) { + case 0x10000: + ... + case 0x10004: + ... + ... + case 0x10010: + r31 = 0x10018; + pc = 0x10210; + return; + ... + } + } +} + +pubic void run_0x10200() { + for(;;) { + switch(pc) { + case 0x10200: + ... + case 0x10204: + ... + } + } +} + +public void trampoline() { + for(;;) { + switch(pc&0xfffffe00) { + case 0x10000: run_0x10000(); break; + case 0x10200: run_0x10200(); break; + case 0xdeadbe00: + ... + } + } +} +\end{verbatim}} +\end{multicols} +\end{minipage} +\caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit} +\end{figure*} + +Translating unsafe code for use within a JVM proceeds as follows: + +\begin{enumerate} + +\item Compile the source code to a statically linked binary, targeting + the MIPS R2000 ISA. Typically this will involve linking against + {\tt libc}, which translates system requests (such as {\tt + open()}, {\tt read()}, or {\tt write()}) into appropriate + invocations of the MIPS {\tt SYSCALL} instruction. + +\item Invoke {\tt NestedVM} on the statically linked binary. + +\item Compile the resulting {\tt .java} code using {\tt jikes} + \cite{jikes} or {\tt javac}. + +\item From java code, invoke the {\tt run()} method on the generated + class. This is equivalent to the {\tt main()} entry point. -\subsection{Virtualization} +\end{enumerate} -The {\tt Runtime} class implements the majority of the standard {\tt -libc} syscalls, providing a complete interface to the filesystem, -network socket library, time of day, (Brian: what else goes here?). +\subsubsection{Optimizations} + +Generating Java source code instead of bytecode frees NestedVM from +having to perform simple constant propagation optimizations, as most +Java compilers already do this. A recurring example is the treatment +of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA. + +Lacking the ability to generate specially optimized bytecode +sequences, a straightforward mapping of the general purpose hardware +registers to 32 {\tt int} fields turned out to be optimal. + + +\epsfig{file=chart1,width=3in} + +Unfortunately, Java imposes a 64kb limit on the size of the bytecode +for a single method. This presents a problem for NestedVM, and +necessitates a {\it trampoline transformation}, as shown in +Figure~\ref{code1}. With this trampoline in place, large binaries can +be handled without much difficulty -- fortunately, there is no +corresponding limit on the size of a classfile as a whole. + +One difficulty that arose as a result of using the trampoline +transformation was the fact that {\tt javac} and {\tt jikes} are +unable to properly optimize its switch statements. For example, the +following code is compiled into a comparatively inefficient {\tt +LOOKUPSWITCH}: + +{\footnotesize +\begin{verbatim} + switch(pc&0xffffff00) { + case 0x00000100: run_100(); break; + case 0x00000200: run_200(); break; + case 0x00000300: run_300(); break; + } +\end{verbatim}} + +v v v v v v v +\begin{figure} +{\footnotesize\begin{verbatim} +switch(pc>>>8) { + case 0x1: run_100(); break; + case 0x2: run_200(); break; + case 0x3: run_300(); break; +} +************* +{\footnotesize +\begin{verbatim} + switch(pc>>>8) { + case 0x1: run_100(); + case 0x2: run_200(); + case 0x3: run_300(); + } +^ ^ ^ ^ ^ ^ ^ +\end{verbatim}} + +v v v v v v v +Javac is not smart enough to see the pattern in the case values and +generates very suboptimal bytecode. Manually doing the shifts +convinces javac to emit a tableswitch statement, which is +significantly faster. This change alone increased the speed of +the compiled binary by approximately 35\%. +************* +This problem was surmounted by switching on a denser set of {\tt case} +values, which is more amenable to the {\tt TABLESWITCH} structure. +This change alone nearly doubled the speed of the compiled binary. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +Finding the optimal method size lead to the next big performance +increase. It was determined through experimentation that the optimal +number of MIPS instructions per method is 64 or 128 (considering only +powers of two). Going above or below that lead to performance +decreases. This is most likely due to a combination of two factors. +************* +The next performance improvement came from tuning the size of the +methods invoked from the trampoline. Trial and error led to the +conclusion that HotSpot \cite{hotspot}, the most popular JVM, performs +best when 128 MIPS instructions are mapped to each method. +^ ^ ^ ^ ^ ^ ^ + +\epsfig{file=chart5,width=3in} + +\epsfig{file=chart6,width=3in} + +This phenomenon is due to two factors: \begin{itemize} -\item ability to provide the same interface to CNI code and mips2javaified code -\item security advantages (chroot the {\tt fork()}ed process) -\end{itemize} -\section{Related Work} +\item While the trampoline method's {\tt switch} statement can be + coded as a {\tt TABLESWITCH}, the {\tt switch} statement + within the individual methods is to sparse to encode this way. -\subsection{Source-to-Source translators} +\item Hybrid Interpretive-JIT compilers such as HotSpot generally + favor smaller methods since they are easier to compile and are + better candidates for compilation in ``normal'' programs (unlike + NestedVM programs). -A number of commercial products and research projects attempt to -translate C++ code to Java code, preserving the mapping of C++ classes -to Java classes. Unfortunately this is problematic since there is no -way to do pointer arithmetic except within arrays, and even in that -case, arithmetic cannot be done in terms of fractional objects. +\end{itemize} -Mention gcc backend +After tuning method sizes, our next performance boost came from +eliminating exraneous case branches. Having case statements before +each instruction prevents JIT compilers from being able to optimize +across instruction boundaries, since control flow can enter the body +of a {\tt switch} statement at any of the {\tt case}s. In order to +eliminate unnecessary case statements we needed to identify all +possible jump targets. Jump targets can come from three sources: -c2java +Putting more than 256 instructions in each method lead to a severe +performance penalty. Apparently Hotspot does not handle very large methods +well. In some tests the simple moving from 256 to 512 instructions per +method decreased performance by a factor of 10. -Many of these products advise the user to tweak the code which results -from the translation. Unfortunately, hand-modifying machine-generated -code is generally a bad idea, since this modification cannot be -automated. This means that every time the origin code changes, the -code generator must be re-run, and the hand modifications must be -performed yet again. This is an error-prone process. +Put chart here -Furthermore, Mips2Java does not attempt to read C code directly. This -frees it from the complex task of faithfully implementing the ANSI C -standard (or, in the case of non ANSI-C compliant code, some other -interface). This also saves the user the chore of altering their -build process to accomodate Mips2Java. +The next big optimization was eliminating unnecessary case +statements. Having case statements before each instruction prevents +JIT compilers from being able to optimize across instruction +boundaries. In order to eliminate unnecessary case statements every +possible address that could be jumped to directly needed to be +identified. The sources for possible jump targets come from 3 places. +\begin{itemize} -\section{Performance} +v v v v v v v +\item The .text segment - Every instruction in the text segment is + scanned for jump targets. Every branch instruction (BEQ, JAL, + etc) has its destination added to the list of possible branch + targets. In addition, functions that set the link register have + theirpc+8 added to the list (the address that would have been put + to the link register). Finally, combinations of LUI (Load Upper + Immediate) of ADDIU (Add Immediate Unsigned) are scanned for + possible addresses in the text segment. This combination of +************* +\item {\bf The {\tt .text} segment} + + Every instruction in the text segment is scanned, and every + branch instruction's destination is added to the list of + possible branch targets. In addition, any function that sets + the link register is added to the list \footnote{actually {\tt addr+8}}. + Finally, combinations of {\tt LUI} (Load Upper Immediate) and + {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible + addresses in the {\tt .text} segment since this combination of +^ ^ ^ ^ ^ ^ ^ + instructions is often used to load a 32-bit word into a + register. + +v v v v v v v +\item The .data segment - When GCC generates switch() statements it + often uses a jump table stored in the .data + segment. Unfortunately gcc does not identify these jump tables in + any way. Therefore, the entire .data segment is conservatively + scanned for possible addresses in the .text segment. +************* +\item {\bf The {\tt .data} segment} + + When compiling {\tt switch} statements, compilers often use a + jump table stored in the {\tt .data} segment. Unfortunately + they typically do not identify these jump tables in any way. + Therefore, the entire {\tt .data} segment is conservatively + scanned for possible addresses in the {\tt .text} segment. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +\item The symbol table - This is mainly used as a backup. Scanning the + .text and .data segments should identify any possible jump + targets but adding every function in the symbol table in the ELF + binary does not hurt. This will also catch functions that are + never called directly from the MIPS binary (for example, + functions called with the call() method in the runtime). +************* +\item {\bf The symbol table} +^ ^ ^ ^ ^ ^ ^ + + This is mainly used as a backup. Scanning the {\tt .text} and + {\tt .data} segments should identify any possible jump targets; + however, adding all function symbols in the ELF symbol table + also catches functions that are never called directly from the + MIPS binary, such as those invoked only via the NestedVM + runtime's {\tt call()} method. -\subsection{Charts} +\end{itemize} -(Note that none of these libraries have pure-Java equivalents.) +Eliminating unnecessary {\tt case} statements provided a 10-25\% speed +increase. + +Despite all the above optimizations, one insurmountable obstacle +remained: the Java {\tt .class} file format limits the constant pool +to 65535 entries. Every integer literal greater than {\tt 32767} +requires an entry in this pool, and each branch instruction generates +one of these. + +One suboptimal solution was to express constants as offsets from a few +central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where +{\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from +inlining it). This was sufficient to get reasonably large binaries to +compile, and caused only a small (approximately 5\%) performance +degredation and a similarly small increase in the size of the {\tt +.class} file. However, as we will see in the next section, compiling +directly to {\tt .class} files (without the intermediate {\tt .java} +file) eliminates this problem entirely. + + +\subsection{Binary-to-Binary} + +After implementing the binary-to-source compiler, a binary-to-binary +translation mode was added. + +\begin{pdfpic} +\newlength{\MyLength} +\settowidth{\MyLength}{xmachine codex} +\newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}} +\psmatrix[colsep=2,rowsep=0,nrot=:U] + & \\[0pt] + [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + & \\[0pt] + [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt] + & \\[0pt] + \psset{nodesep=5pt,arrows=->} + \ncline{s0}{b0}\bput{:U}{\tt gcc} + \ncline{b0}{b1}\naput{\tt NestedVM} +\endpsmatrix +\end{pdfpic} + +This mode has several advantages: \begin{itemize} -\item libjpeg -\item libfreetype -\item libmspack + +v v v v v v v +\item There are little tricks that can be done in JVM bytecode that + cannot be done in Java source code. +************* +\item There are quite a few interesting bytecode sequences that cannot + be generated as a result of compiling Java source code. +^ ^ ^ ^ ^ ^ ^ + +\item Directly generating {\tt .class} files Eliminates the + time-consuming {\tt javac} step. + +\item Direct compilation to {\tt .class} files opens up the + interesting possibility of dynamically translating MIPS binaries + and loading them via {\tt ClassLoader.fromBytes()} {\it at + deployment time}, eliminating the need to compile binaries ahead + of time. + \end{itemize} +Most of the performance improvemen where made where in the handling of +branch instructions and in taking advantage of the JVM stack to +eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables. + +v v v v v v v +The first obvious optimization that generating bytecode allows for is the +use of GOTO. Despite the fact that Java does not have a GOTO keyword a GOTO +bytecode does exist and is used heavily in the code generates by javac. +Unfortunately the java language does not provide any way to take advantage of +this. As result of this, jumps within a method were implemented in the +binary-to-source compiler by setting the PC field to the new address and +making a trip back to the initial switch statement. In the classfile +compiler these jumps are implemented as GOTOs directly to the target +instruction. This saves a costly trip back through the LOOKUPSWITCH +statement and is a huge win for small loops within a method. +************* +\epsfig{file=chart7,width=3in} +^ ^ ^ ^ ^ ^ ^ + +The first optimization gained by direct bytecode generation came from +the use of the JVM {\tt GOTO} instruction. Despite the fact that the +Java {\it language} does not have a {\tt goto} keyword, the VM does in +fact have a corresponding instruction which is used quite heavily by +{\tt javac}. NestedVM's binary-to-binary mode exploits this +instruction to avoid emitting inefficient {\tt switch..case} +structures. + +Related to the {\tt GOTO} instruction is branch statement +optimization. When emitting source code, NestedVM translates branches +into Java source code like this: + +{\footnotesize\begin{verbatim} + if (condition) { + pc = TARGET; + continue; + } +\end{verbatim}} + +v v v v v v v +This requires a branch in the JVM regardless of whether the MIPS +branch is actually taken. If condition is false the JVM has to jump +over the code to set the PC and go back to the switch block. If +condition is true the JVM has to jump to the switch block. By +generating bytecode directly we can make the target of the JVM branch +statement the actual bytecode of the final destination. In the case +where the branch is not taken the JVM does not need to branch at all. +************* +This requires a branch in the JVM {\it regardless} of whether the MIPS +branch is actually taken. If {\tt condition} is false the JVM has to +jump over the code to set {\tt pc} and go back to the {\tt switch} +statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt +switch} block. By generating bytecode directly, NestedVM is able to +emit a JVM bytecode branching directly to the address corresponding to +the target of the MIPS branch. In the case where the branch is not +taken the JVM doesn't branch at all. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +A side affect of the above two optimizations is a solution to the +excess constant pool entries problem. When jumps are implemented as +GOTOs and direct branches to the target the PC field does not need to +be set. This eliminates many of the constant pool entries the java +source compiler requires. The limit is still there however, and given +a large enough binary it will still be reached. +************* +A side effect of the previous two optimizations is a solution to the +excess constant pool entries problem. When jumps are implemented as +{\tt GOTO}s and branches are taken directly, the {\tt pc} field does +not need to be set. This eliminates a huge number of constant pool +entries. The {\tt .class} file constant pool size limit is still +present, but it is less likely to be encountered. +^ ^ ^ ^ ^ ^ ^ + +Implementation of the MIPS delay slot offers another opportunity for +bytecode-level optimization. In order to take advantage of +instructions already in the pipeline, the MIPS ISA specifies that the +instruction after a jump or branch is always executed, even if the +jump/branch is taken. This instruction is referred to as the ``delay +slot\footnote{Newer MIPS CPUs have pipelines that are much larger than +early MIPS CPUs, so they have to discard instructions anyways}.'' The +instruction in the delay slot is actually executed {\it before} the +branch is taken. To further complicate matters, values from the +register file are loaded {\it before} the delay slot is executed. + +Fortunately there is a very elegent solution to this problem which can +be expressed in JVM bytecode. When a branch instruction is +encountered, the registers needed for the comparison are pushed onto +the stack to prepare for the JVM branch instruction. Then, {\it +after} the values are on the stack the delay slot instruction is +emitted, followed by the actual JVM branch instruction. Because the +values were pushed to the stack before the delay slot was executed, any +changes the delay slot made to the registers are not visible to the +branch bytecode. + +One final advantage that generating bytecode directly allows is a +reduction in the size of the ultimate {\tt .class} file. All the +optimizations above lead to more compact bytecode as a beneficial side +effect; in addition, NestedVM performs a few additional optimizations. + +When encountering the following {\tt switch} block, both {\tt javac} +and {\tt jikes} generate redundant bytecode. + +{\footnotesize\begin{verbatim} + switch(pc>>>8) { + case 0x1: run_1(); break; + case 0x2: run_2(); break + ... + case 0x100: run_100(); break; + } +\end{verbatim}} + +The first bytecode in each case arm in the switch statement is {\tt +ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call. By simply +lifting this bytecode outside of the {\tt switch} statement, each {\tt +case} arm shrinks by one instruction. + +\subsubsection{Compiler Flags} + +Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance +profile is nothing like that of actual silicon. In particular, {\tt +gcc} makes several optimizations that increase performance on an +actually MIPS CPU but actually decrease the performance of +NestedVM-generated bytecode. We found the following compiler options +could be used to improve performance: -\subsection{Optimizations} +\begin{itemize} -Brian, can you write something to go here? Just mention which -optimizations helped and which ones hurt. +\item {\tt -falign-functions} + + Normally a function's location in memory has no effect on its + execution speed. However, in the NestedVM binary translator, + the {\tt .text} segment is split on power-of-two boundaries. If + a function starts near the end of one of these boundaries, a + performance critical part of the function winds up spanning two + Java methods. Telling {\tt gcc} to align all functions along + these boundaries decreases the chance of this sort of splitting. + +\item {\tt -fno-rename-registers} + + On an actual silicon chip, using additional registers carries no + performance penalty (as long as none are spilled to the stack). + However, when generating bytecode, using {\it fewer} + ``registers'' helps the JVM optimize the machine code it + generates by simplifying the constraints it needs to deal with. + Disabling register renaming has this effect. + +\item {\tt -fno-schedule-insns} + + Results of MIPS load operations are not available until {\it + two} instructions after the load. Without the {\tt + -fno-schedule-insns} instruction, {\tt gcc} will attempt to + reorder instructions to do other useful work during this period + of unavailability. NestedVM is under no such constraint, so + removing this reordering typically generates simpler machine + code. + +\item {\tt -mmemcpy} + + Enabling this instruction causes {\tt gcc} to use the system + {\tt memcpy()} routine instead of generating loads and stores. + As explained in the next section, the NestedVM runtime + implements {\tt memcpy()} using {\tt System.arraycopy()}, which + is substantially more efficient. + +NestedVM has two primary ways of executing code, the interpreter, and the +binary translators. Both the interpreter and the output from the binary +translators sit on top of a Runtime class. This class provides the public +interface to both the interpreter and the translated binaries. + +The Runtime class does the work that the operating system usually does. +Conceptually the Runtime class can be thought of as the operating system and +its subclasses (translated binaries and the interpreter) the CPU. The +Runtime fulfills 5 primary goals: + +v v v v v v v +The Runtime class does the work that the operating system usually does. +Conceptually the Runtime class can be thought of as the operating system and +its subclasses (translated binaries and the interpreter) the CPU. The +Runtime fulfills 5 primary goals: +************* +\item {\tt -fno-rename-registers} Some processors can better schedule + code when registers aren't reused for two different purposes. By + default GCC will try to use as many registers as possibly when + it can. This excess use of registers just confuses JIT's trying + to compile the output from the binary translator. All the JIT + compilers we tested do much better with a few frequently used + registers. +^ ^ ^ ^ ^ ^ ^ + +\item {\tt -fno-delayed-branch} The MIPS CPU has a delay slot (see + above). Earlier versions of NestedVM didn't efficiently emulate + delay slots. This option causes GCC to avoid using delay slots + for anything (a NOP is simply placed in the delay slot). This + had a small performance benefit. However, recent versions of + NestedVM emulate delay slots with no performance overhead so + this options has little effect. Nonetheless, these delay slots + provide no benefit under NestedVM either so they are avoided + with this option. + +v v v v v v v +\end{itemize} +************* +\item Provides a consistent external interface - The method of actually +executing the code (currently only translated binaries and the interpreter) +can be changed without any code changes to the caller because only Runtime +exposes a public interface. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +The effects of the various optimizations presented in this chapter are +summarized in the table below. +************* +\item Provide an easy to use interface - The interpreter and the output from +the binary translators only know how to execute code. The Runtime class +provides an easy to use interface to the code. It contains methods to pass +arguments to the main() function, read and write from memory, and call +individual functions in the binary. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +\epsfig{file=chart4,width=3in} +************* +\item Manage the process's memory - The Runtime class contains large int[] +arrays that represent the process`s entire memory space. Subclasses read +and write to these arrays as required by the instructions they are +executing. Subclasses can expend their memory space using the sbrk +syscall. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +\epsfig{file=chart3,width=3in} +************* +\item Provide access to the file system and streams - Subclasses access the +file system through standard UNIX syscalls (read, write, open, etc). The +Runtime manages the file descriptor table that maps UNIX file descriptors +to Java RandomAccessFiles, InputStreams, OutputStreams, and sockets. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +\item Miscellaneous other syscalls - In additions to those mentioned above +the Runtime class implements a variety of other syscalls (sleep, +gettimeofday, getpagesize, sysconf, fcntl, etc). +************* +\section{The NestedVM Runtime} +^ ^ ^ ^ ^ ^ ^ + +In addition to binary-to-source and binary-to-binary translation, +NestedVM also includes a MIPS binary interpreter. All three +translation approaches expose the same API to both the translated +binary and the surrounding VM (including peer Java code). + +\subsection{The Runtime Class} + +The runtime fulfills four roles: \begin{itemize} -\item trampoline -\item optimal method size -\item -msingle-float -\item -mmemcpy -\item fastmem -\item local vars for registers (useless) -\item -fno-rename-registers -\item -ffast-math -\item -fno-trapping-math -\item -fsingle-precision-constant -\item -mfused-madd -\item -freg-struct-return -\item -freduce-all-givs -\item -fno-peephole -\item -fno-peephole2 -\item -fmove-all-movables -\item -fno-sched-spec-load -\item -fno-sched-spec -\item -fno-schedule-insns -\item -fno-schedule-insns2 -\item -fno-delayed-branch -\item -fno-function-cse -\item -ffunction-sections -\item -fdata-sections -\item array bounds checking -\item -falign-functions=n -\item -falign-labels=n -\item -falign-loops=n -\item -falign-jumps=n -\item -fno-function-cse + +\item It provides a simple, consistent external interface. The method + of actually executing the code (currently only translated + binaries and the interpreter) can be changed without any code + changes to the caller because only runtime exposes a public + interface. This includes methods to pass arguments to the + binary's {\tt main()} function, read and write from memory, and + call individual functions in the binary. + +\item It manages the process's memory. The runtime class contains + large {\tt int[]} arrays that represent the process`s entire + memory space. Subclasses read and write to these arrays as + required by the instructions they are executing, and can expand + their memory space using the {\tt sbrk} system call. + +\item The runtime provides access to the host file system and standard + I/O streams. Subclasses of {\tt runtime} can access the file + system through standard UNIX syscalls ({\tt read()}, {\tt + write()}, {\tt open()}, etc). The runtime manages the file + descriptor table that maps UNIX file descriptors to Java {\tt + RandomAccessFile}s, {\tt InputStream}s, {\tt OutputStream}s, and + {\tt Socket}s. + +\item It provides general OS services, including {\tt sleep()}, {\tt + gettimeofday()}, {\tt getpagesize()}, {\tt sysconf()}, {\tt + fcntl()}, and so on. + \end{itemize} \section{Future Directions} -World domination. +v v v v v v v +Although we have only implemented it for the Java Virtual Machine, our +technique generalizes to other safe bytecode architectures. In +particular we would like to demonstrate this generality by retargeting +the translator to the Microsoft Intermediate Language \cite{msil}. +************* +Java source code can create a copy of the translated binary by instantiating +the class generated by the binary translator or instantiating the +interpreter. It can then interact with the process through the many +facilities provided by the Runtime interface. Invoking the run() method of +the Runtime interface will load the given arguments into the process's +memory as invoke the binaries entry point (typically \_start() in crt0.o). +This will pass control on to the main() function which will have the +arguments passed to run() loaded into argv and argc. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +Additionally, we would like to explore other uses for dynamic loading +of translated MIPS binaries by combining NestedVM (which itself is +written in Java) and the {\tt ClassLoader.fromBytes()} mechanism. +************* +As the binary executes it often passes control back to the Runtime class +through the MIPS {\tt SYSCALL} instruction. The interpreter and translated +binaries invoke the {\tt syscall()} method of the Runtime class when the +{\tt SYSCALL} instruction is executed. The Runtime class then can manipulate +the process's environment (read and write to memory, modify the file +descriptor table, etc) and interact with the rest of the JVM on behalf of +the process (read and write to a file or stream, etc). There is even a +syscall to pause the VM and temporarily return control to the caller. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +************* +In addition to the interfaces provided by NestedVM, users can create their +own interfaces between the MIPS and Java world. The Runtime provides a +method called call() that will call a function by name in the MIPS binary. +The call() method looks up the function name in the binary's ELF symbol +table and manipulating the stack and registers accordingly to execute the +given function. This allows Java code to seamlessly invoke functions in the +binary. +^ ^ ^ ^ ^ ^ ^ \section{Conclusion} -We rock the hizzouse. +v v v v v v v +The MIPS binaries can also invoke a special method of Runtime called +callJava().When the MIPS binary invokes the {\tt CALL\_JAVA} syscall +(usually done through the {\tt \_call\_java()} function provided by the +NestedVM support library) the callJava() method in Runtime is invoked with +the arguments passes to the syscall. +************* +We have presented a novel technique for using libraries written in +unsafe languages within a safe virtual machine without resorting to +native interfaces. We have implemented this technique in NestedVM, +which is currently used by the Ibex project\footnote{{\tt +http://www.ibex.org}} to perform font rasterization (via {\tt +libfreetype}), JPEG decoding (via {\tt libjpeg}), and CAB archive +extraction (via {\tt libmspack}), three libraries for which no +equivalent Java classes exist. +^ ^ ^ ^ ^ ^ ^ + +NestedVM is available under an open source license, and can be +obtained from +\begin{verbatim} + http://nestedvm.ibex.org +\end{verbatim} -\section{References} -Yer mom. +\section{Appendix: Testing Methodology} -\section{stuff} -\begin{verbatim} +The MIPS binaries can also invoke a special method of Runtime called +callJava().When the MIPS binary invokes the {\tt CALL\_JAVA} syscall +(usually done through the {\tt \_call\_java()} function provided by +the NestedVM support library) the callJava() method in Runtime is +invoked with the arguments passes to the syscall. -libjpeg (render thebride_1280.jpg) -Native - 0.235s -JavaSource - 1.86s -ClassFile - 1.37s - -freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to -48 incrementing by 4) -Native - 0.201s -JavaSource - 2.02s -ClassFile - 1.46s - -Section 3.2 - Interpreter -The Interpreter was the first part of Mips2Java to be written. This was the most -straightforward and simple way to run MIPS binaries inside the JVM. The simplicity of the -interpreter also made it very simple to debug. Debugging machine-generated code is a pain. -Most importantly, the interpreter provided a great reference to use when developing the -compiler. With known working implementations of each MIPS instruction in Java writing a -compiler became a matter of simple doing the same thing ahead of time. -With the completion of the compiler the interpreter in Mips2Java has become less useful. -However, it may still have some life left in it. One possible use is remote debugging with -GDB. Although debugging the compiler generated JVM code is theoretically possible, it -would be far easier to do in the interpreter. The interpreter may also be useful in cases -where size is far more important than speed. The interpreter is very small. The interpreter -and MIPS binary combined are smaller than the compiled classfiles. -Section 3.3 - Compiling to Java Source -The next major step in Mips2JavaÕs development was the Java source compiler. This -generated Java source code (compliable with javac or Jikes) from a MIPS binary. -Generating Java source code was preferred initially over JVM bytecode for two reasons. -The authors werenÕt very familiar with JVM bytecode and therefore generating Java source -code was simpler. Generating source code also eliminated the need to do trivial -optimizations in the Mips2java compiler that javac and Jikes already do. This mainly -includes 2+2=4 stuff. For example, the MIPS register r0 is immutable and always 0. This -register is represented by a static final int in the Java source compiler. Javac and Jikes -automatically handle optimizing this away when possible. In the JVM bytecode compiler -these optimizations needs to be done in Mips2Java. -Early versions of the Mips2Java compiler were very simple. All 32 MIPS GPRs and a -special PC register were fields in the generated java class. There was a run() method -containing all the instructions in the .text segment converted to Java source code. A switch -statement was used to allow jumps from instruction to instruction. The generated code -looked something like this. -private final static int r0 = 0; -private int r1, r2, r3,...,r30; -private int r31 = 0xdeadbeef; -private int pc = ENTRY_POINT; +{\footnotesize\begin{verbatim} -public void run() { - for(;;) { - switch(pc) { - case 0x10000: - r29 = r29 Ð 32; - case 0x10004: - r1 = r4 + r5; - case 0x10008: - if(r1 == r6) { - /* delay slot */ - r1 = r1 + 1; - pc = 0x10018: - continue; - } - case 0x1000C: - r1 = r1 + 1; - case 0x10010: - r31 = 0x10018; - pc = 0x10210; - continue; - case 0x10014: - /* nop */ - case 0x10018: - pc = r31; - continue; - ... - case 0xdeadbeef: - System.err.println("Exited from ENTRY_POINT"); - System.err.println("R2: " + r2); - System.exit(1); - } - } -} +// Java +private Runtime rt = new MyBinary() { + pubilc int callJava(int a, int b, int c, int d) { + System.err.println("Got " + a + " " + b); + } +}; +public void foo() { rt.run(); } -This worked fine for small binaries but as soon as anything substantial was fed to it the 64k -JVM method size limit was soon hit. The solution to this was to break up the code into -many smaller methods. This required a trampoline to dispatch jumps to the appropriate -method. With the addition of the trampoline the generated code looked something like this: -public void run_0x10000() { - for(;;) { - switch(pc) { - case 0x10000: - ... - case 0x10004: - ... - ... - case 0x10010: - r31 = 0x10018; - pc = 0x10210; - return; - ... - } - } +/* C */ +void main(int argc, char **argv) { + _call_java(1,2); } -pubic void run_0x10200() { - for(;;) { - switch(pc) { - case 0x10200: - ... - case 0x10204: - ... - } - } -} +\end{verbatim}} -public void trampoline() { - for(;;) { - switch(pc&0xfffffe00) { - case 0x10000: run_0x10000(); break; - case 0x10200: run_0x10200(); break; - case 0xdeadbe00: - ... - } - } -} -With this trampoline in place somewhat large binaries could be handled without much -difficulty. There is no limit on the size of a classfile as a whole, just individual methods. -This method should scale well. However, there are other classfile limitations that will limit -the size of compiled binaries. -Another interesting problem that was discovered while creating the trampoline method was -javac and JikesÕ incredible stupidity when dealing with switch statements. The follow code -fragment gets compiled into a lookupswich by javac: -Switch(pc&0xffffff00) { - Case 0x00000100: run_100(); break; - Case 0x00000200: run_200(); break; - Case 0x00000300: run_300(); break; -} -while this nearly identical code fragment gets compiled to a tableswitch -switch(pc>>>8) { - case 0x1: run_100(); break - case 0x2: run_200(); break; - case 0x3: run_300(); break; -} -Javac isnÕt smart enough to see the patter in the case values and generates very suboptimal -bytecode. Manually doing the shifts convinces javac to emit a tableswitch statement, which -is significantly faster. This change alone nearly doubled the speed of the compiled binary. -Finding the optimal method size lead to the next big performance increase. It was -determined with experimentation that the optimal number of MIPS instructions per method -is 128 (considering only power of two options). Going above or below that lead to -performance decreases. This is most likely due to a combination of two factors. -_ The two levels of switch statements jumps have to pass though Ð The first switch -statement jumps go through is the trampoline switch statement. This is implemented -as a TABLESWITCH in JVM bytecode so it is very fast. The second level switch -statement in the individual run_ methods is implemented as a LOOKUPSWITCH, -which is much slower. Using smaller methods puts more burden on the faster -TABLESWITCH and less on the slower LOOKUPSWITCH. -_ JIT compilers probably favor smaller methods smaller methods are easier to compile -and are probably better candidates for JIT compilation than larger methods. -FIXME: Put a chart here -The next big optimization was eliminating unnecessary case statements. Having case -statements before each instruction prevents JIT compilers from being able to optimize -across instruction boundaries. In order to eliminate unnecessary case statements every -possible address that could be jumped to directly needed to be identified. The sources for -possible jump targets come from 3 places. -_ The .text segment Ð Every instruction in the text segment in scanned for jump -targets. Every branch instruction (BEQ, JAL, etc) has its destination added to the list -of possible branch targets. In addition, functions that set the link register have -theirpc+8 added to the list (the address that wouldÕve been put to the link register). -Finally, combinations of LUI (Load Upper Immediate) of ADDIU (Add Immediate -Unsigned) are scanned for possible addresses in the text segment. This combination -of instructions is often used to load a 32-bit word into a register. -_ The .data segment Ð When GCC generates switch() statements it often uses a jump -table stored in the .data segment. Unfortunately gcc doesnÕt identify these jump -tables in any way. Therefore, the entire .data segment is conservatively scanned for -possible addresses in the .text segment. -_ The symbol table Ð This is mainly used as a backup. Scanning the .text and .data -segments should identify any possible jump targets but adding every function in the -symbol table in the ELF binary doesnÕt hurt. This will also catch functions that are -never called directly from the MIPS binary (for example, functions called with the -call() method in the runtime). -Eliminating unnecessary case statements provided a 10-25% speed increase . -Despite all the above optimizations and workaround an impossible to workaround hard -classfile limit was eventually hit, the constant pool. The constant pool in classfiles is limited -to 65536 entries. Every Integer with a magnitude greater than 32767 requires an entry in the -constant pool. Every time the compiler emits a jump/branch instruction the PC field is set to -the branch target. This means nearly every branch instruction requires an entry in the -constant pool. Large binaries hit this limit fairly quickly. One workaround that was -employed in the Java source compiler was to express constants as offsets from a few central -values. For example: "pc = N_0x00010000 + 0x10" where N_0x000100000 is a non- -final field to prevent javac from inlining it. This was sufficient to get reasonable large -binaries to compile. It has a small (approximately 5%) performance impact on the generated -code. It also makes the generated classfile somewhat larger. Fortunately, the classfile -compiler eliminates this problem. -3.4 Ð Bytecode compiler -The next step in the evolution of Mips2Java was to compile directly to JVM bytecode -eliminating the intermediate javac step. This had several advantages -_ There are little tricks that can be done in JVM bytecode that canÕt be done in Java -source code. -_ Eliminates the time-consuming javac step Ð Javac takes a long time to parse and -compile the output from the java source compiler. -_ Allows for MIPS binaries to be compiled and loaded into a running VM using a -class loader. This eliminates the need to compile the binaries ahead of time. -By generating code at the bytecode level there are many areas where the compiler can be -smarter than javac. Most of the areas where improvements where made where in the -handling of branch instructions and in taking advantage of the JVM stack to eliminate -unnecessary LOADs and STOREs to local variables. -The first obvious optimization that generating bytecode allows for is the use of GOTO. -Despite the fact that java doesnÕt have a GOTO keyword a GOTO bytecode does exist and -is used heavily in the code generates by javac. Unfortunately the java language doesnÕt -provide any way to take advantage of this. As result of this jumps within a method were -implemented by setting the PC field to the new address and making a trip back to the initial -switch statement. In the classfile compiler these jumps are implemented as GOTOs directly -to the target instruction. This saves a costly trip back through the LOOKUPSWITCH -statement and is a huge win for small loops within a method. -Somewhat related to using GOTO is the ability to optimize branch statements. In the Java -source compiler branch statements are implemented as follows (delay slots are ignored for -the purpose of this example) -If(condition) { pc = TARGET; continue; } -This requires a branch in the JVM regardless of whether the MIPS branch is actually taken. -If condition is false the JVM has to jump over the code to set the PC and go back to the -switch block. If condition is true the JVM as to jump to the switch block. By generating -bytecode directly we can make the target of the JVM branch statement the actual bytecode -of the final destination. In the case where the branch isnÕt taken the JVM doesnÕt need to -branch at all. -A side affect of the above two optimizations is a solution to the excess constant pool entries -problem. When jumps are implemented as GOTOs and direct branches to the target the PC -field doesnÕt need to be set. This eliminates many of the constant pool entries the java -source compiler requires. The limit is still there however, and given a large enough binary it -will still be reached. -Delay slots are another area where things are done somewhat inefficiently in the Java source -compiler. In order to take advantage of instructions already in the pipeline MIPS cpu have a -"delay slot". That is, an instruction after a branch or jump instruction that is executed -regardless of whether the branch is taken. This is done because by the time the branch or -jump instruction is finished being processes the next instruction is already ready to be -executed and it is wasteful to discard it. (However, newer MIPS CPUs have pipelines that -are much larger than early MIPS CPUs so they have to discard many instructions anyway.) -As a result of this the instruction in the delay slot is actually executed BEFORE the branch -is taken. To make things even more difficult, values from the register file are loaded -BEFORE the delay slot is executed. Here is a small piece of MIPS assembly: -ADDIU r2,r0,-1 -BLTZ r2, target -ADDIU r2,r2,10 -... -:target -This piece of code is executed as follows -1. r2 is set to Ð1 -2. r2 is loaded from the register file by the BLTEZ instruction -3. 10 is added to r2 by the ADDIU instruction -4. The branch is taken because at the time the BLTZ instruction was executed r2 was -Ð1, but r2 is now 9 (-1 + 10) -There is a very element solution to this problem when using JVM bytecode. When a branch -instruction is encountered the registers needed for the comparison are pushed onto the stack -to prepare for the JVM branch instruction. Then, AFTER the values are on the stack the -delay slot is emitted, and then finally the actual JVM branch instruction. Because the values -were pushed to the stack before the delay slot was executed any changes the delay slot made -to the registers are not visible to the branch bytecode. This allows delay slots to be used -with no performance penalty or size penalty. -One final advantage that generating bytecode directly allows is smaller more compact -bytecode. All the optimization above lead to smaller bytecode as a side effect. There are also -a few other areas where the generated bytecode can be optimized for size with more -knowledge of the program as a whole. -When encountering the following switch block both javac and Jikes generate redundant -bytecode. -Switch(pc>>>8) { - Case 0x1: run_1(); break; - Case 0x2: run_2(); break - ... - case 0x100: run_100(); break; -} -The first bytecode in each case arm in the switch statement is ALOAD_0 to prepare for a -invoke special call. By simple moving this outside the switch statement each case arm was -reduced in size by one instruction. Similar optimizations were also done in other parts of the -compiler. - -Section 3 -- Adam - The method is run(), not execute. Execute() is only used when you need to -resume from a pause syscall. - -Section 3.1 -- Adam - Even the R1000 supports LB/SB/LH/SH/LWL/LWR Ð saying it only supports -32-bit aligned loads is incorrect. -- Other similarities -o All the branching instructions in MIPS directly map to single JVM instructions. -o Most of the ALU instructions map to single JVM instructions. - -(I can write up some stuff for each of these next several sections if you want) -Section 3.2 - Interpreter -- Originally written mainly to understand the MIPS instruction set -- Far easier to debug than an ahead of time compiler (no waiting, can throw in quick -hacks like if(pc >= 0xbadc0de && pc <= 0xbadcfff) debugOutput() ), donÕt need to -understand most of the ELF format) -- Still could be useful -o for GDB remote debugging -o cases where size is more important than speed (size of interpreter + size of mips -binary < size of compiled binary or size of compiler + mips binary) -o code which dynamically generates code (JIT compilers, etc). The ahead of time -compiler canÕt possibly handle this - -Section 3.3 Ð Compiling to Java Source -- First version of an ahead of time compiler -- Huge performance boost -- Java source output preferred for the 2+2=4 type optimizations java compilers do -- Worked well for small binaries Ð large MIPS binaries required ridiculous amounts of -memory to compile and often created invalid classfiles -- Too many constants Ð every jump operation required an entry in the constant pool (pc = -0xabcd1234; continue; ) - -Section 3.4 Ð Compiling directly to JVM Bytecode -- Another jump in performance -- More efficient code can be created at the bytecode level -o Information can be stored on the JVM stack rather than in local variables -_ Javac/jikes often unnecessarily use local variables -long tmp = a*b; -lo = (int)tmp; -hi = (int)(tmp>>>32) -does a store and two loads when a simple DUP would suffice -o GOTO can be used to branch directly to branch destinations in the same method -rather than going through the switch block again. -o BEQ, BGTZ, BLE, etc can jump directly to destination rather than doing -if(condition) { pc=0xabcd1234; continue; } -o Eliminates excess constant pool entries (only jump outside the current method -require a constant pool entry) -o Delay slots implemented more efficiently. -_ Java source compiler does: -if(condition) { /* delay slot /; pc = 0xabcd1234; continue; } -/* delay slot again */ -_ This is required because the delay slot can change the registers used in -condition. The registers need to be read BEFORE the delay slot in executed. -_ In the bytecode compiler the registers used in the condition are pushed to the -stack, then the delay slot is executed, and finally the comparison is done. -This eliminates the needs to output the delay slot twice. -- Smaller bytecode -o Everything mentioned above makes it smaller and faster -o Javac/jikes add redundant code -_ Switch(a) { - Case 1: Run_1000(); break; - Case 2: run_2000(); break; +v v v v v v v +These two methods can even be combined. MIPS can call Java through the +CALL\_JAVA syscall, which can in turn invoke a MIPS function in the +binary with the call() method. Users preferring a simpler +communication mechanism can also use Java StreamÕs and file +descriptors. Runtime provides a simple interface for mapping a Java +Input or OutputStream to a File Descriptor. +************* +These two methods can even be combined. MIPS can call Java through the +CALL\_JAVA syscall, which can in turn invoke a MIPS function in the binary +with the call() method. +^ ^ ^ ^ ^ ^ ^ + +Users preferring a simpler communication mechanism can also use Java +Stream's and file descriptors. Runtime provides a simple interface for +mapping a Java Input or OutputStream to a File Descriptor. + +%Java source code can create a copy of the translated binary by +%instantiating the corresponding class, which extends {\tt Runtime}. +%Invoking the {\tt main()} method on this class is equivalent to +%calling the {\tt main()} function within the binary; the {\tt String} +%arguments to this function are copied into the binary's memory space +%and made available as {\tt **argv} and {\tt argc}. + +%The translated binary communicates with the rest of the VM by +%executing MIPS {\tt SYSCALL} instructions, which are translated into +%invocations of the {\tt syscall()} method. This calls back to the +%native Java world, which can manipulate the binary's environment by +%reading and writing to its memory space, checking its exit status, +%pausing the VM, and restarting the VM. + + +%\subsection{Virtualization} + +%The {\tt Runtime} class implements the majority of the standard {\tt +%libc} syscalls, providing a complete interface to the filesystem, +%network socket library, time of day, (Brian: what else goes here?). + +%\begin{itemize} + +%\item ability to provide the same interface to CNI code and + % NestedVMified code + +%\item security advantages (chroot the {\tt fork()}ed process) + % +%\end{itemize} + + +\section{Future Directions} + +\section{Conclusion} + +\section{Appendix A: Testing Environment} + +The MIPS binaries can also invoke a special method of Runtime called +callJava().When the MIPS binary invokes the {\tt CALL\_JAVA} syscall +(usually done through the {\tt \_call\_java()} function provided by +the NestedVM support library) the callJava() method in Runtime is +invoked with the arguments passes to the syscall. + +v v v v v v v +Although NestedVM perfectly emulates a MIPS R2000 CPU its performance +characteristics are not anything like an actual MIPS R2000 CPU. GCC makes +several optimizations that increase performance on an actually MIPS CPU but +actually decrease performance when run through the NestedVM binary +translator. Fortunately, GCC provides many options to customize its code +generations and eliminate these optimizations. GCC also has optimization +options that are not helpful on a real MIPS CPU but are very helpful under +NestedVM +************* +{\footnotesize\begin{verbatim} +^ ^ ^ ^ ^ ^ ^ + +// Java +private Runtime rt = new MyBinary() { + pubilc int callJava(int a, int b, int c, int d) { + System.err.println("Got " + a + " " + b); + } +}; +public void foo() { rt.run(); } + +/* C */ +void main(int argc, char **argv) { + _call_java(1,2); } -Gets compiled into -1 ÐTableswitch É. -2 Ð ALOAD_0 -3 Ð invokespecial run_1000 -4 Ð goto end -5 Ð ALOAD_0 -6 Ð invokespecial run_2000 -10 Ð goto end -ALOAD_0 is unnecessarily put in each switch arm - -3.5 Interfacing with Java Code -- _call_java ()/Runtime.call() -o _call_java () - Call a java method from mips -o Runime.call() Ð call a mips method from java -o Easily allocate memory within the binaryÕs memory space by calling libcÕs malloc() -o Can go back and forth between mips and java (java calls mips which calls java which -calls back into mips) -- Java Strings can easily be converted to and from null terminated strings in the processÕ -memory space -- Java InputStreams, OutputStreams, and Sockets can easily be turned in to standard -UNIX file descriptors (and ANSI FILE*s) -- Can easily create custom filedescriptors and have full control over all operations on -them (read, write, seek, close, fstat, etc) -- Int User_info[] Ð optional chunk of memory can very easily be accessed from java -(Runtime.getUserInfo/setUserInfo) - -3.6 Virtualization -- Adam Ð we actually donÕt support sockets directly yet Ð you should probably take that -out. (But you can create a socket in java and expose it to mips as a filedescriptor) -- Runtime services -o Provides a easy to use interface to subclasses (Interpreter or compiles binaries) -_ Subclasses only know how to execute instructions -_ Runtime handles setting up registers/stack for execution to begin and -extracting return values and the exit status from the process -o Memory management -_ Sets up stack and guard pages -_ Allocates pages with the sbrk syscall -_ Provide easy an memcpy like interface for accessing the processes memory -from java -o Runtime.call() support Ð sets up registers,etc to prepare the process for a call into it -from java -o Filesystem Ð open/close/read/write/seek/fcntl s syscalls -o Time related functions Ð sleep, gettimeofday, times syscall -o UnixRuntime provides a more complete unix-like environment (Runtime smaller -though) -_ Supports fork() and waitpid() -_ Pipe() for IPC -_ More advocated filesystem interface -_ All filesystem operations abstracted away into a FileSystem class -o FileSystem class can be written that exposes a zip file, -directory on an http server, etc as a filesystem -_ Chdir/getcwd -_ /dev filesystem -_ stat() -_ directory listing support -5.1 Charts -IÕll put together some charts tonight -5.2 Optimizations -And finish this part - -Let me know if this was what you were looking for - - libjpeg libmspack libfreetype -Interpreted MIPS Binary 22.2 12.9 21.4 -Compled MIPS Binary (fastest options) 3.39 2.23 4.31 -Native -O3 0.235 0.084 0.201 - -Compled - with all case statements 3.50 2.30 4.99 -Compiled - with pruned case statement 3.39 2.23 4.31 - -Compiled - 512 instructions/method 62.7 27.7 56.9 -Compiled - 256 instructions/method 3.54 2.55 4.43 -Compiled - 128 instructions/method 3.39 2.23 4.31 -Compiled - 64 instructions/method 3.56 2.31 4.40 -Compiled - 32 instruction/method 3.71 2.46 4.64 - -Compild MIPS Binary (Server VM) 3.21 2.00 4.54 -Compiled MIPS Binary (Client VM) 3.39 2.23 4.31 - -All times are measured in seconds. These were all run on a dual 1ghz G4 -running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test -was run 8 times within a single VM. The highest and lowest times were -removed and the remaining 6 were averaged. In each case only the first -run differed significantly from the rest. + +v v v v v v v +\end{verbatim}} +************* +\item {\tt -falign-functions} +Normally a function's location in memory has no effect on its execution +speed. However, in the NestedVM binary translator, the .text segment is +split up on power of two boundaries. If a function is unlucky enough to +start near the end of one of these boundaries a performance critical part of +the function could end up spanning two methods. There is a significant +amount of overhead in switching between two methods so this must be avoided +at all costs. By telling GCC to align all functions to the boundary that the +.text segment is split on the chances of a critical part of a function +spanning two methods is significantly reduced. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +These two methods can even be combined. MIPS can call Java through the +CALL\_JAVA syscall, which can in turn invoke a MIPS function in the binary +with the call() method. +************* +\item {\tt -fno-rename-registers} +Some processors can better schedule code when registers are not reused for +two different purposes. By default GCC will try to use as many registers as +possibly when it can. This excess use of registers just confuses JIT's +trying to compile the output from the binary translator. All the JIT +compilers we tested do much better with a few frequently used registers. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +Users preferring a simpler communication mechanism can also use Java +Stream's and file descriptors. Runtime provides a simple interface for +mapping a Java Input or OutputStream to a File Descriptor. +************* +\item {\tt -fno-delayed-branch} +The MIPS CPU has a delay slot (see above). Earlier versions of NestedVM did +not efficiently emulate delay slots. This option causes GCC to avoid using +delay slots for anything (a NOP is simply placed in the delay slot). This +had a small performance benefit. However, recent versions of NestedVM +emulate delay slots with no performance overhead so this options has little +effect. Nonetheless, these delay slots provide no benefit under NestedVM +either so they are avoided with this option. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +%Java source code can create a copy of the translated binary by +%instantiating the corresponding class, which extends {\tt Runtime}. +%Invoking the {\tt main()} method on this class is equivalent to +%calling the {\tt main()} function within the binary; the {\tt String} +%arguments to this function are copied into the binary's memory space +%and made available as {\tt **argv} and {\tt argc}. +************* +\item {\tt -fno-schedule-insns} +Load operations in the MIPS ISA also have a delay slot. The results of a +load operation are not available for use until one instruction later. +Several other instructions also have similar delay slots. GCC tries to do +useful work wile waiting for the results of one of these operations by +default. However, this, like register renaming, tends to confuse JIT +compilers. This option prevents GCC from going out of its way to take +advantage of these delay slots and makes the code generated by NestedVM +easier for JIT compilers to handle. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +%The translated binary communicates with the rest of the VM by +%executing MIPS {\tt SYSCALL} instructions, which are translated into +%invocations of the {\tt syscall()} method. This calls back to the +%native Java world, which can manipulate the binary's environment by +%reading and writing to its memory space, checking its exit status, +%pausing the VM, and restarting the VM. +************* +\item {\tt -mmemcpy} +GCC sometimes has to copy somewhat large areas of memory. The most common +example of this is assigning one struct to another. Memory copying can be +done far more efficiently in Java than under NestedVM. Calls to the memcpy +libc function are treated specially by the binary translator. They are +turned into calls to a memcpy method in Runtime. The {\tt -mmemcpy} option +causes GCC to invoke libc's memcpy() function when it needs to copy a region +of memory rather than generating its own memcpy code. This call in then +turned into a call to this Java memcpy function which is significantly +faster than the MIPS implementation. +^ ^ ^ ^ ^ ^ ^ + +v v v v v v v +************* +\item {\tt -ffunction-sections -fdata-sections} +These two options are used in conjunction with the {\tt --gc-section} linker +option. These three options cause the linker to aggressively discard unused +functions and data sections. In some cases this leads to significantly +smaller binaries. +^ ^ ^ ^ ^ ^ ^ + +%\subsection{Virtualization} + +%The {\tt Runtime} class implements the majority of the standard {\tt +%libc} syscalls, providing a complete interface to the filesystem, +%network socket library, time of day, (Brian: what else goes here?). + +v v v v v v v +%\begin{itemize} +************* +\begin{itemize} +^ ^ ^ ^ ^ ^ ^ + +\item Better use of local variables in binary-to-binary compiler -- need to +do data flow analysis to find how how and when registers are used and avoid +the costly load/restore when it isn't necessary. + +\item More advanced Runtime support -- support more syscalls. This will +allow running large applications such as GCC under NestedVM. + +\item World domination + +\end{itemize} + +%\item ability to provide the same interface to CNI code and + % NestedVMified code + +%\item security advantages (chroot the {\tt fork()}ed process) + % +%\end{itemize} + + +\section{Future Directions} + +\section{Conclusion} + +\section{Appendix A: Testing Environment} + +All times are measured in seconds. These were all run on a dual 1Ghz +Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each +test was run 8 times within a single VM. The highest and lowest times +were removed and the remaining 6 were averaged. In each case only the +first run differed significantly from the rest. The libjpeg test consisted of decoding a 1280x1024 jpeg -(thebride_1280.jpg) and writing a tga. The mspack test consisted of +(thebride\_1280.jpg) and writing a tga. The mspack test consisted of extracting all members from arial32.exe, comic32.exe, times32.exe, and verdan32.exe. The freetype test consisted of rendering characters 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is about 950 individual glyphs). -I can provide you with the source for any of these test if you'd like. - --Brian -\end{verbatim} +\section{References} \end{document}