1 \documentclass{acmconf}
4 \usepackage{amssymb,amsmath,epsfig,alltt}
15 \bibliographystyle{amsplain}
17 \title{\textbf{\textsf{
18 Complete Translation of Unsafe Native Code to Safe Bytecode
21 \author{\begin{tabular}{@{}c@{}}
22 {\em {Brian Alliet}} \\
23 {Rochester Institute of Technology}\\
24 {\tt bja8464@cs.rit.edu}
25 \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
26 {\em {Adam Megacz}} \\
27 {University of California, Berkeley} \\
28 {\tt megacz@cs.berkeley.edu}
36 Most existing techniques for using code written in an unsafe language
37 within a safe virtual machine involve transformations from one source
38 code language (such as C) to another (such as Java) and then to
39 virtual machine bytecodes. We present an alternative approach which
40 uses a standard compiler to turn unsafe source code into unsafe MIPS
41 binaries, which are then translated into safe virtual machine
42 bytecodes. This approach offers four key advantages over existing
46 \item Total coverage of all language features
47 \item No post-translation human intervention
48 \item No build process modifications
49 \item Bug-for-bug compiler compatability
52 We have implemented this technique in NestedVM, a binary-to-source and
53 binary-to-binary translator targeting the Java Virtual Machine.
54 NestedVM-translated versions of the {\tt libfreetype}, {\tt libjpeg},
55 and {\tt libmspack} libraries are currently in production use.
56 Performance measurements indicate a best case performance within 3x of
57 native code and worst case typically within 10x, making it an
58 attractive solution for code which is not performance-critical.
62 \section{Introduction}
64 Unsafe languages such as C \cite{KR} and C++ \cite{soustroup} have
65 been in use much longer than any of today's widely accepted safe
66 languages such as Java \cite{java} and C\# \cite{csharp}. Consequently, there is
67 a huge library of software written in these languages. Although safe
68 languages offer substantial benefits, their comparatively young age
69 often puts them at a disadvantage when breadth of existing support
70 code is an important criterion.
72 The typical solution to this dilemma is to use a native interface such
73 as JNI \cite{jni} or CNI \cite{cni} to invoke unsafe code from within a
74 virtual machine or otherwise safe environment. Unfortunately, there
75 are a number of situations in which this is not an acceptable
76 solution. These situations can be broadly classified into two
77 categories: {\it security concerns} and {\it portability concerns}.
79 Using Java as an example, JNI and CNI are prohibited in a number of
80 contexts, including applets environments and servlet containers with a
81 {\tt SecurityManager}. Additionally, even in the context of trusted
82 code, {\tt native} methods invoked via JNI are susceptible to buffer
83 overflow and heap corruption attacks which are not a concern for
86 The second class of disadvantages revolves around portability
87 concerns; native interfaces require the native library to be compiled
88 ahead of time, for every architecture on which they will be
89 deployed. This is unworkable for situations in which the full set of
90 target architectures is not known at deployment time. Additionally,
91 some JVM platform variants such as J2ME \cite{j2me} simply do not offer
92 support for native code.
94 The technique we present here uses typical compiler to compile unsafe
95 code into a MIPS binary, which is then translated on an
96 instruction-by-instruction basis into Java bytecode. The technique
97 presented here is general; we anticipate that it can be applied to
98 other secure virtual machines such as Microsoft's .NET \cite{msil}, Perl
99 Parrot \cite{parrot}, or Python bytecode \cite{python}.
101 \section{Approaches to Translation}
103 The four program representations of interest in this context, along
104 with their specific types in the C-to-JVM instantiation of the
105 problem are shown in the following diagram:
108 \newlength{\MyLength}
109 \settowidth{\MyLength}{machine code}
110 \newcommand{\MyBox}[1]{\makebox[\MyLength][c]{#1}}
111 \begin{psmatrix}[colsep=2,rowsep=0]
113 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
114 [name=s00]\MyBox{\tt (.c)} & [name=s11]\MyBox{\tt (.java)} \\[0pt]
118 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt]
119 [name=b00]\MyBox{\tt (.o)} & [name=b11]\MyBox{\tt (.class)} \\
121 \psset{nodesep=5pt,arrows=->}
125 To illustrate the context of this diagram, the following arcs show the
126 translations performed by a few familiar tools:
129 \newlength{\MyLength}
130 \settowidth{\MyLength}{xmachine codex}
131 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
132 \psmatrix[colsep=2,rowsep=0,nrot=:D]
134 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
140 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt]
142 \psset{nodesep=5pt,arrows=->}
143 \ncline{s0}{b0}\bput{:U}{\tt gcc}
144 \ncline{s1}{b0}\bput{:D}{\tt gcj}
145 \ncline{s1}{b1}\aput{:U}{\tt javac}
146 \ncline{b1}{b0}\aput{:D}{\tt gcj}\bput{:D}{JITs}
150 Techniques for translating unsafe code into VM bytecode generally fall
151 into four categories, which we expand upon in the next two sections:
154 \item source-to-source translation
155 \item source-to-binary translation
156 \item binary-to-source translation
157 \item binary-to-binary translation
160 \section{Existing Work}
162 \subsection{Source-to-Source Translation}
164 The most common technique employed is partial translation from unsafe
165 source code to safe source code:
168 \newlength{\MyLength}
169 \settowidth{\MyLength}{xmachine codex}
170 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
171 \psmatrix[colsep=2,rowsep=0,nrot=:U]
174 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
180 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt]
182 \psset{nodesep=5pt,arrows=->}
183 \ncline{s0}{s1}\aput{:U}{source-to}\bput{:U}{source}
184 \ncline{s1}{b1}\aput{:U}{\tt javac}
188 A number of existing systems employ this technique; they can
189 be divided into two categories: those which perform a partial
190 translation which is completed by a human, and those which perform a
191 total translation but fail (yield an error) on a large class of input
195 \subsubsection{Incomplete Translation}
197 Jazillian \cite{jazillian} is a commercial solution which produces
198 extremely readable Java source code from C source code, but ony
199 translates a small portion of the C language. Jazillian is unique in
200 that in addition to {\it language migration}, it also performs {\it
201 API migration}; for example, Jazillian is intelligent enough
202 to translate {\tt char*~s1~=~strcpy(s2)} into {\tt String~s1~=~s2}.
204 Unfortunately such deep analysis is intractible for most of the C
205 language and standard library; Jazillian's documentation notes that
206 {\it ``This is not your father's language translator. It's not
207 generating ugly code that's guaranteed to work out of the
208 box... Jazillian does not always produce code that works correctly.''}
210 MoHCA-Java \cite{mohca} is the other major tool in this category, and steps
211 beyond Jazillian by providing tools for analysis of the source C++
212 abstract syntax tree. Additionally, MoHCA-Java's analysis engine is
213 extensible, making it a platform for constructing application-specific
214 translators rather than a single translation tool. However,
215 MoHCA-Java does not always generate complete Java code for all of the C++
216 programs which it accepts.
219 \subsubsection{Partial Domain Translation}
221 The c2j \cite{c2j}, c2j++ \cite{c2jpp}, Cappucinno \cite{capp},
222 and Ephedra \cite{ephedra} systems each provide support for complete
223 translation of a {\it subset} of the source language (C or C++). Each
224 of the four tools supports a progressively greater subset than the one
225 preceding it; however none covers the entire input language.
227 Ephedra, the most advanced of the four, supports most of the C++
228 language, and claims to produce ``human readable'' Java code as
229 output. Notable omissions from the input domain include support for
230 fully general pointer arithmetic, casting between unrelated types, and
231 reading from a {\tt union} via a different member than the one most
234 Unfortunately, when the program being translated is large and complex,
235 it is quite likely that it will use an unsupported feature in at least
236 one place. In the absence of a programmer who understands the source
237 program, a single anomoly is often enough to render the entire
238 translation process useless. As a result, these tools are mainly
239 useful as an {\it aid} to programmers who could normally perform the
240 conversion themselves, but want to save time by automating most of the
244 \subsection{Source-to-Binary Translation}
246 Source-to-binary translation involves a compiler for the unsafe
247 language which has been modified to emit safe bytecode.
250 \newlength{\MyLength}
251 \settowidth{\MyLength}{xmachine codex}
252 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
253 \psmatrix[colsep=2,rowsep=0,nrot=:U]
255 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
261 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt]
263 \psset{nodesep=5pt,arrows=->}
264 \ncline{s0}{b1}\bput{:U}{source-to-binary}
268 The primary occupant of this category is {\tt egcs-jvm}
269 \cite{egcsjvm}, an experimental ``JVM backend'' for the GNU Compiler
270 Collection ( {\tt gcc} ) \cite{gcc}. Since {\tt gcc} employs a highlym
271 odular architecture, it {\it is} possible to add RTL code generators
272 for nonstandard processors. However, {\tt gcc}'s parsing, RTL
273 generation, and optimization layers make fundamental assumptions (such
274 as the availability of pointer math) which cannot be directly
275 supported; thus the compiler still fails for a substantial class of
282 The principal difference between NestedVM and other approaches is that
283 NestedVM {\it does not} attempt to deal with source code as an input.
284 This leads immediately to three advantages:
287 \item {\bf Total coverage of all language features}
289 Because NestedVM does not attempt to implement the parsing and
290 code generation steps of compilation, it is freed from the
291 extremely complex task of faithfully implementing languages
292 which are often not fully or formally specified (such as C and
295 \item {\bf No build process modifications}
297 NestedVM does not modify existing build processes, which can be
298 extremely complex and dependent on strange preprocessor usage as
299 well as the complex interplay between compiler switches and
300 header file locations.
302 \item {\bf Bug-for-bug compiler compatability}
304 Since NestedVM uses the compiler's {\it output} as its own {\it
305 input}, it ensures that programs which are inadvertently
306 dependent on the vagaries of a particular compiler can still be
311 NestedVM's approach carries a fourth benefit as well, arising from its
315 \item {\bf No post-translation human intervention}
317 NestedVM offers total support for all non-privileged
318 instructions, registers, and resources found on a MIPS {\tt
319 R2000} CPU, including the add/multiply unit and floating point
320 coprocessor. As such, it constitutes a total function mapping
321 from the entire domain of (non-kernel-mode) programs onto (a
322 subset of) the semantics of the Java Virtual Machine. This
323 ensures that the translation process is fully automated and
324 always succeeds for valid input binaries.
327 This is a much more important factor than is obvious at first glance.
328 If post-translation human intervention is required, then the {\it
329 human becomes part of the build process}. This means that if a third
330 party library used in the project needs to be upgraded, {\it a human
331 must intervene} in the rebuild process. In addition to slowing the
332 process and introducing opportunities for error, this task often
333 requires specialized knowledge which becomes tied to the particular
334 individual performing this task, rather than being encoded in build
335 scripts which persist throughout the lifetime of the project.
337 \subsection{Why MIPS?}
339 We chose MIPS as a source format for three reasons: the availability
340 of tools to compile legacy code into MIPS binaries, the close
341 similarity between the MIPS ISA and the Java Virtual Machine, and the
342 relatively high degree of program structure that can be inferred from
343 ABI-adherent binaries.
345 The MIPS architecture has been around for quite some time, and is well
346 supported by the GNU Compiler Collection, which is capable of
347 compiling C, C++, Java, Fortran, Pascal, and Objective C
350 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
355 \item Most of the instructions in the original MIPS ISA operate only
356 on 32-bit aligned memory locations. This allows NestedVM to
357 represent memory as a Java {\tt int[]} array without introducing
358 additional overhead. The remaining non-aligned memory load
359 instructions are only rarely emitted by most compilers since
360 they carry a performance penalty on physical MIPS
363 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
364 multiply and divide instructions as well as a single and double
365 precision floating point unit. These capabilities map nicely
366 onto Java's arithmetic instructions.
370 Finally, the MIPS ISA and ABI convey quite a bit of information about
371 program structure. This information can be used for optimization
376 \item The structure of MIPS branching and jump instructions make it
377 easy to infer the set of likely target instructions.
379 \item The MIPS ABI specifies particular registers as caller-save and
380 callee-save, as well as designating a register for the return
381 address after a function call. This allows NestedVM to optimize
382 many operations for the common case of ABI-adherent binaries.
384 \item All MIPS instructions are exactly 32 bits long.
390 \subsection{Binary-to-Source}
392 The simplest operational mode for NestedVM is binary-to-source
393 translation. In this mode, NestedVM translates MIPS binaries into
394 Java source code, which is then fed to a Java compiler in order to
395 produce bytecode files:
398 \newlength{\MyLength}
399 \settowidth{\MyLength}{xmachine codex}
400 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
401 \psmatrix[colsep=2,rowsep=0,nrot=:U]
404 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
410 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt]
411 \psset{nodesep=5pt,arrows=->}
412 \ncline{s0}{b0}\bput{:U}{\tt gcc}
413 \ncline{s1}{b1}\aput{:U}{\tt javac}
414 \ncline{b0}{s1}\naput{\tt NestedVM}
419 \begin{minipage}[c]{7in}%
421 {\footnotesize\begin{verbatim}
422 private final static int r0 = 0;
423 private int r1, r2, r3,...,r30;
424 private int r31 = 0xdeadbeef;
425 private int pc = ENTRY_POINT;
454 System.err.println(``Exited.'');
461 {\footnotesize\begin{verbatim}
462 public void run_0x10000() {
479 pubic void run_0x10200() {
490 public void trampoline() {
492 switch(pc&0xfffffe00) {
493 case 0x10000: run_0x10000(); break;
494 case 0x10200: run_0x10200(); break;
503 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
506 Translating unsafe code for use within a JVM proceeds as follows:
510 \item Compile the source code to a statically linked binary, targeting
511 the MIPS R2000 ISA. Typically this will involve linking against
512 {\tt libc}, which translates system requests (such as {\tt
513 open()}, {\tt read()}, or {\tt write()}) into appropriate
514 invocations of the MIPS {\tt SYSCALL} instruction.
516 \item Invoke {\tt NestedVM} on the statically linked binary.
518 \item Compile the resulting {\tt .java} code using {\tt jikes}
519 \cite{jikes} or {\tt javac}.
521 \item From java code, invoke the {\tt run()} method on the generated
522 class. This is equivalent to the {\tt main()} entry point.
526 \subsubsection{Optimizations}
528 Generating Java source code instead of bytecode frees NestedVM from
529 having to perform simple constant propagation optimizations, as most
530 Java compilers already do this. A recurring example is the treatment
531 of the {\tt r0} register, which is fixed as {\tt 0} in the MIPS ISA.
533 Lacking the ability to generate specially optimized bytecode
534 sequences, a straightforward mapping of the general purpose hardware
535 registers to 32 {\tt int} fields turned out to be optimal.
538 \epsfig{file=chart1,width=3in}
540 Unfortunately, Java imposes a 64kb limit on the size of the bytecode
541 for a single method. This presents a problem for NestedVM, and
542 necessitates a {\it trampoline transformation}, as shown in
543 Figure~\ref{code1}. With this trampoline in place, large binaries can
544 be handled without much difficulty -- fortunately, there is no
545 corresponding limit on the size of a classfile as a whole.
547 One difficulty that arose as a result of using the trampoline
548 transformation was the fact that {\tt javac} and {\tt jikes} are
549 unable to properly optimize its switch statements. For example, the
550 following code is compiled into a comparatively inefficient {\tt
555 switch(pc&0xffffff00) {
556 case 0x00000100: run_100(); break;
557 case 0x00000200: run_200(); break;
558 case 0x00000300: run_300(); break;
571 This problem was surmounted by switching on a denser set of {\tt case}
572 values, which is more amenable to the {\tt TABLESWITCH} structure.
573 This change alone nearly doubled the speed of the compiled binary.
575 The next performance improvement came from tuning the size of the
576 methods invoked from the trampoline. Trial and error led to the
577 conclusion that HotSpot \cite{hotspot}, the most popular JVM, performs
578 best when 128 MIPS instructions are mapped to each method.
580 \epsfig{file=chart5,width=3in}
582 \epsfig{file=chart6,width=3in}
584 This phenomenon is due to two factors:
588 \item While the trampoline method's {\tt switch} statement can be
589 coded as a {\tt TABLESWITCH}, the {\tt switch} statement
590 within the individual methods is to sparse to encode this way.
592 \item Hybrid Interpretive-JIT compilers such as HotSpot generally
593 favor smaller methods since they are easier to compile and are
594 better candidates for compilation in ``normal'' programs (unlike
599 After tuning method sizes, our next performance boost came from
600 eliminating exraneous case branches. Having case statements before
601 each instruction prevents JIT compilers from being able to optimize
602 across instruction boundaries, since control flow can enter the body
603 of a {\tt switch} statement at any of the {\tt case}s. In order to
604 eliminate unnecessary case statements we needed to identify all
605 possible jump targets. Jump targets can come from three sources:
607 Putting more than 256 instructions in each method lead to a severe
608 performance penalty. Apparently Hotspot does not handle very large methods
609 well. In some tests the simple moving from 256 to 512 instructions per
610 method decreased performance by a factor of 10.
614 The next big optimization was eliminating unnecessary case
615 statements. Having case statements before each instruction prevents
616 JIT compilers from being able to optimize across instruction
617 boundaries. In order to eliminate unnecessary case statements every
618 possible address that could be jumped to directly needed to be
619 identified. The sources for possible jump targets come from 3 places.
623 \item {\bf The {\tt .text} segment}
625 Every instruction in the text segment is scanned, and every
626 branch instruction's destination is added to the list of
627 possible branch targets. In addition, any function that sets
628 the link register is added to the list \footnote{actually {\tt addr+8}}.
629 Finally, combinations of {\tt LUI} (Load Upper Immediate) and
630 {\tt ADDIU} (Add Immediate Unsigned) are scanned for possible
631 addresses in the {\tt .text} segment since this combination of
632 instructions is often used to load a 32-bit word into a
635 \item {\bf The {\tt .data} segment}
637 When compiling {\tt switch} statements, compilers often use a
638 jump table stored in the {\tt .data} segment. Unfortunately
639 they typically do not identify these jump tables in any way.
640 Therefore, the entire {\tt .data} segment is conservatively
641 scanned for possible addresses in the {\tt .text} segment.
643 \item {\bf The symbol table}
645 This is mainly used as a backup. Scanning the {\tt .text} and
646 {\tt .data} segments should identify any possible jump targets;
647 however, adding all function symbols in the ELF symbol table
648 also catches functions that are never called directly from the
649 MIPS binary, such as those invoked only via the NestedVM
650 runtime's {\tt call()} method.
654 Eliminating unnecessary {\tt case} statements provided a 10-25\% speed
657 Despite all the above optimizations, one insurmountable obstacle
658 remained: the Java {\tt .class} file format limits the constant pool
659 to 65535 entries. Every integer literal greater than {\tt 32767}
660 requires an entry in this pool, and each branch instruction generates
663 One suboptimal solution was to express constants as offsets from a few
664 central values; for example ``{\tt pc~=~N\_0x00010000~+~0x10}'' (where
665 {\tt N\_0x000100000} is a non-final field to prevent {\tt javac} from
666 inlining it). This was sufficient to get reasonably large binaries to
667 compile, and caused only a small (approximately 5\%) performance
668 degredation and a similarly small increase in the size of the {\tt
669 .class} file. However, as we will see in the next section, compiling
670 directly to {\tt .class} files (without the intermediate {\tt .java}
671 file) eliminates this problem entirely.
674 \subsection{Binary-to-Binary}
676 After implementing the binary-to-source compiler, a binary-to-binary
677 translation mode was added.
680 \newlength{\MyLength}
681 \settowidth{\MyLength}{xmachine codex}
682 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
683 \psmatrix[colsep=2,rowsep=0,nrot=:U]
685 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
691 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\[0pt]
693 \psset{nodesep=5pt,arrows=->}
694 \ncline{s0}{b0}\bput{:U}{\tt gcc}
695 \ncline{b0}{b1}\naput{\tt NestedVM}
699 This mode has several advantages:
703 \item There are quite a few interesting bytecode sequences that cannot
704 be generated as a result of compiling Java source code.
706 \item Directly generating {\tt .class} files Eliminates the
707 time-consuming {\tt javac} step.
709 \item Direct compilation to {\tt .class} files opens up the
710 interesting possibility of dynamically translating MIPS binaries
711 and loading them via {\tt ClassLoader.fromBytes()} {\it at
712 deployment time}, eliminating the need to compile binaries ahead
717 Most of the performance improvemen where made where in the handling of
718 branch instructions and in taking advantage of the JVM stack to
719 eliminate unnecessary {\tt LOAD}s and {\tt STORE}s to local variables.
721 \epsfig{file=chart7,width=3in}
723 The first optimization gained by direct bytecode generation came from
724 the use of the JVM {\tt GOTO} instruction. Despite the fact that the
725 Java {\it language} does not have a {\tt goto} keyword, the VM does in
726 fact have a corresponding instruction which is used quite heavily by
727 {\tt javac}. NestedVM's binary-to-binary mode exploits this
728 instruction to avoid emitting inefficient {\tt switch..case}
731 Related to the {\tt GOTO} instruction is branch statement
732 optimization. When emitting source code, NestedVM translates branches
733 into Java source code like this:
735 {\footnotesize\begin{verbatim}
742 This requires a branch in the JVM {\it regardless} of whether the MIPS
743 branch is actually taken. If {\tt condition} is false the JVM has to
744 jump over the code to set {\tt pc} and go back to the {\tt switch}
745 statemenmt; if {\tt condition} is true the JVM has to jump to the {\tt
746 switch} block. By generating bytecode directly, NestedVM is able to
747 emit a JVM bytecode branching directly to the address corresponding to
748 the target of the MIPS branch. In the case where the branch is not
749 taken the JVM doesn't branch at all.
751 A side effect of the previous two optimizations is a solution to the
752 excess constant pool entries problem. When jumps are implemented as
753 {\tt GOTO}s and branches are taken directly, the {\tt pc} field does
754 not need to be set. This eliminates a huge number of constant pool
755 entries. The {\tt .class} file constant pool size limit is still
756 present, but it is less likely to be encountered.
758 Implementation of the MIPS delay slot offers another opportunity for
759 bytecode-level optimization. In order to take advantage of
760 instructions already in the pipeline, the MIPS ISA specifies that the
761 instruction after a jump or branch is always executed, even if the
762 jump/branch is taken. This instruction is referred to as the ``delay
763 slot\footnote{Newer MIPS CPUs have pipelines that are much larger than
764 early MIPS CPUs, so they have to discard instructions anyways}.'' The
765 instruction in the delay slot is actually executed {\it before} the
766 branch is taken. To further complicate matters, values from the
767 register file are loaded {\it before} the delay slot is executed.
769 Fortunately there is a very elegent solution to this problem which can
770 be expressed in JVM bytecode. When a branch instruction is
771 encountered, the registers needed for the comparison are pushed onto
772 the stack to prepare for the JVM branch instruction. Then, {\it
773 after} the values are on the stack the delay slot instruction is
774 emitted, followed by the actual JVM branch instruction. Because the
775 values were pushed to the stack before the delay slot was executed, any
776 changes the delay slot made to the registers are not visible to the
779 One final advantage that generating bytecode directly allows is a
780 reduction in the size of the ultimate {\tt .class} file. All the
781 optimizations above lead to more compact bytecode as a beneficial side
782 effect; in addition, NestedVM performs a few additional optimizations.
784 When encountering the following {\tt switch} block, both {\tt javac}
785 and {\tt jikes} generate redundant bytecode.
787 {\footnotesize\begin{verbatim}
789 case 0x1: run_1(); break;
790 case 0x2: run_2(); break
792 case 0x100: run_100(); break;
796 The first bytecode in each case arm in the switch statement is {\tt
797 ALOAD\_0} to prepare for a {\tt INVOKESPECIAL} call. By simply
798 lifting this bytecode outside of the {\tt switch} statement, each {\tt
799 case} arm shrinks by one instruction.
801 \subsubsection{Compiler Flags}
803 Although NestedVM perfectly emulates a MIPS R2000 CPU, its performance
804 profile is nothing like that of actual silicon. In particular, {\tt
805 gcc} makes several optimizations that increase performance on an
806 actually MIPS CPU but actually decrease the performance of
807 NestedVM-generated bytecode. We found the following compiler options
808 could be used to improve performance:
812 \item {\tt -falign-functions}
814 Normally a function's location in memory has no effect on its
815 execution speed. However, in the NestedVM binary translator,
816 the {\tt .text} segment is split on power-of-two boundaries. If
817 a function starts near the end of one of these boundaries, a
818 performance critical part of the function winds up spanning two
819 Java methods. Telling {\tt gcc} to align all functions along
820 these boundaries decreases the chance of this sort of splitting.
822 \item {\tt -fno-rename-registers}
824 On an actual silicon chip, using additional registers carries no
825 performance penalty (as long as none are spilled to the stack).
826 However, when generating bytecode, using {\it fewer}
827 ``registers'' helps the JVM optimize the machine code it
828 generates by simplifying the constraints it needs to deal with.
829 Disabling register renaming has this effect.
831 \item {\tt -fno-schedule-insns}
833 Results of MIPS load operations are not available until {\it
834 two} instructions after the load. Without the {\tt
835 -fno-schedule-insns} instruction, {\tt gcc} will attempt to
836 reorder instructions to do other useful work during this period
837 of unavailability. NestedVM is under no such constraint, so
838 removing this reordering typically generates simpler machine
843 Enabling this instruction causes {\tt gcc} to use the system
844 {\tt memcpy()} routine instead of generating loads and stores.
845 As explained in the next section, the NestedVM runtime
846 implements {\tt memcpy()} using {\tt System.arraycopy()}, which
847 is substantially more efficient.
849 NestedVM has two primary ways of executing code, the interpreter, and the
850 binary translators. Both the interpreter and the output from the binary
851 translators sit on top of a Runtime class. This class provides the public
852 interface to both the interpreter and the translated binaries.
854 The Runtime class does the work that the operating system usually does.
855 Conceptually the Runtime class can be thought of as the operating system and
856 its subclasses (translated binaries and the interpreter) the CPU. The
857 Runtime fulfills 5 primary goals:
859 \item {\tt -fno-rename-registers} Some processors can better schedule
860 code when registers aren't reused for two different purposes. By
861 default GCC will try to use as many registers as possibly when
862 it can. This excess use of registers just confuses JIT's trying
863 to compile the output from the binary translator. All the JIT
864 compilers we tested do much better with a few frequently used
867 \item {\tt -fno-delayed-branch} The MIPS CPU has a delay slot (see
868 above). Earlier versions of NestedVM didn't efficiently emulate
869 delay slots. This option causes GCC to avoid using delay slots
870 for anything (a NOP is simply placed in the delay slot). This
871 had a small performance benefit. However, recent versions of
872 NestedVM emulate delay slots with no performance overhead so
873 this options has little effect. Nonetheless, these delay slots
874 provide no benefit under NestedVM either so they are avoided
879 The effects of the various optimizations presented in this chapter are
880 summarized in the table below.
882 \epsfig{file=chart4,width=3in}
884 \epsfig{file=chart3,width=3in}
886 \section{The NestedVM Runtime}
888 In addition to binary-to-source and binary-to-binary translation,
889 NestedVM also includes a MIPS binary interpreter. All three
890 translation approaches expose the same API to both the translated
891 binary and the surrounding VM (including peer Java code).
893 \subsection{The Runtime Class}
895 The runtime fulfills four roles:
899 \item It provides a simple, consistent external interface. The method
900 of actually executing the code (currently only translated
901 binaries and the interpreter) can be changed without any code
902 changes to the caller because only runtime exposes a public
903 interface. This includes methods to pass arguments to the
904 binary's {\tt main()} function, read and write from memory, and
905 call individual functions in the binary.
907 \item It manages the process's memory. The runtime class contains
908 large {\tt int[]} arrays that represent the process`s entire
909 memory space. Subclasses read and write to these arrays as
910 required by the instructions they are executing, and can expand
911 their memory space using the {\tt sbrk} system call.
913 \item The runtime provides access to the host file system and standard
914 I/O streams. Subclasses of {\tt runtime} can access the file
915 system through standard UNIX syscalls ({\tt read()}, {\tt
916 write()}, {\tt open()}, etc). The runtime manages the file
917 descriptor table that maps UNIX file descriptors to Java {\tt
918 RandomAccessFile}s, {\tt InputStream}s, {\tt OutputStream}s, and
921 \item It provides general OS services, including {\tt sleep()}, {\tt
922 gettimeofday()}, {\tt getpagesize()}, {\tt sysconf()}, {\tt
927 \section{Future Directions}
929 Although we have only implemented it for the Java Virtual Machine, our
930 technique generalizes to other safe bytecode architectures. In
931 particular we would like to demonstrate this generality by retargeting
932 the translator to the Microsoft Intermediate Language \cite{msil}.
934 Additionally, we would like to explore other uses for dynamic loading
935 of translated MIPS binaries by combining NestedVM (which itself is
936 written in Java) and the {\tt ClassLoader.fromBytes()} mechanism.
941 We have presented a novel technique for using libraries written in
942 unsafe languages within a safe virtual machine without resorting to
943 native interfaces. We have implemented this technique in NestedVM,
944 which is currently used by the Ibex project\footnote{{\tt
945 http://www.ibex.org}} to perform font rasterization (via {\tt
946 libfreetype}), JPEG decoding (via {\tt libjpeg}), and CAB archive
947 extraction (via {\tt libmspack}), three libraries for which no
948 equivalent Java classes exist.
950 NestedVM is available under an open source license, and can be
953 http://nestedvm.ibex.org
957 \section{Appendix: Testing Methodology}
959 The MIPS binaries can also invoke a special method of Runtime called
960 callJava().When the MIPS binary invokes the {\tt CALL\_JAVA} syscall
961 (usually done through the {\tt \_call\_java()} function provided by
962 the NestedVM support library) the callJava() method in Runtime is
963 invoked with the arguments passes to the syscall.
965 {\footnotesize\begin{verbatim}
968 private Runtime rt = new MyBinary() {
969 pubilc int callJava(int a, int b, int c, int d) {
970 System.err.println("Got " + a + " " + b);
973 public void foo() { rt.run(); }
976 void main(int argc, char **argv) {
982 These two methods can even be combined. MIPS can call Java through the
983 CALL\_JAVA syscall, which can in turn invoke a MIPS function in the
984 binary with the call() method.
\r\r Users preferring a simpler
985 communication mechanism can also use Java StreamÕs and file
986 descriptors. Runtime provides a simple interface for mapping a Java
987 Input or OutputStream to a File Descriptor.
989 Users preferring a simpler communication mechanism can also use Java
990 Stream's and file descriptors. Runtime provides a simple interface for
991 mapping a Java Input or OutputStream to a File Descriptor.
993 %Java source code can create a copy of the translated binary by
994 %instantiating the corresponding class, which extends {\tt Runtime}.
995 %Invoking the {\tt main()} method on this class is equivalent to
996 %calling the {\tt main()} function within the binary; the {\tt String}
997 %arguments to this function are copied into the binary's memory space
998 %and made available as {\tt **argv} and {\tt argc}.
1000 %The translated binary communicates with the rest of the VM by
1001 %executing MIPS {\tt SYSCALL} instructions, which are translated into
1002 %invocations of the {\tt syscall()} method. This calls back to the
1003 %native Java world, which can manipulate the binary's environment by
1004 %reading and writing to its memory space, checking its exit status,
1005 %pausing the VM, and restarting the VM.
1008 %\subsection{Virtualization}
1010 %The {\tt Runtime} class implements the majority of the standard {\tt
1011 %libc} syscalls, providing a complete interface to the filesystem,
1012 %network socket library, time of day, (Brian: what else goes here?).
1016 %\item ability to provide the same interface to CNI code and
1017 % NestedVMified code
1019 %\item security advantages (chroot the {\tt fork()}ed process)
1024 \section{Future Directions}
1026 \section{Conclusion}
1028 \section{Appendix A: Testing Environment}
1030 All times are measured in seconds. These were all run on a dual 1ghz
1031 G4 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1\_01-27). Each
1034 void do_work(int n) {
1037 for(i=0;i<n;i++) ret += i;
1043 The MIPS binaries can also invoke a special method of Runtime called
1044 callJava().When the MIPS binary invokes the {\tt CALL\_JAVA} syscall
1045 (usually done through the {\tt \_call\_java()} function provided by
1046 the NestedVM support library) the callJava() method in Runtime is
1047 invoked with the arguments passes to the syscall.
1049 {\footnotesize\begin{verbatim}
1052 private Runtime rt = new MyBinary() {
1053 pubilc int callJava(int a, int b, int c, int d) {
1054 System.err.println("Got " + a + " " + b);
1057 public void foo() { rt.run(); }
1060 void main(int argc, char **argv) {
1066 These two methods can even be combined. MIPS can call Java through the
1067 CALL\_JAVA syscall, which can in turn invoke a MIPS function in the binary
1068 with the call() method.
1070 Users preferring a simpler communication mechanism can also use Java
1071 Stream's and file descriptors. Runtime provides a simple interface for
1072 mapping a Java Input or OutputStream to a File Descriptor.
1074 %Java source code can create a copy of the translated binary by
1075 %instantiating the corresponding class, which extends {\tt Runtime}.
1076 %Invoking the {\tt main()} method on this class is equivalent to
1077 %calling the {\tt main()} function within the binary; the {\tt String}
1078 %arguments to this function are copied into the binary's memory space
1079 %and made available as {\tt **argv} and {\tt argc}.
1081 %The translated binary communicates with the rest of the VM by
1082 %executing MIPS {\tt SYSCALL} instructions, which are translated into
1083 %invocations of the {\tt syscall()} method. This calls back to the
1084 %native Java world, which can manipulate the binary's environment by
1085 %reading and writing to its memory space, checking its exit status,
1086 %pausing the VM, and restarting the VM.
1089 %\subsection{Virtualization}
1091 %The {\tt Runtime} class implements the majority of the standard {\tt
1092 %libc} syscalls, providing a complete interface to the filesystem,
1093 %network socket library, time of day, (Brian: what else goes here?).
1097 %\item ability to provide the same interface to CNI code and
1098 % NestedVMified code
1100 %\item security advantages (chroot the {\tt fork()}ed process)
1105 \section{Future Directions}
1107 \section{Conclusion}
1109 \section{Appendix A: Testing Environment}
1111 All times are measured in seconds. These were all run on a dual 1ghz
1112 G4 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1\_01-27). Each
1114 All times are measured in seconds. These were all run on a dual 1Ghz
1115 Macintosh G4 running Apple's latest JVM (Sun HotSpot JDK 1.4.1). Each
1117 test was run 8 times within a single VM. The highest and lowest times
1118 were removed and the remaining 6 were averaged. In each case only the
1119 first run differed significantly from the rest.
1121 The libjpeg test consisted of decoding a 1280x1024 jpeg
1122 (thebride\_1280.jpg) and writing a tga. The mspack test consisted of
1123 extracting all members from arial32.exe, comic32.exe, times32.exe, and
1124 verdan32.exe. The freetype test consisted of rendering characters
1125 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
1126 about 950 individual glyphs).
1128 \section{References}