1 \documentclass{acmconf}
4 \usepackage{amssymb,amsmath,epsfig,alltt}
15 \bibliographystyle{alpha}
17 \title{\textbf{\textsf{
18 NestedVM: Total Translation of Native Code into Safe Bytecode
21 \author{\begin{tabular}{@{}c@{}}
22 {\em {Brian Alliet}} \\
23 {Rochester Institute of Technology}\\
25 \end{tabular}\hskip 1in\begin{tabular}{@{}c@{}}
26 {\em {Adam Megacz}} \\
27 {UC Berkeley Statistical Computing Facility} \\
36 We present a new approach to utilizing unsafe legacy code
37 within safe virtual machines by compiling to MIPS machine code as an
38 intermediate language. This approach carries N key benefits over
42 \item total coverage of all language features, unlike source translation
43 \item no build process modifications
44 \item no post-translation human intervention
45 \item efficient bytecode
48 We also present NestedVM, a complete system in production use which
49 implements this technique. We conclude with quantitative performance
50 measurements and suggestions for VM acceleration of the resulting
56 \section{Introduction}
58 The C programming language \cite{KR} has been in use for over 30
59 years. Consequently, there is a huge library of software written in
60 this language. Although Java offers substantial benefits \cite{} over
61 C (and C++), its comparatively young age means that it often lacks
62 equivalents of many C/C++ libraries.
64 The typical solution to this dilemma is to use JNI \cite{} or CNI
65 \cite{} to invoke C code from within a Java VM. Unfortunately, there
66 are a number of situations in which this is not an acceptable
67 solution due to security concerns:
71 \item Java Applets are not permitted to invoke {\tt
72 Runtime.loadLibrary()}
74 \item Java Servlet containers with a {\tt SecurityManager} will not
75 permit loading new JNI libraries. This configuration is popular
76 with {\it shared hosting} providers and corporate intranets
77 where a number of different parties contribute individual web
78 applications which are run together in a single container.
80 \item Unlike Java Bytecode, JNI code is susceptible to buffer overflow
81 and heap corruption attacks. This can be a major security
86 In addition to security concerns, JNI and CNI carry other
91 \item JNI requires the native library to be compiled ahead of time,
92 separately, for every architecture on which it will be deployed.
93 This is unworkable for situations in which the full set of
94 target architectures is not known at deployment time.
96 \item The increasingly popular J2ME \cite{} platform does not support
99 \item JNI often introduces undesirable added complexity to an
104 The technique we present here is based on using a typical ANSI C
105 compiler to compile C/C++ code into a MIPS binary, and then using a
106 tool to translate that binary on an instruction-by-instruction basis
109 The technique presented here is general; we anticipate that it can be
110 applied to other secure virtual machines such as Microsoft's .NET
111 \cite{}, Perl Parrot \cite{}, or Python bytecode \cite{}.
113 \section{Approaches to Translation}
115 Techniques for translating unsafe code into VM bytecode generally fall
116 into four categories:
119 \item source-to-source translation
120 \item source-to-binary translation
121 \item binary-to-source translation
122 \item binary-to-binary translation
127 \newlength{\MyLength}
128 \settowidth{\MyLength}{machine code}
129 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
130 \begin{psmatrix}[colsep=3,rowsep=3]
131 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
132 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\
133 \psset{nodesep=5pt,arrows=->}
134 \ncline{s0}{b0}<{\it gcc}
135 \ncline{s0}{s1}\aput{:U}{\it c2java}
136 \ncline{s0}{b1}\aput{:U}{\it gcc bytecode backend}
137 \ncline{s1}{b1}>{\it javac}
140 \caption{\label{lattice} Conversion Lattice with examples of tools specific to a C/JVM scenario}
145 \newlength{\MyLength}
146 \settowidth{\MyLength}{machine code}
147 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
148 \begin{psmatrix}[colsep=3,rowsep=3,nrot=:U]
149 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
150 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\
151 \psset{nodesep=5pt,arrows=->}
152 \ncline{s0}{b0}<{\it gcc}
153 \ncline{s1}{b1}>{\it javac}
154 \ncline{b0}{s1}\naput{\it NestedVM}
155 \ncline{b0}{s1}\nbput{\it binary-to-source}
158 \caption{\label{lattice2} Conversion Lattice including NestedVM in {\it source-output} mode}
163 \newlength{\MyLength}
164 \settowidth{\MyLength}{machine code}
165 \newcommand{\MyBox}[1]{\makebox[\MyLength]{#1}}
166 \begin{psmatrix}[colsep=3,rowsep=3,nrot=:U]
167 [name=s0]\MyBox{unsafe source} & [name=s1]\MyBox{safe source} \\[0pt]
168 [name=b0]\MyBox{machine code} & [name=b1]\MyBox{safe bytecode} \\
169 \psset{nodesep=5pt,arrows=->}
170 \ncline{s0}{b0}<{\it gcc}
171 \ncline{s1}{b1}>{\it javac}
172 \ncline{b0}{b1}\naput{\it NestedVM}
173 \ncline{b0}{b1}\nbput{\it binary-to-binary}
176 \caption{\label{lattice3} Conversion Lattice including NestedVM in {\it bytecode-output} mode}
179 A diagram showing these four translation approaches in the context of
180 running C/C++ code within a Java VM is shown in Figure~\ref{lattice}.
182 \subsection{Existing Work}
183 \subsubsection{Source-to-Source Translation}
187 \item commercial products
190 A number of commercial products and research projects attempt to
191 translate C++ code to Java code, preserving the mapping of C++ classes
192 to Java classes. Unfortunately this is problematic since there is no
193 way to do pointer arithmetic except within arrays, and even in that
194 case, arithmetic cannot be done in terms of fractional objects.
198 Many of these products advise the user to tweak the code which results
199 from the translation. Unfortunately, hand-modifying machine-generated
200 code is generally a bad idea, since this modification cannot be
201 automated. This means that every time the origin code changes, the
202 code generator must be re-run, and the hand modifications must be
203 performed yet again. This is an error-prone process.
205 Furthermore, NestedVM does not attempt to read C code directly. This
206 frees it from the complex task of faithfully implementing the ANSI C
207 standard (or, in the case of non ANSI-C compliant code, some other
208 interface). This also saves the user the chore of altering their
209 build process to accomodate NestedVM.
213 NestedVM takes a novel approach; it uses compiled machine code as a
214 starting point for the translation process. NestedVM has gone through
218 \item binary-to-source compilation (Figure~\ref{lattice2})
219 \item binary-to-binary compilation (Figure~\ref{lattice3})
222 \subsection{Translation Process}
224 Translating a legacy library for use within a JVM proceeds as follows:
228 \item Compile the source code to a statically linked binary, targeting
231 \item Invoke {\tt NestedVM} on the statically linked binary.
232 Typically this will involve linking against {\tt libc}, which
233 translates system requests (such as {\tt open()}, {\tt read()},
234 or {\tt write()}) into appropriate invocations of the MIPS
235 {\tt SYSCALL} instruction.
237 \item (If using binary-to-source translation) compile the resulting
238 {\tt .java} code using {\tt jikes} or {\tt javac}.
240 \item (Optional) compile the resulting bytecode into a {\it safe}
241 native binary using {\tt gcj}.
243 \item From java code, invoke the {\tt run()} method on the generated
244 class. This is equivalent to the {\tt main()} entry point.
249 \subsection{Why MIPS?}
251 We chose MIPS as a source format for two primary reasons: the
252 availability of tools to translate legacy code into MIPS binaries, and
253 the close similarity between the MIPS ISA and the Java Virtual Machine.
255 The MIPS architecture has been around for quite some time, and is well
256 supported by the GNU Compiler Collection, which is capable of
257 compiling C, C++, Java, Fortran, Pascal (with p2c), and Objective C
260 The MIPS R2000 ISA bears a striking similarity to the Java Virtual
265 %\item The original MIPS ISA supports only 32-bit aligned memory loads
266 % and stores. This allows NestedVM to represent memory as a Java
267 % {\tt int[]} without introducing additional overhead.
268 \item Most of the instructions in the original MIPS ISA operate only on
269 32-bit aligned memory locations. This allows NestedVM to represent
270 memory as a Java {\tt int[]} array without introducing additional
273 \item Unlike its predecessor, the R2000 supports 32-bit by 32-bit
274 multiply and divide instructions as well as a single and double
275 precision floating point unit. These capabilities map nicely
276 onto Java's arithmetic instructions.
281 \subsection{Binary-to-Source Compilation}
283 The first incarnation of NestedVM was a binary-to-source compiler.
284 This version reads in a MIPS binary and emits Java source code, which
285 can be compiled with {\tt javac}, {\tt jikes}, or {\tt gcj}.
287 This implementation was primarily a first step towards the
288 binary-to-binary compiler. Conveniently, generating Java source code
289 frees NestedVM from having to perform simple constant propagation
290 optimizations, since most Java compilers already do this. A recurring
291 example is the treatment of the {\tt r0} register, which is fixed as
292 {\tt 0} in the MIPS ISA.
294 Lacking the ability to generate specially optimized bytecode
295 sequences, a straightforward mapping of the general purpose hardware
296 registers to 32 {\tt int} fields was optimal.
299 \begin{minipage}[c]{7in}%
301 {\footnotesize\begin{verbatim}
302 private final static int r0 = 0;
303 private int r1, r2, r3,...,r30;
304 private int r31 = 0xdeadbeef;
305 private int pc = ENTRY_POINT;
334 System.err.println(``Exited.'');
341 {\footnotesize\begin{verbatim}
342 public void run_0x10000() {
359 pubic void run_0x10200() {
370 public void trampoline() {
372 switch(pc&0xfffffe00) {
373 case 0x10000: run_0x10000(); break;
374 case 0x10200: run_0x10200(); break;
383 \caption{\label{code1} Trampoline transformation necessitated by Java's 64kb method size limit}
386 Unfortunately Java imposes a 64kb limit on the size of the bytecode
387 for a single method. This presents a problem for NestedVM, and
388 necessitates a {\it trampoline transformation}, as shown in
389 Figure~\ref{code1}. With this trampoline in place somewhat large
390 binaries can be handled without much difficulty -- fortunately there
391 is no corresponding limit on the size of a classfile as a whole.
393 Another interesting problem that was discovered while creating the
394 trampoline method was javac and Jikes' inability to properly optimize
395 switch statements. The code in Figure~\ref{lookupswitch} is compiled
396 into a comparatively inefficient {\tt LOOKUPSWITCH}, while the code in
397 Figure~\ref{tableswitch} is optimized into a {\tt TABLESWITCH}.
400 {\footnotesize\begin{verbatim}
401 switch(pc&0xffffff00) {
402 case 0x00000100: run_100(); break;
403 case 0x00000200: run_200(); break;
404 case 0x00000300: run_300(); break;
407 \caption{\label{lookupswitch} Code which {\it is not} optimized into a tableswitch}
411 {\footnotesize\begin{verbatim}
418 \caption{\label{tableswitch} Code which {\it is} optimized into a tableswitch}
421 Javac isn't smart enough to see the pattern in the case values and
422 generates very suboptimal bytecode. Manually doing the shifts
423 convinces javac to emit a tableswitch statement, which is
424 significantly faster. This change alone nearly doubled the speed of
427 Finding the optimal method size lead to the next big performance
428 increase. It was determined through experimentation that the optimal
429 number of MIPS instructions per method is 128 (considering only power
430 of two options). Going above or below that lead to performance
431 decreases. This is most likely due to a combination of two factors.
435 \item The two levels of switch statements jumps have to pass though -
436 The first switch statement jumps go through is the trampoline
437 switch statement. This is implemented as a {\tt TABLESWITCH} in JVM
438 bytecode so it is very fast. The second level switch statement
439 in the individual run\_ methods is implemented as a
440 {\tt LOOKUPSWITCH}, which is much slower. Using smaller methods puts
441 more burden on the faster {\tt TABLESWITCH} and less on the slower
444 \item JIT compilers probably favor smaller methods smaller methods are
445 easier to compile and are probably better candidates for JIT
446 compilation than larger methods.
452 The next big optimization was eliminating unnecessary case
453 statements. Having case statements before each instruction prevents
454 JIT compilers from being able to optimize across instruction
455 boundaries. In order to eliminate unnecessary case statements every
456 possible address that could be jumped to directly needed to be
457 identified. The sources for possible jump targets come from 3 places.
461 \item The .text segment - Every instruction in the text segment is
462 scanned for jump targets. Every branch instruction (BEQ, JAL,
463 etc) has its destination added to the list of possible branch
464 targets. In addition, functions that set the link register have
465 theirpc+8 added to the list (the address that would've been put
466 to the link register). Finally, combinations of LUI (Load Upper
467 Immediate) of ADDIU (Add Immediate Unsigned) are scanned for
468 possible addresses in the text segment. This combination of
469 instructions is often used to load a 32-bit word into a
472 \item The .data segment - When GCC generates switch() statements it
473 often uses a jump table stored in the .data
474 segment. Unfortunately gcc doesn't identify these jump tables in
475 any way. Therefore, the entire .data segment is conservatively
476 scanned for possible addresses in the .text segment.
478 \item The symbol table - This is mainly used as a backup. Scanning the
479 .text and .data segments should identify any possible jump
480 targets but adding every function in the symbol table in the ELF
481 binary doesn't hurt. This will also catch functions that are
482 never called directly from the MIPS binary (for example,
483 functions called with the call() method in the runtime).
487 Eliminating unnecessary case statements provided a 10-25\% speed
490 Despite all the above optimizations and workaround an impossible to
491 workaround hard classfile limit was eventually hit, the constant
492 pool. The constant pool in classfiles is limited to 65536
493 entries. Every Integer with a magnitude greater than 32767 requires an
494 entry in the constant pool. Every time the compiler emits a
495 jump or branch instruction the PC field is set to the branch target. This
496 means nearly every branch instruction requires an entry in the
497 constant pool. Large binaries hit this limit fairly quickly. One
498 workaround that was employed in the Java source compiler was to
499 express constants as offsets from a few central values. For example:
500 ``pc = N\_0x00010000 + 0x10'' where N\_0x000100000 is a non-final
501 field to prevent javac from inlining it. This was sufficient to get
502 reasonable large binaries to compile. It has a small (approximately
503 5\%) performance impact on the generated code. It also makes the
504 generated classfile somewhat larger. Fortunately, the classfile
505 compiler eliminates this problem.
508 \subsection{Binary-to-Binary Translation}
510 The next step in the evolution of NestedVM was to compile directly to
511 JVM bytecode eliminating the intermediate javac step. This had several
516 \item There are little tricks that can be done in JVM bytecode that
517 can't be done in Java source code.
519 \item Eliminates the time-consuming javac step - Javac takes a long
520 time to parse and compile the output from the java source
523 \item Allows for MIPS binaries to be compiled and loaded into a
524 running VM using a class loader. This eliminates the need to
525 compile the binaries ahead of time.
529 By generating code at the bytecode level there are many areas where
530 the compiler can be smarter than javac. Most of the areas where
531 improvements where made where in the handling of branch instructions
532 and in taking advantage of the JVM stack to eliminate unnecessary
533 LOADs and STOREs to local variables.
535 The first obvious optimization that generating bytecode allows for is the
536 use of GOTO. Despite the fact that java doesn't have a GOTO keyword a GOTO
537 bytecode does exist and is used heavily in the code generates by javac.
538 Unfortunately the java language doesn't provide any way to take advantage of
539 this. As result of this, jumps within a method were implemented in the
540 binary-to-source compiler by setting the PC field to the new address and
541 making a trip back to the initial switch statement. In the classfile
542 compiler these jumps are implemented as GOTOs directly to the target
543 instruction. This saves a costly trip back through the LOOKUPSWITCH
544 statement and is a huge win for small loops within a method.
546 Somewhat related to using GOTO is the ability to optimize branch
547 statements. In the Java source compiler branch statements are
548 implemented as follows (delay slots are ignored for the purpose of
551 {\footnotesize\begin{verbatim}
552 if(condition) { pc = TARGET; continue; }
555 This requires a branch in the JVM regardless of whether the MIPS
556 branch is actually taken. If condition is false the JVM has to jump
557 over the code to set the PC and go back to the switch block. If
558 condition is true the JVM as to jump to the switch block. By
559 generating bytecode directly we can make the target of the JVM branch
560 statement the actual bytecode of the final destination. In the case
561 where the branch isn't taken the JVM doesn't need to branch at all.
563 A side affect of the above two optimizations is a solution to the
564 excess constant pool entries problem. When jumps are implemented as
565 GOTOs and direct branches to the target the PC field doesn't need to
566 be set. This eliminates many of the constant pool entries the java
567 source compiler requires. The limit is still there however, and given
568 a large enough binary it will still be reached.
570 Delay slots are another area where things are done somewhat
571 inefficiently in the Java source compiler. In order to take advantage
572 of instructions already in the pipeline MIPS cpu have a ``delay
573 slot''. That is, an instruction after a branch or jump instruction that
574 is executed regardless of whether the branch is taken. This is done
575 because by the time the branch or jump instruction is finished being
576 processes the next instruction is already ready to be executed and it
577 is wasteful to discard it. (However, newer MIPS CPUs have pipelines
578 that are much larger than early MIPS CPUs so they have to discard many
579 instructions anyway.) As a result of this the instruction in the delay
580 slot is actually executed BEFORE the branch is taken. To make things
581 even more difficult, values from the register file are loaded BEFORE
582 the delay slot is executed. Here is a small piece of MIPS assembly:
584 {\footnotesize\begin{verbatim}
592 This piece of code is executed as follows
596 \item r2 is set to -1
598 \item r2 is loaded from the register file by the BLTEZ instruction
600 \item 10 is added to r2 by the ADDIU instruction
602 \item The branch is taken because at the time the BLTZ instruction was
603 executed r2 was -1, but r2 is now 9 (-1 + 10)
607 There is a very elegent solution to this problem when using JVM
608 bytecode. When a branch instruction is encountered the registers
609 needed for the comparison are pushed onto the stack to prepare for the
610 JVM branch instruction. Then, AFTER the values are on the stack the
611 delay slot is emitted, and then finally the actual JVM branch
612 instruction. Because the values were pushed to the stack before the
613 delay slot was executed any changes the delay slot made to the
614 registers are not visible to the branch bytecode. This allows delay
615 slots to be used with no performance penalty or size penalty.
617 One final advantage that generating bytecode directly allows is
618 smaller more compact bytecode. All the optimization above lead to
619 smaller bytecode as a side effect. There are also a few other areas
620 where the generated bytecode can be optimized for size with more
621 knowledge of the program as a whole.
623 When encountering the following switch block both javac and Jikes
624 generate redundant bytecode.
626 {\footnotesize\begin{verbatim}
628 case 0x1: run_1(); break;
629 case 0x2: run_2(); break
631 case 0x100: run_100(); break;
635 The first bytecode in each case arm in the switch statement is ALOAD\_0 to
636 prepare for a invoke special call. By simple moving this outside the switch
637 statement each case arm was reduced in size by one instruction. Similar
638 optimizations were also done in other parts of the compiler.
641 \section{Interfacing with Java Code}
643 Java source code can create a copy of the translated binary by
644 instantiating the corresponding class, which extends {\tt Runtime}.
645 Invoking the {\tt main()} method on this class is equivalent to
646 calling the {\tt main()} function within the binary; the {\tt String}
647 arguments to this function are copied into the binary's memory space
648 and made available as {\tt **argv} and {\tt argc}.
650 The translated binary communicates with the rest of the VM by
651 executing MIPS {\tt SYSCALL} instructions, which are translated into
652 invocations of the {\tt syscall()} method. This calls back to the
653 native Java world, which can manipulate the binary's environment by
654 reading and writing to its memory space, checking its exit status,
655 pausing the VM, and restarting the VM.
658 \subsection{Virtualization}
660 The {\tt Runtime} class implements the majority of the standard {\tt
661 libc} syscalls, providing a complete interface to the filesystem,
662 network socket library, time of day, (Brian: what else goes here?).
666 \item ability to provide the same interface to CNI code and
669 \item security advantages (chroot the {\tt fork()}ed process)
674 \section{Quantitative Performance}
678 (Note that none of these libraries have pure-Java equivalents.)
687 \subsection{Optimizations}
689 Brian, can you write something to go here? Just mention which
690 optimizations helped and which ones hurt.
693 \item {\tt trampoline}
694 \item {\tt optimal method size}
695 \item {\tt -msingle-float}
698 \item {\tt local vars for registers (useless)}
699 \item {\tt -fno-rename-registers}
700 \item {\tt -ffast-math}
701 \item {\tt -fno-trapping-math}
702 \item {\tt -fsingle-precision-constant}
703 \item {\tt -mfused-madd}
704 \item {\tt -freg-struct-return}
705 \item {\tt -freduce-all-givs}
706 \item {\tt -fno-peephole}
707 \item {\tt -fno-peephole2}
708 \item {\tt -fmove-all-movables}
709 \item {\tt -fno-sched-spec-load}
710 \item {\tt -fno-sched-spec}
711 \item {\tt -fno-schedule-insns}
712 \item {\tt -fno-schedule-insns2}
713 \item {\tt -fno-delayed-branch}
714 \item {\tt -fno-function-cse}
715 \item {\tt -ffunction-sections}
716 \item {\tt -fdata-sections}
717 \item {\tt array bounds checking}
718 \item {\tt -falign-functions=n}
719 \item {\tt -falign-labels=n}
720 \item {\tt -falign-loops=n}
721 \item {\tt -falign-jumps=n}
722 \item {\tt -fno-function-cse}
725 \section{Future Directions}
731 We rock the hizzouse.
739 {\footnotesize\begin{verbatim}
741 libjpeg (render thebride_1280.jpg)
746 freetype (rendering characters 32-127 of Comic.TTF at sizes from 8 to
747 48 incrementing by 4)
752 libjpeg libmspack libfreetype
753 Interpreted MIPS Binary 22.2 12.9 21.4
754 Compled MIPS Binary (fastest options) 3.39 2.23 4.31
755 Native -O3 0.235 0.084 0.201
757 Compled - with all case statements 3.50 2.30 4.99
758 Compiled - with pruned case statement 3.39 2.23 4.31
760 Compiled - 512 instructions/method 62.7 27.7 56.9
761 Compiled - 256 instructions/method 3.54 2.55 4.43
762 Compiled - 128 instructions/method 3.39 2.23 4.31
763 Compiled - 64 instructions/method 3.56 2.31 4.40
764 Compiled - 32 instruction/method 3.71 2.46 4.64
766 Compild MIPS Binary (Server VM) 3.21 2.00 4.54
767 Compiled MIPS Binary (Client VM) 3.39 2.23 4.31
769 All times are measured in seconds. These were all run on a dual 1ghz G4
770 running OS X 10.3.1 with Apple's latest VM (JDK 1.4.1_01-27). Each test
771 was run 8 times within a single VM. The highest and lowest times were
772 removed and the remaining 6 were averaged. In each case only the first
773 run differed significantly from the rest.
775 The libjpeg test consisted of decoding a 1280x1024 jpeg
776 (thebride_1280.jpg) and writing a tga. The mspack test consisted of
777 extracting all members from arial32.exe, comic32.exe, times32.exe, and
778 verdan32.exe. The freetype test consisted of rendering characters
779 32-127 of Comic.TTF at sizes from 8 to 48 incrementing by 4. (That is
780 about 950 individual glyphs).
782 I can provide you with the source for any of these test if you'd like.