1 Optimisation of C-code (runtime and compiled)
2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4 - Placing of STG registers in machine registers
5 - Optimisation of interpreter loop (tail calls)
8 -OC flag to ghc causes optimisation
13 For functions with no args we declare them as
17 rather than foo(), because you can tell ANSI C a little more
18 by making STG_NO_ARGS expand to void. Maybe something to do with
25 We are concentrating most optimisation with gcc which allows
26 suitable global register declarations.
31 See StgOpt.h for a description of register usage
33 Note that all modules that reference the STG registers must be
34 compiled the same way so they look at the registers and not the
40 Seperate modules for tail call optimisations are required.
41 Requitre to partition runtime system code.
44 Modules with tail call routines (most runtime and all compiled)
45 are labeled .hc (literate = .lhc).
46 These are compiled to assember with tail call optimisation on
47 and then post processed with sed (Yuk!)
49 All routines which return a continuation (STGJUMP) must be
52 (The only exeption to this is continuations which exit/abort which
56 These modules are not compiled with the tail call optimisation
57 and don't have sed processing.
58 Sed processing would destroy the code!
60 All routines which are not continuations (called and return
61 conventionally) must be compiled this way.
63 This includes various parts of the runtime system.
68 See Also ~sansom/work/gcstats/OBSERVATIONS
73 Info Recieved from Eliot Miranda:
75 Received: from dcs.glasgow.ac.uk (tutuila) by vanuata.dcs.glasgow.ac.uk; Thu, 4 Jul 91 09:40:34 BST
76 Message-Id: <15456.9107040841@dcs.glasgow.ac.uk>
77 X-Comment1: #############################################################
78 X-Comment2: # uk.ac.glasgow.cs has changed to uk.ac.glasgow.dcs #
79 X-Comment3: # If this address does not work please ask your mail #
80 X-Comment4: # administrator to update your NRS & mailer tables. #
81 X-Comment5: #############################################################
84 Subject: Miranda's original msg
85 Date: Thu, 04 Jul 91 09:41:19 +0100
86 From: Simon L Peyton Jones <simonpj>
92 >From eliot@.cs.qmw.ac.uk Mon Apr 1 11:16:06 1991
93 From: eliot@.cs.qmw.ac.uk (Eliot Miranda)
94 Newsgroups: comp.compilers
95 Subject: Portable Fast Direct Threaded Code
96 Keywords: interpreter, design
97 Date: 28 Mar 91 12:20:06 GMT
98 Reply-To: Eliot Miranda <eliot@.cs.qmw.ac.uk>
99 Organization: Computer Science Dept, QMW, University of London, UK.
101 Various people have asked me for details on how I do threaded code
102 in my dynamic translation Smalltalk-80 VM. So here's the gory details
103 as well as the first published benchmarks for the system.
105 How to do "Machine-Independent" Fast Direct Threaded Code:
108 First off, use C (although other flexible machine-oriented imperative
109 languages would probably work too).
112 If you can use GCC >v1.33 you can use global register variables to
113 hold the threaded code machine's registers. If you have various forms of
114 stupid C compiler then you can get global register variables by declaring
115 your globals as register variables in every function, and later editing the
116 assembler generated by the C compiler to remove global register saves &
117 restores (details in [Miranda]).
121 Threaded code instructions (TCIs) are written as C procedures.
122 They are compiled to assembler by the C compiler. Subsequently a simple
123 editor script is run on the assembler to remove the prologues and epilogues
124 from the threaded code procedures, and possibly to insert direct threaded
127 How to Identify Threaded Code Instructions:
128 The method I prefer is to segregate TCIs from other procedures &
129 functions in the machine by placing related groups of TCIs in separate
130 source files. I give my threaded code source files the .tc extension and
131 they have a rule in the makefile that will run the editor script on the
132 assembler. An alternative is to identify each threaded code procedure with
133 a special prefix which is spotted by the editor script. This is probably
134 more error prone & doesn't fit well with leaf-routine optimization (see
137 How to Write Threaded Code Instructions:
138 Each TCI is writen an a void function of no arguments. It is wise to start
139 and end each TCI with two special macros to replace '{' and '}'. So, using
140 GCC on the SPARC, given some declarations:
143 typedef void (*TCODE)(); /* a TCODE is a pointer to a function */
144 typedef ???? ObjectPointer;
148 register TCODE *tcip asm("%g7"); /*threaded code instruction pointer*/
149 register ObjectPointer *stackPointer asm("%g5");
151 e.g. popStack would be written:
162 With stupid C compilers it can be defined to be the list of global register
163 variables. Further, if you want to build a debuger for your threaded code
164 machine you could compile the system with
166 #define TBEGIN { int frig = checkForBreakPoint();
168 and ignore lots of warnings about variable frig being unused :-).
170 TEND has to do a direct threaded code jump. In my system I want an indirect
171 post-increment jump on tcip; i.e. jump to *tcip++. On the SPARC with tcip
174 ld [%g7],%o0 ! get *tcip
175 jmpl %o0,%g0 ! jump to it
176 add %g7,4,%g7 ! increment tcip in the jump's delay slot
178 On the 68k with tcip in a5 the jump is
183 With GCC this is implemented by the JUMPNEXT macro. On the SPARC:
184 #define JUMPNEXT do{ \
185 asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");\
188 Note the return, it tells the compiler that control does not pass this point.
190 /* SBD = Silent But Deadly = Stack Bug Dummy. gcc has a bug with
191 no-defer-pop. it always depers the pop of the last function call in
192 a routine. SBD is a dummy call to ensure no other previous call gets
195 extern void SBD P((void));
197 #define JUMPNEXT do{ \
198 asm("movl a5@+,a0; jmp a0@");\
199 SBD();return;}while(0)
201 SBD is then removed by the editor script.
203 So TEND is defined to be
204 #define TEND JUMPNEXT; }
206 On the SPARC popStack is expanded to
210 do{asm("ld [%g7],%o0; jmpl %o0,%g0; add
211 %g7,4,%g7");return;}while(0);
220 ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7
223 The editor script then reduces this to:`
227 ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7
229 On the 68k you end up with:
233 movl a5@+,a0; jmp a0@
235 Global Register Variables and Stupid C Compilers:
236 Some C compilers are stupid enough to provide straight-forward global
237 registers. A C compiler on a nat-semi based system I used just allocated
238 registers in the order they were declared. The assembler syntax was very
239 simple to edit. Global register variables could thus be obtained easily.
241 Some C compilers are stupid but think they're clever. Sun's SUN3
242 compiler is a case in point. It also allocates registers in the order declared.
243 However, it tries to be clever by removing 'dead assignments' (assignments to
244 subsequently unused register variables). These compilers are easy to fool.
245 Simply arrange that the register variables are always used before leaving a
246 function. I do this by always writing RETURN or RETURNV(e) instead of
247 return and return e. On systems with such stupid C compilers RETURN(e)
250 #define RETURNV(e) do{DummyUseRegs(GR1,GR2,GR3); return e;}while(1)
252 The call on DummyUseRegs fools the compiler into thinking the registers
253 are live & hence saves assignments to them. The editor scripts can then
254 remove calls on DumyUseRegs.
256 Of course on systems with marginally clever C compilers (SUN4
258 you're stuffed. However, in clever C compilers like GCC and Acorn's C compiler
259 you can declare global registers & everything is clean & wholesome :-).
263 Conditional TCODE Jumps:
264 Say we wanted a conditional tcode jump. This might be writen:
267 if (*stackPointer-- == TrueObject) {
274 With the above scheme, each threaded code procedure runs in the same C
275 stack frame, and jumps directly to the next procedure, eliminating an
276 unnecessary <epilogue, return>, <call, prolog> pair. Once we establish a
277 stack frame and call the first function away we go. Assuming that you've
278 produced your first threaded code method (a sequence of pointers to
279 threaded code procedures), and that tcip points at the start, then
280 StartTCMachine might be defined as follows:
284 { char *enoughSpaceForAllTCIStackFrames;
286 enoughSpaceForAllTCIStackFrames = alloca(1024);
288 while(1) { (*tcip++)(); }
291 The alloca allocates space on the stack. After the first (*tcip++)()
292 control goes off into threaded code land and never returns.
294 Leaf Routine Optimization:
295 The threaded code machine will make calls on support routines e.g.
296 graphics, garbage collector etc. Any group of routines that dont access
297 the global registers and don't directly or indirectly call routines that
298 need to access the global registers can be optimized. These routines
299 should be compiled without declaring the global registers. These routines
300 will then use as many registers as the compiler will give them and will
301 save & restore any registers they use, preserving the values of the global
305 Details of my Smalltalk Threaded Code Machine:
306 I use a pair of words for each TCI, a pointer to the procedure followed
307 by an optional operand. This avoids going out of line to access arguments.
311 *++stackPointer = (OOP)*tcip++;
313 where OOP is an ordinary object pointer. So on entry to push lit we have:
315 tcip-> <object pointer>
316 <pointer to next TCI>
318 and popStack must therefore be written
325 I dynamically compile Smalltalk-80 bytecodes to threaded code. I use 128k
326 bytes of memory to hold all threaded code. This 'tspace' is periodically
327 scavenged to reclaim space. The architecture is similar to
328 [DeutschSchiffman]. Using an eighth of the space used by the Deutch
329 Schifman machine I get around 75% of the performance on the non-graphics
330 benchmarks. Here are the Smalltalk macro benchmarks for BrouHaHa
331 Smalltalk-80 v2.3.2t running on a monochrome SUN3/60 (20MHz 68020):
335 ClassOrganizer 80.6667
336 PrintDefinition 59.0278
337 PrintHierachy 142.857
339 AllImplementors 130.0
343 KeyboardLookAhead 212.5
344 KeyboardSingle 302.778
346 TextFormatting 273.81
348 Performance Rating 134.198
350 and on the same machine under the same conditions are the timings for
351 ParcPlace Smalltalk-80 V2.3:
355 ClassOrganizer 155.128
356 PrintDefinition 137.097
357 PrintHierachy 192.308
359 AllImplementors 108.333
363 KeyboardLookAhead 303.571
364 KeyboardSingle 473.913
366 TextFormatting 442.308
368 Performance Rating 189.504
370 134.198/189.504 = 0.708154
372 WARNING!! These systems ARE different, these benchmarks are included only
373 to give a feeling for ball-park performance.
376 closures blue-book BlockContexts
378 characters, smallints, fixedpoints immediate smallintegers
379 5844 compiled methods 5136 compiled methods
380 (5026 ordinary methods) (4798 ordinary methods)
381 (818 quick methods) (338 quick methods)
385 More Portable File Organization:
386 To keep the code as clean looking as possible all machine-dependencies are
387 isolated in separate files. e.g. tcode.h gives machine independent
388 definitions for TCODE. It includes machine dependencies from another file:
390 /* for debugging purposes; single step breakpoint at start of
393 #define DEBUG_FETCH_BREAK int frig = fetchBrk();
396 # include "fasttcode.h"
399 TCODE *tcip; /* the tcode ip points at TCODEs */
401 # define JUMPNEXT return
403 # define TBEGIN { DEBUG_FETCH_BREAK
410 GCC/SPARC/fasttcode.h:
412 register TCODE *tcip asm("%g7");
414 #define JUMPNEXT do{asm("ld [%g7],%o0; jmpl %o0,%g0; add
415 %g7,4,%g7");return;}while(0)
418 # define TBEGIN { DEBUG_FETCH_BREAK
422 #define TEND JUMPNEXT; }
424 I also don't want to include the necessary definitions for the global registers
425 in every file. So for those non-leaf routines that must avoid using the
426 global registers there's a fastglobal.h file that gives dummy definitions for
427 these registers. e.g. GCC/SPARC/fastglobal.h:
428 /* machine specific FAST defines.
429 Gnu C 1.33 systems can use nice compiler provided global registers.
434 #define RETURN(e) return e
435 #define RETURNV return
437 register char *GlobRegDummy1 asm("a5");
438 register char *GlobRegDummy2 asm("a4");
439 register char *GlobRegDummy3 asm("a3");
440 register char *GlobRegDummy4 asm("d6");
443 register char *GlobRegDummy5 asm("d7");
446 I use symbolic links to set up the machine dependent include files.
448 advantage that if you add a new machine you don't have to remake all
453 The only tedious bit is writing the sed-scripts. For the SPARC this took 1 day.
454 Here are the sed scripts I use for SUN 3, MAC2AUX (using GAS) and SUN4,
455 all using GCC (v1.33 upwards). There's a problem on the SPARC in that the ABI
456 does not seem to define the status of the global registers. Some math and
457 library routines stomp on the global registers (beware getwd!), so I've
459 GCC/SUN4/sed.globreg.bugfix as an example of how to spot the offending math
462 SUN3/GCC/lib/sed.tcode.opt:
463 # script to strip prolog & epilog from threaded code under gcc.
464 # WARNING the script may strip a push of a register argument if a call is the
465 # first statement of a function!!
470 s/ link a6,#[^\n]*\n//
471 / fmovem #[^\n]*,sp@-/{
473 s/ fmovem #[^\n]*,sp@-\n//
475 s/ moveml .*,sp@-\n//
476 s/ movel [ad][0-7],sp@-\n//
478 / moveml a6@(-[1-9][0-9]*),#/{N
479 s/ moveml a6@(-[1-9][0-9]*),#[^\n]*\n unlk a6//
481 / movel a6@(-[1-9][0-9]*),[ad][0-7]/{N
482 s/ movel a6@(-[1-9][0-9]*),[ad][0-7]\n unlk a6//
490 MAC2AUX/GCC/lib.gas/sed.tcode.opt:
493 script to strip prolog & epilog from threaded code under gcc. WARNING \
494 the script may strip a push of a register argument if a call is the\
495 first statement of a function!!
502 s/ link %a6,#[x0-9-]*\n//
503 / fmovem #[^\n]*,%sp@-/{
505 s/ fmovem #[^\n]*,%sp@-\n//
507 s/ moveml #[x0-9a-f]*,%sp@-\n//
508 s/ movel %[ad][0-7],%sp@-\n//
512 / moveml -[1-9][0-9]*%a6@,#/{ N
513 s/ moveml -[1-9][0-9]*%a6@,#[x0-9a-f-]*\n unlk %a6//
515 / movel -[1-9][0-9]*%a6@,%[ad][0-7]/{ N
516 s/ movel -[1-9][0-9]*%a6@,%[ad][0-7]\n unlk %a6//
522 s/ movel %d0,%a0\n unlk %a6//
525 s/ movel %d0,%a0\n movem*l %a6.*\n unlk %a6//
528 s/ movel %d0,%a0\n movem*l %a6.*\n fmovem %a6.*\n unlk %a6//
537 SUN4/GCC/lib/sed.tcode.opt:
538 # script to strip prolog & epilog from threaded code under gcc.
543 s/ !#PROLOGUE# 0\n save %sp,[-0-9]*,%sp\n !#PROLOGUE# 1/ ! [gotcher]/
548 SUN4/GCC/lib/sed.globreg.bugfix:
549 # Some of the libc builtin routines (.rem .urem .div & .udiv so far known)
550 # stamp on %g3 which is the maskReg (it contains 0x7FFFFF).
551 # This script reassigns the value of maskReg after each of these routines
553 /call[ ]\.div,[0-9]/{n
556 sethi %hi(0x7FFFFF),%g3 ![globregfix]\
557 or %lo(0x7FFFFF),%g3,%g3
559 /call[ ]\.udiv,[0-9]/{n
562 sethi %hi(0x7FFFFF),%g3 ![globregfix]\
563 or %lo(0x7FFFFF),%g3,%g3
565 /call[ ]\.rem,[0-9]/{n
568 sethi %hi(0x7FFFFF),%g3 ![globregfix]\
569 or %lo(0x7FFFFF),%g3,%g3
571 /call[ ]\.urem,[0-9]/{n
574 sethi %hi(0x7FFFFF),%g3 ![globregfix]\
575 or %lo(0x7FFFFF),%g3,%g3
579 You can now see why I put "Machine-Independent" in quotes. Here's the count
580 of machine dependent code for the SPARC:
582 25 99 786 fastcache.h
583 68 262 1882 fastglobal.h
584 31 112 906 fasttcode.h
585 28 80 595 ../tcsrc/SUN4/GCC/lib/sed.globreg.bugfix
586 5 34 222 ../tcsrc/SUN4/GCC/lib/sed.peep.opt
587 9 30 173 ../tcsrc/SUN4/GCC/lib/sed.tcode.opt
590 Of these 166 lines 51 lines are banner headers. 100 odd lines are
591 machine dependent. A whole VM is around the 20k lines mark. So
592 machine dependencies are down in the 0.5% range.
596 Use this stuff as part of what ever you like. If you try & assert ownership
597 I'll fight & batter you over the head with the GPL ('bout time we had some
598 serious steel in that thar GPL).
602 P.S. The BrouHaHa machine is available to educational institutions with a
603 valid ParcPlace Smalltalk-80 licence, subject to a strict non-disclosure
604 agreement. email me if you want it. I am slow to answer requests!
606 Send compilers articles to compilers@iecc.cambridge.ma.us or
607 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
609 >From crowl@cs.rochester.edu Tue Apr 2 10:34:53 1991
610 From: crowl@cs.rochester.edu (Lawrence Crowl)
611 Newsgroups: comp.compilers
612 Subject: Re: Portable Fast Direct Threaded Code
613 Keywords: interpreter, design
614 Date: 31 Mar 91 18:06:35 GMT
615 Reply-To: crowl@cs.rochester.edu (Lawrence Crowl)
616 Organization: Computer Science Department University of Rochester
618 In article <3035@redstar.cs.qmw.ac.uk>
619 Eliot Miranda <eliot@.cs.qmw.ac.uk> writes:
620 >The threaded code machine will make calls on support routines, e.g. graphics,
621 >garbage collector etc. Any group of routines that don't access the global
622 >registers and don't directly or indirectly call routines that need to access
623 >the global registers can be optimized. These routines should be compiled
624 >without declaring the global registers. These routines will then use as many
625 >registers as the compiler will give them and will save & restore any
626 >registers they use, preserving the values of the global register variables.
628 This scheme assumes that procedure calls use a "callee saves" register
629 convention, and does not work if you allocate the global register
630 variables out of the "caller saves" set of registers. The problem is that
631 the caller does not save the register (because it is global) and the
632 callee writes over the register (because the caller saved it). In this
633 situation, the programmer must insert explicit saves and restores of the
634 global register variables.
636 The obvious solution to this problem is to allocate all global register
637 variables out of the "callee saves" set of registers. However, the
638 Alliant has _no_ callee saves registers. Library routines on the Alliant
639 will trash every register you have. In addition, implicit library calls
640 to routines like bcopy will also trash your registers. (I learned this
643 The lesson is that calling library routines with global register variables in
644 caller saves registers requires special handling. It is not automagic.
646 Lawrence Crowl 716-275-9499 University of Rochester
647 crowl@cs.rochester.edu Computer Science Department
648 ...!rutgers!rochester!crowl Rochester, New York, 14627-0226
650 Send compilers articles to compilers@iecc.cambridge.ma.us or
651 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
653 >From Tom.Lane@G.GP.CS.CMU.EDU Wed Apr 3 10:38:09 1991
654 From: Tom.Lane@G.GP.CS.CMU.EDU
655 Newsgroups: comp.compilers
656 Subject: Re: Portable Fast Direct Threaded Code
657 Keywords: interpreter, design
658 Date: 1 Apr 91 15:21:14 GMT
659 Reply-To: Tom.Lane@G.GP.CS.CMU.EDU
660 Organization: Compilers Central
662 Lawrence Crowl points out one important problem with Eliot Miranda's
663 scheme for global register use in C. There's an even more fundamental
664 problem, though: there is *no guarantee whatever* that the compiler will
665 assign the same registers to the global variables in every routine.
667 Compilers that do intelligent allocation of variables to registers may
668 refuse to honor the "register" declarations at all if the global variables
669 are not heavily used in a given routine, and in any case the globals need
670 not be assigned to the same registers every time. Miranda's scheme thus
671 relies heavily on the assumption of a dumb register allocator (or else a
672 compiler that supports global register variable declarations).
674 This scheme may be "fast" direct threaded code, but it's hardly "portable".
677 Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma
678 [GCC lets you specify what register to use for global register variables, but
679 that is of course both machine and compiler specific. -John]
681 Send compilers articles to compilers@iecc.cambridge.ma.us or
682 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
684 >From pardo@cs.washington.edu Thu Apr 4 17:34:39 1991
685 From: pardo@cs.washington.edu (David Keppel)
686 Newsgroups: comp.compilers
687 Subject: Re: Portable Fast Direct Threaded Code
688 Keywords: interpreter, design, bibliography
689 Date: 2 Apr 91 19:21:25 GMT
690 Reply-To: pardo@cs.washington.edu (David Keppel)
691 Organization: Computer Science & Engineering, U. of Washington, Seattle
693 metzger@watson.ibm.com (Perry E. Metzger) writes:
694 >[I'd like a reference on threaded code interpreters.]
700 %J Communications of the ACM (CACM)
705 %X Describes the basic idea of threaded code.
706 Compares to hard code (subroutine calls) and interpreters.
708 %A Richard H. Eckhouse Jr.
710 %T Minicomputer Systems Organization, Programming, and Applications
712 %I Prentice-Hall, Inc.
714 %X Describes threaded code and ``knotted code''. I (pardo) think that
715 this is a very clear introduction to threaded code.
718 %T An Architectural Trail to Threaded Code Systems
724 %X Describes how to build a threaded code interpeter/compiler from
726 * Direct threaded/indirect threaded.
727 * Less than 2:1 performance hit of threading compared to full
729 * Note about bad compilers contributing to relative OK-ness of
731 * Ease of rewriting stuff.
734 My favorite of the three is Eckhouse & Morris; however I don't know
735 where to get it. The pages that I have are photocopies graciously
736 sent to me by a friend. As the title implies, this book is several
737 years old and undoubtedly out-of-print.
739 ;-D on ( Following this thread of the discussion... ) Pardo
741 Send compilers articles to compilers@iecc.cambridge.ma.us or
742 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
744 >From simonpj Fri Apr 5 09:52:33 1991
745 Received: from tutuila.dcs.glasgow.ac.uk by vanuata.cs.glasgow.ac.uk;
746 Fri, 5 Apr 91 09:52:27 BST
747 Message-Id: <2763.9104050851@tutuila.dcs.glasgow.ac.uk>
748 X-Comment1: #############################################################
749 X-Comment2: # uk.ac.glasgow.cs has changed to uk.ac.glasgow.dcs #
750 X-Comment3: # If this address does not work please ask your mail #
751 X-Comment4: # administrator to update your NRS & mailer tables. #
752 X-Comment5: #############################################################
753 From: Simon L Peyton Jones <simonpj>
754 To: eliot@cs.qmw.ac.uk
756 Subject: Threaded code
757 Date: Fri, 05 Apr 91 09:51:48 +0100
762 I saw your article about threaded code. Like you and others, we are
763 using C as an assembler, only for a pure functional language, Haskell.
764 I have some brief questions.
766 1. By telling GCC not to use a frame pointer, one can eliminate
767 the prolog entirely, can't one? So why edit it out?
769 I guess the answer is going to be local variables, allocated once for
770 all by the StartTCMachine routine. Still, it seems quite a pain. I guess
771 one could sacrifice some (perhaps slight) speed by using a bunch of
774 2. You edit out the epilogue for tidiness only, I take it. It doesn't
775 cause any problems if you leave it in, does it?
777 3. Why do you use no-defer-pop (with the associated bug)?
779 4. Why does JUMPNEXT have a loop? Surely the jump leaves the loop right
780 away. Presumably you are tricking the compiler somehow.
791 ============================= Address change =======================
792 My email address is now officially: simonpj@dcs.glasgow.ac.uk
793 This may fail if your local site has out-of-date mail tables.
794 The old address (simonpj@cs.glasgow.ac.uk) will work for quite a long while,
795 so stay with the old one if the new one fails.
796 ====================================================================
798 >From eliot@cs.qmw.ac.uk Fri Apr 5 12:18:18 1991
799 Via: uk.ac.qmw.cs; Fri, 5 Apr 91 12:18:06 BST
800 Received: from aux47 by redstar.cs.qmw.ac.uk id aa26828; 5 Apr 91 12:17 BST
801 Reply-To: eliot@cs.qmw.ac.uk
802 In-Reply-To: Simon L Peyton Jones's mail message
803 <2763.9104050851@tutuila.dcs.glasgow.ac.uk>
804 Message-Id: <9104051217.aa26828@uk.ac.qmw.cs.redstar>
805 From: Eliot Miranda <eliot@cs.qmw.ac.uk>
808 Subject: re: Threaded code
809 Date: Fri, 5 Apr 91 10:54:25 BST
814 >I saw your article about threaded code. Like you and others, we are
815 >using C as an assembler, only for a pure functional language, Haskell.
816 >I have some brief questions.
818 >1. By telling GCC not to use a frame pointer, one can eliminate
819 >the prolog entirely, can't one? So why edit it out?
820 No, registers local to the procedure will still be saved & stack space
821 allocated for automatic variables. This IS a problem since the
822 threaded-code jump at the end of the procedure will miss the register
823 restores before the epilogue. Consequently the stack will grow unless
824 these register saves & stack-space allocations are removed.
826 >I guess the answer is going to be local variables, allocated once for
827 >all by the StartTCMachine routine. Still, it seems quite a pain. I guess
828 >one could sacrifice some (perhaps slight) speed by using a bunch of
830 For certain routines, not using register variables will be expensive
831 (e.g. a simple integer arithmetic primitive).
833 >2. You edit out the epilogue for tidiness only, I take it. It doesn't
834 >cause any problems if you leave it in, does it?
835 No, but given that the prologue has to be removed & removing the epilogue
836 is as easy (& given finite memory :-) one might as well remove it.
838 >3. Why do you use no-defer-pop (with the associated bug)?
839 OK. This is again to avoid stack growth. On conventional stack architectures
840 gcc will try & combine the argument popping code of a sequence of
845 foo(a); bar(b); baz(c);
848 with -O -no-defer-pop one might expect gcc to generate
863 but because gcc knows that the unlk instruction will roll back the stack
864 in fact gcc generates:
878 With -O -fdefer-pop gcc optimizes out the pops completely & generates:
890 with -O -fomit-frame-pointer -fdefer-pop gcc generates:
901 & with -O -fomit-frame-pointer -fno-defer-pop gcc generates:
914 All the above cases are as one would wish. The elimination of all
915 defered pops in the unlk instruction is especially clever.
917 However, in the presence of the threaded-code jump the waters darken!
918 Consider what gcc generates for:
920 register void (**tcip)() asm("%a5");
922 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)
926 foo(a); bar(b); JUMPNEXT;
928 with -O -fdefer-pop gcc generates
937 movl %a5@+,%a0; jmp %a0@
942 This is clearly no good because the arguments to both foo & bar
943 will never be popped. Every time doit() is executed the stack will grow
944 by 8 bytes. Soon your program will dump core with a very large stack
947 with -O -fno-defer-pop gcc generates:
956 movl %a5@+,%a0; jmp %a0@
961 Again useless because bar's pop has been folded into the unlk
962 which won't be executed.
964 with -O -fdefer-pop -fomit-frame-pointer gcc generates
972 movl %a5@+,%a0; jmp %a0@
976 This is great. However, not all functions are susceptible to
977 the omit-frame-pointer optimization (i.e. functions
978 with local variables). E.g. the code generated for:
980 register void (**tcip)() asm("%a5");
982 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)
987 foo(a,large); bar(b); JUMPNEXT;
990 with -O -fomit-frame-pointer -fdefer-pop is:
999 movl %a5@+,%a0; jmp %a0@
1004 so in general one can't rely on -fomit-frame-pointer.
1005 For the above example both
1006 -O -fomit-frame-pointer -fno-defer-pop
1020 movl %a5@+,%a0; jmp %a0@
1025 This is also useless because bar's argument pop has been folded away.
1026 The problem is that gcc will always fold away the last call's argument
1027 pop if the function has a frame pointer, and -fomit-frame-pointer
1028 can't allways get rid of the frame pointer. In fact, in the presence
1029 of variable sized automatic variables or calls on alloca it would be
1030 very hard (impossible for recursive functions?) to do.
1032 The eatest solution I've come up with is to use -fno-defer-pop
1033 and a dummy function call between the threaded-code jump and
1036 register void (**tcip)() asm("%a5");
1038 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp
1039 %a0@");SBD();return;}while(0)
1043 foo(a); bar(b); JUMPNEXT;
1045 with -O -fno-defer-pop gcc generates:
1055 movl %a5@+,%a0; jmp %a0@
1061 Now bar's argument pop is not folded because its no longer the last
1062 call in the routine, SBD is.
1064 a) prevents gcc's 'last call argument pop fold into unlk' optimization
1065 which prevents uncontrolled stack growth.
1066 b) doesn't get executed because of the jump
1067 c) is trivial to remove from the assembler with a sed-script
1070 >4. Why does JUMPNEXT have a loop? Surely the jump leaves the loop right
1071 >away. Presumably you are tricking the compiler somehow.
1073 This is old C lore. The problem is
1074 'How do you write a macro that is a sequence of statements
1075 that can be used wherever a single statement can?'
1077 take the following definition of JUMPNEXT:
1078 #define JUMPNEXT asm("movl %a5@+,%a0; jmp %a0@");return;
1081 if (its_time_to_jump)
1083 do_something_else();
1086 if (its_time_to_jump)
1087 asm("movl %a5@+,%a0; jmp %a0@");
1089 do_something_else();
1091 Not at all whats intended!
1093 There are two tricks I know of (the first I saw in Berkely Smalltalk,
1094 the second in Richard Stallman's gcc manual. I expect they're both
1096 The first is to surround your statements with
1097 if (TRUE){statements}else
1099 #define JUMPNEXT if(1){asm("movl %a5@+,%a0; jmp %a0@");return;}else
1101 if (its_time_to_jump)
1103 asm("movl %a5@+,%a0; jmp %a0@");
1106 do_something_else();
1108 which works because C binds elses innermost first. However, some
1109 compilers will whine about dangling elses. The second scheme is
1112 Surround your statements with
1113 do{statements}while(FALSE);
1114 which will execute statements precisely once (its NOT a loop).
1116 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0)
1119 if (its_time_to_jump)
1121 asm("movl %a5@+,%a0; jmp %a0@");
1124 do_something_else();
1126 which does what's wanted and doesn't incur compiler whines.
1131 >Simon L Peyton Jones
1134 Eliot Miranda email: eliot@cs.qmw.ac.uk
1135 Dept of Computer Science Tel: 071 975 5229 (+44 71 975 5229)
1136 Queen Mary Westfield College ARPA: eliot%cs.qmw.ac.uk@nsf.ac.uk
1137 Mile End Road UUCP: eliot@qmw-cs.uucp
1140 >From vestal@SRC.Honeywell.COM Fri Apr 5 12:26:11 1991
1141 From: vestal@SRC.Honeywell.COM (Steve Vestal)
1142 Newsgroups: comp.compilers
1143 Subject: Re: Portable Fast Direct Threaded Code
1144 Keywords: interpreter, performance, design
1145 Date: 3 Apr 91 18:23:34 GMT
1146 Reply-To: vestal@SRC.Honeywell.COM (Steve Vestal)
1147 Organization: Honeywell Systems & Research Center
1148 In-Reply-To: pardo@cs.washington.edu's message of 2 Apr 91 19:21:25 GMT
1150 In article <1991Apr2.192125.7464@beaver.cs.washington.edu>
1151 pardo@cs.washington.edu (David Keppel) writes:
1152 [references about threaded code, much stuff deleted]
1154 David> %X Describes how to build a threaded code interpeter/compiler from
1156 David> * Less than 2:1 performance hit of threading compared to full
1159 I have a question about this. Numbers like this are often cited for
1160 threaded-type code, but in Bell's paper this was for the PDP-11 (whose
1161 addressing modes made it a natural for threaded code). Paul Klint's
1162 "Interpretation Techniques" paper (Software P&E, v11, 1981) cites a
1163 significant difference for interpreter fetch/decode times on different
1164 architectures. He cited numbers around 2:1 for the PDP-11, but something
1165 more like 9:1 for a Cyber. I did a Q&D evaluation of this for a RISC, and
1166 the ratio I guestemated was closer to that Klint gave for the Cyber than
1167 for the PDP-11 (not unexpectedly).
1169 How architecturally dependent is the performance of these techniques
1170 (relative to compiling to native code)?
1173 Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418
1174 Phone: (612) 782-7049 Internet: vestal@src.honeywell.com
1176 Send compilers articles to compilers@iecc.cambridge.ma.us or
1177 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
1179 >From E.Ireland@massey.ac.nz Fri Apr 5 12:29:20 1991
1180 From: E.Ireland@massey.ac.nz (Evan Ireland)
1181 Newsgroups: comp.lang.functional
1182 Subject: Three address code
1183 Date: 4 Apr 91 21:49:21 GMT
1184 Reply-To: E.Ireland@massey.ac.nz
1185 Organization: Information Sciences, Massey University, New Zealand
1187 I've had no luck with mail, so this is for csdw at Rhodes University.
1190 >In an attempt to optimize a functional language, I would like to
1191 >turn the stack based intermediate code into three address code.
1193 >Has anyone done similar conversions? Any references would be
1194 >greatly appreciated.
1196 I do not have any references, but I thought that one aspect of my FAM
1197 implementation might be of interest.
1199 A number of interpreters and compilers that I have seen implement a stack
1200 pointer in a register or global variable. Then to implement various stack
1201 operations, they use auto-increment or auto-decrement operations on the stack
1202 pointer register. Since I generate portable C, and thus cannot assume I have
1204 DATA *f (register DATA *fp)
1209 Thus I pass to each function the current pointer to top of stack, from which it
1210 can index downwards to find its arguments. Within the function, I use indexing
1211 operations on fp, e.g. fp[3] = fp[1], to manipulate values on the stack, so I
1212 am not continually manipulating the stack pointer. If "f" calls another
1213 function, it will pass the address of the current top of stack, e.g. g (&f[5]).
1215 The advantage to me is that I have a register for a stack pointer even though I
1216 am generating portable C code.
1218 Now the relationship to three-address code. If you adopt such a scheme, and
1219 your three address instructions allow some indexing, you can sometimes generate
1221 ADD fp[3],f[4],fp[3]
1224 _______________________________________________________________________________
1226 E.Ireland@massey.ac.nz Evan Ireland, School of Information Sciences,
1227 +64 63 69099 x8541 Massey University, Palmerston North, New Zealand.
1229 >From pardo@cs.washington.edu Sat Apr 6 14:32:24 1991
1230 From: pardo@cs.washington.edu (David Keppel)
1231 Newsgroups: comp.compilers
1232 Subject: Re: Portable Fast Direct Threaded Code
1233 Keywords: interpreter, performance, design
1234 Date: 4 Apr 91 17:10:55 GMT
1235 Reply-To: pardo@cs.washington.edu (David Keppel)
1236 Organization: Computer Science & Engineering, U. of Washington, Seattle
1238 >>>[Threaded code vs. compilation]
1240 >pardo@cs.washington.edu (David Keppel) writes:
1241 >>[Less than 2:1 performance hit of threading vs. full compilation.]
1243 Note also that the reference that claimed 2:1 (Peter M. Kogge, IEEE
1244 Computer pp 22-33 March 1982) also attributed part of that ratio to the
1245 poor state of compiler optimization.
1248 vestal@SRC.Honeywell.COM (Steve Vestal) writes:
1249 >[Klint says 2:1 for PDP-11 v. 9:1 for Cyber.
1250 > How architecturally dependent are these techniques?]
1252 Suppose that the statically-compiled code fragments that are threaded
1253 together are called `primitives'.
1256 When the execution time of a primitive is large, then the overhead for the
1257 interpreter can be large and still have a small effect on performance.
1258 The performance of the interpreted code is dominated by the time in a
1259 primitive vs. the overhead of moving between primitives.
1261 When the execution time of the primitives is small, then the overhead for
1262 moving between primitives can be a large fraction of the total execution
1263 time. Overhead comes from at least two sources:
1265 * Control flow: the address of the the next primitive is loaded
1266 from data memory and the processor executes an indirect jump.
1268 * Register allocation: a primitive is essentially a function with
1269 a fast calling convention (no stack adjustment). Thus, all the
1270 traditional problems with interprocedural register allocation.
1272 Examples of `large primitives' are ``draw circle'' and ``interpolate
1273 spline''. Examplees of small primitives are ``push'', ``add'', etc.
1276 * Architectural dependency of control flow
1280 Direct jumps in full compilation:
1284 br next // direct jump
1286 Indirect jumps for threading for a CISC:
1290 br *(r0)+ // load, increment, jump
1292 Indirect jumps for threading for a RISC:
1294 ld *r0, r1 // scheduled load
1298 add r1, #4, r1 // delay slot increment
1300 Direct jumps in full compilation can frequently use one instruction (a
1301 ``near branch'') both to find the address of the next code fragment and
1302 perform the control transfer. On a CISC, branches are typically two or
1303 three bytes. On a RISC, branches are typically four bytes. The threaded
1304 indirect (load, increment, jump) is typically three bytes on a CISC and
1305 twelve bytes (three instructions) on a RISC.
1307 Direct jumps in full compilation take typically one instruction time.
1308 Indirect jumps take at least the following operations: load, increment,
1309 jump indirect. On a CISC, the three operations can typically be `folded'
1310 in to one instruction. There may be a load penalty of one instruction
1311 time but the increment is overlapped, so the total time is three machine
1312 units (one `unit' is about one register->register operation). On a RISC,
1313 the total penalty is three machine units.
1315 Direct jumps take one (I-cache) cycle to fetch both the branch instruction
1316 and the address of the branch target. Indirect jumps take a D-cache cycle
1317 to fetch the address of the branch target and an I-cache cycle to fetch
1318 the branch instruction.
1320 Direct jumps can take advantage of instruction prefetching since the
1321 address of the next instruction is known at the time that the instruction
1322 prefetch reads the direct jump. Threaded indirects require an indirect
1323 branch off of a register. Current RISC and CISC machines are about
1324 equivalent in that they do little prefetching. Some machines being
1325 designed do more prefetching so the threading overhead for them will be
1329 * Architectural dependency of register allocation
1331 In a machine with a small number of registers, many of the registers are
1332 in-use in each primitive and the best possible register allocation will
1333 contain many loads and stores. In a machine with a large number of
1334 registers, the full-compilation implementation can make much better use of
1335 registers than the threaded primitives implementation (again, for small
1336 primitives). The savings from full compilation are exactly analagous to
1337 the improvements in register allocation from doing inlining of small
1341 * Other points to ponder
1343 In some cases the threaded code implementation is substantially smaller
1344 than the full-compilation implementation. For large functions or a
1345 machine with small caches, the loss of performance from threading might be
1346 overwhelmed by the gain in cache performance.
1348 On RISC machines, procedure call/return is about twice the cost of other
1349 control flow, except for the overhead of register management. Thus,
1350 call-dense RISC code from full compilation may behave about the same as
1353 Send compilers articles to compilers@iecc.cambridge.ma.us or
1354 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
1356 >From airs!ian@uunet.UU.NET Sat Apr 6 14:32:56 1991
1357 From: airs!ian@uunet.UU.NET (Ian Lance Taylor)
1358 Newsgroups: comp.compilers
1359 Subject: Threaded code
1360 Keywords: books, interpreter, design
1361 Date: 4 Apr 91 07:19:41 GMT
1362 Reply-To: airs!ian@uunet.UU.NET (Ian Lance Taylor)
1363 Organization: Compilers Central
1365 The book ``Threaded Interpretive Languages'' has a quite complete
1366 implementation of a threaded version of Forth in Z80 assembler. It's
1367 a very complete description of why threaded implementations exist and
1368 how to implement them easily. It's by R. G. Loeliger and was
1369 published by Byte Books (ISBN 0-07-038360-X). It was published in
1370 1981, though, and I'm sure it's long out of print.
1372 Ian Taylor airs!ian@uunet.uu.net uunet!airs!ian
1374 Send compilers articles to compilers@iecc.cambridge.ma.us or
1375 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
1377 >From firth@sei.cmu.edu Sun Apr 7 14:33:13 1991
1378 From: firth@sei.cmu.edu (Robert Firth)
1379 Newsgroups: comp.compilers
1380 Subject: Re: Portable Fast Direct Threaded Code
1381 Keywords: interpreter, performance, design
1382 Date: 4 Apr 91 13:27:21 GMT
1383 Reply-To: firth@sei.cmu.edu (Robert Firth)
1384 Organization: Software Engineering Institute, Pittsburgh, PA
1386 In article <1991Apr3.182334.16164@src.honeywell.com>
1387 vestal@SRC.Honeywell.COM (Steve Vestal) writes:
1389 >How architecturally dependent is the performance of these techniques
1390 >(relative to compiling to native code)?
1392 [cost of threaded code on PDP-11, RISC &c]
1394 We might have a misunderstanding here, because what I think of as threaded
1395 code doesn't have a decoding and interpretation step. But I'll talk of
1398 A program in threaded code is just an array of addresses, possibly
1399 interspersed with operands. So the fragment
1403 becomes something like
1413 This implies a very simple virtual stack machine - you can get more clever
1414 by implementing a virtual register machine.
1416 The basic execution thread then does this. We point a global register at
1417 the table of addresses, and each primitive has the form
1419 treg := treg + address'size
1423 As you can see, this is great on the PDP-11, since that reduces to one
1426 MOV (treg)+,PC ; NOTE TO MAINTAINER: FASTER THAN JMP - DON'T TOUCH!
1428 On a typical RISC machine, it's two cycles, since you can put almost
1429 anything in the delay slot(s) after the jump.
1431 Now, the load instruction, for instance, would say
1433 load: treg := treg + address'size
1434 load (treg) into tempreg
1435 treg := treg + address'size
1436 push (tempreg) onto simulated stack
1439 On the PDP-11, that's
1444 On a RISC, it's much more like
1446 L R0, 4(treg) ; get operand address
1447 L R0, 0(R0) ; dereference to get operand
1448 SUBI SP, #4 ; decrement simulated SP
1449 S R0, 0(SP) ; push operand on stack
1450 ADDI treg, #8 ; step over two addresses (mine & operands)
1451 JR (treg) ; over to you, Moriarty!
1453 [to fill delay slots, shuffle the above to 132564]
1455 Well, if you have one load delay slot and one branch delay slot, you can
1456 fill all three of them, so that's 6 cycles. Given that a typical load is
1457 only 1.1 cycles in direct code (90% of the delays filled), this is
1458 certainly a lot more than a 2:1 overhead! When you add the cost of a
1459 simulated stack (lots of needless loads and stores), I can well believe
1460 10:1 time expansion for simple code.
1462 In fact, it was more than that on the PDP-11, if you compared threaded
1463 code with direct code from a decent compiler. The big win in the Fortran
1464 compiler came from (a) very compact threaded code, and (b) the floating
1465 point operations were implemented in software, so the overhead of threaded
1466 code was swamped by the cost of floating addition, subtraction &c.
1468 Here's the full code of the above example, so you can see for yourself
1478 * MOV @(treg)+, -(SP)
1485 Note that, if you implement a one-address add, you save two instructions,
1486 since the *** bit reduces to
1490 But even then, it's coming out at over 4:1.
1492 What architectural features make threaded code more efficient? The
1493 fundamental one is main memory that is fast (or not too slow) relative to
1494 registers, since you're doing a lot more fetching. Another is a set of
1495 address modes with double indirection, since you're accessing most
1496 operands one level of indirection further back. And good old
1497 autoincrement helps a little, too. Alas, none of that says 'risc', and
1498 much of it says '1960s'.
1500 Incidentally, if I were to do this again today, I'd definitely simulate a
1501 general-register machine and use a subset of the real machine's registers.
1502 If you take 8 of them, you then have 8 loads and stores, one for each
1503 register, but if you make an absolute rule that nobody even thinks about
1504 touching one of those 8 that doesn't belong to him, then all the good
1505 tricks about register allocation, slaving &c will still work. If you then
1506 implement the operations as one-address general-register, you have again 8
1507 versions (add into R0, add into R1, ...) and lo! you're programming a very
1508 familiar old friend.
1510 "But that was in another country, and besides, the machine is dead..."
1514 Send compilers articles to compilers@iecc.cambridge.ma.us or
1515 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
1517 >From bpendlet@bambam.es.com Tue Apr 9 20:35:22 1991
1518 From: bpendlet@bambam.es.com (Bob Pendleton)
1519 Newsgroups: comp.compilers
1520 Subject: Re: Portable Fast Direct Threaded Code
1521 Keywords: interpreter, design
1522 Date: 8 Apr 91 19:48:00 GMT
1523 Reply-To: bpendlet@bambam.es.com (Bob Pendleton)
1524 Organization: Compilers Central
1526 In article <23613@as0c.sei.cmu.edu> you write:
1528 > A program in threaded code is just an array of addresses, possibly
1529 > interspersed with operands. So the fragment
1533 > becomes something like
1540 > address of 'store'
1543 > This implies a very simple virtual stack machine - you can get more clever
1544 > by implementing a virtual register machine.
1546 About 10 years ago I was working on a lisp compler that compiled to
1547 threaded code. I was trying to get small code and still have some
1548 performance. (Since I wanted to run the code on a Z80 or 8080 small was
1549 important. My how things change :-)
1551 I found that the 3 most common operations in threaded code were load,
1552 store, and execute. So I put those operations with the operands. This
1553 made the operands look rather like classes with load, store, and
1554 execute as virtual functions. If you let the evaluate operation
1555 subsume the load and execute operations the threaded code for
1561 address of 'a.evaluate()'
1562 address of 'b.evaluate()'
1564 address of 'c.store()'
1572 address of 'x.evaluate()'
1573 address of 'y.evaluate()'
1574 address of 'F.evaluate()'
1575 address of 'g.store()'
1578 Which is much smaller than the original version of threaded code.
1580 Later, while working on a homebrew version of FORTH I gave up on
1581 threaded code completely. I found, like most who have expolored it,
1582 that symbolic execution of RPN code is a nice way to generated machine
1583 code. Machine code that runs much faster than threaded code, and that
1584 the machine code, even on an 8080, was only about 25% bigger than
1588 bpendlet@dsd.es.com or decwrl!esunix!bpendlet or utah-cs!esunix!bpendlet
1589 [The DEC PDP-11 Fortran compiler did something similar, writing load routines
1590 for commonly used variables. -John]
1592 Send compilers articles to compilers@iecc.cambridge.ma.us or
1593 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
1595 >From pardo@june.cs.washington.edu Wed Apr 24 09:26:32 1991
1596 From: pardo@june.cs.washington.edu (David Keppel)
1597 Newsgroups: comp.compilers
1598 Subject: Re: Fast Interpreted Code
1599 Keywords: interpreter, threaded code
1600 Date: 23 Apr 91 02:06:21 GMT
1601 Reply-To: pardo@june.cs.washington.edu (David Keppel)
1602 Organization: Computer Science & Engineering, U. of Washington, Seattle
1604 ssinghani@viewlogic.com (Sunder Singhani) writes:
1605 >[Our threaded code isn't fast enough. What's faster?]
1607 As far as I know, threaded code gives the fastest primitives/second
1608 dispatch rate on a variety of architectures. The general techniques for
1609 making things faster (that I know of!) are to change things so that the
1610 dispatch rate can go down without changing the work that gets done (or use
1611 hardware, but we'll ignore that for the moment.)
1613 * Use a different v-machine instruction set
1615 The overhead of interpreting is almost nothing in generic PostScript
1616 imaging code because all the time is spent in non-interpretded
1617 primitives. If you can characterize your key operations (perhaps
1618 info in [Davidson & Fraser ??, Fraser, Myers & Wendt 84] can help
1619 you analyze for common operations instead of the more normal time in
1620 routines) then you can re-code your virtual instruction set to have
1621 as primintives oeprations that are performed frequently.
1623 * Dynamic compilation to native machine code
1625 This is what is done in ParcPlace System's Smalltalk-80
1626 implementation, [Deutsch & Schiffman 84] also Insignia Solution's
1629 Dynamic compilation suffers from the need to do compilation at
1630 runtime: a compiler that produces better code will take longer to
1631 run and the compile time contributes to the overall program runtime.
1632 Also, program text isn't shared, even if multiple instances are
1633 running simultaneously.
1635 * Native-coding key routines
1637 If you believe that your program spends 80% of its time in a few key
1638 routines, then compiling just these routines -- statically, adding
1639 them to the primitive set, statically adding them as library
1640 routines, or dynamically -- can improve performance substantially
1647 %T Some Efficient Architecture Simulation Techniques
1648 %J Winter '90 USENIX Conference
1651 %W Pardo has a copy.
1652 %X Describes a simulator that uses threaded-code techniques to emulate
1653 a Motorola 88000. Each 88k instruction is executed in about 20 host
1654 (68020) instructions. Discusses techniques used to get the simulation
1655 down from several thousand host instructions in many other
1660 %T Eliminating Redundant Object Code
1665 %A Alan M. Schiffman
1666 %T Efficient Implementation of the Smalltalk-80 System
1667 %J 11th Annual Symposium on Principles of Programming Languages
1671 %X Dynamic translatin of p-code to n-code (native code).
1672 Resons for not using straight p-code or straight n-code:
1673 * p-code is smaller than n-code (<= 5X).
1674 * The debugger can debug p-code, improving portability.
1675 * Native code is faster (<= 2X). Reasons include
1676 special fetch/decode/dispatch hardware;
1677 p-machine and n-machine may be very different, e.g.,
1678 stack machine vs. register-oriented.
1679 * Threaded code does reduce the cost of p-code fetch/decode.
1680 Does not help with operand decoding.
1681 Does not allow optimizations to span more than one instruction.
1682 [pardo: that's not technically true, but each optimized
1683 instruction must exist in addition to the unoptimized version.
1684 That leads to exponential blowup of the p-code. Example: delayed
1685 branch and non-delayed branch versions of Robert Bedichek's 88k
1687 The system characteristics:
1688 * The time to translate to n-code via macro expansion is about the
1689 same as the execute time to interpret the p-code.
1690 * (pg 300:) Self-modifying code (SMC) is deprecated but used in a
1691 ``well-confined'' way. Could indirect at more cost. Use SMC on the
1692 machines where it works, indirection where SMC.
1694 * Performance is compared to a ``straightforward'' interpreter.
1697 %A Christopher W. Fraser
1700 %T Analyzing and Compressing Assembly Code
1705 %T Two-Level Hybrid Interpreter/Native Code Execution for Combined
1706 Space-Time Program Efficiency
1710 %X Talks about native code execution vs. various kinds of interpreting
1711 and encoding of key routines in assembly.
1716 ;-D on ( This is all open to interpretation ) Pardo
1718 Send compilers articles to compilers@iecc.cambridge.ma.us or
1719 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
1721 >From eliot@cs.qmw.ac.uk Tue Apr 30 15:55:17 1991
1722 From: eliot@cs.qmw.ac.uk (Eliot Miranda)
1723 Newsgroups: comp.compilers,gnu.gcc.bug,alt.sources
1724 Subject: re: Threaded Code
1725 Keywords: design, interpreter
1726 Date: 5 Apr 91 11:43:51 GMT
1727 Reply-To: Eliot Miranda <eliot@cs.qmw.ac.uk>
1728 Followup-To: comp.compilers
1729 Organization: Computer Science Dept, QMW, University of London, UK.
1731 I recently posted articles to comp.compilers & alt.sources on how
1732 to write threaded code machines in C. I received the following questions
1733 from Simon Peyton Jones at Glasgow. They are specific to GCC.
1734 Since they have non-obvious answers & since the answers suggest
1735 augmentation of the GCC compiler I'm posting my response to Simon.
1737 >From: Simon L Peyton Jones <simonpj@cs.gla.ac.uk>
1739 >I saw your article about threaded code. Like you and others, we are
1740 >using C as an assembler, only for a pure functional language, Haskell.
1741 >I have some brief questions.
1743 >1. By telling GCC not to use a frame pointer, one can eliminate
1744 >the prolog entirely, can't one? So why edit it out?
1746 No, registers local to the procedure will still be saved & stack space
1747 allocated for automatic variables. This IS a problem since the
1748 threaded-code jump at the end of the procedure will miss the register
1749 restores before the epilogue. Consequently the stack will grow unless
1750 these register saves & stack-space allocations are removed. Also
1751 GCC can not always eliminate the frame pointer.
1753 >I guess the answer is going to be local variables, allocated once for
1754 >all by the StartTCMachine routine. Still, it seems quite a pain. I guess
1755 >one could sacrifice some (perhaps slight) speed by using a bunch of
1757 For certain routines, not using register variables will be expensive
1758 (e.g. a simple integer arithmetic primitive).
1760 >2. You edit out the epilogue for tidiness only, I take it. It doesn't
1761 >cause any problems if you leave it in, does it?
1762 No, but given that the prologue has to be removed & removing the epilogue
1763 is as easy (& given finite memory :-) one might as well remove it.
1765 >3. Why do you use no-defer-pop (with the associated bug)?
1766 OK. This is again to avoid stack growth. On conventional stack architectures
1767 gcc will try & combine the argument popping code of a sequence of
1770 extern long a, b, c;
1772 foo(a); bar(b); baz(c);
1775 with -O -no-defer-pop one might expect gcc to generate
1790 but because gcc knows that the unlk instruction will roll back the stack
1791 in fact gcc generates:
1805 With -O -fdefer-pop gcc optimizes out the pops completely & generates:
1817 with -O -fomit-frame-pointer -fdefer-pop gcc generates:
1828 & with -O -fomit-frame-pointer -fno-defer-pop gcc generates:
1841 All the above cases are as one would wish. The elimination of all
1842 defered pops in the unlk instruction is especially clever.
1844 However, in the presence of the threaded-code jump the waters darken!
1845 Consider what gcc generates for:
1847 register void (**tcip)() asm("%a5");
1849 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)
1853 foo(a); bar(b); JUMPNEXT;
1855 with -O -fdefer-pop gcc generates
1864 movl %a5@+,%a0; jmp %a0@
1869 This is clearly no good because the arguments to both foo & bar
1870 will never be popped. Every time doit() is executed the stack will grow
1871 by 8 bytes. Soon your program will dump core with a very large stack
1874 with -O -fno-defer-pop gcc generates:
1883 movl %a5@+,%a0; jmp %a0@
1888 Again useless because bar's pop has been folded into the unlk
1889 which won't be executed.
1891 with -O -fdefer-pop -fomit-frame-pointer gcc generates
1899 movl %a5@+,%a0; jmp %a0@
1903 This is great. However, not all functions are susceptible to
1904 the omit-frame-pointer optimization (i.e. functions
1905 with local variables). E.g. the code generated for:
1907 register void (**tcip)() asm("%a5");
1909 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)
1914 foo(a,large); bar(b); JUMPNEXT;
1917 with -O -fomit-frame-pointer -fdefer-pop is:
1926 movl %a5@+,%a0; jmp %a0@
1931 so in general one can't rely on -fomit-frame-pointer.
1932 For the above example both
1933 -O -fomit-frame-pointer -fno-defer-pop
1947 movl %a5@+,%a0; jmp %a0@
1952 This is also useless because bar's argument pop has been folded away. The
1953 problem is that gcc will always fold away the last call's argument pop if
1954 the function has a frame pointer, and -fomit-frame-pointer can't allways
1955 get rid of the frame pointer. In fact, in the presence of variable sized
1956 automatic variables or calls on alloca it would be very hard (impossible
1957 for recursive functions?) to do.
1959 The neatest solution I've come up with is to use -fno-defer-pop and a
1960 dummy function call between the threaded-code jump and the return:
1962 register void (**tcip)() asm("%a5");
1964 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp
1965 %a0@");SBD();return;}while(0)
1969 foo(a); bar(b); JUMPNEXT;
1971 with -O -fno-defer-pop gcc generates:
1981 movl %a5@+,%a0; jmp %a0@
1987 Now bar's argument pop is not folded because its no longer the last
1988 call in the routine, SBD is.
1990 a) prevents gcc's 'last call argument pop fold into unlk' optimization
1991 which prevents uncontrolled stack growth.
1992 b) doesn't get executed because of the jump
1993 c) is trivial to remove from the assembler with a sed-script
1996 One an try to use -fcaller-saves, but this surrounds calls with unnecessary
1997 register saves & restores that for the code to be optimal have to be
2000 >4. Why does JUMPNEXT have a loop? Surely the jump leaves the loop right
2001 >away. Presumably you are tricking the compiler somehow.
2003 This is old C lore. The problem is
2004 'How do you write a macro that is a sequence of statements
2005 that can be used wherever a single statement can?'
2007 take the following definition of JUMPNEXT:
2008 #define JUMPNEXT asm("movl %a5@+,%a0; jmp %a0@");return;
2011 if (its_time_to_jump)
2013 do_something_else();
2016 if (its_time_to_jump)
2017 asm("movl %a5@+,%a0; jmp %a0@");
2019 do_something_else();
2021 Not at all whats intended!
2023 There are two tricks I know of (the first I saw in Berkely Smalltalk,
2024 the second in Richard Stallman's gcc manual. I expect they're both
2026 The first is to surround your statements with
2027 if (TRUE){statements}else
2029 #define JUMPNEXT if(1){asm("movl %a5@+,%a0; jmp %a0@");return;}else
2031 if (its_time_to_jump)
2033 asm("movl %a5@+,%a0; jmp %a0@");
2036 do_something_else();
2038 which works because C binds elses innermost first. However, some
2039 compilers will whine about dangling elses. The second scheme is
2042 Surround your statements with
2043 do{statements}while(FALSE);
2044 which will execute statements precisely once (its NOT a loop).
2046 #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0)
2049 if (its_time_to_jump)
2051 asm("movl %a5@+,%a0; jmp %a0@");
2054 do_something_else();
2056 which does what's wanted and doesn't incur compiler whines.
2061 >Simon L Peyton Jones, Glasgow University
2063 More and more people are taking the 'use C as an assembler' route, and
2064 more and more people are using GCC to do it (because its code quality is
2065 good, it had global register variables, and it has an excellent asm
2066 facility). The threaded-code in C idea is also becomming more popular.
2067 But as the code above demonstrates, one does have to side-step
2068 optimizations and develop system-specific assembler editing scripts.
2070 I'd like to ask Richard Stallman & the GCC development team for
2071 -fno-prolog -fno-epilog
2072 flags that instruct gcc to generate
2073 a) no register saves or restores
2074 b) no automatic variable allocation
2075 c) no procedure linkage/frame creation
2077 Then the optimal 'Threaded-Code Machine in GCC C' can be compiled without
2078 any assembler editing scripts at all.
2080 Also nice would be a way of telling GCC that an asm statement
2081 changed the flow of control so GCC could
2082 a) warn about not-reached code
2083 b) eliminate unnecessary code (do more code folding)
2085 Eliot Miranda email: eliot@cs.qmw.ac.uk
2086 Dept of Computer Science Tel: 071 975 5229 (+44 71 975 5229)
2087 Queen Mary Westfield College ARPA: eliot%cs.qmw.ac.uk@nsf.ac.uk
2088 Mile End Road UUCP: eliot@qmw-cs.uucp
2091 Send compilers articles to compilers@iecc.cambridge.ma.us or
2092 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
2094 >From brennan@bcsaic.boeing.com Sat May 4 11:28:41 1991
2095 From: brennan@bcsaic.boeing.com (Michael D Brennan)
2096 Newsgroups: comp.compilers
2097 Subject: re: Threaded code
2098 Keywords: interpreter, design
2099 Date: 2 May 91 19:50:23 GMT
2100 Reply-To: brennan@bcsaic.boeing.com (Michael D Brennan)
2101 Organization: Boeing Aerospace & Electronics, Seattle WA
2103 Another method for obtaining threaded byte code for an interpreter
2104 is to edit the assembler output of a big switch
2105 rather than editing the prologue and epilogue off functions calls.
2107 You don't need gcc, global vars in registers, works with smart and
2108 dumb compilers, and all optimization can be turned on.
2112 This C routine executes (unthreaded) byte code for an interpreter
2113 that can add, subtract and print.
2121 static int stack[32] ;
2123 void execute(code_ptr)
2124 register int *code_ptr ;
2126 register int *stack_ptr = stack - 1 ;
2131 switch( *code_ptr++ )
2133 case HALT : return ;
2135 * ++ stack_ptr = *code_ptr++ ;
2140 *stack_ptr += stack_ptr[1] ;
2145 *stack_ptr -= stack_ptr[1] ;
2149 printf("%d\n", *stack_ptr--);
2155 -------------------------------------------------------
2157 to interpret 2 + (3 - 4)
2159 the front end "compiles" in int code[]
2161 PUSH, 2, PUSH, 3, PUSH, 4, SUB, ADD, PRINT, HALT
2163 and calls execute(code).
2165 ------------------------------------------------------
2167 The difference between this and the threaded code discussed over
2168 the last few weeks is the switch gets compiled as
2170 jmp TABLE[ *code_ptr++ ]
2172 where TABLE is the jump table generated by the compiler which holds
2173 the addresses of the case labels.
2175 With threading, the transitions between functions become
2180 but this is easy to get by editing the assembler output to
2181 export the case label and recode the switch.
2183 --------------------------------------------------
2185 For example on a SPARC:
2202 ! the switch, doesn't change structure
2203 ! as you add new op codes
2212 sethi %hi(L2000000),%o1
2213 or %o1,%lo(L2000000),%o1 ! [internal]
2217 L2000000: ! the jump TABLE
2218 .word L77003 ! HALT etc
2225 -------------------------------------------
2226 modify by adding global labels and edit the switch
2251 -------------------------------------------
2253 For another example on an Intel 8088
2260 ; switch( *code_ptr++ )
2266 mov bx,word ptr [bx]
2270 jmp word ptr cs:@1@C738[bx]
2276 ; * ++ stack_ptr = *code_ptr++ ;
2280 mov ax,word ptr [di]
2281 mov word ptr [si],ax
2292 @1@C738 label word ; jump TABLE
2294 dw @1@122 ; PUSH etc
2299 ------------------------------------------------
2301 edited the jump can be computed inline
2305 ; switch( *code_ptr++ )
2307 @1@50: ; switch code is replaced by code only executed once
2319 ; * ++ stack_ptr = *code_ptr++ ;
2323 mov ax,word ptr [di]
2324 mov word ptr [si],ax
2330 inc di ; jmp to *code_ptr++ inline
2336 ----------------------------------------------
2338 the "front end" has defines
2340 typedef void (*TCODE)() ;
2342 extern void halt(), push(), add(), sub(), print() ;
2344 TCODE code[CODESIZE] ;
2346 in the array code[], the front end compiles
2349 push, 2, push, 3, push, 4, sub, add, print, halt
2351 and calls execute(code).
2356 brennan@bcsaic.boeing.com
2358 Send compilers articles to compilers@iecc.cambridge.ma.us or
2359 {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.