Optimisation of C-code (runtime and compiled)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Placing of STG registers in machine registers
- Optimisation of interpreter loop (tail calls)

/* TODO: flags */
-OC flag to ghc causes optimisation


ANSI C
~~~~~~
For functions with no args we declare them as 

	foo( STG_NO_ARGS )

rather than foo(), because you can tell ANSI C a little more
by making STG_NO_ARGS expand to void.  Maybe something to do with
forward declarations?


Optimisation with GCC
~~~~~~~~~~~~~~~~~~~~~

We are concentrating most optimisation with gcc which allows
suitable global register declarations.


REGISTERS:

See StgOpt.h for a description of register usage

Note that all modules that reference the STG registers must be
compiled the same way so they look at the registers and not the
global variables.


TAIL CALLS:

Seperate modules for tail call optimisations are required.
Requitre to partition runtime system code.

.hc:
	Modules with tail call routines (most runtime and all compiled)
	are labeled .hc (literate = .lhc).
	These are compiled to assember with tail call optimisation on
	and then post processed with sed (Yuk!)

	All routines which return a continuation (STGJUMP) must be
	compiled this way. 

	(The only exeption to this is continuations which exit/abort which
	 may live in .c files)

.c:
	These modules are not compiled with the tail call optimisation
	and don't have sed processing. 
	Sed processing would destroy the code!

	All routines which are not continuations (called and return
	conventionally) must be	compiled this way.

	This includes various parts of the runtime system.


See Also  ~sansom/work/gcstats/OBSERVATIONS


Info Recieved from Eliot Miranda:

Received: from dcs.glasgow.ac.uk (tutuila) by vanuata.dcs.glasgow.ac.uk; Thu, 4 Jul 91 09:40:34 BST
Message-Id: <15456.9107040841@dcs.glasgow.ac.uk>
X-Comment1: #############################################################
X-Comment2: #     uk.ac.glasgow.cs has changed to uk.ac.glasgow.dcs     #
X-Comment3: #     If this address does not work please ask your mail    #
X-Comment4: #     administrator to update your NRS & mailer tables.     #
X-Comment5: #############################################################
To: simonpj
Cc: sansom
Subject: Miranda's original msg
Date: Thu, 04 Jul 91 09:41:19 +0100
From: Simon L Peyton Jones <simonpj>


>From eliot@.cs.qmw.ac.uk Mon Apr  1 11:16:06 1991
From: eliot@.cs.qmw.ac.uk (Eliot Miranda)
Newsgroups: comp.compilers
Subject: Portable Fast Direct Threaded Code
Keywords: interpreter, design
Date: 28 Mar 91 12:20:06 GMT
Reply-To: Eliot Miranda <eliot@.cs.qmw.ac.uk>
Organization: Computer Science Dept, QMW, University of London, UK.

Various people have asked me for details on how I do threaded code
in my dynamic translation Smalltalk-80 VM. So here's the gory details
as well as the first published benchmarks for the system.

	How to do "Machine-Independent" Fast Direct Threaded Code:


First off, use C (although other flexible machine-oriented imperative
languages would probably work too).

Global registers:
	If you can use GCC >v1.33 you can use global register variables to
hold the threaded code machine's registers.  If you have various forms of
stupid C compiler then you can get global register variables by declaring
your globals as register variables in every function, and later editing the
assembler generated by the C compiler to remove global register saves &
restores (details in [Miranda]).


Threaded Code:
	Threaded code instructions (TCIs) are written as C procedures. 
They are compiled to assembler by the C compiler.  Subsequently a simple
editor script is run on the assembler to remove the prologues and epilogues
from the threaded code procedures, and possibly to insert direct threaded
code jumps.

How to Identify Threaded Code Instructions:
	The method I prefer is to segregate TCIs from other procedures &
functions in the machine by placing related groups of TCIs in separate
source files.   I give my threaded code source files the .tc extension and
they have a rule in the makefile that will run the editor script on the
assembler.  An alternative is to identify each threaded code procedure with
a special prefix which is spotted by the editor script.  This is probably
more error prone & doesn't fit well with leaf-routine optimization (see
below).

How to Write Threaded Code Instructions:
Each TCI is writen an a void function of no arguments.  It is wise to start
and end each TCI with two special macros to replace '{' and '}'.  So, using
GCC on the SPARC, given some declarations:


	typedef	void	(*TCODE)(); /* a TCODE is a pointer to a function */
	typedef	????	ObjectPointer;

	. . .

	register TCODE	*tcip asm("%g7"); /*threaded code instruction pointer*/
	register ObjectPointer	*stackPointer asm("%g5");

e.g. popStack would be written:

	void	popStack()
	TBEGIN
		stackPointer--;
	TEND

With GCC TBEGIN is

	#define TBEGIN {

With stupid C compilers it can be defined to be the list of global register
variables.  Further, if you want to build a debuger for your threaded code
machine you could compile the system with

	#define TBEGIN { int frig = checkForBreakPoint();

and ignore lots of warnings about variable frig being unused :-).

TEND has to do a direct threaded code jump. In my system I want an indirect
post-increment jump on tcip; i.e. jump to *tcip++.  On the SPARC with tcip
in %g7 the jump is

	ld [%g7],%o0		! get *tcip
	jmpl %o0,%g0		! jump to it
	add %g7,4,%g7		! increment tcip in the jump's delay slot

On the 68k with tcip in a5 the jump is

	movl a5@+,a0
	jmp a0@

With GCC this is implemented by the JUMPNEXT macro. On the SPARC:
	#define JUMPNEXT do{ \
		asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");\
		return;}while(0)

Note the return, it tells the compiler that control does not pass this point.
On the 68k:
	/* SBD = Silent But Deadly = Stack Bug Dummy. gcc has a bug with
	   no-defer-pop. it always depers the pop of the last function call in
	   a routine. SBD is a dummy call to ensure no other previous call gets
	   its pop deferred.
	 */
	extern	void	SBD P((void));

	#define JUMPNEXT do{ \
		asm("movl a5@+,a0; jmp a0@");\
		SBD();return;}while(0)

SBD is then removed by the editor script.

So TEND is defined to be
	#define TEND JUMPNEXT; }

On the SPARC popStack is expanded to
	void	popStack()
	{
		stackPointer--;
		do{asm("ld [%g7],%o0; jmpl %o0,%g0; add
%g7,4,%g7");return;}while(0);
	}

Its compiled to:
	_popStack:
		!#PROLOGUE# 0
		save %sp,-80,%sp
		!#PROLOGUE# 1
		add %g5,-4,%g5
		ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7
		ret
		restore
The editor script then reduces this to:`
	_popStack:
		! [gotcher]
		add %g5,-4,%g5
		ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7

On the 68k you end up with:
	.globl _popStack
	_popStack:
		subqw #4,a3
		movl a5@+,a0; jmp a0@

Global Register Variables and Stupid C Compilers:
	Some C compilers are stupid enough to provide straight-forward global
registers.  A C compiler on a nat-semi based system I used just allocated
registers in the order they were declared.  The assembler syntax was very
simple to edit.  Global register variables could thus be obtained easily.

	Some C compilers are stupid but think they're clever.  Sun's SUN3
compiler is a case in point. It also allocates registers in the order declared.
However, it tries to be clever by removing 'dead assignments' (assignments to
subsequently unused register variables).  These compilers are easy to fool.
Simply arrange that the register variables are always used before leaving a
function.  I do this by always writing RETURN or RETURNV(e) instead of
return and return e.  On systems with such stupid C compilers RETURN(e)
is defined thus:

	#define RETURNV(e) do{DummyUseRegs(GR1,GR2,GR3); return e;}while(1)

The call on DummyUseRegs fools the compiler into thinking the registers
are live & hence saves assignments to them.  The editor scripts can then
remove calls on DumyUseRegs.

	Of course on systems with marginally clever C compilers (SUN4
HP-UX etc)
you're stuffed.  However, in clever C compilers like GCC and Acorn's C compiler
you can declare global registers & everything is clean & wholesome :-).


Conditional TCODE Jumps:
	Say we wanted a conditional tcode jump. This might be writen:
	void	skipIfTrue()
	TBEGIN
		if (*stackPointer-- == TrueObject) {
			tcip += 1;
			JUMPNEXT;
		}
	TEND

How this All Works:
With the above scheme, each threaded code procedure runs in the same C
stack frame, and jumps directly to the next procedure, eliminating an
unnecessary <epilogue, return>, <call, prolog> pair.  Once we establish a
stack frame and call the first function away we go.  Assuming that you've
produced your first threaded code method (a sequence of pointers to
threaded code procedures), and that tcip points at the start, then
StartTCMachine might be defined as follows:

	volatile	void
	StartTCMachine()
	{	char	*enoughSpaceForAllTCIStackFrames;

		enoughSpaceForAllTCIStackFrames = alloca(1024);

		while(1) { (*tcip++)(); }
	}

The alloca allocates space on the stack. After the first (*tcip++)()
control goes off into threaded code land and never returns.

Leaf Routine Optimization:
The threaded code machine will make calls on support routines e.g.
graphics, garbage collector etc.  Any group of routines that dont access
the global registers and don't directly or indirectly call routines that
need to access the global registers can be optimized.  These routines
should be compiled without declaring the global registers.  These routines
will then use as many registers as the compiler will give them and will
save & restore any registers they use, preserving the values of the global
register variables.


Details of my Smalltalk Threaded Code Machine:
	I use a pair of words for each TCI, a pointer to the procedure followed
by an optional operand.  This avoids going out of line to access arguments.
e.g. pushLiteral is:
	void	pushLit()
	TBEGIN
		*++stackPointer = (OOP)*tcip++;
	TEND
where OOP is an ordinary object pointer. So on entry to push lit we have:
		<pointer to pushLit>
	tcip->	<object pointer>
		<pointer to next TCI>
		<next TCI's operand>
and popStack must therefore be written
	void	popStack()
	TBEGIN
		stackPointer--;
		tcip++;
	TEND

I dynamically compile Smalltalk-80 bytecodes to threaded code.  I use 128k
bytes of memory to hold all threaded code. This 'tspace' is periodically
scavenged to reclaim space.  The architecture is similar to
[DeutschSchiffman].  Using an eighth of the space used by the Deutch
Schifman machine I get around 75% of the performance on the non-graphics
benchmarks.  Here are the Smalltalk macro benchmarks for BrouHaHa
Smalltalk-80 v2.3.2t running on a monochrome SUN3/60 (20MHz 68020):

	BitBLT			76.7308
	TextScanning		222.857
	ClassOrganizer		80.6667
	PrintDefinition		59.0278
	PrintHierachy		142.857
	AllCallsOn		112.5
	AllImplementors		130.0
	Inspect			116.667
	Compiler		86.4341
	Decompiler		101.639
	KeyboardLookAhead	212.5
	KeyboardSingle		302.778
	TextDisplay		148.837
	TextFormatting		273.81
	TextEditing		180.342
	Performance Rating	134.198

and on the same machine under the same conditions are the timings for
ParcPlace Smalltalk-80 V2.3:

	BitBLT			99.75
	TextScanning		390.0
	ClassOrganizer		155.128
	PrintDefinition		137.097
	PrintHierachy		192.308
	AllCallsOn		120.0
	AllImplementors		108.333
	Inspect			146.774
	Compiler		118.617
	Decompiler		129.167
	KeyboardLookAhead	303.571
	KeyboardSingle		473.913
	TextDisplay		172.973
	TextFormatting		442.308
	TextEditing		285.135
	Performance Rating	189.504

134.198/189.504 = 0.708154

WARNING!! These systems ARE different, these benchmarks are included only
to give a feeling for ball-park performance.
Example differences:
	BrouHaHa				ParcPlace
	closures				blue-book BlockContexts
	immediates:
	  characters, smallints, fixedpoints	immediate smallintegers
	5844 compiled methods			5136 compiled methods
	(5026 ordinary methods)			(4798 ordinary methods)
	(818 quick methods)			(338 quick methods)


More Portable File Organization:
To keep the code as clean looking as possible all machine-dependencies are
isolated in separate files.  e.g. tcode.h gives machine independent
definitions for TCODE. It includes machine dependencies from another file:

	/* for debugging purposes; single step breakpoint at start of
each tcode
	 */
	#define DEBUG_FETCH_BREAK int frig = fetchBrk();

	#ifdef FAST
	#   include "fasttcode.h"
	#else

	TCODE	*tcip;		/* the tcode ip points at TCODEs */

	#   define JUMPNEXT return
	#   ifdef BIDebug
	#	define TBEGIN { DEBUG_FETCH_BREAK
	#   else
	#	define TBEGIN {
	#   endif
	#   define TEND }
	#endif

GCC/SPARC/fasttcode.h:
	/* tcodeip in g7 */
	register TCODE	*tcip asm("%g7");

	#define JUMPNEXT do{asm("ld [%g7],%o0; jmpl %o0,%g0; add
%g7,4,%g7");return;}while(0)

	#ifdef BIDebug
	#   define TBEGIN { DEBUG_FETCH_BREAK
	#else
	#   define TBEGIN {
	#endif
	#define TEND JUMPNEXT; }

I also don't want to include the necessary definitions for the global registers
in every file. So for those non-leaf routines that must avoid using the
global registers there's a fastglobal.h file that gives dummy definitions for
these registers. e.g. GCC/SPARC/fastglobal.h:
	/* machine specific FAST defines.
	   Gnu C 1.33 systems can use nice compiler provided global registers.
	 */

	#define BEGIN {
	#define END }
	#define RETURN(e) return e
	#define RETURNV return

	register char	*GlobRegDummy1 asm("a5");
	register char	*GlobRegDummy2 asm("a4");
	register char	*GlobRegDummy3 asm("a3");
	register char	*GlobRegDummy4 asm("d6");

	#ifdef MASKREGISTER
	register char	*GlobRegDummy5 asm("d7");
	#endif

I use symbolic links to set up the machine dependent include files.
This has the
advantage that if you add a new machine you don't have to remake all
the others.


The Tedious Bit:
The only tedious bit is writing the sed-scripts. For the SPARC this took 1 day.
Here are the sed scripts I use for SUN 3, MAC2AUX (using GAS) and SUN4,
all using GCC (v1.33 upwards). There's a problem on the SPARC in that the ABI
does not seem to define the status of the global registers.  Some math and
library routines stomp on the global registers (beware getwd!), so I've
included
GCC/SUN4/sed.globreg.bugfix as an example of how to spot the offending math
routines:

SUN3/GCC/lib/sed.tcode.opt:
# script to strip prolog & epilog from threaded code under gcc.
# WARNING the script may strip a push of a register argument if a call is the
# first statement of a function!!
#
/^_.*:$/{n
N
N
s/	link a6,#[^\n]*\n//
/	fmovem #[^\n]*,sp@-/{
N
s/	fmovem #[^\n]*,sp@-\n//
}
s/	moveml .*,sp@-\n//
s/	movel [ad][0-7],sp@-\n//
}
/	moveml a6@(-[1-9][0-9]*),#/{N
s/	moveml a6@(-[1-9][0-9]*),#[^\n]*\n	unlk a6//
}
/	movel a6@(-[1-9][0-9]*),[ad][0-7]/{N
s/	movel a6@(-[1-9][0-9]*),[ad][0-7]\n	unlk a6//
}
/	movel sp@+,/d
/	moveml sp@+,#/d
/	unlk a6/d
/	rts/d
/	jbsr _SBD/d

MAC2AUX/GCC/lib.gas/sed.tcode.opt:
/COMMENT/{
i\
	script to strip prolog & epilog from threaded code under gcc. WARNING \
	the script may strip a push of a register argument if a call is the\
	first statement of a function!!
}
/^gcc_compiled:/d
/^.[^%].*:$/{ n
/	link %a6/{
N
N
s/	link %a6,#[x0-9-]*\n//
/	fmovem #[^\n]*,%sp@-/{
N
s/	fmovem #[^\n]*,%sp@-\n//
}
s/	moveml #[x0-9a-f]*,%sp@-\n//
s/	movel %[ad][0-7],%sp@-\n//
n
}
}
/	moveml -[1-9][0-9]*%a6@,#/{ N
s/	moveml -[1-9][0-9]*%a6@,#[x0-9a-f-]*\n	unlk %a6//
}
/	movel -[1-9][0-9]*%a6@,%[ad][0-7]/{ N
s/	movel -[1-9][0-9]*%a6@,%[ad][0-7]\n	unlk %a6//
}
/	movel %sp@+,%/d
/	moveml %sp@+,#/d
/	movel %d0,%a0/{
N
s/	movel %d0,%a0\n	unlk %a6//
/	movem*l %a6/{
N
s/	movel %d0,%a0\n	movem*l %a6.*\n	unlk %a6//
/	fmovem %a6/{
N
s/	movel %d0,%a0\n	movem*l %a6.*\n	fmovem %a6.*\n	unlk %a6//
}
}
}
/	unlk %a6/d
/	rts/d
/	jbsr SBD/d


SUN4/GCC/lib/sed.tcode.opt:
# script to strip prolog & epilog from threaded code under gcc.
#
/^_.*:$/{n
N
N
s/	!#PROLOGUE# 0\n	save %sp,[-0-9]*,%sp\n	!#PROLOGUE# 1/	! [gotcher]/
}
/	ret/d
/	restore/d

SUN4/GCC/lib/sed.globreg.bugfix:
# Some of the libc builtin routines (.rem .urem .div & .udiv so far known)
# stamp on %g3 which is the maskReg (it contains 0x7FFFFF).
# This script reassigns the value of maskReg after each of these routines
# has been called.
/call[ 	]\.div,[0-9]/{n
n
i\
	sethi %hi(0x7FFFFF),%g3	![globregfix]\
	or %lo(0x7FFFFF),%g3,%g3
}
/call[ 	]\.udiv,[0-9]/{n
n
i\
	sethi %hi(0x7FFFFF),%g3	![globregfix]\
	or %lo(0x7FFFFF),%g3,%g3
}
/call[ 	]\.rem,[0-9]/{n
n
i\
	sethi %hi(0x7FFFFF),%g3	![globregfix]\
	or %lo(0x7FFFFF),%g3,%g3
}
/call[ 	]\.urem,[0-9]/{n
n
i\
	sethi %hi(0x7FFFFF),%g3	![globregfix]\
	or %lo(0x7FFFFF),%g3,%g3
}


You can now see why I put "Machine-Independent" in quotes.  Here's the count
of machine dependent code for the SPARC:

	 25      99     786 fastcache.h
	 68     262    1882 fastglobal.h
	 31     112     906 fasttcode.h
	 28      80     595 ../tcsrc/SUN4/GCC/lib/sed.globreg.bugfix
	  5      34     222 ../tcsrc/SUN4/GCC/lib/sed.peep.opt
	  9      30     173 ../tcsrc/SUN4/GCC/lib/sed.tcode.opt
	166     617    4564 total

Of these 166 lines 51 lines are banner headers. 100 odd lines are
machine dependent.  A whole VM is around the 20k lines mark. So
machine dependencies are down in the 0.5% range.


Use this stuff as part of what ever you like.  If you try & assert ownership
I'll fight & batter you over the head with the GPL ('bout time we had some
serious steel in that thar GPL).

Share And Enjoy!

P.S. The BrouHaHa machine is available to educational institutions with a
valid ParcPlace Smalltalk-80 licence, subject to a strict non-disclosure
agreement.  email me if you want it. I am slow to answer requests!
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From crowl@cs.rochester.edu Tue Apr  2 10:34:53 1991
From: crowl@cs.rochester.edu (Lawrence Crowl)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, design
Date: 31 Mar 91 18:06:35 GMT
Reply-To: crowl@cs.rochester.edu (Lawrence Crowl)
Organization: Computer Science Department University of Rochester

In article <3035@redstar.cs.qmw.ac.uk>
Eliot Miranda <eliot@.cs.qmw.ac.uk> writes:
>The threaded code machine will make calls on support routines, e.g. graphics,
>garbage collector etc.  Any group of routines that don't access the global
>registers and don't directly or indirectly call routines that need to access
>the global registers can be optimized.  These routines should be compiled
>without declaring the global registers.  These routines will then use as many
>registers as the compiler will give them and will save & restore any
>registers they use, preserving the values of the global register variables.

This scheme assumes that procedure calls use a "callee saves" register
convention, and does not work if you allocate the global register
variables out of the "caller saves" set of registers.  The problem is that
the caller does not save the register (because it is global) and the
callee writes over the register (because the caller saved it).  In this
situation, the programmer must insert explicit saves and restores of the
global register variables.

The obvious solution to this problem is to allocate all global register
variables out of the "callee saves" set of registers.  However, the
Alliant has _no_ callee saves registers.  Library routines on the Alliant
will trash every register you have.  In addition, implicit library calls
to routines like bcopy will also trash your registers.  (I learned this
the hard way.)

The lesson is that calling library routines with global register variables in
caller saves registers requires special handling.  It is not automagic.
-- 
  Lawrence Crowl		716-275-9499   University of Rochester
		      crowl@cs.rochester.edu   Computer Science Department
		 ...!rutgers!rochester!crowl   Rochester, New York, 14627-0226
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From Tom.Lane@G.GP.CS.CMU.EDU Wed Apr  3 10:38:09 1991
From: Tom.Lane@G.GP.CS.CMU.EDU
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, design
Date: 1 Apr 91 15:21:14 GMT
Reply-To: Tom.Lane@G.GP.CS.CMU.EDU
Organization: Compilers Central

Lawrence Crowl points out one important problem with Eliot Miranda's
scheme for global register use in C.  There's an even more fundamental
problem, though: there is *no guarantee whatever* that the compiler will
assign the same registers to the global variables in every routine.

Compilers that do intelligent allocation of variables to registers may
refuse to honor the "register" declarations at all if the global variables
are not heavily used in a given routine, and in any case the globals need
not be assigned to the same registers every time.  Miranda's scheme thus
relies heavily on the assumption of a dumb register allocator (or else a
compiler that supports global register variable declarations).

This scheme may be "fast" direct threaded code, but it's hardly "portable".
-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma
[GCC lets you specify what register to use for global register variables, but
that is of course both machine and compiler specific. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From pardo@cs.washington.edu Thu Apr  4 17:34:39 1991
From: pardo@cs.washington.edu (David Keppel)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, design, bibliography
Date: 2 Apr 91 19:21:25 GMT
Reply-To: pardo@cs.washington.edu (David Keppel)
Organization: Computer Science & Engineering, U. of Washington, Seattle

metzger@watson.ibm.com (Perry E. Metzger) writes:
>[I'd like a reference on threaded code interpreters.]

3 citations follow:

%A James R. Bell
%T Threaded Code
%J Communications of the ACM (CACM)
%V 16
%N 2
%D June 1973
%P 370-372
%X Describes the basic idea of threaded code.
Compares to hard code (subroutine calls) and interpreters.

%A Richard H. Eckhouse Jr.
%A L. Robert Morris
%T Minicomputer Systems  Organization, Programming, and Applications
(PDP-11).  2nd Ed.
%I Prentice-Hall, Inc.
%P 290-298
%X Describes threaded code and ``knotted code''.  I (pardo) think that
this is a very clear introduction to threaded code.

%A Peter M. Kogge
%T An Architectural Trail to Threaded Code Systems
%J IEEE Computer
%P 22-33
%D March 1982
%W rrh (original)
%W Pardo (copy)
%X Describes how to build a threaded code interpeter/compiler from
scratch.
 * Direct threaded/indirect threaded.
 * Less than 2:1 performance hit of threading compared to full
compilation.
 * Note about bad compilers contributing to relative OK-ness of
threaded code.
 * Ease of rewriting stuff.
 * Ease of tuning.

My favorite of the three is Eckhouse & Morris; however I don't know
where to get it.  The pages that I have are photocopies graciously
sent to me by a friend.  As the title implies, this book is several
years old and undoubtedly out-of-print.

	;-D on  ( Following this thread of the discussion... )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From simonpj Fri Apr  5 09:52:33 1991
Received: from tutuila.dcs.glasgow.ac.uk by vanuata.cs.glasgow.ac.uk;
Fri, 5 Apr 91 09:52:27 BST
Message-Id: <2763.9104050851@tutuila.dcs.glasgow.ac.uk>
X-Comment1: #############################################################
X-Comment2: #     uk.ac.glasgow.cs has changed to uk.ac.glasgow.dcs     #
X-Comment3: #     If this address does not work please ask your mail    #
X-Comment4: #     administrator to update your NRS & mailer tables.     #
X-Comment5: #############################################################
From: Simon L Peyton Jones <simonpj>
To: eliot@cs.qmw.ac.uk
Cc: simonpj, partain
Subject: Threaded code
Date: Fri, 05 Apr 91 09:51:48 +0100


Eliot 

I saw your article about threaded code.  Like you and others, we are
using C as an assembler, only for a pure functional language, Haskell.
I have some brief questions.

1. By telling GCC not to use a frame pointer, one can eliminate
the prolog entirely, can't one?  So why edit it out?

I guess the answer is going to be local variables, allocated once for
all by the StartTCMachine routine.  Still, it seems quite a pain.  I guess
one could sacrifice some (perhaps slight) speed by using a bunch of
globals instead.

2. You edit out the epilogue for tidiness only, I take it.  It doesn't
cause any problems if you leave it in, does it?

3. Why do you use no-defer-pop (with the associated bug)?

4. Why does JUMPNEXT have a loop?  Surely the jump leaves the loop right
away.  Presumably you are tricking the compiler somehow.

Thanks

Simon L Peyton Jones
Glasgow University


Simon
============================= Address change =======================
My email address is now officially:  simonpj@dcs.glasgow.ac.uk
This may fail if your local site has out-of-date mail tables.
The old address (simonpj@cs.glasgow.ac.uk) will work for quite a long while, 
so stay with the old one if the new one fails.
====================================================================

>From eliot@cs.qmw.ac.uk Fri Apr  5 12:18:18 1991
Via: uk.ac.qmw.cs; Fri, 5 Apr 91 12:18:06 BST
Received: from aux47 by redstar.cs.qmw.ac.uk id aa26828; 5 Apr 91 12:17 BST
Reply-To: eliot@cs.qmw.ac.uk
In-Reply-To: Simon L Peyton Jones's mail message
<2763.9104050851@tutuila.dcs.glasgow.ac.uk>
Message-Id:  <9104051217.aa26828@uk.ac.qmw.cs.redstar>
From: Eliot Miranda <eliot@cs.qmw.ac.uk>
To: simonpj
Cc: partain
Subject:  re: Threaded code
Date:     Fri, 5 Apr 91 10:54:25 BST

>
>Eliot 
>
>I saw your article about threaded code.  Like you and others, we are
>using C as an assembler, only for a pure functional language, Haskell.
>I have some brief questions.
>
>1. By telling GCC not to use a frame pointer, one can eliminate
>the prolog entirely, can't one?  So why edit it out?
No, registers local to the procedure will still be saved & stack space
allocated for automatic variables. This IS a problem since the
threaded-code jump at the end of the procedure will miss the register
restores before the epilogue.  Consequently the stack will grow unless
these register saves & stack-space allocations are removed.
>
>I guess the answer is going to be local variables, allocated once for
>all by the StartTCMachine routine.  Still, it seems quite a pain.  I guess
>one could sacrifice some (perhaps slight) speed by using a bunch of
>globals instead.
For certain routines, not using register variables will be expensive
(e.g. a simple integer arithmetic primitive).
>
>2. You edit out the epilogue for tidiness only, I take it.  It doesn't
>cause any problems if you leave it in, does it?
No, but given that the prologue has to be removed & removing the epilogue
is as easy (& given finite memory :-) one might as well remove it.
>
>3. Why do you use no-defer-pop (with the associated bug)?
OK. This is again to avoid stack growth.  On conventional stack architectures
gcc will try & combine the argument popping code of a sequence of
procedure calls.
e.g.
extern long a, b, c;
void doit() {
	foo(a); bar(b); baz(c);
}

with -O -no-defer-pop one might expect gcc to generate

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	addqw #4,%sp
	unlk %a6
	rts

but because gcc knows that the unlk instruction will roll back the stack
in fact gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	unlk %a6
	rts

With -O -fdefer-pop gcc optimizes out the pops completely & generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	movel c,%sp@-
	jbsr baz
	unlk %a6
	rts

with -O -fomit-frame-pointer -fdefer-pop gcc generates:

	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	movel c,%sp@-
	jbsr baz
	addw #12,%sp
	rts

& with -O -fomit-frame-pointer -fno-defer-pop gcc generates:

	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	addqw #4,%sp
	rts

All the above cases are as one would wish.  The elimination of all
defered pops in the unlk instruction is especially clever.

However, in the presence of the threaded-code jump the waters darken!
Consider what gcc generates for:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)

	extern long a, b;
	void doit() {
		foo(a); bar(b); JUMPNEXT;
	}
with -O -fdefer-pop gcc generates

doit:
	link %a6,#0
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

This is clearly no good because the arguments to both foo & bar
will never be popped. Every time doit() is executed the stack will grow
by 8 bytes.  Soon your program will dump core with a very large stack
segment!

with -O -fno-defer-pop gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

Again useless because bar's pop has been folded into the unlk
which won't be executed.

with -O -fdefer-pop -fomit-frame-pointer gcc generates

	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	addqw #8,%sp
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	rts

This is great. However, not all functions are susceptible to
the omit-frame-pointer optimization (i.e. functions
with local variables).  E.g. the code generated for:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)

	extern long a, b;
	void doit() {
		char	large[1024];
		foo(a,large); bar(b); JUMPNEXT;
	}

with -O -fomit-frame-pointer -fdefer-pop is:

	link %a6,#-1024
	pea %a6@(-1024)
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

so in general one can't rely on -fomit-frame-pointer.
For the above example both
	-O -fomit-frame-pointer -fno-defer-pop
and
	-O -fno-defer-pop
generate:

doit:
	link %a6,#-1024
	pea %a6@(-1024)
	movel a,%sp@-
	jbsr foo
	addqw #8,%sp
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

This is also useless because bar's argument pop has been folded away.
The problem is that gcc will always fold away the last call's argument
pop if the function has a frame pointer, and -fomit-frame-pointer
can't allways get rid of the frame pointer.  In fact, in the presence
of variable sized automatic variables or calls on alloca it would be
very hard (impossible for recursive functions?) to do.

The eatest solution I've come up with is to use -fno-defer-pop
and a dummy function call between the threaded-code jump and
the return:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp
%a0@");SBD();return;}while(0)

	extern long a, b;
	void doit() {
		foo(a); bar(b); JUMPNEXT;
	}
with -O -fno-defer-pop gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	jbsr SBD
	unlk %a6
	rts

Now bar's argument pop is not folded because its no longer the last
call in the routine, SBD is.
So the call to SBD
	a) prevents gcc's 'last call argument pop fold into unlk' optimization
	   which prevents uncontrolled stack growth.
	b) doesn't get executed because of the jump
	c) is trivial to remove from the assembler with a sed-script


>4. Why does JUMPNEXT have a loop?  Surely the jump leaves the loop right
>away.  Presumably you are tricking the compiler somehow.
>
This is old C lore.  The problem is
	'How do you write a macro that is a sequence of statements
	 that can be used wherever a single statement can?'

take the following definition of JUMPNEXT:
#define JUMPNEXT asm("movl %a5@+,%a0; jmp %a0@");return;

Now invoke it here:
	if (its_time_to_jump)
		JUMPNEXT;
	do_something_else();

This expands to:
	if (its_time_to_jump)
		asm("movl %a5@+,%a0; jmp %a0@");
	return;
	do_something_else();

Not at all whats intended!

There are two tricks I know of (the first I saw in Berkely Smalltalk,
the second in Richard Stallman's gcc manual. I expect they're both
quite old).
The first is to surround your statements with
if (TRUE){statements}else
i.e.
#define JUMPNEXT if(1){asm("movl %a5@+,%a0; jmp %a0@");return;}else
So now we get:
	if (its_time_to_jump)
		if (1){
			asm("movl %a5@+,%a0; jmp %a0@");
			return;
		else;
	do_something_else();

which works because C binds elses innermost first. However, some
compilers will whine about dangling elses.  The second scheme is
more elegant (-:

Surround your statements with
do{statements}while(FALSE);
which will execute statements precisely once (its NOT a loop).
i.e.
#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0)
expands to

	if (its_time_to_jump)
		do {
			asm("movl %a5@+,%a0; jmp %a0@");
			return;
		while(0);
	do_something_else();

which does what's wanted and doesn't incur compiler whines.


>Thanks
>
>Simon L Peyton Jones
>Glasgow University

Eliot Miranda			email:	eliot@cs.qmw.ac.uk
Dept of Computer Science	Tel:	071 975 5229 (+44 71 975 5229)
Queen Mary Westfield College	ARPA:	eliot%cs.qmw.ac.uk@nsf.ac.uk	
Mile End Road			UUCP:	eliot@qmw-cs.uucp
LONDON E1 4NS

>From vestal@SRC.Honeywell.COM Fri Apr  5 12:26:11 1991
From: vestal@SRC.Honeywell.COM (Steve Vestal)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, performance, design
Date: 3 Apr 91 18:23:34 GMT
Reply-To: vestal@SRC.Honeywell.COM (Steve Vestal)
Organization: Honeywell Systems & Research Center
In-Reply-To: pardo@cs.washington.edu's message of 2 Apr 91 19:21:25 GMT

In article <1991Apr2.192125.7464@beaver.cs.washington.edu>
pardo@cs.washington.edu (David Keppel) writes:
[references about threaded code, much stuff deleted]

David> %X Describes how to build a threaded code interpeter/compiler from
David> scratch.
David>  * Less than 2:1 performance hit of threading compared to full
David> compilation.

I have a question about this.  Numbers like this are often cited for
threaded-type code, but in Bell's paper this was for the PDP-11 (whose
addressing modes made it a natural for threaded code).  Paul Klint's
"Interpretation Techniques" paper (Software P&E, v11, 1981) cites a
significant difference for interpreter fetch/decode times on different
architectures.  He cited numbers around 2:1 for the PDP-11, but something
more like 9:1 for a Cyber.  I did a Q&D evaluation of this for a RISC, and
the ratio I guestemated was closer to that Klint gave for the Cyber than
for the PDP-11 (not unexpectedly).

How architecturally dependent is the performance of these techniques
(relative to compiling to native code)?

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From E.Ireland@massey.ac.nz Fri Apr  5 12:29:20 1991
From: E.Ireland@massey.ac.nz (Evan Ireland)
Newsgroups: comp.lang.functional
Subject: Three address code
Date: 4 Apr 91 21:49:21 GMT
Reply-To: E.Ireland@massey.ac.nz
Organization: Information Sciences, Massey University, New Zealand

I've had no luck with mail, so this is for csdw at Rhodes University.

>
>In an attempt to optimize a functional language, I would like to
>turn the stack based intermediate code into three address code. 
>
>Has anyone done similar conversions?  Any references would be
>greatly appreciated.

I do not have any references, but I thought that one aspect of my FAM
implementation might be of interest.

A number of interpreters and compilers that I have seen implement a stack
pointer in a register or global variable.  Then to implement various stack
operations, they use auto-increment or auto-decrement operations on the stack
pointer register.  Since I generate portable C, and thus cannot assume I have

    DATA *f (register DATA *fp)
    {
	....
    }

Thus I pass to each function the current pointer to top of stack, from which it
can index downwards to find its arguments.  Within the function, I use indexing
operations on fp, e.g. fp[3] = fp[1], to manipulate values on the stack, so I
am not continually manipulating the stack pointer.  If "f" calls another
function, it will pass the address of the current top of stack, e.g. g (&f[5]).

The advantage to me is that I have a register for a stack pointer even though I
am generating portable C code.

Now the relationship to three-address code.  If you adopt such a scheme, and
your three address instructions allow some indexing, you can sometimes generate

	ADD	fp[3],f[4],fp[3]

I hope this helps.
_______________________________________________________________________________

E.Ireland@massey.ac.nz          Evan Ireland, School of Information Sciences,
  +64 63 69099 x8541          Massey University, Palmerston North, New Zealand.

>From pardo@cs.washington.edu Sat Apr  6 14:32:24 1991
From: pardo@cs.washington.edu (David Keppel)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, performance, design
Date: 4 Apr 91 17:10:55 GMT
Reply-To: pardo@cs.washington.edu (David Keppel)
Organization: Computer Science & Engineering, U. of Washington, Seattle

>>>[Threaded code vs. compilation]

>pardo@cs.washington.edu (David Keppel) writes:
>>[Less than 2:1 performance hit of threading vs. full compilation.]

Note also that the reference that claimed 2:1 (Peter M. Kogge, IEEE
Computer pp 22-33 March 1982) also attributed part of that ratio to the
poor state of compiler optimization.


vestal@SRC.Honeywell.COM (Steve Vestal) writes:
>[Klint says 2:1 for PDP-11 v. 9:1 for Cyber.
> How architecturally dependent are these techniques?]

Suppose that the statically-compiled code fragments that are threaded
together are called `primitives'.


When the execution time of a primitive is large, then the overhead for the
interpreter can be large and still have a small effect on performance.
The performance of the interpreted code is dominated by the time in a
primitive vs. the overhead of moving between primitives.

When the execution time of the primitives is small, then the overhead for
moving between primitives can be a large fraction of the total execution
time.  Overhead comes from at least two sources:

 * Control flow: the address of the the next primitive is loaded
   from data memory and the processor executes an indirect jump.

 * Register allocation: a primitive is essentially a function with
   a fast calling convention (no stack adjustment).  Thus, all the
   traditional problems with interprocedural register allocation.

Examples of `large primitives' are ``draw circle'' and ``interpolate
spline''.  Examplees of small primitives are ``push'', ``add'', etc.


* Architectural dependency of control flow

Examples:

	Direct jumps in full compilation:

		op1
		op2
		br next		// direct jump
	
	Indirect jumps for threading for a CISC:

		op1
		op2
		br *(r0)+	// load, increment, jump

	Indirect jumps for threading for a RISC:

		ld *r0, r1	// scheduled load
		op1
		op2
		br *r1		// jump
		add r1, #4, r1	// delay slot increment

Direct jumps in full compilation can frequently use one instruction (a
``near branch'') both to find the address of the next code fragment and
perform the control transfer.  On a CISC, branches are typically two or
three bytes.  On a RISC, branches are typically four bytes.  The threaded
indirect (load, increment, jump) is typically three bytes on a CISC and
twelve bytes (three instructions) on a RISC.

Direct jumps in full compilation take typically one instruction time.
Indirect jumps take at least the following operations: load, increment,
jump indirect.  On a CISC, the three operations can typically be `folded'
in to one instruction.  There may be a load penalty of one instruction
time but the increment is overlapped, so the total time is three machine
units (one `unit' is about one register->register operation).  On a RISC,
the total penalty is three machine units.

Direct jumps take one (I-cache) cycle to fetch both the branch instruction
and the address of the branch target.  Indirect jumps take a D-cache cycle
to fetch the address of the branch target and an I-cache cycle to fetch
the branch instruction.

Direct jumps can take advantage of instruction prefetching since the
address of the next instruction is known at the time that the instruction
prefetch reads the direct jump.  Threaded indirects require an indirect
branch off of a register.  Current RISC and CISC machines are about
equivalent in that they do little prefetching.  Some machines being
designed do more prefetching so the threading overhead for them will be
greater.


* Architectural dependency of register allocation

In a machine with a small number of registers, many of the registers are
in-use in each primitive and the best possible register allocation will
contain many loads and stores.  In a machine with a large number of
registers, the full-compilation implementation can make much better use of
registers than the threaded primitives implementation (again, for small
primitives).  The savings from full compilation are exactly analagous to
the improvements in register allocation from doing inlining of small
procedures.


* Other points to ponder

In some cases the threaded code implementation is substantially smaller
than the full-compilation implementation.  For large functions or a
machine with small caches, the loss of performance from threading might be
overwhelmed by the gain in cache performance.

On RISC machines, procedure call/return is about twice the cost of other
control flow, except for the overhead of register management.  Thus,
call-dense RISC code from full compilation may behave about the same as
threaded code.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From airs!ian@uunet.UU.NET Sat Apr  6 14:32:56 1991
From: airs!ian@uunet.UU.NET (Ian Lance Taylor)
Newsgroups: comp.compilers
Subject: Threaded code
Keywords: books, interpreter, design
Date: 4 Apr 91 07:19:41 GMT
Reply-To: airs!ian@uunet.UU.NET (Ian Lance Taylor)
Organization: Compilers Central

The book ``Threaded Interpretive Languages'' has a quite complete
implementation of a threaded version of Forth in Z80 assembler.  It's
a very complete description of why threaded implementations exist and
how to implement them easily.  It's by R. G. Loeliger and was
published by Byte Books (ISBN 0-07-038360-X).  It was published in
1981, though, and I'm sure it's long out of print.

Ian Taylor              airs!ian@uunet.uu.net               uunet!airs!ian
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From firth@sei.cmu.edu Sun Apr  7 14:33:13 1991
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, performance, design
Date: 4 Apr 91 13:27:21 GMT
Reply-To: firth@sei.cmu.edu (Robert Firth)
Organization: Software Engineering Institute, Pittsburgh, PA

In article <1991Apr3.182334.16164@src.honeywell.com>
vestal@SRC.Honeywell.COM (Steve Vestal) writes:

>How architecturally dependent is the performance of these techniques
>(relative to compiling to native code)?

[cost of threaded code on PDP-11, RISC &c]

We might have a misunderstanding here, because what I think of as threaded
code doesn't have a decoding and interpretation step.  But I'll talk of
what I know.

A program in threaded code is just an array of addresses, possibly
interspersed with operands.  So the fragment

	c := a + b

becomes something like

	address of 'load'
	address of 'a'
	address of 'load'
	address of 'b'
	address of '+'
	address of 'store'
	address of 'c'

This implies a very simple virtual stack machine - you can get more clever
by implementing a virtual register machine.

The basic execution thread then does this.  We point a global register at
the table of addresses, and each primitive has the form

	treg := treg + address'size
	...
	jump (treg)

As you can see, this is great on the PDP-11, since that reduces to one
instruction

	MOV (treg)+,PC 	; NOTE TO MAINTAINER: FASTER THAN JMP - DON'T TOUCH!

On a typical RISC machine, it's two cycles, since you can put almost
anything in the delay slot(s) after the jump.

Now, the load instruction, for instance, would say

load:	treg := treg + address'size
	load (treg) into tempreg
	treg := treg + address'size
	push (tempreg) onto simulated stack
	jump (treg)

On the PDP-11, that's

	MOV @(treg)+, -(SP)
	MOV (treg)+, PC

On a RISC, it's much more like

	L R0, 4(treg)		; get operand address
	L R0, 0(R0)		; dereference to get operand
	SUBI SP, #4		; decrement simulated SP
	S R0, 0(SP)		; push operand on stack
	ADDI treg, #8		; step over two addresses (mine & operands)
	JR (treg)		; over to you, Moriarty!

[to fill delay slots, shuffle the above to 132564]

Well, if you have one load delay slot and one branch delay slot, you can
fill all three of them, so that's 6 cycles.  Given that a typical load is
only 1.1 cycles in direct code (90% of the delays filled), this is
certainly a lot more than a 2:1 overhead!  When you add the cost of a
simulated stack (lots of needless loads and stores), I can well believe
10:1 time expansion for simple code.

In fact, it was more than that on the PDP-11, if you compared threaded
code with direct code from a decent compiler.  The big win in the Fortran
compiler came from (a) very compact threaded code, and (b) the floating
point operations were implemented in software, so the overhead of threaded
code was swamped by the cost of floating addition, subtraction &c.

Here's the full code of the above example, so you can see for yourself

Direct:
	MOV a, R0
	ADD b, R0
	MOV R0, c

Threaded:
	MOV @(treg)+, -(SP)
	MOV (treg)+, PC
*	MOV @(treg)+, -(SP)
*	MOV (treg)+, PC
*	ADD (SP)+,(SP)
	MOV (treg)+, PC
	MOV (SP)+, @(treg)+
	MOV (treg)+, PC

Note that, if you implement a one-address add, you save two instructions,
since the *** bit reduces to

	ADD @(treg)+, (SP)

But even then, it's coming out at over 4:1.

What architectural features make threaded code more efficient?  The
fundamental one is main memory that is fast (or not too slow) relative to
registers, since you're doing a lot more fetching.  Another is a set of
address modes with double indirection, since you're accessing most
operands one level of indirection further back.  And good old
autoincrement helps a little, too.  Alas, none of that says 'risc', and
much of it says '1960s'.

Incidentally, if I were to do this again today, I'd definitely simulate a
general-register machine and use a subset of the real machine's registers.
If you take 8 of them, you then have 8 loads and stores, one for each
register, but if you make an absolute rule that nobody even thinks about
touching one of those 8 that doesn't belong to him, then all the good
tricks about register allocation, slaving &c will still work.  If you then
implement the operations as one-address general-register, you have again 8
versions (add into R0, add into R1, ...) and lo! you're programming a very
familiar old friend.

"But that was in another country, and besides, the machine is dead..."

Robert Firth
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From bpendlet@bambam.es.com Tue Apr  9 20:35:22 1991
From: bpendlet@bambam.es.com (Bob Pendleton)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, design
Date: 8 Apr 91 19:48:00 GMT
Reply-To: bpendlet@bambam.es.com (Bob Pendleton)
Organization: Compilers Central

In article <23613@as0c.sei.cmu.edu> you write:

> A program in threaded code is just an array of addresses, possibly
> interspersed with operands.  So the fragment
> 
> 	c := a + b
> 
> becomes something like
> 
> 	address of 'load'
> 	address of 'a'
> 	address of 'load'
> 	address of 'b'
> 	address of '+'
> 	address of 'store'
> 	address of 'c'
> 
> This implies a very simple virtual stack machine - you can get more clever
> by implementing a virtual register machine.

About 10 years ago I was working on a lisp compler that compiled to
threaded code. I was trying to get small code and still have some
performance. (Since I wanted to run the code on a Z80 or 8080 small was
important. My how things change :-)

I found that the 3 most common operations in threaded code were load,
store, and execute. So I put those operations with the operands. This
made the operands look rather like classes with load, store, and
execute as virtual functions. If you let the evaluate operation
subsume the load and execute operations the threaded code for

	c := a + b;

becomes

	address of 'a.evaluate()'
	address of 'b.evaluate()'
	address of '+'
	address of 'c.store()'

and

	g := F(x, y);

becomes

	address of 'x.evaluate()'
	address of 'y.evaluate()'
	address of 'F.evaluate()'
	address of 'g.store()'


Which is much smaller than the original version of threaded code.

Later, while working on a homebrew version of FORTH I gave up on
threaded code completely. I found, like most who have expolored it,
that symbolic execution of RPN code is a nice way to generated machine
code. Machine code that runs much faster than threaded code, and that
the machine code, even on an 8080, was only about 25% bigger than
threaded code.
-- 
              Bob Pendleton
   bpendlet@dsd.es.com or decwrl!esunix!bpendlet or utah-cs!esunix!bpendlet
[The DEC PDP-11 Fortran compiler did something similar, writing load routines
for commonly used variables. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From pardo@june.cs.washington.edu Wed Apr 24 09:26:32 1991
From: pardo@june.cs.washington.edu (David Keppel)
Newsgroups: comp.compilers
Subject: Re: Fast Interpreted Code
Keywords: interpreter, threaded code
Date: 23 Apr 91 02:06:21 GMT
Reply-To: pardo@june.cs.washington.edu (David Keppel)
Organization: Computer Science & Engineering, U. of Washington, Seattle

ssinghani@viewlogic.com (Sunder Singhani) writes:
>[Our threaded code isn't fast enough.  What's faster?]

As far as I know, threaded code gives the fastest primitives/second
dispatch rate on a variety of architectures.  The general techniques for
making things faster (that I know of!) are to change things so that the
dispatch rate can go down without changing the work that gets done (or use
hardware, but we'll ignore that for the moment.)

* Use a different v-machine instruction set

  The overhead of interpreting is almost nothing in generic PostScript
  imaging code because all the time is spent in non-interpretded
  primitives.  If you can characterize your key operations (perhaps
  info in [Davidson & Fraser ??, Fraser, Myers & Wendt 84] can help
  you analyze for common operations instead of the more normal time in
  routines) then you can re-code your virtual instruction set to have
  as primintives oeprations that are performed frequently.

* Dynamic compilation to native machine code

  This is what is done in ParcPlace System's Smalltalk-80
  implementation,  [Deutsch & Schiffman 84] also Insignia Solution's
  8086 interpreter.

  Dynamic compilation suffers from the need to do compilation at
  runtime: a compiler that produces better code will take longer to
  run and the compile time contributes to the overall program runtime.
  Also, program text isn't shared, even if multiple instances are
  running simultaneously.
 
* Native-coding key routines

  If you believe that your program spends 80% of its time in a few key
  routines, then compiling just these routines -- statically, adding
  them to the primitive set, statically adding them as library
  routines, or dynamically -- can improve performance substantially
  [Pittman 87].


5 Citations follow:

%A Robert Bedichek
%T Some Efficient Architecture Simulation Techniques
%J Winter '90 USENIX Conference
%D 26 October, 1989
%W Robert Bedichek.
%W Pardo has a copy.
%X Describes a simulator that uses threaded-code techniques to emulate
a Motorola 88000.  Each 88k instruction is executed in about 20 host
(68020) instructions.  Discusses techniques used to get the simulation
down from several thousand host instructions in many other
simulators.

%A Jack W. Davidson
%A Chris W. Fraser
%T Eliminating Redundant Object Code
%J POPL9
%P 128-132

%A Peter Deutsch
%A Alan M. Schiffman
%T Efficient Implementation of the Smalltalk-80 System
%J 11th Annual Symposium on Principles of Programming Languages
(POPL 11)
%D January 1984
%P 297-302
%X Dynamic translatin of p-code to n-code (native code).
Resons for not using straight p-code or straight n-code:
 * p-code is smaller than n-code (<= 5X).
 * The debugger can debug p-code, improving portability.
 * Native code is faster (<= 2X).  Reasons include
special fetch/decode/dispatch hardware;
p-machine and n-machine may be very different, e.g.,
stack machine vs. register-oriented.
 * Threaded code does reduce the cost of p-code fetch/decode.
Does not help with operand decoding.
Does not allow optimizations to span more than one instruction.
[pardo: that's not technically true, but each optimized
instruction must exist in addition to the unoptimized version.
That leads to exponential blowup of the p-code.  Example: delayed
branch and non-delayed branch versions of Robert Bedichek's 88k
simulator.]
 The system characteristics:
 * The time to translate to n-code via macro expansion is about the
same as the execute time to interpret the p-code.
 * (pg 300:) Self-modifying code (SMC) is deprecated but used in a
``well-confined'' way.  Could indirect at more cost.  Use SMC on the
machines where it works, indirection where SMC.
doesn't.
 * Performance is compared to a ``straightforward'' interpreter.
What's that?

%A Christopher W. Fraser
%A Eugene W. Myers
%A Alan L. Wendt
%T Analyzing and Compressing Assembly Code
%J CCC84
%P 117-121

%A Thomas Pittman
%T Two-Level Hybrid Interpreter/Native Code Execution for Combined
Space-Time Program Efficiency
%D 1987
%J ACM SIGPLAN
%P 150-152
%X Talks about native code execution vs. various kinds of interpreting
and encoding of key routines in assembly.


Hope this helps!

	;-D on  ( This is all open to interpretation )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From eliot@cs.qmw.ac.uk Tue Apr 30 15:55:17 1991
From: eliot@cs.qmw.ac.uk (Eliot Miranda)
Newsgroups: comp.compilers,gnu.gcc.bug,alt.sources
Subject: re: Threaded Code
Keywords: design, interpreter
Date: 5 Apr 91 11:43:51 GMT
Reply-To: Eliot Miranda <eliot@cs.qmw.ac.uk>
Followup-To: comp.compilers
Organization: Computer Science Dept, QMW, University of London, UK.

I recently posted articles to comp.compilers & alt.sources on how
to write threaded code machines in C.  I received the following questions
from Simon Peyton Jones at Glasgow.  They are specific to GCC.
Since they have non-obvious answers & since the answers suggest
augmentation of the GCC compiler I'm posting my response to Simon.

>From: Simon L Peyton Jones <simonpj@cs.gla.ac.uk>
>
>I saw your article about threaded code.  Like you and others, we are
>using C as an assembler, only for a pure functional language, Haskell.
>I have some brief questions.
>
>1. By telling GCC not to use a frame pointer, one can eliminate
>the prolog entirely, can't one?  So why edit it out?

No, registers local to the procedure will still be saved & stack space
allocated for automatic variables. This IS a problem since the
threaded-code jump at the end of the procedure will miss the register
restores before the epilogue.  Consequently the stack will grow unless
these register saves & stack-space allocations are removed.  Also
GCC can not always eliminate the frame pointer.

>I guess the answer is going to be local variables, allocated once for
>all by the StartTCMachine routine.  Still, it seems quite a pain.  I guess
>one could sacrifice some (perhaps slight) speed by using a bunch of
>globals instead.
For certain routines, not using register variables will be expensive
(e.g. a simple integer arithmetic primitive).

>2. You edit out the epilogue for tidiness only, I take it.  It doesn't
>cause any problems if you leave it in, does it?
No, but given that the prologue has to be removed & removing the epilogue
is as easy (& given finite memory :-) one might as well remove it.
>
>3. Why do you use no-defer-pop (with the associated bug)?
OK. This is again to avoid stack growth.  On conventional stack architectures
gcc will try & combine the argument popping code of a sequence of
procedure calls.
e.g.
extern long a, b, c;
void doit() {
	foo(a); bar(b); baz(c);
}

with -O -no-defer-pop one might expect gcc to generate

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	addqw #4,%sp
	unlk %a6
	rts

but because gcc knows that the unlk instruction will roll back the stack
in fact gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	unlk %a6
	rts

With -O -fdefer-pop gcc optimizes out the pops completely & generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	movel c,%sp@-
	jbsr baz
	unlk %a6
	rts

with -O -fomit-frame-pointer -fdefer-pop gcc generates:

	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	movel c,%sp@-
	jbsr baz
	addw #12,%sp
	rts

& with -O -fomit-frame-pointer -fno-defer-pop gcc generates:

	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	addqw #4,%sp
	rts

All the above cases are as one would wish.  The elimination of all
defered pops in the unlk instruction is especially clever.

However, in the presence of the threaded-code jump the waters darken!
Consider what gcc generates for:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)

	extern long a, b;
	void doit() {
		foo(a); bar(b); JUMPNEXT;
	}
with -O -fdefer-pop gcc generates

doit:
	link %a6,#0
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

This is clearly no good because the arguments to both foo & bar
will never be popped. Every time doit() is executed the stack will grow
by 8 bytes.  Soon your program will dump core with a very large stack
segment!

with -O -fno-defer-pop gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

Again useless because bar's pop has been folded into the unlk
which won't be executed.

with -O -fdefer-pop -fomit-frame-pointer gcc generates

	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	addqw #8,%sp
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	rts

This is great. However, not all functions are susceptible to
the omit-frame-pointer optimization (i.e. functions
with local variables).  E.g. the code generated for:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)

	extern long a, b;
	void doit() {
		char	large[1024];
		foo(a,large); bar(b); JUMPNEXT;
	}

with -O -fomit-frame-pointer -fdefer-pop is:

	link %a6,#-1024
	pea %a6@(-1024)
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

so in general one can't rely on -fomit-frame-pointer.
For the above example both
	-O -fomit-frame-pointer -fno-defer-pop
and
	-O -fno-defer-pop
generate:

doit:
	link %a6,#-1024
	pea %a6@(-1024)
	movel a,%sp@-
	jbsr foo
	addqw #8,%sp
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

This is also useless because bar's argument pop has been folded away.  The
problem is that gcc will always fold away the last call's argument pop if
the function has a frame pointer, and -fomit-frame-pointer can't allways
get rid of the frame pointer.  In fact, in the presence of variable sized
automatic variables or calls on alloca it would be very hard (impossible
for recursive functions?) to do.

The neatest solution I've come up with is to use -fno-defer-pop and a
dummy function call between the threaded-code jump and the return:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp
%a0@");SBD();return;}while(0)

	extern long a, b;
	void doit() {
		foo(a); bar(b); JUMPNEXT;
	}
with -O -fno-defer-pop gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	jbsr SBD
	unlk %a6
	rts

Now bar's argument pop is not folded because its no longer the last
call in the routine, SBD is.
So the call to SBD
	a) prevents gcc's 'last call argument pop fold into unlk' optimization
	   which prevents uncontrolled stack growth.
	b) doesn't get executed because of the jump
	c) is trivial to remove from the assembler with a sed-script


One an try to use -fcaller-saves, but this surrounds calls with unnecessary
register saves & restores that for the code to be optimal have to be
edited out.

>4. Why does JUMPNEXT have a loop?  Surely the jump leaves the loop right
>away.  Presumably you are tricking the compiler somehow.

This is old C lore.  The problem is
	'How do you write a macro that is a sequence of statements
	 that can be used wherever a single statement can?'

take the following definition of JUMPNEXT:
#define JUMPNEXT asm("movl %a5@+,%a0; jmp %a0@");return;

Now invoke it here:
	if (its_time_to_jump)
		JUMPNEXT;
	do_something_else();

This expands to:
	if (its_time_to_jump)
		asm("movl %a5@+,%a0; jmp %a0@");
	return;
	do_something_else();

Not at all whats intended!

There are two tricks I know of (the first I saw in Berkely Smalltalk,
the second in Richard Stallman's gcc manual. I expect they're both
quite old).
The first is to surround your statements with
if (TRUE){statements}else
i.e.
#define JUMPNEXT if(1){asm("movl %a5@+,%a0; jmp %a0@");return;}else
So now we get:
	if (its_time_to_jump)
		if (1){
			asm("movl %a5@+,%a0; jmp %a0@");
			return;
		else;
	do_something_else();

which works because C binds elses innermost first. However, some
compilers will whine about dangling elses.  The second scheme is
more elegant (-:

Surround your statements with
do{statements}while(FALSE);
which will execute statements precisely once (its NOT a loop).
i.e.
#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0)
expands to

	if (its_time_to_jump)
		do {
			asm("movl %a5@+,%a0; jmp %a0@");
			return;
		while(0);
	do_something_else();

which does what's wanted and doesn't incur compiler whines.


>Thanks
>
>Simon L Peyton Jones, Glasgow University

More and more people are taking the 'use C as an assembler' route, and
more and more people are using GCC to do it (because its code quality is
good, it had global register variables, and it has an excellent asm
facility).  The threaded-code in C idea is also becomming more popular.
But as the code above demonstrates, one does have to side-step
optimizations and develop system-specific assembler editing scripts.

I'd like to ask Richard Stallman & the GCC development team for
	-fno-prolog -fno-epilog
flags that instruct gcc to generate
	a) no register saves or restores
	b) no automatic variable allocation
	c) no procedure linkage/frame creation

Then the optimal 'Threaded-Code Machine in GCC C' can be compiled without
any assembler editing scripts at all.

Also nice would be a way of telling GCC that an asm statement
changed the flow of control so GCC could
	a) warn about not-reached code
	b) eliminate unnecessary code (do more code folding)
-- 
Eliot Miranda			email:	eliot@cs.qmw.ac.uk
Dept of Computer Science	Tel:	071 975 5229 (+44 71 975 5229)
Queen Mary Westfield College	ARPA:	eliot%cs.qmw.ac.uk@nsf.ac.uk	
Mile End Road			UUCP:	eliot@qmw-cs.uucp
LONDON E1 4NS
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

>From brennan@bcsaic.boeing.com Sat May  4 11:28:41 1991
From: brennan@bcsaic.boeing.com (Michael D Brennan)
Newsgroups: comp.compilers
Subject: re: Threaded code
Keywords: interpreter, design
Date: 2 May 91 19:50:23 GMT
Reply-To: brennan@bcsaic.boeing.com (Michael D Brennan)
Organization: Boeing Aerospace & Electronics, Seattle WA

Another method for obtaining threaded byte code for an interpreter
is to edit the assembler output of a big switch
rather than editing the prologue and epilogue off functions calls.

You don't need gcc, global vars in registers, works with smart and
dumb compilers, and all optimization can be turned on.

For example:

This C routine executes (unthreaded) byte code for an interpreter 
that can add, subtract and print.

#define  HALT     0
#define  PUSH     1
#define  ADD      2
#define  SUB      3
#define  PRINT    4

static int stack[32] ;

void  execute(code_ptr)
  register int *code_ptr ;
{
  register int *stack_ptr = stack - 1 ;


  while ( 1 )
  {
    switch( *code_ptr++ )
    {
      case HALT   :  return ;
      case PUSH   :
	    * ++ stack_ptr = *code_ptr++ ;
	    break ;

      case  ADD   :
	    stack_ptr-- ;
	    *stack_ptr += stack_ptr[1] ;
	    break ;

      case  SUB   :
	    stack_ptr-- ;
	    *stack_ptr -= stack_ptr[1] ;
	    break ;

      case   PRINT  :
             printf("%d\n", *stack_ptr--);
	     break ;
     }
  }
}

-------------------------------------------------------

to interpret 2 + (3 - 4)

the front end "compiles" in  int code[]

PUSH, 2, PUSH, 3, PUSH, 4, SUB, ADD, PRINT, HALT

and calls  execute(code).

------------------------------------------------------

The difference between this and the threaded code discussed over
the last few weeks is the switch gets compiled as

      jmp	TABLE[ *code_ptr++ ]

where TABLE is the jump table generated by the compiler which holds
the addresses of the case labels.

With threading, the transitions between functions become

      jmp	*code_ptr++


but this is easy to get by editing the assembler output to
export the case label and recode the switch.

--------------------------------------------------

For example on a SPARC:

code_ptr is %o0
stack_ptr is %i5


	.....
			      ! case PUSH
L77004:
	ld	[%i0],%o1
	inc	4,%i5
	inc	4,%i0
	b	L77008
	st	%o1,[%i5]

	.....

			     !  the switch, doesn't change structure
			     !  as you add new op codes

L77008:
	mov	%i0,%i4
	ld	[%i4],%i4
	inc	4,%i0
	cmp	%i4,4
	bgu	L77008
	sll	%i4,2,%i4
	sethi	%hi(L2000000),%o1
	or	%o1,%lo(L2000000),%o1	! [internal]
	ld	[%i4+%o1],%o0
	jmp	%o0
	nop
L2000000:  		  !  the jump TABLE
	.word	L77003    !  HALT  etc
	.word	L77004
	.word	L77005
	.word	L77006
	.word	L77007


-------------------------------------------
modify by adding global labels and edit the switch
	

	.....
			      ! case PUSH
_push :
L77004:
	ld	[%i0],%o1
	inc	4,%i5
	inc	4,%i0
	b	L77008
	st	%o1,[%i5]

	.....

			  !  the edited switch
L77008:
	mov	%i0,%i4
	ld	[%i4],%i4
	inc	4,%i0
	jmp	%i4
	nop
          		  !  remove TABLE

-------------------------------------------

For another example on an Intel 8088 

stack_ptr is si
code_ptr  is di

   ;	  while ( 1 )
   ;	  {
   ;	    switch( *code_ptr++ )
   ;	
@1@50:
	mov	bx,di
	inc	di
	inc	di
	mov	bx,word ptr [bx]
	cmp	bx,3
	ja	short @1@50
	shl	bx,1
	jmp	word ptr cs:@1@C738[bx]


@1@122:
   ;	
   ;	      case PUSH   :
   ;		    * ++ stack_ptr = *code_ptr++ ;
   ;	
	inc	si
	inc	si
	mov	ax,word ptr [di]
	mov	word ptr [si],ax
	inc	di
	inc	di
   ;	
   ;		    break ;
   ;	
	jmp	short @1@50   
   ;	

	....

@1@C738	label	word       ;  jump TABLE
	dw	@1@194     ;  HALT
	dw	@1@122     ;  PUSH   etc
	dw	@1@146

	....

------------------------------------------------

edited the jump can be computed inline

   ;	  while ( 1 )
   ;	  {
   ;	    switch( *code_ptr++ )
   ;	
@1@50:      ; switch code is replaced by code only executed once
   
	inc	di
	inc	di
	jmp     [di-2]

	.....

_push :
@1@122:
   ;	
   ;	      case PUSH   :
   ;		    * ++ stack_ptr = *code_ptr++ ;
   ;	
	inc	si
	inc	si
	mov	ax,word ptr [di]
	mov	word ptr [si],ax
	inc	di
	inc	di
   ;	
   ;		    break ;
   ;	     
	inc	di         ;  jmp  to *code_ptr++  inline
	inc	di
	jmp	[di-2]
   ;	
	....

----------------------------------------------

the "front end" has defines

typedef  void (*TCODE)() ;

extern  void  halt(), push(), add(), sub(), print() ;

TCODE  code[CODESIZE] ;

in the array code[], the front end compiles


push, 2, push, 3, push, 4, sub, add, print, halt

and calls execute(code).


--
Mike Brennan
brennan@bcsaic.boeing.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.