\begin{onlystandalone} \documentstyle[11pt,literate,a4wide]{article} \begin{document} \title{C as the assembly language for the Glasgow Haskell compiler} \author{Patrick Sansom, with help from his enemies} \date{June 1992} \maketitle \begin{rawlatex} \tableofcontents \end{rawlatex} \end{onlystandalone} \begin{onlypartofdoc} \section[C-as-assembler]{Using C for our assembler} \downsection \end{onlypartofdoc} % this file describes the issues; % it then \inputs all the actual source bits that do the job %************************************************************************ %* * \section[C-as-asm-intro-portable]{C as assembler: introduction to ``portable C''} %* * %************************************************************************ The Glasgow Haskell compiler expresses its output in C for the usual reasons: (1)~We hope it will be more readily portable to new machines; (2)~We'd rather not waste our time writing a register-allocator, instruction-selection mumbo-jumbo, etc.---in short, the dreary, tedious bits of a code-generator. The execution model that underlies Glasgow Haskell is the Spineless-Tagless G-machine (STG-machine). The constituent pieces of the STG-machine model that must be fashioned out of good old-fashioned C code are, roughly: \begin{description} \item[Dynamic heap:] The usual. \item[Stack space:] With two stacks (A and B) growing from each end of a space, towards each other. To pass (some) arguments, return addresses, ``update frames'', ... \item[Various ``STG registers'':] Stack pointers, heap pointers, return registers---all abstract machine-model entities... \item[Tables of static read-only info:] Mostly so-called ``info tables''... \item[Code:] Lots of code fragments that do nothing but tail-call each other. \end{description} OK, as mentioned, goal Numero Uno in using C as the compiler's target is to produce code that's relatively easy to port. To this end, we need very-plain-vanilla bits of C to do all of the above things. Roughly speaking, here's what comes out: \begin{description} \item[Dynamic heap:] @malloc(3)@ a chunk of memory and use that. \item[Stack space:] Ditto. \item[Various ``STG registers'':] These are just global variables (or words in a ``global register table''), each just a 32|64-bit word (i.e., not a funny structure, no funny tag bits, etc.---just very ordinary global variables). \item[Tables of static read-only info:] Initialised arrays of 32|64-bit words; they tend to come out looking something like (behind a wall of CPP macros) [very straightforward]: \begin{verbatim} const StgWord foo_info[] = { (StgWord) foo_entry, 0, 0 } \end{verbatim} \item[Code:] This is the tricky part: it's hard to convince any C compiler to do tail-calls, and certainly not all C compilers all of the time. Fortunately, {\em all} code fragments in the STG world can be coerced into C ``functions'' of the form: \begin{itemize} \item {\em No} arguments (all the STG-world function arguments, etc., go via the STG-world stacks and registers, not the C-world stacks/registers); \item They {\em always} know the next C fragment to be called (it's all tail-recursive, after all). \end{itemize} So: we dictate that every one of these code fragments is an argument-less C function of its own, which {\em returns} the address of the next code fragment to be executed (i.e., its continuation). Here is an example of one such fragment (simplified from actual code): \begin{verbatim} stg_1848_entry (/* no args */) { STK_CHK(0,-5); PUSH_STD_UPD_FRAME(Node,StdUpdRetVecReg,0,0); *(SpB+5)=(StgWord)(vtbl_1930); Node=(StgPtr)(*(Node+SPEC_HS)); SpB=SpB+5; return(*Node); /* *Node points at next code to enter */ } \end{verbatim} Now all we need is a so-called {\em mini-interpreter} to dispatch these returned continuations; this is the basic idea: \begin{verbatim} StgFunPtr continuation = (StgFunPtr) start_cont; while ( 1 ) { continuation = (StgFunPtr) (continuation)(); } \end{verbatim} \end{description} If you are utterly baffled at this point, you probably need to read the STG-machine paper. All of the above is dead simple, if not particularly whizzy; in the rest of this document, when we talk about ``portable~C'' (or, ``the portable version''), \index{portable C output} we mean this stuff. %************************************************************************ %* * \section[C-as-asm-intro-fast]{C as assembler: going faster, introduction to ``optimised C''} %* * %************************************************************************ %************************************************************************ %* * \subsection[C-as-asm-portably-slow]{Why ``portable C'' is slow} %* * %************************************************************************ Compiling Haskell programs via simple, utterly portable C, as outlined above, comes at a significant cost in code size and speed. \begin{description} \item[Dynamic heap:] No problems associated with this. \item[Stack space:] Or this. [Well, not quite. The C compiler won't take advantage of the {\em fact} (which it doesn't know) that the heap and stacks {\em cannot possibly overlap}. In a perfect world, it could use this information to select better instruction sequences.) \item[Various ``STG registers'':] These are all in global C variables, about which C compilers are notoriously pessimistic. A load or store to one of these ``registers'' is almost certainly two or more instructions---not to mention stalls for going to memory. And their global-you-don't-know-anything-about-me nature (in the C compiler's view) is very optimisation-unfriendly. And, because they're the ``registers'' of the STG execution model, they are {\em FREQUENTLY} used! You {\em really pay} for their global-variableness... (estimates: 2--4x of code speed, 2x of code size) \item[Tables of static read-only info:] There aren't any big costs associated with these, except that you can't micro-manage their placement in memory, esp.~w.r.t.~entry-code fragments. (See \sectionref{C-as-asm-native}.) \item[Code:] Rather than really tail-jumping, we make many, many trips through the ``mini-interpreter.'' Besides all those instructions, there is probably plenty of pushing/popping of frames on/off the C stack, because each dispatch by the mini-interpreter is a C function call... Also, we don't ``fall through'' from argument-satisfaction-checking code into the real code for a function: we make an extra, fairly gratuitous trip through the mini-interpreter... We {\em estimate} that real tail-jumping should make programs go \pl{~25%} faster. \end{description} %************************************************************************ %* * \subsection[C-as-asm-fast]{Solution: ``Optimising C,'' with our friend, Richard Stallman} %* * %************************************************************************ The freely-available GNU~C compiler, GCC, (version 2.x), written under the lead of Richard Stallman at the Free Software Foundation, is a good C~compiler that has some non-standard extensions and machine-code hooks that make it possible to win back practically all the slow-nesses listed above for ``portable C''. First, there is a non-standard extension to C that makes it possible to put a ``global variable'' in a machine register. For example, something like \begin{verbatim} register StgPtr SpB __asm__("%i4"); \end{verbatim} says ``global variable'' \tr{SpB} should live in machine-register \tr{%i4}. Second, GCC offers an ``extended \tr{__asm__}'' directive, which lets you inject raw assembly-language stuff into the C-compiler output. Inserting \tr{jmps} in this way is is one component of the do-real-tailjumps machinery... The other main component of doing-real-tailjumps is shutting down the C world's function-calling machinery, i.e., pushing/popping of stack frames (and/or register windows, in the SPARC case). This involves ``sedding'' the C-compiler's assembly-language output, stripping off function prologues and epilogues, and other such really horrible stuff... {\em Caveat~1:} The above is machine- and compiler-specific. {\em Response:} We don't care. GCC is freely and widely available and has been ported to bazillions of machines. {\em Caveat~2:} The above constitutes {\em serious} mangling of what ends up in a \tr{.o} object file. Mixing such \tr{.o} files with others that were incompatibly compiled (e.g., w/ different magical-global-register declarations) is {\em guaranteed} to result in gruesome program death. {\em Response:} We treat optimised-C \tr{.o} files quite separately. First, this document carefully records how C files from different ``worlds'' must be compiled (next section); also, our compilation-system driver checks that executables it links are built from compatible \tr{.o} files (see \sectionref{driver-consistency-chking}). The User's Guide describes how the beleaguered Haskell programmer interacts with all this C optimisation stuff... (executive summary: it's invisible.) [For a discussion of how well you can do on compiling via straight ANSI-standard C, see the paper from CMU about compiling ML...] %************************************************************************ %* * \subsection[C-as-asm-worlds]{``Worlds'' that must interoperate in optimised~C} %* * %************************************************************************ A Glasgow-Haskell-compiled executable is made of bits that come from various places, or ``worlds.'' These are: \begin{itemize} \item The major ``world'' is the {\em Haskell Threaded World} of direct-jumping, machine-register-using compiled-to-optimised-C Haskell code. \item The Haskell Threaded World may call into the {\em Arbitrary~C World}; for example, to do I/O. This is C code that is written by someone else, compiled by someone else, over which the Glasgow Haskell system has no control whatsoever. However, a most pleasant property of the Arbitrary~C World is that it doesn't care about any of the internals of the Haskell threaded world. In particular, it does not know about the heap, stack etc, and hence does not modify the magic ``registers'' @Hp@, @SpA@, etc. \item There may also be calls out to an {\em STG~C World}, of which the storage manager is the major example. (The use of the GNU multi-precision arithmetic package to do ``Integers'' is nearly another; except we do all the STG magic in-line, then do ``ordinary'' C calls to the library after we know it's safe to do so.) An STG~C World is a blob of C that needs to inspect/save/restore some part of the Haskell-Threaded-World state, notably some of its registers. For example, the storage manager {\em really} needs to know where the heap pointer \tr{Hp}---in a machine register---is pointing to, before embarking on a garbage collection... :-) These STG~C Worlds are the tricky devils... {\em NB:} The storage manager is a direct-jumping threaded world of its own, with its own mapping-to-machine-registers, etc. \item The {\em C~Startup World}: The C runtime-system gets things going by invoking a user-supplied function, \tr{main} (or \tr{mainIO}, if Glasgow I/O is in force). We must have such a function (it is provided by the GHC RTS), and it must get us from there into the Haskell Threaded World. Similarly, when we finally emerge from the Haskell Threaded World, we must exit as a well-behave C program would, i.e., set the exit status, etc., etc. The C~Startup World is not a big problem. \end{itemize} %************************************************************************ %* * \section[C-as-asm-issues]{Things to be {\em very careful} about with ``optimised C''} %* * %************************************************************************ (Most of) These issues can be organised according to the affected World-to-World transition... %************************************************************************ %* * \subsection[C-as-asm-starting-stopping]{Starting and stopping a Haskell Threaded World} %* * %************************************************************************ \begin{itemize} \item As part of real-tailjumping support, we completely shut down the pushing/popping of C stack frames (by sedding off prologues/epilogues). This is fine---except the C compiler doesn't know we do this and may use the stack anyway (for temporaries, automatic variables, ...)! The initialisation of the Haskell Threaded World must therefore ensure that there is some/enough C-stack space sitting around for this purpose... \item The ``mini-interpreter'' must be re-entrant! The code for @error@---and bits of the RTS---actually exploit this. \item (C Startup to Haskell) Beginning of mini-interpreter: The STG registers (in real registers) must be initialised from the @smInfo@ structure. Ending: @smInfo@ updated from the STG registers. \end{itemize} %************************************************************************ %* * \subsection[C-as-asm-haskell-to-arbitrary-C]{Haskell Threaded to/from Arbitrary~C} %* * %************************************************************************ Remember, ``Arbitrary C'' cannot modify any of the STG registers, so all that is required is to keep them safe across the call. Hence we can just call the arbitrary~C routine. But, {\em don't} map any STG registers onto any call-clobbered registers; the arbitrary~C may stomp on them. (Just use callee-save registers.) In the SPARC case, this means all \tr{%o} and several \tr{%g} registers are unavailable. GCC~2.x warns if you try to snaffle call-clobbered registers. %************************************************************************ %* * \subsection[C-as-asm-haskell-to-STG-C]{Haskell Threaded to/from special ``STG~C''} %* * %************************************************************************ This is the tricky business. [ToDo: re-make this section in its particulars (it is out-of-date); principles still valid.] \begin{itemize} \item The compiler ``knows'' about things to which it makes ``STG calls''... It emits macro calls of the form \tr{STGCALLn(f, arg1, arg2..., arg_n)}; calls to C-function @f@, with @n@ arguments, returning (at most) one 32|64-bit value. A use of this might look like... \begin{verbatim} x = STGCALL2( arby_prec_add, y, z ); \end{verbatim} \item In portable~C, the above would just become an ordinary C function-call: \begin{verbatim} x = arby_prec_add(y,z); \end{verbatim} Also, in portable~C, the @xxx_SAVE@ backup global (see below) is \tr{#defined} to be the same as \tr{xxx}! \item In optimised~C, the @STGCALLn@ macros turn into (highly-machine dependent) uses of the ultra-magic @callWrapper@ function. At least in the SPARC case, using @STGCALL1@ as an example: \begin{verbatim} STGCALL1( f, x ) ---> %o5 := f; callWrapper(x) \end{verbatim} The @callWrapper@ function is very special indeed. It must preserve all the callee-saves registers (SPARC e.g.: all \tr{%i} and \tr{%l} registers). It is {\em NOT} tail-jumped to like the Haskell-Threaded-World routines. So all it does is: \begin{enumerate} \item Save the return address (SPARC: \tr{[%o7]+8}) into @continuation_SAVE@. \item Save all relevant STG register in their corresponding @_SAVE@ backup globals. (We might have @callWrapper@ variants to save different sets.) \item Call the function (SPARC: \tr{call %o5}); note that the function arguments are all in the right place already. (NOTE: It would be hard to think of something more machine- and compiler-assumptive! But, remember, we are calling code with which we are on friendly terms...) \item Restore the @_SAVE@ backup globals into the registers; the restore mustn't affect the single returned 32|64-bit value (SPARC case: in \tr{%o0}). \item @STGJUMP@ (tail-jump) to @continuation_SAVE@. \end{enumerate} N.B.: @callWrapper@ only works up to a limited number of arguments (SPARC: 5 words, \tr{%o0-%o4}), because we are using \tr{%o5} (SPARC) for the function to call. If we run into this limit, we should pass the function in a global instead of \tr{%o5} (or whatever). \end{itemize} %************************************************************************ %* * \subsubsection[C-as-asm-haskell-to-GC]{...To the Storage Manager in particular...} %* * %************************************************************************ The interface to the GC uses the same regime; having to save and restore all STG and ptr regs is no big deal, because it only happens once per GC. Other routines should only use SpA/B, Hp, HeapLimit, SuA/B (for GC). %************************************************************************ %* * \section[C-as-asm-native]{Things that could be better with a native-code generator} %* * %************************************************************************ Even with all the fancy GNU~C tricks and whatnot, the resulting code speed and size isn't as good as you could do if you wrote a full-blown native-code generator. We have no interest in doing this---the payoff isn't big enough---but here we list, for the record, Things That Could Be Better: \begin{enumerate} \item We could park info-tables and entry code in judicious proximity to each other, so @ENTER_CLOSURE@ would be a single-indirection-with-offset, rather than a double-indirection. \end{enumerate} %************************************************************************ %* * \section[C-as-asm-registers]{IMPLEMENTATION: Mapping STG registers onto machine registers} %* * %************************************************************************ We have several mappings from STG~registers onto real machine registers, for different segments of the runtime system. Each mapping is dependent on the target architecture as well. \downsection \input{StgRegs.lh} % decls of STG registers \upsection %************************************************************************ %* * \section[C-as-asm-tailjumps]{IMPLEMENTATION: tail-jumping} %* * %************************************************************************ \downsection \input{COptJumps.lh} % per-platform tailjumps (optimised C) \subsection[driver-sedding]{ToDo: driver sedding} THIS SHOULD BE THE SOURCE FOR THE PART OF THE DRIVER THAT MANGLES OPTIMISED C ASSEMBLER FILES. \input{../runtime/c-as-asm/StgMiniInt.lc} \upsection %************************************************************************ %* * \section[C-as-asm-wrappers]{IMPLEMENTATION: ``wrappers'' to call out from the threaded world} %* * %************************************************************************ \downsection \input{COptWraps.lh} \input{../runtime/c-as-asm/HpOverflow.lc} \upsection %************************************************************************ %* * \section[driver-consistency-chking]{ToDo: driver consistency checking} %* * %************************************************************************ THIS SHOULD BE THE SOURCE FOR THE PART OF THE DRIVER THAT CHECKS CONSISTENCY OF EXECUTABLES. \begin{onlystandalone} \printindex \end{document} \end{onlystandalone}