2 \documentstyle[11pt,literate,a4wide]{article}
4 \title{C as the assembly language for the Glasgow Haskell compiler}
5 \author{Patrick Sansom, with help from his enemies}
14 \section[C-as-assembler]{Using C for our assembler}
18 % this file describes the issues;
19 % it then \inputs all the actual source bits that do the job
22 %************************************************************************
24 \section[C-as-asm-intro-portable]{C as assembler: introduction to ``portable C''}
26 %************************************************************************
28 The Glasgow Haskell compiler expresses its output in C for the usual
29 reasons: (1)~We hope it will be more readily portable to new machines;
30 (2)~We'd rather not waste our time writing a register-allocator,
31 instruction-selection mumbo-jumbo, etc.---in short, the dreary,
32 tedious bits of a code-generator.
34 The execution model that underlies Glasgow Haskell is the
35 Spineless-Tagless G-machine (STG-machine). The constituent pieces of
36 the STG-machine model that must be fashioned out of good old-fashioned
39 \item[Dynamic heap:] The usual.
41 \item[Stack space:] With two stacks (A and B) growing from each end of a space,
42 towards each other. To pass (some) arguments, return addresses, ``update
45 \item[Various ``STG registers'':] Stack pointers, heap pointers,
46 return registers---all abstract machine-model entities...
48 \item[Tables of static read-only info:] Mostly so-called ``info
51 \item[Code:] Lots of code fragments that do nothing but tail-call each
55 OK, as mentioned, goal Numero Uno in using C as the compiler's target
56 is to produce code that's relatively easy to port. To this end, we
57 need very-plain-vanilla bits of C to do all of the above things.
58 Roughly speaking, here's what comes out:
61 \item[Dynamic heap:] @malloc(3)@ a chunk of memory and use that.
63 \item[Stack space:] Ditto.
65 \item[Various ``STG registers'':] These are just global variables (or
66 words in a ``global register table''), each just a 32|64-bit word (i.e.,
67 not a funny structure, no funny tag bits, etc.---just very ordinary
70 \item[Tables of static read-only info:] Initialised arrays of 32|64-bit
71 words; they tend to come out looking something like (behind a wall of
72 CPP macros) [very straightforward]:
74 const StgWord foo_info[] = { (StgWord) foo_entry, 0, 0 }
77 \item[Code:] This is the tricky part: it's hard to convince any C compiler to do
78 tail-calls, and certainly not all C compilers all of the time.
80 Fortunately, {\em all} code fragments in the STG world can be coerced
81 into C ``functions'' of the form:
84 {\em No} arguments (all the STG-world function arguments, etc., go via
85 the STG-world stacks and registers, not the C-world stacks/registers);
87 They {\em always} know the next C fragment to be called (it's all
88 tail-recursive, after all).
91 So: we dictate that every one of these code fragments is an
92 argument-less C function of its own, which {\em returns} the address
93 of the next code fragment to be executed (i.e., its continuation). Here
94 is an example of one such fragment (simplified from actual code):
97 stg_1848_entry (/* no args */) {
99 PUSH_STD_UPD_FRAME(Node,StdUpdRetVecReg,0,0);
100 *(SpB+5)=(StgWord)(vtbl_1930);
101 Node=(StgPtr)(*(Node+SPEC_HS));
103 return(*Node); /* *Node points at next code to enter */
107 Now all we need is a so-called {\em mini-interpreter} to dispatch
108 these returned continuations; this is the basic idea:
111 StgFunPtr continuation = (StgFunPtr) start_cont;
113 while ( 1 ) { continuation = (StgFunPtr) (continuation)(); }
117 If you are utterly baffled at this point, you probably need to read
118 the STG-machine paper.
120 All of the above is dead simple, if not particularly whizzy; in the
121 rest of this document, when we talk about ``portable~C'' (or, ``the
123 \index{portable C output}
126 %************************************************************************
128 \section[C-as-asm-intro-fast]{C as assembler: going faster, introduction to ``optimised C''}
130 %************************************************************************
132 %************************************************************************
134 \subsection[C-as-asm-portably-slow]{Why ``portable C'' is slow}
136 %************************************************************************
138 Compiling Haskell programs via simple, utterly portable C, as outlined
139 above, comes at a significant cost in code size and speed.
142 \item[Dynamic heap:] No problems associated with this.
144 \item[Stack space:] Or this.
146 [Well, not quite. The C compiler won't take advantage of the {\em
147 fact} (which it doesn't know) that the heap and stacks {\em cannot
148 possibly overlap}. In a perfect world, it could use this information
149 to select better instruction sequences.)
151 \item[Various ``STG registers'':] These are all in global C variables,
152 about which C compilers are notoriously pessimistic.
154 A load or store to one of these ``registers'' is almost certainly two
155 or more instructions---not to mention stalls for going to memory. And
156 their global-you-don't-know-anything-about-me nature (in the C
157 compiler's view) is very optimisation-unfriendly.
159 And, because they're the ``registers'' of the STG execution model,
160 they are {\em FREQUENTLY} used! You {\em really pay} for their
161 global-variableness... (estimates: 2--4x of code speed, 2x of code
164 \item[Tables of static read-only info:] There aren't any big costs
165 associated with these, except that you can't micro-manage their
166 placement in memory, esp.~w.r.t.~entry-code fragments.
167 (See \sectionref{C-as-asm-native}.)
170 \item[Code:] Rather than really tail-jumping, we make many, many trips
171 through the ``mini-interpreter.'' Besides all those instructions,
172 there is probably plenty of pushing/popping of frames on/off the C
173 stack, because each dispatch by the mini-interpreter is a C function
176 Also, we don't ``fall through'' from argument-satisfaction-checking
177 code into the real code for a function: we make an extra, fairly
178 gratuitous trip through the mini-interpreter...
180 We {\em estimate} that real tail-jumping should make programs go
184 %************************************************************************
186 \subsection[C-as-asm-fast]{Solution: ``Optimising C,'' with our friend, Richard Stallman}
188 %************************************************************************
190 The freely-available GNU~C compiler, GCC, (version 2.x), written under the
191 lead of Richard Stallman at the Free Software Foundation, is a good
192 C~compiler that has some non-standard extensions and machine-code
193 hooks that make it possible to win back practically all the slow-nesses
194 listed above for ``portable C''.
196 First, there is a non-standard extension to C that makes it possible
197 to put a ``global variable'' in a machine register. For example,
201 register StgPtr SpB __asm__("%i4");
204 says ``global variable'' \tr{SpB} should live in machine-register \tr{%i4}.
206 Second, GCC offers an ``extended \tr{__asm__}'' directive, which lets
207 you inject raw assembly-language stuff into the C-compiler output.
208 Inserting \tr{jmps} in this way is is one component of the
209 do-real-tailjumps machinery...
211 The other main component of doing-real-tailjumps is shutting down the
212 C world's function-calling machinery, i.e., pushing/popping of stack
213 frames (and/or register windows, in the SPARC case). This involves
214 ``sedding'' the C-compiler's assembly-language output, stripping off
215 function prologues and epilogues, and other such really horrible
218 {\em Caveat~1:} The above is machine- and compiler-specific. {\em
219 Response:} We don't care. GCC is freely and widely available and has
220 been ported to bazillions of machines.
222 {\em Caveat~2:} The above constitutes {\em serious} mangling of what
223 ends up in a \tr{.o} object file. Mixing such \tr{.o} files with
224 others that were incompatibly compiled (e.g., w/ different
225 magical-global-register declarations) is {\em guaranteed} to result in
226 gruesome program death.
228 {\em Response:} We treat optimised-C \tr{.o} files quite separately.
229 First, this document carefully records how C files from different
230 ``worlds'' must be compiled (next section); also, our
231 compilation-system driver checks that executables it links are built
232 from compatible \tr{.o} files (see \sectionref{driver-consistency-chking}).
234 The User's Guide describes how the beleaguered Haskell programmer
235 interacts with all this C optimisation stuff... (executive summary:
238 [For a discussion of how well you can do on compiling via straight
239 ANSI-standard C, see the paper from CMU about compiling ML...]
241 %************************************************************************
243 \subsection[C-as-asm-worlds]{``Worlds'' that must interoperate in optimised~C}
245 %************************************************************************
247 A Glasgow-Haskell-compiled executable is made of bits that come from
248 various places, or ``worlds.'' These are:
251 The major ``world'' is the {\em Haskell Threaded World} of
252 direct-jumping, machine-register-using compiled-to-optimised-C Haskell
256 The Haskell Threaded World may call into the {\em Arbitrary~C
257 World}; for example, to do I/O. This is C code that is written by
258 someone else, compiled by someone else, over which the Glasgow Haskell
259 system has no control whatsoever. However, a most pleasant property
260 of the Arbitrary~C World is that it doesn't care about any of the
261 internals of the Haskell threaded world. In particular, it does
262 not know about the heap, stack etc, and hence does not modify
263 the magic ``registers'' @Hp@, @SpA@, etc.
266 There may also be calls out to an {\em STG~C World}, of which the
267 storage manager is the major example. (The use of the GNU
268 multi-precision arithmetic package to do ``Integers'' is nearly
269 another; except we do all the STG magic in-line, then do ``ordinary''
270 C calls to the library after we know it's safe to do so.) An STG~C
271 World is a blob of C that needs to inspect/save/restore some part of
272 the Haskell-Threaded-World state, notably some of its registers. For
273 example, the storage manager {\em really} needs to know where the heap
274 pointer \tr{Hp}---in a machine register---is pointing to, before
275 embarking on a garbage collection... :-)
277 These STG~C Worlds are the tricky devils...
279 {\em NB:} The storage manager is a direct-jumping threaded world of
280 its own, with its own mapping-to-machine-registers, etc.
283 The {\em C~Startup World}: The C runtime-system gets things going by
284 invoking a user-supplied function, \tr{main} (or \tr{mainIO}, if
285 Glasgow I/O is in force). We must have such a function (it is
286 provided by the GHC RTS), and it must get us from there into the
287 Haskell Threaded World.
289 Similarly, when we finally emerge from the Haskell Threaded World, we
290 must exit as a well-behave C program would, i.e., set the exit status,
293 The C~Startup World is not a big problem.
296 %************************************************************************
298 \section[C-as-asm-issues]{Things to be {\em very careful} about with ``optimised C''}
300 %************************************************************************
302 (Most of) These issues can be organised according to the affected
303 World-to-World transition...
305 %************************************************************************
307 \subsection[C-as-asm-starting-stopping]{Starting and stopping a Haskell Threaded World}
309 %************************************************************************
313 As part of real-tailjumping support, we completely shut down the
314 pushing/popping of C stack frames (by sedding off
315 prologues/epilogues). This is fine---except the C compiler doesn't
316 know we do this and may use the stack anyway (for temporaries,
317 automatic variables, ...)! The initialisation of the Haskell Threaded
318 World must therefore ensure that there is some/enough C-stack space
319 sitting around for this purpose...
322 The ``mini-interpreter'' must be re-entrant! The code for
323 @error@---and bits of the RTS---actually exploit this.
326 (C Startup to Haskell) Beginning of mini-interpreter: The STG
327 registers (in real registers) must be initialised from the @smInfo@
328 structure. Ending: @smInfo@ updated from the STG registers.
331 %************************************************************************
333 \subsection[C-as-asm-haskell-to-arbitrary-C]{Haskell Threaded to/from Arbitrary~C}
335 %************************************************************************
337 Remember, ``Arbitrary C'' cannot modify any of the STG registers, so
338 all that is required is to keep them safe across the call.
340 Hence we can just call the arbitrary~C routine. But, {\em don't} map
341 any STG registers onto any call-clobbered registers; the arbitrary~C
342 may stomp on them. (Just use callee-save registers.) In the SPARC
343 case, this means all \tr{%o} and several \tr{%g} registers are
344 unavailable. GCC~2.x warns if you try to snaffle call-clobbered
347 %************************************************************************
349 \subsection[C-as-asm-haskell-to-STG-C]{Haskell Threaded to/from special ``STG~C''}
351 %************************************************************************
353 This is the tricky business.
355 [ToDo: re-make this section in its particulars (it is out-of-date);
356 principles still valid.]
360 The compiler ``knows'' about things to which it makes ``STG calls''...
362 It emits macro calls of the form \tr{STGCALLn(f, arg1, arg2..., arg_n)};
363 calls to C-function @f@, with @n@ arguments, returning (at most) one
364 32|64-bit value. A use of this might look like...
367 x = STGCALL2( arby_prec_add, y, z );
371 In portable~C, the above would just become an ordinary C
375 x = arby_prec_add(y,z);
378 Also, in portable~C, the @xxx_SAVE@ backup global (see below) is
379 \tr{#defined} to be the same as \tr{xxx}!
382 In optimised~C, the @STGCALLn@ macros turn into (highly-machine
383 dependent) uses of the ultra-magic @callWrapper@ function.
385 At least in the SPARC case, using @STGCALL1@ as an example:
388 STGCALL1( f, x ) ---> %o5 := f; callWrapper(x)
391 The @callWrapper@ function is very special indeed. It must preserve
392 all the callee-saves registers (SPARC e.g.: all \tr{%i} and \tr{%l}
393 registers). It is {\em NOT} tail-jumped to like the
394 Haskell-Threaded-World routines. So all it does is:
398 Save the return address (SPARC: \tr{[%o7]+8}) into @continuation_SAVE@.
401 Save all relevant STG register in their corresponding @_SAVE@ backup globals.
402 (We might have @callWrapper@ variants to save different sets.)
405 Call the function (SPARC: \tr{call %o5}); note that the function
406 arguments are all in the right place already. (NOTE: It would be hard
407 to think of something more machine- and compiler-assumptive! But,
408 remember, we are calling code with which we are on friendly terms...)
411 Restore the @_SAVE@ backup globals into the registers; the restore
412 mustn't affect the single returned 32|64-bit value (SPARC case: in \tr{%o0}).
415 @STGJUMP@ (tail-jump) to @continuation_SAVE@.
418 N.B.: @callWrapper@ only works up to a limited number of arguments
419 (SPARC: 5 words, \tr{%o0-%o4}), because we are using \tr{%o5} (SPARC)
420 for the function to call. If we run into this limit, we should pass
421 the function in a global instead of \tr{%o5} (or whatever).
424 %************************************************************************
426 \subsubsection[C-as-asm-haskell-to-GC]{...To the Storage Manager in particular...}
428 %************************************************************************
430 The interface to the GC uses the same regime; having to save and restore
431 all STG and ptr regs is no big deal, because it only happens once per GC.
432 Other routines should only use SpA/B, Hp, HeapLimit, SuA/B (for GC).
434 %************************************************************************
436 \section[C-as-asm-native]{Things that could be better with a native-code generator}
438 %************************************************************************
440 Even with all the fancy GNU~C tricks and whatnot, the resulting code
441 speed and size isn't as good as you could do if you wrote a full-blown
442 native-code generator. We have no interest in doing this---the payoff
443 isn't big enough---but here we list, for the record, Things That Could
448 We could park info-tables and entry code in judicious proximity to
449 each other, so @ENTER_CLOSURE@ would be a
450 single-indirection-with-offset, rather than a double-indirection.
453 %************************************************************************
455 \section[C-as-asm-registers]{IMPLEMENTATION: Mapping STG registers onto machine registers}
457 %************************************************************************
459 We have several mappings from STG~registers onto real machine registers,
460 for different segments of the runtime system. Each mapping is
461 dependent on the target architecture as well.
464 \input{StgRegs.lh} % decls of STG registers
467 %************************************************************************
469 \section[C-as-asm-tailjumps]{IMPLEMENTATION: tail-jumping}
471 %************************************************************************
474 \input{COptJumps.lh} % per-platform tailjumps (optimised C)
476 \subsection[driver-sedding]{ToDo: driver sedding}
478 THIS SHOULD BE THE SOURCE FOR THE PART OF THE DRIVER THAT MANGLES
479 OPTIMISED C ASSEMBLER FILES.
481 \input{../runtime/c-as-asm/StgMiniInt.lc}
484 %************************************************************************
486 \section[C-as-asm-wrappers]{IMPLEMENTATION: ``wrappers'' to call out from the threaded world}
488 %************************************************************************
493 \input{../runtime/c-as-asm/HpOverflow.lc}
496 %************************************************************************
498 \section[driver-consistency-chking]{ToDo: driver consistency checking}
500 %************************************************************************
502 THIS SHOULD BE THE SOURCE FOR THE PART OF THE DRIVER THAT CHECKS
503 CONSISTENCY OF EXECUTABLES.
506 \begin{onlystandalone}