1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
4 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
5 <title>The GHC Commentary - The Native Code Generator</title>
8 <body BGCOLOR="FFFFFF">
9 <h1>The GHC Commentary - The Native Code Generator</h1>
11 On some platforms (currently x86 and PowerPC, with bitrotted
12 support for Sparc and Alpha), GHC can generate assembly code
13 directly, without having to go via C. This can sometimes almost
14 halve compilation time, and avoids the fragility and
15 horribleness of the <a href="mangler.html">mangler</a>. The NCG
16 is enabled by default for
17 non-optimising compilation on supported platforms. For most programs
18 it generates code which runs only 1-3% slower
19 (depending on platform and type of code) than that
20 created by gcc on x86s, so it is well worth using even with
21 optimised compilation. FP-intensive x86 programs see a bigger
22 slowdown, and all Sparc code runs about 5% slower due to
23 us not filling branch delay slots.
25 The NCG has always been something of a second-class citizen
26 inside GHC, an unloved child, rather. This means that its
27 integration into the compiler as a whole is rather clumsy, which
28 brings some problems described below. That apart, the NCG
29 proper is fairly cleanly designed, as target-independent as it
30 reasonably can be, and so should not be difficult to retarget.
32 <b>NOTE!</b> The native code generator was largely rewritten as part
33 of the C-- backend changes, around May 2004. Unfortunately the
34 rest of this document still refers to the old version, and was written
35 with relation to the CVS head as of end-Jan 2002. Some of it is relevant,
39 The top-level code generator fn is
41 <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code>
43 The returned <code>SDoc</code> is for debugging, so is empty unless
44 you specify <code>-ddump-stix</code>. The <code>Pretty.Doc</code>
45 bit is the final assembly code. Translation involves three main
46 phases, the first and third of which are target-independent.
48 <li><b>Translation into the <code>Stix</code> representation.</b> Stix
49 is a simple tree-like RTL-style language, in which you can
53 <li>An infinite number of temporary, virtual registers.
54 <li>The STG "magic" registers (<code>MagicId</code>), such as
55 the heap and stack pointers.
56 <li>Literals and low-level machine ops (<code>MachOp</code>).
57 <li>Simple address computations.
58 <li>Reads and writes of: memory, virtual regs, and various STG
60 <li>Labels and <code>if ... goto ...</code> style control-flow.
63 Stix has two main associated types:
66 <li><code>StixStmt</code> -- trees executed for their side
67 effects: assignments, control transfers, and auxiliary junk
68 such as segment changes and literal data.
69 <li><code>StixExpr</code> -- trees which denote a value.
72 Translation into Stix is almost completely
73 target-independent. Needed dependencies are knowledge of
74 word size and endianness, used when generating code to do
75 deal with half-word fields in info tables. This could be
76 abstracted out easily enough. Also, the Stix translation
77 needs to know which <code>MagicId</code>s map to registers
78 on the given target, and which are stored in offsets from
81 After initial Stix generation, the trees are cleaned up with
82 constant-folding and a little copy-propagation ("Stix
83 inlining", as the code misleadingly calls it). We take
84 the opportunity to translate <code>MagicId</code>s which are
85 stored in memory on the given target, into suitable memory
86 references. Those which are stored in registers are left
87 alone. There is also a half-hearted attempt to lift literal
88 strings to the top level in cases where nested strings have
89 been observed to give incorrect code in the past.
91 Primitive machine-level operations will already be phrased in
92 terms of <code>MachOp</code>s in the presented Abstract C, and
93 these are passed through unchanged. We comment only that the
94 <code>MachOp</code>s have been chosen so as to be easy to
95 implement on all targets, and their meaning is intended to be
96 unambiguous, and the same on all targets, regardless of word
99 <b>A note on <code>MagicId</code>s.</b>
100 Those which are assigned to
101 registers on the current target are left unmodified. Those
102 which are not are stored in memory as offsets from
103 <code>BaseReg</code> (which is assumed to permanently have the
104 value <code>(&MainCapability.r)</code>), so the constant folder
105 calculates the offsets and inserts suitable loads/stores. One
106 complication is that not all archs have <code>BaseReg</code>
107 itself in a register, so for those (sparc), we instead
108 generate the address as an offset from the static symbol
109 <code>MainCapability</code>, since the register table lives in
112 Finally, <code>BaseReg</code> does occasionally itself get
113 mentioned in Stix expression trees, and in this case what is
114 denoted is precisely <code>(&MainCapability.r)</code>, not, as
115 in all other cases, the value of memory at some offset from
116 the start of the register table. Since what it denotes is an
117 r-value and not an l-value, assigning <code>BaseReg</code> is
118 meaningless, so the machinery checks to ensure this never
119 happens. All these details are taken into account by the
122 <li><b>Instruction selection.</b> This is the only majorly
123 target-specific phase. It turns Stix statements and
124 expressions into sequences of <code>Instr</code>, a data
125 type which is different for each architecture.
126 <code>Instr</code>, unsurprisingly, has various supporting
127 types, such as <code>Reg</code>, <code>Operand</code>,
128 <code>Imm</code>, etc. The generated instructions may refer
129 to specific machine registers, or to arbitrary virtual
130 registers, either those created within the instruction
131 selector, or those mentioned in the Stix passed to it.
133 The instruction selectors live in <code>MachCode.lhs</code>.
134 The core functions, for each target, are:
137 getAmode :: StixExpr -> NatM Amode
138 <br>getRegister :: StixExpr -> NatM Register
139 <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
140 <br>assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock
143 The insn selectors use the "maximal munch" algorithm. The
144 bizarrely-misnamed <code>getRegister</code> translates
145 expressions. A simplified version of its type is:
147 <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code>
149 That is: it (monadically) turns a <code>StixExpr</code> into a
150 sequence of instructions, and a register, with the meaning
151 that after executing the (possibly empty) sequence of
152 instructions, the (possibly virtual) register will
153 hold the resulting value. The real situation is complicated
154 by the presence of fixed registers, and is detailed below.
156 Maximal munch is a greedy algorithm and is known not to give
157 globally optimal code sequences, but it is good enough, and
158 fast and simple. Early incarnations of the NCG used something
159 more sophisticated, but that is long gone now.
161 Similarly, <code>getAmode</code> translates a value, intended
162 to denote an address, into a sequence of insns leading up to
163 a (processor-specific) addressing mode. This stuff could be
164 done using the general <code>getRegister</code> selector, but
165 would necessarily generate poorer code, because the calculated
166 address would be forced into a register, which might be
167 unnecessary if it could partially or wholly be calculated
168 using an addressing mode.
170 Finally, <code>assignMem_IntCode</code> and
171 <code>assignReg_IntCode</code> create instruction sequences to
172 calculate a value and store it in the given register, or at
173 the given address. Because these guys translate a statement,
174 not a value, they just return a sequence of insns and no
175 associated register. Floating-point and 64-bit integer
176 assignments have analogous selectors.
178 Apart from the complexities of fixed vs floating registers,
179 discussed below, the instruction selector is as simple
180 as it can be. It looks long and scary but detailed
181 examination reveals it to be fairly straightforward.
183 <li><b>Register allocation.</b> The register allocator,
184 <code>AsmRegAlloc.lhs</code> takes sequences of
185 <code>Instr</code>s which mention a mixture of real and
186 virtual registers, and returns a modified sequence referring
187 only to real ones. It is gloriously and entirely
188 target-independent. Well, not exactly true. Instead it
189 regards <code>Instr</code> (instructions) and <code>Reg</code>
190 (virtual and real registers) as abstract types, to which it has
191 the following interface:
194 insnFuture :: Instr -> InsnFuture
195 <br>regUsage :: Instr -> RegUsage
196 <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr
199 <code>insnFuture</code> is used to (re)construct the graph of
200 all possible control transfers between the insns to be
201 allocated. <code>regUsage</code> returns the sets of registers
202 read and written by an instruction. And
203 <code>patchRegs</code> is used to apply the allocator's final
204 decision on virtual-to-real reg mapping to an instruction.
206 Clearly these 3 fns have to be written anew for each
207 architecture. They are defined in
208 <code>RegAllocInfo.lhs</code>. Think twice, no, thrice,
209 before modifying them: making false claims about insn
210 behaviour will lead to hard-to-find register allocation
213 <code>AsmRegAlloc.lhs</code> contains detailed comments about
214 how the allocator works. Here is a summary. The head honcho
216 <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code>
218 takes a list of instructions and a list of real registers
219 available for allocation, and maps as many of the virtual regs
220 in the input into real ones as it can. The returned
221 <code>Bool</code> indicates whether or not it was
222 successful. If so, that's the end of it. If not, the caller
223 of <code>allocUsingTheseRegs</code> will attempt spilling.
224 More of that later. What <code>allocUsingTheseRegs</code>
228 <li>Implicitly number each instruction by its position in the
231 <li>Using <code>insnFuture</code>, create the set of all flow
232 edges -- possible control transfers -- within this set of
235 <li>Using <code>regUsage</code> and iterating around the flow
236 graph from the previous step, calculate, for each virtual
237 register, the set of flow edges on which it is live.
239 <li>Make a real-register committment map, which gives the set
240 of edges for which each real register is committed (in
241 use). These sets are initially empty. For each virtual
242 register, attempt to find a real register whose current
243 committment does not intersect that of the virtual
244 register -- ie, is uncommitted on all edges that the
245 virtual reg is live. If successful, this means the vreg
246 can be assigned to the realreg, so add the vreg's set to
247 the realreg's committment.
249 <li>If all the vregs were assigned to a realreg, use
250 <code>patchInstr</code> to apply the mapping to the insns themselves.
255 If <code>allocUsingTheseRegs</code> fails, a baroque
256 mechanism comes into play. We now know that much simpler
257 schemes are available to do the same thing and give better
261 The logic above <code>allocUsingTheseRegs</code>, in
262 <code>doGeneralAlloc</code> and <code>runRegAllocate</code>,
263 observe that allocation has failed with some set R of real
264 registers. So they apply <code>runRegAllocate</code> a second
265 time to the code, but remove (typically) two registers from R
266 before doing so. This naturally fails too, but returns a
267 partially-allocated sequence. <code>doGeneralAlloc</code>
268 then inserts spill code into the sequence, and finally re-runs
269 <code>allocUsingTheseRegs</code>, but supplying the original,
270 unadulterated R. This is guaranteed to succeed since the two
271 registers previously removed from R are sufficient to allocate
272 all the spill/restore instructions added.
274 Because x86 is very short of registers, and in the worst case
275 needs three removed from R, a softly-softly approach is used.
276 <code>doGeneralAlloc</code> first tries with zero regs removed
277 from R, then if that fails one, then two, etc. This means
278 <code>allocUsingTheseRegs</code> may get run several times
279 before a successful arrangement is arrived at.
280 <code>findReservedRegs</code> cooks up the sets of spill
281 registers to try with.
283 The resulting machinery is complicated and the generated spill
284 code is appalling. The saving grace is that spills are very
285 rare so it doesn't matter much. I did not invent this -- I inherited it.
287 <b>Dealing with common cases fast</b>
289 The entire reg-alloc mechanism described so far is general and
290 correct, but expensive overkill for many simple code blocks.
291 So to begin with we use
292 <code>doSimpleAlloc</code>, which attempts to do something
293 simple. It exploits the observation that if the total number
294 of virtual registers does not exceed the number of real ones
295 available, we can simply dole out a new realreg each time we
296 see mention of a new vreg, with no regard for control flow.
297 <code>doSimpleAlloc</code> therefore attempts this in a
298 single pass over the code. It gives up if it runs out of real
299 regs or sees any condition which renders the above observation
300 invalid (fixed reg uses, for example).
302 This clever hack handles the majority of code blocks quickly.
303 It was copied from the previous reg-allocator (the
304 Mattson/Partain/Marlow/Gill one).
308 <h2>Complications, observations, and possible improvements</h2>
310 <h3>Real vs virtual registers in the instruction selectors</h3>
312 The instruction selectors for expression trees, namely
313 <code>getRegister</code>, are complicated by the fact that some
314 expressions can only be computed into a specific register, whereas
315 the majority can be computed into any register. We take x86 as an
316 example, but the problem applies to all archs.
318 Terminology: <em>rreg</em> means real register, a real machine
319 register. <em>vreg</em> means one of an infinite set of virtual
320 registers. The type <code>Reg</code> is the sum of <em>rreg</em> and
321 <em>vreg</em>. The instruction selector generates sequences with
322 unconstrained use of vregs, leaving the register allocator to map them
325 Now, where was I ? Oh yes. We return to the type of
326 <code>getRegister</code>, which despite its name, selects instructions
327 to compute the value of an expression tree.
329 getRegister :: StixExpr -> NatM Register
332 = Fixed PrimRep Reg InstrBlock
333 | Any PrimRep (Reg -> InstrBlock)
335 type InstrBlock -- sequence of instructions
337 At first this looks eminently reasonable (apart from the stupid
338 name). <code>getRegister</code>, and nobody else, knows whether or
339 not a given expression has to be computed into a fixed rreg or can be
340 computed into any rreg or vreg. In the first case, it returns
341 <code>Fixed</code> and indicates which rreg the result is in. In the
342 second case it defers committing to any specific target register by
343 returning a function from <code>Reg</code> to <code>InstrBlock</code>,
344 and the caller can specify the target reg as it sees fit.
346 Unfortunately, that forces <code>getRegister</code>'s callers (usually
347 itself) to use a clumsy and confusing idiom in the common case where
348 they do not care what register the result winds up in. The reason is
349 that although a value might be computed into a fixed rreg, we are
350 forbidden (on pain of segmentation fault :) from subsequently
351 modifying the fixed reg. This and other rules are record in "Rules of
352 the game" inside <code>MachCode.lhs</code>.
354 Why can't fixed registers be modified post-hoc? Consider a simple
355 expression like <code>Hp+1</code>. Since the heap pointer
356 <code>Hp</code> is definitely in a fixed register, call it R,
357 <code>getRegister</code> on subterm <code>Hp</code> will simply return
358 <code>Fixed</code> with an empty sequence and R. But we can't just
359 emit an increment instruction for R, because that trashes
360 <code>Hp</code>; instead we first have to copy it into a fresh vreg
363 With all that in mind, consider now writing a <code>getRegister</code>
364 clause for terms of the form <code>(1 + E)</code>. Contrived, yes,
365 but illustrates the matter. First we do
366 <code>getRegister</code> on E. Now we are forced to examine what
369 getRegister (OnePlus e)
370 = getRegister e `thenNat` \ e_result ->
373 -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
375 -> Any (\dst -> e_any dst ++ [INC dst])
377 This seems unreasonably cumbersome, yet the instruction selector is
378 full of such idioms. A good example of the complexities induced by
379 this scheme is shown by <code>trivialCode</code> for x86 in
380 <code>MachCode.lhs</code>. This deals with general integer dyadic
381 operations on x86 and has numerous cases. It was difficult to get
384 An alternative suggestion is to simplify the type of
385 <code>getRegister</code> to this:
387 getRegister :: StixExpr -> NatM (InstrBloc, VReg)
388 type VReg = .... a vreg ...
390 and then we could safely write
392 getRegister (OnePlus e)
393 = getRegister e `thenNat` \ (e_code, e_vreg) ->
394 returnNat (e_code ++ [INC e_vreg], e_vreg)
396 which is about as straightforward as you could hope for.
397 Unfortunately, it requires <code>getRegister</code> to insert moves of
398 values which naturally compute into an rreg, into a vreg. Consider:
402 On x86 the ccall result is returned in rreg <code>%eax</code>. The
403 resulting sequence, prior to register allocation, would be:
407 # move %esp to nuke args
411 If, as is likely, <code>%eax</code> is not held live beyond this point
412 for any other purpose, the move into a fresh register is pointless;
413 we'd have been better off leaving the value in <code>%eax</code> as
416 The simplified <code>getRegister</code> story is attractive. It would
417 clean up the instruction selectors significantly and make it simpler
418 to write new ones. The only drawback is that it generates redundant
419 register moves. I suggest that eliminating these should be the job
420 of the register allocator. Indeed:
422 <li>There has been some work on this already ("Iterated register
423 coalescing" ?), so this isn't a new idea.
425 <li>You could argue that the existing scheme inappropriately blurs the
426 boundary between the instruction selector and the register
427 allocator. The instruction selector should .. well .. just
428 select instructions, without having to futz around worrying about
429 what kind of registers subtrees get generated into. Register
430 allocation should be <em>entirely</em> the domain of the register
431 allocator, with the proviso that it should endeavour to allocate
432 registers so as to minimise the number of non-redundant reg-reg
433 moves in the final output.
437 <h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>
439 Note that this stuff doesn't apply on 64-bit archs, since the
440 <code>getRegister</code> mechanism applies there.
442 The relevant functions are:
444 assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
445 assignReg_I64Code :: StixReg -> StixExpr -> NatM InstrBlock
446 iselExpr64 :: StixExpr -> NatM ChildCode64
448 data ChildCode64 -- a.k.a "Register64"
451 VRegUnique -- unique for the lower 32-bit temporary
453 <code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
454 <code>getRegister</code>, and <code>ChildCode64</code> is the analogue
455 of <code>Register</code>. The aim here was to generate working 64
456 bit code as simply as possible. To this end, I used the
457 simplified <code>getRegister</code> scheme described above, in which
458 <code>iselExpr64</code>generates its results into two vregs which
459 can always safely be modified afterwards.
461 Virtual registers are, unsurprisingly, distinguished by their
462 <code>Unique</code>s. There is a small difficulty in how to
463 know what the vreg for the upper 32 bits of a value is, given the vreg
464 for the lower 32 bits. The simple solution adopted is to say that
465 any low-32 vreg may also have a hi-32 counterpart which shares the
466 same unique, but is otherwise regarded as a separate entity.
467 <code>getHiVRegFromLo</code> gets one from the other.
470 = VRegUniqueLo Unique -- lower part of a split quantity
471 | VRegUniqueHi Unique -- upper part thereof
473 Apart from that, 64-bit code generation is really simple. The sparc
474 and x86 versions are almost copy-n-pastes of each other, with minor
475 adjustments for endianness. The generated code isn't wonderful but
476 is certainly acceptable, and it works.
480 <h3>Shortcomings and inefficiencies in the register allocator</h3>
482 <h4>Redundant reconstruction of the control flow graph</h4>
484 The allocator goes to considerable computational expense to construct
485 all the flow edges in the group of instructions it's allocating for,
486 by using the <code>insnFuture</code> function in the
487 <code>Instr</code> pseudo-abstract type.
489 This is really silly, because all that information is present at the
490 abstract C stage, but is thrown away in the translation to Stix.
491 So a good thing to do is to modify that translation to
492 produce a directed graph of Stix straight-line code blocks,
493 and to preserve that structure through the insn selector, so the
494 allocator can see it.
496 This would eliminate the fragile, hacky, arch-specific
497 <code>insnFuture</code> mechanism, and probably make the whole
498 compiler run measurably faster. Register allocation is a fair chunk
499 of the time of non-optimising compilation (10% or more), and
500 reconstructing the flow graph is an expensive part of reg-alloc.
501 It would probably accelerate the vreg liveness computation too.
503 <h4>Really ridiculous method for doing spilling</h4>
505 This is a more ambitious suggestion, but ... reg-alloc should be
506 reimplemented, using the scheme described in "Quality and speed in
507 linear-scan register allocation." (Traub?) For straight-line code
508 blocks, this gives an elegant one-pass algorithm for assigning
509 registers and creating the minimal necessary spill code, without the
510 need for reserving spill registers ahead of time.
512 I tried it in Rigr, replacing the previous spiller which used the
513 current GHC scheme described above, and it cut the number of spill
514 loads and stores by a factor of eight. Not to mention being simpler,
515 easier to understand and very fast.
517 The Traub paper also describes how to extend their method to multiple
518 basic blocks, which will be needed for GHC. It comes down to
519 reconciling multiple vreg-to-rreg mappings at points where control
522 <h4>Redundant-move support for revised instruction selector suggestion</h4>
524 As mentioned above, simplifying the instruction selector will require
525 the register allocator to try and allocate source and destination
526 vregs to the same rreg in reg-reg moves, so as to make as many as
527 possible go away. Without that, the revised insn selector would
528 generate worse code than at present. I know this stuff has been done
529 but know nothing about it. The Linear-scan reg-alloc paper mentioned
530 above does indeed mention a bit about it in the context of single
531 basic blocks, but I don't know if that's sufficient.
535 <h3>x86 arcana that you should know about</h3>
537 The main difficulty with x86 is that many instructions have fixed
538 register constraints, which can occasionally make reg-alloc fail
539 completely. And the FPU doesn't have the flat register model which
540 the reg-alloc abstraction (implicitly) assumes.
542 Our strategy is: do a good job for the common small subset, that is
543 integer loads, stores, address calculations, basic ALU ops (+, -,
544 and, or, xor), and jumps. That covers the vast majority of
545 executed insns. And indeed we do do a good job, with a loss of
546 less than 2% compared with gcc.
548 Initially we tried to handle integer instructions with awkward
549 register constraints (mul, div, shifts by non-constant amounts) via
550 various jigglings of the spiller et al. This never worked robustly,
551 and putting platform-specific tweaks in the generic infrastructure is
552 a big No-No. (Not quite true; shifts by a non-constant amount are
553 still done by a giant kludge, and should be moved into this new
556 Fortunately, all such insns are rare. So the current scheme is to
557 pretend that they don't have any such constraints. This fiction is
558 carried all the way through the register allocator. When the insn
559 finally comes to be printed, we emit a sequence which copies the
560 operands through memory (<code>%esp</code>-relative), satisfying the
561 constraints of the real instruction. This localises the gruesomeness
562 to just one place. Here, for example, is the code generated for
563 integer divison of <code>%esi</code> by <code>%ecx</code>:
565 # BEGIN IQUOT %ecx, %esi
578 # END IQUOT %ecx, %esi
580 This is not quite as appalling as it seems, if you consider that the
581 division itself typically takes 16+ cycles, whereas the rest of the
582 insns probably go through in about 1 cycle each.
584 This trick is taken to extremes for FP operations.
586 All notions of the x86 FP stack and its insns have been removed.
587 Instead, we pretend, to the instruction selector and register
588 allocator, that x86 has six floating point registers,
589 <code>%fake0</code> .. <code>%fake5</code>, which can be used in the
590 usual flat manner. We further claim that x86 has floating point
591 instructions very similar to SPARC and Alpha, that is, a simple
592 3-operand register-register arrangement. Code generation and register
593 allocation proceed on this basis.
595 When we come to print out the final assembly, our convenient fiction
596 is converted to dismal reality. Each fake instruction is
597 independently converted to a series of real x86 instructions.
598 <code>%fake0</code> .. <code>%fake5</code> are mapped to
599 <code>%st(0)</code> .. <code>%st(5)</code>. To do reg-reg arithmetic
600 operations, the two operands are pushed onto the top of the FP stack,
601 the operation done, and the result copied back into the relevant
602 register. When one of the operands is also the destination, we emit a
603 slightly less scummy translation. There are only six
604 <code>%fake</code> registers because 2 are needed for the translation,
605 and x86 has 8 in total.
607 The translation is inefficient but is simple and it works. A cleverer
608 translation would handle a sequence of insns, simulating the FP stack
609 contents, would not impose a fixed mapping from <code>%fake</code> to
610 <code>%st</code> regs, and hopefully could avoid most of the redundant
611 reg-reg moves of the current translation.
613 There are, however, two unforeseen bad side effects:
615 <li>This doesn't work properly, because it doesn't observe the normal
616 conventions for x86 FP code generation. It turns out that each of
617 the 8 elements in the x86 FP register stack has a tag bit which
618 indicates whether or not that register is notionally in use or
619 not. If you do a FPU operation which happens to read a
620 tagged-as-empty register, you get an x87 FPU (stack invalid)
621 exception, which is normally handled by the FPU without passing it
622 to the OS: the program keeps going, but the resulting FP values
623 are garbage. The OS can ask for the FPU to pass it FP
624 stack-invalid exceptions, but it usually doesn't.
626 Anyways: inside NCG created x86 FP code this all works fine.
627 However, the NCG's fiction of a flat register set does not operate
628 the x87 register stack in the required stack-like way. When
629 control returns to a gcc-generated world, the stack tag bits soon
630 cause stack exceptions, and thus garbage results.
632 The only fix I could think of -- and it is horrible -- is to clear
633 all the tag bits just before the next STG-level entry, in chunks
634 of code which use FP insns. <code>i386_insert_ffrees</code>
635 inserts the relevant <code>ffree</code> insns into such code
636 blocks. It depends critically on <code>is_G_instr</code> to
639 <li>It's very difficult to read the generated assembly and
640 reason about it when debugging, because there's so much clutter.
641 We print the fake insns as comments in the output, and that helps
647 <h3>Generating code for ccalls</h3>
649 For reasons I don't really understand, the instruction selectors for
650 generating calls to C (<code>genCCall</code>) have proven surprisingly
651 difficult to get right, and soaked up a lot of debugging time. As a
652 result, I have once again opted for schemes which are simple and not
653 too difficult to argue as correct, even if they don't generate
656 The sparc ccall generator in particular forces all arguments into
657 temporary virtual registers before moving them to the final
658 out-registers (<code>%o0</code> .. <code>%o5</code>). This creates
659 some unnecessary reg-reg moves. The reason is explained in a
663 <h3>Duplicate implementation for many STG macros</h3>
665 This has been discussed at length already. It has caused a couple of
666 nasty bugs due to subtle untracked divergence in the macro
667 translations. The macro-expander really should be pushed up into the
668 Abstract C phase, so the problem can't happen.
670 Doing so would have the added benefit that the NCG could be used to
671 compile more "ways" -- well, at least the 'p' profiling way.
674 <h3>How to debug the NCG without losing your sanity/hair/cool</h3>
676 Last, but definitely not least ...
678 The usual syndrome is that some program, when compiled via C, works,
679 but not when compiled via the NCG. Usually the problem is fairly
680 simple to fix, once you find the specific code block which has been
681 mistranslated. But the latter can be nearly impossible, since most
682 modules generate at least hundreds and often thousands of them.
686 Because the via-C and native routes diverge only late in the day,
687 it is not difficult to construct a 1-1 correspondence between basic
688 blocks on the two routes. So, if the program works via C but not on
689 the NCG, do the following:
691 <li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler
692 with <code>-DDEBUG_NCG</code>, so that it inserts
693 <code>___ncg_debug_marker</code>s
694 into the assembly it emits.
696 <li>Using a binary search on modules, find the module which is causing
699 <li>Compile that module to assembly code, with identical flags, twice,
700 once via C and once via NCG.
701 Call the outputs <code>ModuleName.s-gcc</code> and
702 <code>ModuleName.s-nat</code>. Check that the latter does indeed have
703 <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail.
705 <li>Build (with a working compiler) the program
706 <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>.
708 <li>Run: <code>diff_gcc_nat ModuleName.s</code>. This will
709 construct the 1-1 correspondence, and emits on stdout
710 a cppable assembly output. Place this in a file -- I always
711 call it <code>synth.S</code>. Note, the capital S is important;
712 otherwise it won't get cpp'd. You can feed this file directly to
713 ghc and it will automatically get cpp'd; you don't have to do so
716 <li>By messing with the <code>#define</code>s at the top of
717 <code>synth.S</code>, do a binary search to find the incorrect
718 block. Keep a careful record of where you are in the search; it
719 is easy to get confused. Remember also that multiple blocks may
720 be wrong, which also confuses matters. Finally, I usually start
721 off by re-checking that I can build the executable with all the
722 <code>#define</code>s set to 0 and then all to 1. This ensures
723 you won't get halfway through the search and then get stuck due to
724 some snafu with gcc-specific literals. Usually I set
725 <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should
726 contain only literal data.
727 <code>UNMATCHED_NAT</code> should be empty.
730 <code>diff_gcc_nat</code> was known to work correctly last time I used
731 it, in December 01, for both x86 and sparc. If it doesn't work, due
732 to changes in assembly syntax, or whatever, make it work. The
733 investment is well worth it. Searching for the incorrect block(s) any
734 other way is a total time waster.
745 Last modified: Fri Feb 1 16:14:11 GMT 2002