1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
4 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
5 <title>The GHC Commentary - The Native Code Generator</title>
8 <body BGCOLOR="FFFFFF">
9 <h1>The GHC Commentary - The Native Code Generator</h1>
11 On x86 and sparc platforms, GHC can generate assembly code
12 directly, without having to go via C. This can sometimes almost
13 halve compilation time, and avoids the fragility and
14 horribleness of the mangler. The NCG is enabled by default for
15 non-optimising compilation on x86 and sparc. For most programs
16 it generates code which runs only about 1% slower than that
17 created by gcc on x86s, so it is well worth using even with
18 optimised compilation. FP-intensive x86 programs see a bigger
19 slowdown, and all sparc code runs about 5% slower due to
20 us not filling branch delay slots.
23 the NCG could also generate Alpha code, and that machinery
24 is still there, but will need extensive refurbishment to
25 get it going again, due to underlying infrastructural changes.
26 Budding hackers thinking of doing a PowerPC port would do well
27 to use the sparc bits as a starting point.
29 The NCG has always been something of a second-class citizen
30 inside GHC, an unloved child, rather. This means that its
31 integration into the compiler as a whole is rather clumsy, which
32 brings some problems described below. That apart, the NCG
33 proper is fairly cleanly designed, as target-independent as it
34 reasonably can be, and so should not be difficult to retarget.
36 The following details are correct as per the CVS head of end-Jan
40 The top-level code generator fn is
42 <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code>
44 The returned <code>SDoc</code> is for debugging, so is empty unless
45 you specify <code>-ddump-stix</code>. The <code>Pretty.Doc</code>
46 bit is the final assembly code. Translation involves three main
47 phases, the first and third of which are target-independent.
49 <li><b>Translation into the <code>Stix</code> representation.</b> Stix
50 is a simple tree-like RTL-style language, in which you can
54 <li>An infinite number of temporary, virtual registers.
55 <li>The STG "magic" registers (<code>MagicId</code>), such as
56 the heap and stack pointers.
57 <li>Literals and low-level machine ops (<code>MachOp</code>).
58 <li>Simple address computations.
59 <li>Reads and writes of: memory, virtual regs, and various STG
61 <li>Labels and <code>if ... goto ...</code> style control-flow.
64 Stix has two main associated types:
67 <li><code>StixStmt</code> -- trees executed for their side
68 effects: assignments, control transfers, and auxiliary junk
69 such as segment changes and literal data.
70 <li><code>StixExpr</code> -- trees which denote a value.
73 Translation into Stix is almost completely
74 target-independent. Needed dependencies are knowledge of
75 word size and endianness, used when generating code to do
76 deal with half-word fields in info tables. This could be
77 abstracted out easily enough. Also, the Stix translation
78 needs to know which <code>MagicId</code>s map to registers
79 on the given target, and which are stored in offsets from
82 After initial Stix generation, the trees are cleaned up with
83 constant-folding and a little copy-propagation ("Stix
84 inlining", as the code misleadingly calls it). We take
85 the opportunity to translate <code>MagicId</code>s which are
86 stored in memory on the given target, into suitable memory
87 references. Those which are stored in registers are left
88 alone. There is also a half-hearted attempt to lift literal
89 strings to the top level in cases where nested strings have
90 been observed to give incorrect code in the past.
92 Primitive machine-level operations will already be phrased in
93 terms of <code>MachOp</code>s in the presented Abstract C, and
94 these are passed through unchanged. We comment only that the
95 <code>MachOp</code>s have been chosen so as to be easy to
96 implement on all targets, and their meaning is intended to be
97 unambiguous, and the same on all targets, regardless of word
100 <li><b>Instruction selection.</b> This is the only majorly
101 target-specific phase. It turns Stix statements and
102 expressions into sequences of <code>Instr</code>, a data
103 type which is different for each architecture.
104 <code>Instr</code>, unsurprisingly, has various supporting
105 types, such as <code>Reg</code>, <code>Operand</code>,
106 <code>Imm</code>, etc. The generated instructions may refer
107 to specific machine registers, or to arbitrary virtual
108 registers, either those created within the instruction
109 selector, or those mentioned in the Stix passed to it.
111 The instruction selectors live in <code>MachCode.lhs</code>.
112 The core functions, for each target, are:
115 getAmode :: StixExpr -> NatM Amode
116 <br>getRegister :: StixExpr -> NatM Register
117 <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
118 <br>assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock
121 The insn selectors use the "maximal munch" algorithm. The
122 bizarrely-misnamed <code>getRegister</code> translates
123 expressions. A simplified version of its type is:
125 <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code>
127 That is: it (monadically) turns a <code>StixExpr</code> into a
128 sequence of instructions, and a register, with the meaning
129 that after executing the (possibly empty) sequence of
130 instructions, the (possibly virtual) register will
131 hold the resulting value. The real situation is complicated
132 by the presence of fixed registers, and is detailed below.
134 Maximal munch is a greedy algorithm and is known not to give
135 globally optimal code sequences, but it is good enough, and
136 fast and simple. Early incarnations of the NCG used something
137 more sophisticated, but that is long gone now.
139 Similarly, <code>getAmode</code> translates a value, intended
140 to denote an address, into a sequence of insns leading up to
141 a (processor-specific) addressing mode. This stuff could be
142 done using the general <code>getRegister</code> selector, but
143 would necessarily generate poorer code, because the calculated
144 address would be forced into a register, which might be
145 unnecessary if it could partially or wholly be calculated
146 using an addressing mode.
148 Finally, <code>assignMem_IntCode</code> and
149 <code>assignReg_IntCode</code> create instruction sequences to
150 calculate a value and store it in the given register, or at
151 the given address. Because these guys translate a statement,
152 not a value, they just return a sequence of insns and no
153 associated register. Floating-point and 64-bit integer
154 assignments have analogous selectors.
156 Apart from the complexities of fixed vs floating registers,
157 discussed below, the instruction selector is as simple
158 as it can be. It looks long and scary but detailed
159 examination reveals it to be fairly straightforward.
161 <li><b>Register allocation.</b> The register allocator,
162 <code>AsmRegAlloc.lhs</code> takes sequences of
163 <code>Instr</code>s which mention a mixture of real and
164 virtual registers, and returns a modified sequence referring
165 only to real ones. It is gloriously and entirely
166 target-independent. Well, not exactly true. Instead it
167 regards <code>Instr</code> (instructions) and <code>Reg</code>
168 (virtual and real registers) as abstract types, to which it has
169 the following interface:
172 insnFuture :: Instr -> InsnFuture
173 <br>regUsage :: Instr -> RegUsage
174 <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr
177 <code>insnFuture</code> is used to (re)construct the graph of
178 all possible control transfers between the insns to be
179 allocated. <code>regUsage</code> returns the sets of registers
180 read and written by an instruction. And
181 <code>patchRegs</code> is used to apply the allocator's final
182 decision on virtual-to-real reg mapping to an instruction.
184 Clearly these 3 fns have to be written anew for each
185 architecture. They are defined in
186 <code>RegAllocInfo.lhs</code>. Think twice, no, thrice,
187 before modifying them: making false claims about insn
188 behaviour will lead to hard-to-find register allocation
191 <code>AsmRegAlloc.lhs</code> contains detailed comments about
192 how the allocator works. Here is a summary. The head honcho
194 <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code>
196 takes a list of instructions and a list of real registers
197 available for allocation, and maps as many of the virtual regs
198 in the input into real ones as it can. The returned
199 <code>Bool</code> indicates whether or not it was
200 successful. If so, that's the end of it. If not, the caller
201 of <code>allocUsingTheseRegs</code> will attempt spilling.
202 More of that later. What <code>allocUsingTheseRegs</code>
206 <li>Implicitly number each instruction by its position in the
209 <li>Using <code>insnFuture</code>, create the set of all flow
210 edges -- possible control transfers -- within this set of
213 <li>Using <code>regUsage</code> and iterating around the flow
214 graph from the previous step, calculate, for each virtual
215 register, the set of flow edges on which it is live.
217 <li>Make a real-register committment map, which gives the set
218 of edges for which each real register is committed (in
219 use). These sets are initially empty. For each virtual
220 register, attempt to find a real register whose current
221 committment does not intersect that of the virtual
222 register -- ie, is uncommitted on all edges that the
223 virtual reg is live. If successful, this means the vreg
224 can be assigned to the realreg, so add the vreg's set to
225 the realreg's committment.
227 <li>If all the vregs were assigned to a realreg, use
228 <code>patchInstr</code> to apply the mapping to the insns themselves.
233 If <code>allocUsingTheseRegs</code> fails, a baroque
234 mechanism comes into play. We now know that much simpler
235 schemes are available to do the same thing and give better
239 The logic above <code>allocUsingTheseRegs</code>, in
240 <code>doGeneralAlloc</code> and <code>runRegAllocate</code>,
241 observe that allocation has failed with some set R of real
242 registers. So they apply <code>runRegAllocate</code> a second
243 time to the code, but remove (typically) two registers from R
244 before doing so. This naturally fails too, but returns a
245 partially-allocated sequence. <code>doGeneralAlloc</code>
246 then inserts spill code into the sequence, and finally re-runs
247 <code>allocUsingTheseRegs</code>, but supplying the original,
248 unadulterated R. This is guaranteed to succeed since the two
249 registers previously removed from R are sufficient to allocate
250 all the spill/restore instructions added.
252 Because x86 is very short of registers, and in the worst case
253 needs three removed from R, a softly-softly approach is used.
254 <code>doGeneralAlloc</code> first tries with zero regs removed
255 from R, then if that fails one, then two, etc. This means
256 <code>allocUsingTheseRegs</code> may get run several times
257 before a successful arrangement is arrived at.
258 <code>findReservedRegs</code> cooks up the sets of spill
259 registers to try with.
261 The resulting machinery is complicated and the generated spill
262 code is appalling. The saving grace is that spills are very
263 rare so it doesn't matter much. I did not invent this -- I inherited it.
265 <b>Dealing with common cases fast</b>
267 The entire reg-alloc mechanism described so far is general and
268 correct, but expensive overkill for many simple code blocks.
269 So to begin with we use
270 <code>doSimpleAlloc</code>, which attempts to do something
271 simple. It exploits the observation that if the total number
272 of virtual registers does not exceed the number of real ones
273 available, we can simply dole out a new realreg each time we
274 see mention of a new vreg, with no regard for control flow.
275 <code>doSimpleAlloc</code> therefore attempts this in a
276 single pass over the code. It gives up if it runs out of real
277 regs or sees any condition which renders the above observation
278 invalid (fixed reg uses, for example).
280 This clever hack handles the majority of code blocks quickly.
281 It was copied from the previous reg-allocator (the
282 Mattson/Partain/Marlow/Gill one).
286 <h2>Complications, observations, and possible improvements</h2>
288 <h3>Real vs virtual registers in the instruction selectors</h3>
290 The instruction selectors for expression trees, namely
291 <code>getRegister</code>, are complicated by the fact that some
292 expressions can only be computed into a specific register, whereas
293 the majority can be computed into any register. We take x86 as an
294 example, but the problem applies to all archs.
296 Terminology: <em>rreg</em> means real register, a real machine
297 register. <em>vreg</em> means one of an infinite set of virtual
298 registers. The type <code>Reg</code> is the sum of <em>rreg</em> and
299 <em>vreg</em>. The instruction selector generates sequences with
300 unconstrained use of vregs, leaving the register allocator to map them
303 Now, where was I ? Oh yes. We return to the type of
304 <code>getRegister</code>, which despite its name, selects instructions
305 to compute the value of an expression tree.
307 getRegister :: StixExpr -> NatM Register
310 = Fixed PrimRep Reg InstrBlock
311 | Any PrimRep (Reg -> InstrBlock)
313 type InstrBlock -- sequence of instructions
315 At first this looks eminently reasonable (apart from the stupid
316 name). <code>getRegister</code>, and nobody else, knows whether or
317 not a given expression has to be computed into a fixed rreg or can be
318 computed into any rreg or vreg. In the first case, it returns
319 <code>Fixed</code> and indicates which rreg the result is in. In the
320 second case it defers committing to any specific target register by
321 returning a function from <code>Reg</code> to <code>InstrBlock</code>,
322 and the caller can specify the target reg as it sees fit.
324 Unfortunately, that forces <code>getRegister</code>'s callers (usually
325 itself) to use a clumsy and confusing idiom in the common case where
326 they do not care what register the result winds up in. The reason is
327 that although a value might be computed into a fixed rreg, we are
328 forbidden (on pain of segmentation fault :) from subsequently
329 modifying the fixed reg. This and other rules are record in "Rules of
330 the game" inside <code>MachCode.lhs</code>.
332 Why can't fixed registers be modified post-hoc? Consider a simple
333 expression like <code>Hp+1</code>. Since the heap pointer
334 <code>Hp</code> is definitely in a fixed register, call it R,
335 <code>getRegister</code> on subterm <code>Hp</code> will simply return
336 <code>Fixed</code> with an empty sequence and R. But we can't just
337 emit an increment instruction for R, because that trashes
338 <code>Hp</code>; instead we first have to copy it into a fresh vreg
341 With all that in mind, consider now writing a <code>getRegister</code>
342 clause for terms of the form <code>(1 + E)</code>. Contrived, yes,
343 but illustrates the matter. First we do
344 <code>getRegister</code> on E. Now we are forced to examine what
347 getRegister (OnePlus e)
348 = getRegister e `thenNat` \ e_result ->
351 -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
353 -> Any (\dst -> e_any dst ++ [INC dst])
355 This seems unreasonably cumbersome, yet the instruction selector is
356 full of such idioms. A good example of the complexities induced by
357 this scheme is shown by <code>trivialCode</code> for x86 in
358 <code>MachCode.lhs</code>. This deals with general integer dyadic
359 operations on x86 and has numerous cases. It was difficult to get
362 An alternative suggestion is to simplify the type of
363 <code>getRegister</code> to this:
365 getRegister :: StixExpr -> NatM (InstrBloc, VReg)
366 type VReg = .... a vreg ...
368 and then we could safely write
370 getRegister (OnePlus e)
371 = getRegister e `thenNat` \ (e_code, e_vreg) ->
372 returnNat (e_code ++ [INC e_vreg], e_vreg)
374 which is about as straightforward as you could hope for.
375 Unfortunately, it requires <code>getRegister</code> to insert moves of
376 values which naturally compute into an rreg, into a vreg. Consider:
380 On x86 the ccall result is returned in rreg <code>%eax</code>. The
381 resulting sequence, prior to register allocation, would be:
385 # move %esp to nuke args
389 If, as is likely, <code>%eax</code> is not held live beyond this point
390 for any other purpose, the move into a fresh register is pointless;
391 we'd have been better off leaving the value in <code>%eax</code> as
394 The simplified <code>getRegister</code> story is attractive. It would
395 clean up the instruction selectors significantly and make it simpler
396 to write new ones. The only drawback is that it generates redundant
397 register moves. I suggest that eliminating these should be the job
398 of the register allocator. Indeed:
400 <li>There has been some work on this already ("Iterated register
401 coalescing" ?), so this isn't a new idea.
403 <li>You could argue that the existing scheme inappropriately blurs the
404 boundary between the instruction selector and the register
405 allocator. The instruction selector should .. well .. just
406 select instructions, without having to futz around worrying about
407 what kind of registers subtrees get generated into. Register
408 allocation should be <em>entirely</em> the domain of the register
409 allocator, with the proviso that it should endeavour to allocate
410 registers so as to minimise the number of non-redundant reg-reg
411 moves in the final output.
415 <h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>
417 Note that this stuff doesn't apply on 64-bit archs, since the
418 <code>getRegister</code> mechanism applies there.
420 The relevant functions are:
422 assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
423 assignReg_I64Code :: StixReg -> StixExpr -> NatM InstrBlock
424 iselExpr64 :: StixExpr -> NatM ChildCode64
426 data ChildCode64 -- a.k.a "Register64"
429 VRegUnique -- unique for the lower 32-bit temporary
431 <code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
432 <code>getRegister</code>, and <code>ChildCode64</code> is the analogue
433 of <code>Register</code>. The aim here was to generate working 64
434 bit code as simply as possible. To this end, I used the
435 simplified <code>getRegister</code> scheme described above, in which
436 <code>iselExpr64</code>generates its results into two vregs which
437 can always safely be modified afterwards.
439 Virtual registers are, unsurprisingly, distinguished by their
440 <code>Unique</code>s. There is a small difficulty in how to
441 know what the vreg for the upper 32 bits of a value is, given the vreg
442 for the lower 32 bits. The simple solution adopted is to say that
443 any low-32 vreg may also have a hi-32 counterpart which shares the
444 same unique, but is otherwise regarded as a separate entity.
445 <code>getHiVRegFromLo</code> gets one from the other.
448 = VRegUniqueLo Unique -- lower part of a split quantity
449 | VRegUniqueHi Unique -- upper part thereof
451 Apart from that, 64-bit code generation is really simple. The sparc
452 and x86 versions are almost copy-n-pastes of each other, with minor
453 adjustments for endianness. The generated code isn't wonderful but
454 is certainly acceptable, and it works.
458 <h3>Shortcomings and inefficiencies in the register allocator</h3>
460 <h4>Redundant reconstruction of the control flow graph</h4>
462 The allocator goes to considerable computational expense to construct
463 all the flow edges in the group of instructions it's allocating for,
464 by using the <code>insnFuture</code> function in the
465 <code>Instr</code> pseudo-abstract type.
467 This is really silly, because all that information is present at the
468 abstract C stage, but is thrown away in the translation to Stix.
469 So a good thing to do is to modify that translation to
470 produce a directed graph of Stix straight-line code blocks,
471 and to preserve that structure through the insn selector, so the
472 allocator can see it.
474 This would eliminate the fragile, hacky, arch-specific
475 <code>insnFuture</code> mechanism, and probably make the whole
476 compiler run measurably faster. Register allocation is a fair chunk
477 of the time of non-optimising compilation (10% or more), and
478 reconstructing the flow graph is an expensive part of reg-alloc.
479 It would probably accelerate the vreg liveness computation too.
481 <h4>Really ridiculous method for doing spilling</h4>
483 This is a more ambitious suggestion, but ... reg-alloc should be
484 reimplemented, using the scheme described in "Quality and speed in
485 linear-scan register allocation." (Traub?) For straight-line code
486 blocks, this gives an elegant one-pass algorithm for assigning
487 registers and creating the minimal necessary spill code, without the
488 need for reserving spill registers ahead of time.
490 I tried it in Rigr, replacing the previous spiller which used the
491 current GHC scheme described above, and it cut the number of spill
492 loads and stores by a factor of eight. Not to mention being simpler,
493 easier to understand and very fast.
495 The Traub paper also describes how to extend their method to multiple
496 basic blocks, which will be needed for GHC. It comes down to
497 reconciling multiple vreg-to-rreg mappings at points where control
500 <h4>Redundant-move support for revised instruction selector suggestion</h4>
502 As mentioned above, simplifying the instruction selector will require
503 the register allocator to try and allocate source and destination
504 vregs to the same rreg in reg-reg moves, so as to make as many as
505 possible go away. Without that, the revised insn selector would
506 generate worse code than at present. I know this stuff has been done
507 but know nothing about it. The Linear-scan reg-alloc paper mentioned
508 above does indeed mention a bit about it in the context of single
509 basic blocks, but I don't know if that's sufficient.
513 <h3>x86 arcana that you should know about</h3>
515 The main difficulty with x86 is that many instructions have fixed
516 register constraints, which can occasionally make reg-alloc fail
517 completely. And the FPU doesn't have the flat register model which
518 the reg-alloc abstraction (implicitly) assumes.
520 Our strategy is: do a good job for the common small subset, that is
521 integer loads, stores, address calculations, basic ALU ops (+, -,
522 and, or, xor), and jumps. That covers the vast majority of
523 executed insns. And indeed we do do a good job, with a loss of
524 less than 2% compared with gcc.
526 Initially we tried to handle integer instructions with awkward
527 register constraints (mul, div, shifts by non-constant amounts) via
528 various jigglings of the spiller et al. This never worked robustly,
529 and putting platform-specific tweaks in the generic infrastructure is
530 a big No-No. (Not quite true; shifts by a non-constant amount are
531 still done by a giant kludge, and should be moved into this new
534 Fortunately, all such insns are rare. So the current scheme is to
535 pretend that they don't have any such constraints. This fiction is
536 carried all the way through the register allocator. When the insn
537 finally comes to be printed, we emit a sequence which copies the
538 operands through memory (<code>%esp</code>-relative), satisfying the
539 constraints of the real instruction. This localises the gruesomeness
540 to just one place. Here, for example, is the code generated for
541 integer divison of <code>%esi</code> by <code>%ecx</code>:
543 # BEGIN IQUOT %ecx, %esi
556 # END IQUOT %ecx, %esi
558 This is not quite as appalling as it seems, if you consider that the
559 division itself typically takes 16+ cycles, whereas the rest of the
560 insns probably go through in about 1 cycle each.
562 This trick is taken to extremes for FP operations.
564 All notions of the x86 FP stack and its insns have been removed.
565 Instead, we pretend, to the instruction selector and register
566 allocator, that x86 has six floating point registers,
567 <code>%fake0</code> .. <code>%fake5</code>, which can be used in the
568 usual flat manner. We further claim that x86 has floating point
569 instructions very similar to SPARC and Alpha, that is, a simple
570 3-operand register-register arrangement. Code generation and register
571 allocation proceed on this basis.
573 When we come to print out the final assembly, our convenient fiction
574 is converted to dismal reality. Each fake instruction is
575 independently converted to a series of real x86 instructions.
576 <code>%fake0</code> .. <code>%fake5</code> are mapped to
577 <code>%st(0)</code> .. <code>%st(5)</code>. To do reg-reg arithmetic
578 operations, the two operands are pushed onto the top of the FP stack,
579 the operation done, and the result copied back into the relevant
580 register. When one of the operands is also the destination, we emit a
581 slightly less scummy translation. There are only six
582 <code>%fake</code> registers because 2 are needed for the translation,
583 and x86 has 8 in total.
585 The translation is inefficient but is simple and it works. A cleverer
586 translation would handle a sequence of insns, simulating the FP stack
587 contents, would not impose a fixed mapping from <code>%fake</code> to
588 <code>%st</code> regs, and hopefully could avoid most of the redundant
589 reg-reg moves of the current translation.
591 There are, however, two unforeseen bad side effects:
593 <li>This doesn't work properly, because it doesn't observe the normal
594 conventions for x86 FP code generation. It turns out that each of
595 the 8 elements in the x86 FP register stack has a tag bit which
596 indicates whether or not that slot contains a valid value. If you
597 do a FPU operation which happens to read such a value, you get a
598 x87 FPU exception, which is normally handled by the FPU without
599 passing it to the OS: the program keeps going, but the resulting
600 FP values are garbage. (The OS can ask for the FPU to pass it FP
601 stack-invalid exceptions, but it usually doesn't).
603 Anyways: inside NCG created x86 FP code this all works fine, but
604 when control returns to a gcc-generated world, the stack tag bits
605 soon cause stack exceptions, and thus garbage results.
607 The only fix I could think of -- and it is horrible -- is to clear
608 all the tag bits just before the next STG-level entry, in chunks
609 of code which use FP insns. <code>i386_insert_ffrees</code>
610 inserts the relevant <code>ffree</code> insns into such code
611 blocks. It depends critically on <code>is_G_instr</code> to
614 <li>It's very difficult to read the generated assembly and
615 reason about it when debugging, because there's so much clutter.
616 We print the fake insns as comments in the output, and that helps
622 <h3>Generating code for ccalls</h3>
624 For reasons I don't really understand, the instruction selectors for
625 generating calls to C (<code>genCCall</code>) have proven surprisingly
626 difficult to get right, and soaked up a lot of debugging time. As a
627 result, I have once again opted for schemes which are simple and not
628 too difficult to argue as correct, even if they don't generate
631 The sparc ccall generator in particular forces all arguments into
632 temporary virtual registers before moving them to the final
633 out-registers (<code>%o0</code> .. <code>%o5</code>). This creates
634 some unnecessary reg-reg moves. The reason is explained in a
638 <h3>Duplicate implementation for many STG macros</h3>
640 This has been discussed at length already. It has caused a couple of
641 nasty bugs due to subtle untracked divergence in the macro
642 translations. The macro-expander really should be pushed up into the
643 Abstract C phase, so the problem can't happen.
645 Doing so would have the added benefit that the NCG could be used to
646 compile more "ways" -- well, at least the 'p' profiling way.
649 <h3>How to debug the NCG without losing your sanity/hair/cool</h3>
651 Last, but definitely not least ...
653 The usual syndrome is that some program, when compiled via C, works,
654 but not when compiled via the NCG. Usually the problem is fairly
655 simple to fix, once you find the specific code block which has been
656 mistranslated. But the latter can be nearly impossible, since most
657 modules generate at least hundreds and often thousands of them.
661 Because the via-C and native routes diverge only late in the day,
662 it is not difficult to construct a 1-1 correspondence between basic
663 blocks on the two routes. So, if the program works via C but not on
664 the NCG, do the following:
666 <li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler
667 with <code>-DDEBUG_NCG</code>, so that it inserts
668 <code>___ncg_debug_marker</code>s
669 into the assembly it emits.
671 <li>Using a binary search on modules, find the module which is causing
674 <li>Compile that module to assembly code, with identical flags, twice,
675 once via C and once via NCG.
676 Call the outputs <code>ModuleName.s-gcc</code> and
677 <code>ModuleName.s-nat</code>. Check that the latter does indeed have
678 <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail.
680 <li>Build (with a working compiler) the program
681 <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>.
683 <li>Run: <code>diff_gcc_nat ModuleName.s</code>. This will
684 construct the 1-1 correspondence, and emits on stdout
685 a cppable assembly output. Place this in a file -- I always
686 call it <code>synth.S</code>. Note, the capital S is important;
687 otherwise it won't get cpp'd. You can feed this file directly to
688 ghc and it will automatically get cpp'd; you don't have to do so
691 <li>By messing with the <code>#define</code>s at the top of
692 <code>synth.S</code>, do a binary search to find the incorrect
693 block. Keep a careful record of where you are in the search; it
694 is easy to get confused. Remember also that multiple blocks may
695 be wrong, which also confuses matters. Finally, I usually start
696 off by re-checking that I can build the executable with all the
697 <code>#define</code>s set to 0 and then all to 1. This ensures
698 you won't get halfway through the search and then get stuck due to
699 some snafu with gcc-specific literals. Usually I set
700 <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should
701 contain only literal data.
702 <code>UNMATCHED_NAT</code> should be empty.
705 <code>diff_gcc_nat</code> was known to work correctly last time I used
706 it, in December 01, for both x86 and sparc. If it doesn't work, due
707 to changes in assembly syntax, or whatever, make it work. The
708 investment is well worth it. Searching for the incorrect block(s) any
709 other way is a total time waster.
720 Last modified: Fri Feb 1 16:14:11 GMT 2002