ghc/docs/comm/the-beast/ncg.html

   1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
   2 <html>
   3   <head>
   4     <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
   5     <title>The GHC Commentary - The Native Code Generator</title>
   6   </head>
   7
   8   <body BGCOLOR="FFFFFF">
   9     <h1>The GHC Commentary - The Native Code Generator</h1>
  10     <p>
  11       On x86 and sparc platforms, GHC can generate assembly code
  12       directly, without having to go via C.  This can sometimes almost
  13       halve compilation time, and avoids the fragility and
  14       horribleness of the mangler.  The NCG is enabled by default for
  15       non-optimising compilation on x86 and sparc.  For most programs
  16       it generates code which runs only about 1% slower than that
  17       created by gcc on x86s, so it is well worth using even with
  18       optimised compilation.  FP-intensive x86 programs see a bigger
  19       slowdown, and all sparc code runs about 5% slower due to
  20       us not filling branch delay slots.
  21     <p>
  22       In the distant past
  23       the NCG could also generate Alpha code, and that machinery
  24       is still there, but will need extensive refurbishment to
  25       get it going again, due to underlying infrastructural changes.
  26       Budding hackers thinking of doing a PowerPC port would do well
  27       to use the sparc bits as a starting point.
  28     <p>
  29       The NCG has always been something of a second-class citizen
  30       inside GHC, an unloved child, rather.  This means that its
  31       integration into the compiler as a whole is rather clumsy, which
  32       brings some problems described below.  That apart, the NCG
  33       proper is fairly cleanly designed, as target-independent as it
  34       reasonably can be, and so should not be difficult to retarget.
  35     <p>
  36       The following details are correct as per the CVS head of end-Jan
  37       2002.
  38
  39     <h2>Overview</h2>
  40       The top-level code generator fn is
  41     <p>
  42     <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code>
  43     <p>
  44     The returned <code>SDoc</code> is for debugging, so is empty unless
  45     you specify <code>-ddump-stix</code>.  The <code>Pretty.Doc</code>
  46     bit is the final assembly code.  Translation involves three main
  47     phases, the first and third of which are target-independent.
  48     <ul>
  49     <li><b>Translation into the <code>Stix</code> representation.</b>  Stix
  50         is a simple tree-like RTL-style language, in which you can
  51         mention:
  52         <p>
  53         <ul>
  54         <li>An infinite number of temporary, virtual registers.
  55         <li>The STG "magic" registers (<code>MagicId</code>), such as
  56             the heap and stack pointers.
  57         <li>Literals and low-level machine ops (<code>MachOp</code>).
  58         <li>Simple address computations.
  59         <li>Reads and writes of: memory, virtual regs, and various STG
  60             regs.
  61         <li>Labels and <code>if ... goto ...</code> style control-flow.
  62         </ul>
  63         <p>
  64         Stix has two main associated types:
  65         <p>
  66         <ul>
  67         <li><code>StixStmt</code> -- trees executed for their side
  68         effects: assignments, control transfers, and auxiliary junk
  69         such as segment changes and literal data.
  70         <li><code>StixExpr</code> -- trees which denote a value.
  71         </ul>
  72         <p>
  73         Translation into Stix is almost completely
  74         target-independent.  Needed dependencies are knowledge of
  75         word size and endianness, used when generating code to do
  76         deal with half-word fields in info tables.  This could be
  77         abstracted out easily enough.  Also, the Stix translation
  78         needs to know which <code>MagicId</code>s map to registers
  79         on the given target, and which are stored in offsets from
  80         <code>BaseReg</code>.
  81         <p>
  82         After initial Stix generation, the trees are cleaned up with
  83         constant-folding and a little copy-propagation ("Stix
  84         inlining", as the code misleadingly calls it).  We take
  85         the opportunity to translate <code>MagicId</code>s which are
  86         stored in memory on the given target, into suitable memory
  87         references.  Those which are stored in registers are left
  88         alone.  There is also a half-hearted attempt to lift literal
  89         strings to the top level in cases where nested strings have
  90         been observed to give incorrect code in the past.
  91         <p>
  92         Primitive machine-level operations will already be phrased in
  93         terms of <code>MachOp</code>s in the presented Abstract C, and
  94         these are passed through unchanged.  We comment only that the
  95         <code>MachOp</code>s have been chosen so as to be easy to
  96         implement on all targets, and their meaning is intended to be
  97         unambiguous, and the same on all targets, regardless of word
  98         size or endianness.
  99         <p>
 100         <b>A note on <code>MagicId</code>s.</b>
 101         Those which are assigned to
 102         registers on the current target are left unmodified.  Those
 103         which are not are stored in memory as offsets from
 104         <code>BaseReg</code> (which is assumed to permanently have the
 105         value <code>(&MainCapability.r)</code>), so the constant folder
 106         calculates the offsets and inserts suitable loads/stores.  One
 107         complication is that not all archs have <code>BaseReg</code>
 108         itself in a register, so for those (sparc), we instead
 109         generate the address as an offset from the static symbol
 110         <code>MainCapability</code>, since the register table lives in
 111         there.
 112         <p>
 113         Finally, <code>BaseReg</code> does occasionally itself get
 114         mentioned in Stix expression trees, and in this case what is
 115         denoted is precisely <code>(&MainCapability.r)</code>, not, as
 116         in all other cases, the value of memory at some offset from
 117         the start of the register table.  Since what it denotes is an
 118         r-value and not an l-value, assigning <code>BaseReg</code> is
 119         meaningless, so the machinery checks to ensure this never
 120         happens.  All these details are taken into account by the
 121         constant folder.
 122     <p>
 123     <li><b>Instruction selection.</b>  This is the only majorly
 124         target-specific phase.  It turns Stix statements and
 125         expressions into sequences of <code>Instr</code>, a data
 126         type which is different for each architecture.
 127         <code>Instr</code>, unsurprisingly, has various supporting
 128         types, such as <code>Reg</code>, <code>Operand</code>,
 129         <code>Imm</code>, etc.  The generated instructions may refer
 130         to specific machine registers, or to arbitrary virtual
 131         registers, either those created within the instruction
 132         selector, or those mentioned in the Stix passed to it.
 133         <p>
 134         The instruction selectors live in <code>MachCode.lhs</code>.
 135         The core functions, for each target, are:
 136         <p>
 137         <code>
 138         getAmode :: StixExpr -> NatM Amode
 139         <br>getRegister :: StixExpr -> NatM Register
 140         <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
 141         <br>assignReg_IntCode :: PrimRep -> StixReg  -> StixExpr -> NatM InstrBlock
 142         </code>
 143         <p>
 144         The insn selectors use the "maximal munch" algorithm.  The
 145         bizarrely-misnamed <code>getRegister</code> translates
 146         expressions.  A simplified version of its type is:
 147         <p>
 148         <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code>
 149         <p>
 150         That is: it (monadically) turns a <code>StixExpr</code> into a
 151         sequence of instructions, and a register, with the meaning
 152         that after executing the (possibly empty) sequence of
 153         instructions, the (possibly virtual) register will
 154         hold the resulting value.  The real situation is complicated
 155         by the presence of fixed registers, and is detailed below.
 156         <p>
 157         Maximal munch is a greedy algorithm and is known not to give
 158         globally optimal code sequences, but it is good enough, and
 159         fast and simple.  Early incarnations of the NCG used something
 160         more sophisticated, but that is long gone now.
 161         <p>
 162         Similarly, <code>getAmode</code> translates a value, intended
 163         to denote an address, into a sequence of insns leading up to
 164         a (processor-specific) addressing mode.  This stuff could be
 165         done using the general <code>getRegister</code> selector, but
 166         would necessarily generate poorer code, because the calculated
 167         address would be forced into a register, which might be
 168         unnecessary if it could partially or wholly be calculated
 169         using an addressing mode.
 170         <p>
 171         Finally, <code>assignMem_IntCode</code> and
 172         <code>assignReg_IntCode</code> create instruction sequences to
 173         calculate a value and store it in the given register, or at
 174         the given address.  Because these guys translate a statement,
 175         not a value, they just return a sequence of insns and no
 176         associated register.  Floating-point and 64-bit integer
 177         assignments have analogous selectors.
 178         <p>
 179         Apart from the complexities of fixed vs floating registers,
 180         discussed below, the instruction selector is as simple
 181         as it can be.  It looks long and scary but detailed
 182         examination reveals it to be fairly straightforward.
 183     <p>
 184     <li><b>Register allocation.</b> The register allocator,
 185         <code>AsmRegAlloc.lhs</code> takes sequences of
 186         <code>Instr</code>s which mention a mixture of real and
 187         virtual registers, and returns a modified sequence referring
 188         only to real ones.  It is gloriously and entirely
 189         target-independent.  Well, not exactly true.  Instead it
 190         regards <code>Instr</code> (instructions) and <code>Reg</code>
 191         (virtual and real registers) as abstract types, to which it has
 192         the following interface:
 193         <p>
 194         <code>
 195         insnFuture :: Instr -> InsnFuture
 196         <br>regUsage :: Instr -> RegUsage
 197         <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr
 198         </code>
 199         <p>
 200         <code>insnFuture</code> is used to (re)construct the graph of
 201         all possible control transfers between the insns to be
 202         allocated.  <code>regUsage</code> returns the sets of registers
 203         read and written by an instruction.  And
 204         <code>patchRegs</code> is used to apply the allocator's final
 205         decision on virtual-to-real reg mapping to an instruction.
 206         <p>
 207         Clearly these 3 fns have to be written anew for each
 208         architecture.  They are defined in
 209         <code>RegAllocInfo.lhs</code>.  Think twice, no, thrice,
 210         before modifying them: making false claims about insn
 211         behaviour will lead to hard-to-find register allocation
 212         errors.
 213         <p>
 214         <code>AsmRegAlloc.lhs</code> contains detailed comments about
 215         how the allocator works.  Here is a summary.  The head honcho
 216         <p>
 217         <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code>
 218         <p>
 219         takes a list of instructions and a list of real registers
 220         available for allocation, and maps as many of the virtual regs
 221         in the input into real ones as it can.  The returned
 222         <code>Bool</code> indicates whether or not it was
 223         successful.  If so, that's the end of it.  If not, the caller
 224         of <code>allocUsingTheseRegs</code> will attempt spilling.
 225         More of that later.  What <code>allocUsingTheseRegs</code>
 226         does is:
 227         <p>
 228         <ul>
 229         <li>Implicitly number each instruction by its position in the
 230             input list.
 231         <p>
 232         <li>Using <code>insnFuture</code>, create the set of all flow
 233             edges -- possible control transfers -- within this set of
 234             insns.
 235         <p>
 236         <li>Using <code>regUsage</code> and iterating around the flow
 237             graph from the previous step, calculate, for each virtual
 238             register, the set of flow edges on which it is live.
 239         <p>
 240         <li>Make a real-register committment map, which gives the set
 241             of edges for which each real register is committed (in
 242             use).  These sets are initially empty.  For each virtual
 243             register, attempt to find a real register whose current
 244             committment does not intersect that of the virtual
 245             register -- ie, is uncommitted on all edges that the
 246             virtual reg is live.  If successful, this means the vreg
 247             can be assigned to the realreg, so add the vreg's set to
 248             the realreg's committment.
 249         <p>
 250         <li>If all the vregs were assigned to a realreg, use
 251             <code>patchInstr</code> to apply the mapping to the insns themselves.
 252         </ul>
 253         <p>
 254         <b>Spilling</b>
 255         <p>
 256         If <code>allocUsingTheseRegs</code> fails, a baroque
 257         mechanism comes into play.  We now know that much simpler
 258         schemes are available to do the same thing and give better
 259         results.
 260         Anyways:
 261         <p>
 262         The logic above <code>allocUsingTheseRegs</code>, in
 263         <code>doGeneralAlloc</code> and <code>runRegAllocate</code>,
 264         observe that allocation has failed with some set R of real
 265         registers.  So they apply <code>runRegAllocate</code> a second
 266         time to the code, but remove (typically) two registers from R
 267         before doing so.  This naturally fails too, but returns a
 268         partially-allocated sequence.  <code>doGeneralAlloc</code>
 269         then inserts spill code into the sequence, and finally re-runs
 270         <code>allocUsingTheseRegs</code>, but supplying the original,
 271         unadulterated R.  This is guaranteed to succeed since the two
 272         registers previously removed from R are sufficient to allocate
 273         all the spill/restore instructions added.
 274         <p>
 275         Because x86 is very short of registers, and in the worst case
 276         needs three removed from R, a softly-softly approach is used.
 277         <code>doGeneralAlloc</code> first tries with zero regs removed
 278         from R, then if that fails one, then two, etc.  This means
 279         <code>allocUsingTheseRegs</code> may get run several times
 280         before a successful arrangement is arrived at.
 281         <code>findReservedRegs</code> cooks up the sets of spill
 282         registers to try with.
 283         <p>
 284         The resulting machinery is complicated and the generated spill
 285         code is appalling.  The saving grace is that spills are very
 286         rare so it doesn't matter much.  I did not invent this -- I inherited it.
 287         <p>
 288         <b>Dealing with common cases fast</b>
 289         <p>
 290         The entire reg-alloc mechanism described so far is general and
 291         correct, but expensive overkill for many simple code blocks.
 292         So to begin with we use
 293         <code>doSimpleAlloc</code>, which attempts to do something
 294         simple.  It exploits the observation that if the total number
 295         of virtual registers does not exceed the number of real ones
 296         available, we can simply dole out a new realreg each time we
 297         see mention of a new vreg, with no regard for control flow.
 298         <code>doSimpleAlloc</code> therefore attempts this in a
 299         single pass over the code.  It gives up if it runs out of real
 300         regs or sees any condition which renders the above observation
 301         invalid (fixed reg uses, for example).
 302         <p>
 303         This clever hack handles the majority of code blocks quickly.
 304         It was copied from the previous reg-allocator (the
 305         Mattson/Partain/Marlow/Gill one).
 306     </ul>
 307
 308 <p>
 309 <h2>Complications, observations, and possible improvements</h2>
 310
 311 <h3>Real vs virtual registers in the instruction selectors</h3>
 312
 313 The instruction selectors for expression trees, namely
 314 <code>getRegister</code>, are complicated by the fact that some
 315 expressions can only be computed into a specific register, whereas
 316 the majority can be computed into any register.  We take x86 as an
 317 example, but the problem applies to all archs.
 318 <p>
 319 Terminology: <em>rreg</em> means real register, a real machine
 320 register.  <em>vreg</em> means one of an infinite set of virtual
 321 registers.  The type <code>Reg</code> is the sum of <em>rreg</em> and
 322 <em>vreg</em>.  The instruction selector generates sequences with
 323 unconstrained use of vregs, leaving the register allocator to map them
 324 all into rregs.
 325 <p>
 326 Now, where was I ?  Oh yes.  We return to the type of
 327 <code>getRegister</code>, which despite its name, selects instructions
 328 to compute the value of an expression tree.
 329 <pre>
 330    getRegister :: StixExpr -> NatM Register
 331
 332    data Register
 333      = Fixed   PrimRep Reg InstrBlock
 334      | Any     PrimRep (Reg -> InstrBlock)
 335
 336    type InstrBlock -- sequence of instructions
 337 </pre>
 338 At first this looks eminently reasonable (apart from the stupid
 339 name).  <code>getRegister</code>, and nobody else, knows whether or
 340 not a given expression has to be computed into a fixed rreg or can be
 341 computed into any rreg or vreg.  In the first case, it returns
 342 <code>Fixed</code> and indicates which rreg the result is in.  In the
 343 second case it defers committing to any specific target register by
 344 returning a function from <code>Reg</code> to <code>InstrBlock</code>,
 345 and the caller can specify the target reg as it sees fit.
 346 <p>
 347 Unfortunately, that forces <code>getRegister</code>'s callers (usually
 348 itself) to use a clumsy and confusing idiom in the common case where
 349 they do not care what register the result winds up in.  The reason is
 350 that although a value might be computed into a fixed rreg, we are
 351 forbidden (on pain of segmentation fault :) from subsequently
 352 modifying the fixed reg.  This and other rules are record in "Rules of
 353 the game" inside <code>MachCode.lhs</code>.
 354 <p>
 355 Why can't fixed registers be modified post-hoc?  Consider a simple
 356 expression like <code>Hp+1</code>.  Since the heap pointer
 357 <code>Hp</code> is definitely in a fixed register, call it R,
 358 <code>getRegister</code> on subterm <code>Hp</code> will simply return
 359 <code>Fixed</code> with an empty sequence and R.  But we can't just
 360 emit an increment instruction for R, because that trashes
 361 <code>Hp</code>; instead we first have to copy it into a fresh vreg
 362 and increment that.
 363 <p>
 364 With all that in mind, consider now writing a <code>getRegister</code>
 365 clause for terms of the form <code>(1 + E)</code>.  Contrived, yes,
 366 but illustrates the matter.  First we do
 367 <code>getRegister</code> on E.  Now we are forced to examine what
 368 comes back.
 369 <pre>
 370    getRegister (OnePlus e)
 371       = getRegister e           `thenNat`   \ e_result ->
 372         case e_result of
 373            Fixed e_code e_fixed
 374               -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
 375            Any e_any
 376               -> Any (\dst -> e_any dst ++ [INC dst])
 377 </pre>
 378 This seems unreasonably cumbersome, yet the instruction selector is
 379 full of such idioms.  A good example of the complexities induced by
 380 this scheme is shown by <code>trivialCode</code> for x86 in
 381 <code>MachCode.lhs</code>.  This deals with general integer dyadic
 382 operations on x86 and has numerous cases.  It was difficult to get
 383 right.
 384 <p>
 385 An alternative suggestion is to simplify the type of
 386 <code>getRegister</code> to this:
 387 <pre>
 388    getRegister :: StixExpr -> NatM (InstrBloc, VReg)
 389    type VReg = .... a vreg ...
 390 </pre>
 391 and then we could safely write
 392 <pre>
 393    getRegister (OnePlus e)
 394       = getRegister e        `thenNat`  \ (e_code, e_vreg) ->
 395         returnNat (e_code ++ [INC e_vreg], e_vreg)
 396 </pre>
 397 which is about as straightforward as you could hope for.
 398 Unfortunately, it requires <code>getRegister</code> to insert moves of
 399 values which naturally compute into an rreg, into a vreg.  Consider:
 400 <pre>
 401    1 + ccall some-C-fn
 402 </pre>
 403 On x86 the ccall result is returned in rreg <code>%eax</code>.  The
 404 resulting sequence, prior to register allocation, would be:
 405 <pre>
 406    # push args
 407    call some-C-fn
 408    # move %esp to nuke args
 409    movl   %eax, %vreg
 410    incl   %vreg
 411 </pre>
 412 If, as is likely, <code>%eax</code> is not held live beyond this point
 413 for any other purpose, the move into a fresh register is pointless;
 414 we'd have been better off leaving the value in <code>%eax</code> as
 415 long as possible.
 416 <p>
 417 The simplified <code>getRegister</code> story is attractive.  It would
 418 clean up the instruction selectors significantly and make it simpler
 419 to write new ones.  The only drawback is that it generates redundant
 420 register moves.  I suggest that eliminating these should be the job
 421 of the register allocator.  Indeed:
 422 <ul>
 423 <li>There has been some work on this already ("Iterated register
 424     coalescing" ?), so this isn't a new idea.
 425 <p>
 426 <li>You could argue that the existing scheme inappropriately blurs the
 427     boundary between the instruction selector and the register
 428     allocator.  The instruction selector should .. well .. just
 429     select instructions, without having to futz around worrying about
 430     what kind of registers subtrees get generated into.  Register
 431     allocation should be <em>entirely</em> the domain of the register
 432     allocator, with the proviso that it should endeavour to allocate
 433     registers so as to minimise the number of non-redundant reg-reg
 434     moves in the final output.
 435 </ul>
 436
 437
 438 <h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>
 439
 440 Note that this stuff doesn't apply on 64-bit archs, since the
 441 <code>getRegister</code> mechanism applies there.
 442
 443 The relevant functions are:
 444 <pre>
 445    assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
 446    assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock
 447    iselExpr64        :: StixExpr -> NatM ChildCode64
 448
 449    data ChildCode64     -- a.k.a "Register64"
 450       = ChildCode64
 451            InstrBlock   -- code
 452            VRegUnique   -- unique for the lower 32-bit temporary
 453 </pre>
 454 <code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
 455 <code>getRegister</code>, and <code>ChildCode64</code> is the analogue
 456 of <code>Register</code>.  The aim here was to generate working 64
 457 bit code as simply as possible.  To this end, I used the
 458 simplified <code>getRegister</code> scheme described above, in which
 459 <code>iselExpr64</code>generates its results into two vregs which
 460 can always safely be modified afterwards.
 461 <p>
 462 Virtual registers are, unsurprisingly, distinguished by their
 463 <code>Unique</code>s.  There is a small difficulty in how to
 464 know what the vreg for the upper 32 bits of a value is, given the vreg
 465 for the lower 32 bits.  The simple solution adopted is to say that
 466 any low-32 vreg may also have a hi-32 counterpart which shares the
 467 same unique, but is otherwise regarded as a separate entity.
 468 <code>getHiVRegFromLo</code> gets one from the other.
 469 <pre>
 470    data VRegUnique
 471       = VRegUniqueLo Unique          -- lower part of a split quantity
 472       | VRegUniqueHi Unique          -- upper part thereof
 473 </pre>
 474 Apart from that, 64-bit code generation is really simple.  The sparc
 475 and x86 versions are almost copy-n-pastes of each other, with minor
 476 adjustments for endianness.  The generated code isn't wonderful but
 477 is certainly acceptable, and it works.
 478
 479
 480
 481 <h3>Shortcomings and inefficiencies in the register allocator</h3>
 482
 483 <h4>Redundant reconstruction of the control flow graph</h4>
 484
 485 The allocator goes to considerable computational expense to construct
 486 all the flow edges in the group of instructions it's allocating for,
 487 by using the <code>insnFuture</code> function in the
 488 <code>Instr</code> pseudo-abstract type.
 489 <p>
 490 This is really silly, because all that information is present at the
 491 abstract C stage, but is thrown away in the translation to Stix.
 492 So a good thing to do is to modify that translation to
 493 produce a directed graph of Stix straight-line code blocks,
 494 and to preserve that structure through the insn selector, so the
 495 allocator can see it.
 496 <p>
 497 This would eliminate the fragile, hacky, arch-specific
 498 <code>insnFuture</code> mechanism, and probably make the whole
 499 compiler run measurably faster.  Register allocation is a fair chunk
 500 of the time of non-optimising compilation (10% or more), and
 501 reconstructing the flow graph is an expensive part of reg-alloc.
 502 It would probably accelerate the vreg liveness computation too.
 503
 504 <h4>Really ridiculous method for doing spilling</h4>
 505
 506 This is a more ambitious suggestion, but ... reg-alloc should be
 507 reimplemented, using the scheme described in "Quality and speed in
 508 linear-scan register allocation."  (Traub?)  For straight-line code
 509 blocks, this gives an elegant one-pass algorithm for assigning
 510 registers and creating the minimal necessary spill code, without the
 511 need for reserving spill registers ahead of time.
 512 <p>
 513 I tried it in Rigr, replacing the previous spiller which used the
 514 current GHC scheme described above, and it cut the number of spill
 515 loads and stores by a factor of eight.  Not to mention being simpler,
 516 easier to understand and very fast.
 517 <p>
 518 The Traub paper also describes how to extend their method to multiple
 519 basic blocks, which will be needed for GHC.  It comes down to
 520 reconciling multiple vreg-to-rreg mappings at points where control
 521 flow merges.
 522
 523 <h4>Redundant-move support for revised instruction selector suggestion</h4>
 524
 525 As mentioned above, simplifying the instruction selector will require
 526 the register allocator to try and allocate source and destination
 527 vregs to the same rreg in reg-reg moves, so as to make as many as
 528 possible go away.  Without that, the revised insn selector would
 529 generate worse code than at present.  I know this stuff has been done
 530 but know nothing about it.  The Linear-scan reg-alloc paper mentioned
 531 above does indeed mention a bit about it in the context of single
 532 basic blocks, but I don't know if that's sufficient.
 533
 534
 535
 536 <h3>x86 arcana that you should know about</h3>
 537
 538 The main difficulty with x86 is that many instructions have fixed
 539 register constraints, which can occasionally make reg-alloc fail
 540 completely.  And the FPU doesn't have the flat register model which
 541 the reg-alloc abstraction (implicitly) assumes.
 542 <p>
 543 Our strategy is: do a good job for the common small subset, that is
 544 integer loads, stores, address calculations, basic ALU ops (+, -,
 545 and, or, xor), and jumps.  That covers the vast majority of
 546 executed insns.  And indeed we do do a good job, with a loss of
 547 less than 2% compared with gcc.
 548 <p>
 549 Initially we tried to handle integer instructions with awkward
 550 register constraints (mul, div, shifts by non-constant amounts) via
 551 various jigglings of the spiller et al.  This never worked robustly,
 552 and putting platform-specific tweaks in the generic infrastructure is
 553 a big No-No.  (Not quite true; shifts by a non-constant amount are
 554 still done by a giant kludge, and should be moved into this new
 555 framework.)
 556 <p>
 557 Fortunately, all such insns are rare.  So the current scheme is to
 558 pretend that they don't have any such constraints.  This fiction is
 559 carried all the way through the register allocator.  When the insn
 560 finally comes to be printed, we emit a sequence which copies the
 561 operands through memory (<code>%esp</code>-relative), satisfying the
 562 constraints of the real instruction.  This localises the gruesomeness
 563 to just one place.  Here, for example, is the code generated for
 564 integer divison of <code>%esi</code> by <code>%ecx</code>:
 565 <pre>
 566    # BEGIN IQUOT %ecx, %esi
 567    pushl $0
 568    pushl %eax
 569    pushl %edx
 570    pushl %ecx
 571    movl  %esi,% eax
 572    cltd
 573    idivl 0(%esp)
 574    movl %eax, 12(%esp)
 575    popl %edx
 576    popl %edx
 577    popl %eax
 578    popl %esi
 579    # END   IQUOT %ecx, %esi
 580 </pre>
 581 This is not quite as appalling as it seems, if you consider that the
 582 division itself typically takes 16+ cycles, whereas the rest of the
 583 insns probably go through in about 1 cycle each.
 584 <p>
 585 This trick is taken to extremes for FP operations.
 586 <p>
 587 All notions of the x86 FP stack and its insns have been removed.
 588 Instead, we pretend, to the instruction selector and register
 589 allocator, that x86 has six floating point registers,
 590 <code>%fake0</code> .. <code>%fake5</code>, which can be used in the
 591 usual flat manner.  We further claim that x86 has floating point
 592 instructions very similar to SPARC and Alpha, that is, a simple
 593 3-operand register-register arrangement.  Code generation and register
 594 allocation proceed on this basis.
 595 <p>
 596 When we come to print out the final assembly, our convenient fiction
 597 is converted to dismal reality.  Each fake instruction is
 598 independently converted to a series of real x86 instructions.
 599 <code>%fake0</code> .. <code>%fake5</code> are mapped to
 600 <code>%st(0)</code> .. <code>%st(5)</code>.  To do reg-reg arithmetic
 601 operations, the two operands are pushed onto the top of the FP stack,
 602 the operation done, and the result copied back into the relevant
 603 register.  When one of the operands is also the destination, we emit a
 604 slightly less scummy translation.  There are only six
 605 <code>%fake</code> registers because 2 are needed for the translation,
 606 and x86 has 8 in total.
 607 <p>
 608 The translation is inefficient but is simple and it works.  A cleverer
 609 translation would handle a sequence of insns, simulating the FP stack
 610 contents, would not impose a fixed mapping from <code>%fake</code> to
 611 <code>%st</code> regs, and hopefully could avoid most of the redundant
 612 reg-reg moves of the current translation.
 613 <p>
 614 There are, however, two unforeseen bad side effects:
 615 <ul>
 616 <li>This doesn't work properly, because it doesn't observe the normal
 617     conventions for x86 FP code generation.  It turns out that each of
 618     the 8 elements in the x86 FP register stack has a tag bit which
 619     indicates whether or not that register is notionally in use or
 620     not.  If you do a FPU operation which happens to read a
 621     tagged-as-empty register, you get an x87 FPU (stack invalid)
 622     exception, which is normally handled by the FPU without passing it
 623     to the OS: the program keeps going, but the resulting FP values
 624     are garbage.  The OS can ask for the FPU to pass it FP
 625     stack-invalid exceptions, but it usually doesn't.
 626     <p>
 627     Anyways: inside NCG created x86 FP code this all works fine.
 628     However, the NCG's fiction of a flat register set does not operate
 629     the x87 register stack in the required stack-like way.  When
 630     control returns to a gcc-generated world, the stack tag bits soon
 631     cause stack exceptions, and thus garbage results.
 632     <p>
 633     The only fix I could think of -- and it is horrible -- is to clear
 634     all the tag bits just before the next STG-level entry, in chunks
 635     of code which use FP insns.  <code>i386_insert_ffrees</code>
 636     inserts the relevant <code>ffree</code> insns into such code
 637     blocks.  It depends critically on <code>is_G_instr</code> to
 638     detect such blocks.
 639 <p>
 640 <li>It's very difficult to read the generated assembly and
 641     reason about it when debugging, because there's so much clutter.
 642     We print the fake insns as comments in the output, and that helps
 643     a bit.
 644 </ul>
 645
 646
 647
 648 <h3>Generating code for ccalls</h3>
 649
 650 For reasons I don't really understand, the instruction selectors for
 651 generating calls to C (<code>genCCall</code>) have proven surprisingly
 652 difficult to get right, and soaked up a lot of debugging time.  As a
 653 result, I have once again opted for schemes which are simple and not
 654 too difficult to argue as correct, even if they don't generate
 655 excellent code.
 656 <p>
 657 The sparc ccall generator in particular forces all arguments into
 658 temporary virtual registers before moving them to the final
 659 out-registers (<code>%o0</code> .. <code>%o5</code>).  This creates
 660 some unnecessary reg-reg moves.  The reason is explained in a
 661 comment in the code.
 662
 663
 664 <h3>Duplicate implementation for many STG macros</h3>
 665
 666 This has been discussed at length already.  It has caused a couple of
 667 nasty bugs due to subtle untracked divergence in the macro
 668 translations.  The macro-expander really should be pushed up into the
 669 Abstract C phase, so the problem can't happen.
 670 <p>
 671 Doing so would have the added benefit that the NCG could be used to
 672 compile more "ways" -- well, at least the 'p' profiling way.
 673
 674
 675 <h3>How to debug the NCG without losing your sanity/hair/cool</h3>
 676
 677 Last, but definitely not least ...
 678 <p>
 679 The usual syndrome is that some program, when compiled via C, works,
 680 but not when compiled via the NCG.  Usually the problem is fairly
 681 simple to fix, once you find the specific code block which has been
 682 mistranslated.  But the latter can be nearly impossible, since most
 683 modules generate at least hundreds and often thousands of them.
 684 <p>
 685 My solution: cheat.
 686 <p>
 687 Because the via-C and native routes diverge only late in the day,
 688 it is not difficult to construct a 1-1 correspondence between basic
 689 blocks on the two routes.  So, if the program works via C but not on
 690 the NCG, do the following:
 691 <ul>
 692 <li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler
 693     with <code>-DDEBUG_NCG</code>, so that it inserts
 694     <code>___ncg_debug_marker</code>s
 695     into the assembly it emits.
 696 <p>
 697 <li>Using a binary search on modules, find the module which is causing
 698     the problem.
 699 <p>
 700 <li>Compile that module to assembly code, with identical flags, twice,
 701     once via C and once via NCG.
 702     Call the outputs <code>ModuleName.s-gcc</code> and
 703     <code>ModuleName.s-nat</code>.  Check that the latter does indeed have
 704     <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail.
 705 <p>
 706 <li>Build (with a working compiler) the program
 707     <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>.
 708 <p>
 709 <li>Run: <code>diff_gcc_nat ModuleName.s</code>.  This will
 710     construct the 1-1 correspondence, and emits on stdout
 711     a cppable assembly output.  Place this in a file -- I always
 712     call it <code>synth.S</code>.  Note, the capital S is important;
 713     otherwise it won't get cpp'd.  You can feed this file directly to
 714     ghc and it will automatically get cpp'd; you don't have to do so
 715     yourself.
 716 <p>
 717 <li>By messing with the <code>#define</code>s at the top of
 718     <code>synth.S</code>, do a binary search to find the incorrect
 719     block.  Keep a careful record of where you are in the search; it
 720     is easy to get confused.  Remember also that multiple blocks may
 721     be wrong, which also confuses matters.  Finally, I usually start
 722     off by re-checking that I can build the executable with all the
 723     <code>#define</code>s set to 0 and then all to 1.  This ensures
 724     you won't get halfway through the search and then get stuck due to
 725     some snafu with gcc-specific literals.  Usually I set
 726     <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should
 727     contain only literal data.
 728     <code>UNMATCHED_NAT</code> should be empty.
 729 </ul>
 730 <p>
 731 <code>diff_gcc_nat</code> was known to work correctly last time I used
 732 it, in December 01, for both x86 and sparc.  If it doesn't work, due
 733 to changes in assembly syntax, or whatever, make it work.  The
 734 investment is well worth it.  Searching for the incorrect block(s) any
 735 other way is a total time waster.
 736
 737
 738
 739 </ul>
 740
 741
 742
 743
 744     <p><small>
 745 <!-- hhmts start -->
 746 Last modified: Fri Feb  1 16:14:11 GMT 2002
 747 <!-- hhmts end -->
 748     </small>
 749   </body>
 750 </html>