docs/comm/the-beast/ncg.html

   1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
   2 <html>
   3   <head>
   4     <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
   5     <title>The GHC Commentary - The Native Code Generator</title>
   6   </head>
   7
   8   <body BGCOLOR="FFFFFF">
   9     <h1>The GHC Commentary - The Native Code Generator</h1>
  10     <p>
  11       On some platforms (currently x86 and PowerPC, with bitrotted
  12       support for Sparc and Alpha), GHC can generate assembly code
  13       directly, without having to go via C.  This can sometimes almost
  14       halve compilation time, and avoids the fragility and
  15       horribleness of the <a href="mangler.html">mangler</a>.  The NCG
  16       is enabled by default for
  17       non-optimising compilation on supported platforms.  For most programs
  18       it generates code which runs only 1-3% slower
  19       (depending on platform and type of code) than that
  20       created by gcc on x86s, so it is well worth using even with
  21       optimised compilation.  FP-intensive x86 programs see a bigger
  22       slowdown, and all Sparc code runs about 5% slower due to
  23       us not filling branch delay slots.
  24     <p>
  25       The NCG has always been something of a second-class citizen
  26       inside GHC, an unloved child, rather.  This means that its
  27       integration into the compiler as a whole is rather clumsy, which
  28       brings some problems described below.  That apart, the NCG
  29       proper is fairly cleanly designed, as target-independent as it
  30       reasonably can be, and so should not be difficult to retarget.
  31     <p>
  32       <b>NOTE!</b> The native code generator was largely rewritten as part
  33       of the C-- backend changes, around May 2004.  Unfortunately the
  34       rest of this document still refers to the old version, and was written
  35       with relation to the CVS head as of end-Jan 2002.  Some of it is relevant,
  36       some of it isn't.
  37
  38     <h2>Overview</h2>
  39       The top-level code generator fn is
  40     <p>
  41     <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code>
  42     <p>
  43     The returned <code>SDoc</code> is for debugging, so is empty unless
  44     you specify <code>-ddump-stix</code>.  The <code>Pretty.Doc</code>
  45     bit is the final assembly code.  Translation involves three main
  46     phases, the first and third of which are target-independent.
  47     <ul>
  48     <li><b>Translation into the <code>Stix</code> representation.</b>  Stix
  49         is a simple tree-like RTL-style language, in which you can
  50         mention:
  51         <p>
  52         <ul>
  53         <li>An infinite number of temporary, virtual registers.
  54         <li>The STG "magic" registers (<code>MagicId</code>), such as
  55             the heap and stack pointers.
  56         <li>Literals and low-level machine ops (<code>MachOp</code>).
  57         <li>Simple address computations.
  58         <li>Reads and writes of: memory, virtual regs, and various STG
  59             regs.
  60         <li>Labels and <code>if ... goto ...</code> style control-flow.
  61         </ul>
  62         <p>
  63         Stix has two main associated types:
  64         <p>
  65         <ul>
  66         <li><code>StixStmt</code> -- trees executed for their side
  67         effects: assignments, control transfers, and auxiliary junk
  68         such as segment changes and literal data.
  69         <li><code>StixExpr</code> -- trees which denote a value.
  70         </ul>
  71         <p>
  72         Translation into Stix is almost completely
  73         target-independent.  Needed dependencies are knowledge of
  74         word size and endianness, used when generating code to do
  75         deal with half-word fields in info tables.  This could be
  76         abstracted out easily enough.  Also, the Stix translation
  77         needs to know which <code>MagicId</code>s map to registers
  78         on the given target, and which are stored in offsets from
  79         <code>BaseReg</code>.
  80         <p>
  81         After initial Stix generation, the trees are cleaned up with
  82         constant-folding and a little copy-propagation ("Stix
  83         inlining", as the code misleadingly calls it).  We take
  84         the opportunity to translate <code>MagicId</code>s which are
  85         stored in memory on the given target, into suitable memory
  86         references.  Those which are stored in registers are left
  87         alone.  There is also a half-hearted attempt to lift literal
  88         strings to the top level in cases where nested strings have
  89         been observed to give incorrect code in the past.
  90         <p>
  91         Primitive machine-level operations will already be phrased in
  92         terms of <code>MachOp</code>s in the presented Abstract C, and
  93         these are passed through unchanged.  We comment only that the
  94         <code>MachOp</code>s have been chosen so as to be easy to
  95         implement on all targets, and their meaning is intended to be
  96         unambiguous, and the same on all targets, regardless of word
  97         size or endianness.
  98         <p>
  99         <b>A note on <code>MagicId</code>s.</b>
 100         Those which are assigned to
 101         registers on the current target are left unmodified.  Those
 102         which are not are stored in memory as offsets from
 103         <code>BaseReg</code> (which is assumed to permanently have the
 104         value <code>(&MainCapability.r)</code>), so the constant folder
 105         calculates the offsets and inserts suitable loads/stores.  One
 106         complication is that not all archs have <code>BaseReg</code>
 107         itself in a register, so for those (sparc), we instead
 108         generate the address as an offset from the static symbol
 109         <code>MainCapability</code>, since the register table lives in
 110         there.
 111         <p>
 112         Finally, <code>BaseReg</code> does occasionally itself get
 113         mentioned in Stix expression trees, and in this case what is
 114         denoted is precisely <code>(&MainCapability.r)</code>, not, as
 115         in all other cases, the value of memory at some offset from
 116         the start of the register table.  Since what it denotes is an
 117         r-value and not an l-value, assigning <code>BaseReg</code> is
 118         meaningless, so the machinery checks to ensure this never
 119         happens.  All these details are taken into account by the
 120         constant folder.
 121     <p>
 122     <li><b>Instruction selection.</b>  This is the only majorly
 123         target-specific phase.  It turns Stix statements and
 124         expressions into sequences of <code>Instr</code>, a data
 125         type which is different for each architecture.
 126         <code>Instr</code>, unsurprisingly, has various supporting
 127         types, such as <code>Reg</code>, <code>Operand</code>,
 128         <code>Imm</code>, etc.  The generated instructions may refer
 129         to specific machine registers, or to arbitrary virtual
 130         registers, either those created within the instruction
 131         selector, or those mentioned in the Stix passed to it.
 132         <p>
 133         The instruction selectors live in <code>MachCode.lhs</code>.
 134         The core functions, for each target, are:
 135         <p>
 136         <code>
 137         getAmode :: StixExpr -> NatM Amode
 138         <br>getRegister :: StixExpr -> NatM Register
 139         <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
 140         <br>assignReg_IntCode :: PrimRep -> StixReg  -> StixExpr -> NatM InstrBlock
 141         </code>
 142         <p>
 143         The insn selectors use the "maximal munch" algorithm.  The
 144         bizarrely-misnamed <code>getRegister</code> translates
 145         expressions.  A simplified version of its type is:
 146         <p>
 147         <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code>
 148         <p>
 149         That is: it (monadically) turns a <code>StixExpr</code> into a
 150         sequence of instructions, and a register, with the meaning
 151         that after executing the (possibly empty) sequence of
 152         instructions, the (possibly virtual) register will
 153         hold the resulting value.  The real situation is complicated
 154         by the presence of fixed registers, and is detailed below.
 155         <p>
 156         Maximal munch is a greedy algorithm and is known not to give
 157         globally optimal code sequences, but it is good enough, and
 158         fast and simple.  Early incarnations of the NCG used something
 159         more sophisticated, but that is long gone now.
 160         <p>
 161         Similarly, <code>getAmode</code> translates a value, intended
 162         to denote an address, into a sequence of insns leading up to
 163         a (processor-specific) addressing mode.  This stuff could be
 164         done using the general <code>getRegister</code> selector, but
 165         would necessarily generate poorer code, because the calculated
 166         address would be forced into a register, which might be
 167         unnecessary if it could partially or wholly be calculated
 168         using an addressing mode.
 169         <p>
 170         Finally, <code>assignMem_IntCode</code> and
 171         <code>assignReg_IntCode</code> create instruction sequences to
 172         calculate a value and store it in the given register, or at
 173         the given address.  Because these guys translate a statement,
 174         not a value, they just return a sequence of insns and no
 175         associated register.  Floating-point and 64-bit integer
 176         assignments have analogous selectors.
 177         <p>
 178         Apart from the complexities of fixed vs floating registers,
 179         discussed below, the instruction selector is as simple
 180         as it can be.  It looks long and scary but detailed
 181         examination reveals it to be fairly straightforward.
 182     <p>
 183     <li><b>Register allocation.</b> The register allocator,
 184         <code>AsmRegAlloc.lhs</code> takes sequences of
 185         <code>Instr</code>s which mention a mixture of real and
 186         virtual registers, and returns a modified sequence referring
 187         only to real ones.  It is gloriously and entirely
 188         target-independent.  Well, not exactly true.  Instead it
 189         regards <code>Instr</code> (instructions) and <code>Reg</code>
 190         (virtual and real registers) as abstract types, to which it has
 191         the following interface:
 192         <p>
 193         <code>
 194         insnFuture :: Instr -> InsnFuture
 195         <br>regUsage :: Instr -> RegUsage
 196         <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr
 197         </code>
 198         <p>
 199         <code>insnFuture</code> is used to (re)construct the graph of
 200         all possible control transfers between the insns to be
 201         allocated.  <code>regUsage</code> returns the sets of registers
 202         read and written by an instruction.  And
 203         <code>patchRegs</code> is used to apply the allocator's final
 204         decision on virtual-to-real reg mapping to an instruction.
 205         <p>
 206         Clearly these 3 fns have to be written anew for each
 207         architecture.  They are defined in
 208         <code>RegAllocInfo.lhs</code>.  Think twice, no, thrice,
 209         before modifying them: making false claims about insn
 210         behaviour will lead to hard-to-find register allocation
 211         errors.
 212         <p>
 213         <code>AsmRegAlloc.lhs</code> contains detailed comments about
 214         how the allocator works.  Here is a summary.  The head honcho
 215         <p>
 216         <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code>
 217         <p>
 218         takes a list of instructions and a list of real registers
 219         available for allocation, and maps as many of the virtual regs
 220         in the input into real ones as it can.  The returned
 221         <code>Bool</code> indicates whether or not it was
 222         successful.  If so, that's the end of it.  If not, the caller
 223         of <code>allocUsingTheseRegs</code> will attempt spilling.
 224         More of that later.  What <code>allocUsingTheseRegs</code>
 225         does is:
 226         <p>
 227         <ul>
 228         <li>Implicitly number each instruction by its position in the
 229             input list.
 230         <p>
 231         <li>Using <code>insnFuture</code>, create the set of all flow
 232             edges -- possible control transfers -- within this set of
 233             insns.
 234         <p>
 235         <li>Using <code>regUsage</code> and iterating around the flow
 236             graph from the previous step, calculate, for each virtual
 237             register, the set of flow edges on which it is live.
 238         <p>
 239         <li>Make a real-register committment map, which gives the set
 240             of edges for which each real register is committed (in
 241             use).  These sets are initially empty.  For each virtual
 242             register, attempt to find a real register whose current
 243             committment does not intersect that of the virtual
 244             register -- ie, is uncommitted on all edges that the
 245             virtual reg is live.  If successful, this means the vreg
 246             can be assigned to the realreg, so add the vreg's set to
 247             the realreg's committment.
 248         <p>
 249         <li>If all the vregs were assigned to a realreg, use
 250             <code>patchInstr</code> to apply the mapping to the insns themselves.
 251         </ul>
 252         <p>
 253         <b>Spilling</b>
 254         <p>
 255         If <code>allocUsingTheseRegs</code> fails, a baroque
 256         mechanism comes into play.  We now know that much simpler
 257         schemes are available to do the same thing and give better
 258         results.
 259         Anyways:
 260         <p>
 261         The logic above <code>allocUsingTheseRegs</code>, in
 262         <code>doGeneralAlloc</code> and <code>runRegAllocate</code>,
 263         observe that allocation has failed with some set R of real
 264         registers.  So they apply <code>runRegAllocate</code> a second
 265         time to the code, but remove (typically) two registers from R
 266         before doing so.  This naturally fails too, but returns a
 267         partially-allocated sequence.  <code>doGeneralAlloc</code>
 268         then inserts spill code into the sequence, and finally re-runs
 269         <code>allocUsingTheseRegs</code>, but supplying the original,
 270         unadulterated R.  This is guaranteed to succeed since the two
 271         registers previously removed from R are sufficient to allocate
 272         all the spill/restore instructions added.
 273         <p>
 274         Because x86 is very short of registers, and in the worst case
 275         needs three removed from R, a softly-softly approach is used.
 276         <code>doGeneralAlloc</code> first tries with zero regs removed
 277         from R, then if that fails one, then two, etc.  This means
 278         <code>allocUsingTheseRegs</code> may get run several times
 279         before a successful arrangement is arrived at.
 280         <code>findReservedRegs</code> cooks up the sets of spill
 281         registers to try with.
 282         <p>
 283         The resulting machinery is complicated and the generated spill
 284         code is appalling.  The saving grace is that spills are very
 285         rare so it doesn't matter much.  I did not invent this -- I inherited it.
 286         <p>
 287         <b>Dealing with common cases fast</b>
 288         <p>
 289         The entire reg-alloc mechanism described so far is general and
 290         correct, but expensive overkill for many simple code blocks.
 291         So to begin with we use
 292         <code>doSimpleAlloc</code>, which attempts to do something
 293         simple.  It exploits the observation that if the total number
 294         of virtual registers does not exceed the number of real ones
 295         available, we can simply dole out a new realreg each time we
 296         see mention of a new vreg, with no regard for control flow.
 297         <code>doSimpleAlloc</code> therefore attempts this in a
 298         single pass over the code.  It gives up if it runs out of real
 299         regs or sees any condition which renders the above observation
 300         invalid (fixed reg uses, for example).
 301         <p>
 302         This clever hack handles the majority of code blocks quickly.
 303         It was copied from the previous reg-allocator (the
 304         Mattson/Partain/Marlow/Gill one).
 305     </ul>
 306
 307 <p>
 308 <h2>Complications, observations, and possible improvements</h2>
 309
 310 <h3>Real vs virtual registers in the instruction selectors</h3>
 311
 312 The instruction selectors for expression trees, namely
 313 <code>getRegister</code>, are complicated by the fact that some
 314 expressions can only be computed into a specific register, whereas
 315 the majority can be computed into any register.  We take x86 as an
 316 example, but the problem applies to all archs.
 317 <p>
 318 Terminology: <em>rreg</em> means real register, a real machine
 319 register.  <em>vreg</em> means one of an infinite set of virtual
 320 registers.  The type <code>Reg</code> is the sum of <em>rreg</em> and
 321 <em>vreg</em>.  The instruction selector generates sequences with
 322 unconstrained use of vregs, leaving the register allocator to map them
 323 all into rregs.
 324 <p>
 325 Now, where was I ?  Oh yes.  We return to the type of
 326 <code>getRegister</code>, which despite its name, selects instructions
 327 to compute the value of an expression tree.
 328 <pre>
 329    getRegister :: StixExpr -> NatM Register
 330
 331    data Register
 332      = Fixed   PrimRep Reg InstrBlock
 333      | Any     PrimRep (Reg -> InstrBlock)
 334
 335    type InstrBlock -- sequence of instructions
 336 </pre>
 337 At first this looks eminently reasonable (apart from the stupid
 338 name).  <code>getRegister</code>, and nobody else, knows whether or
 339 not a given expression has to be computed into a fixed rreg or can be
 340 computed into any rreg or vreg.  In the first case, it returns
 341 <code>Fixed</code> and indicates which rreg the result is in.  In the
 342 second case it defers committing to any specific target register by
 343 returning a function from <code>Reg</code> to <code>InstrBlock</code>,
 344 and the caller can specify the target reg as it sees fit.
 345 <p>
 346 Unfortunately, that forces <code>getRegister</code>'s callers (usually
 347 itself) to use a clumsy and confusing idiom in the common case where
 348 they do not care what register the result winds up in.  The reason is
 349 that although a value might be computed into a fixed rreg, we are
 350 forbidden (on pain of segmentation fault :) from subsequently
 351 modifying the fixed reg.  This and other rules are record in "Rules of
 352 the game" inside <code>MachCode.lhs</code>.
 353 <p>
 354 Why can't fixed registers be modified post-hoc?  Consider a simple
 355 expression like <code>Hp+1</code>.  Since the heap pointer
 356 <code>Hp</code> is definitely in a fixed register, call it R,
 357 <code>getRegister</code> on subterm <code>Hp</code> will simply return
 358 <code>Fixed</code> with an empty sequence and R.  But we can't just
 359 emit an increment instruction for R, because that trashes
 360 <code>Hp</code>; instead we first have to copy it into a fresh vreg
 361 and increment that.
 362 <p>
 363 With all that in mind, consider now writing a <code>getRegister</code>
 364 clause for terms of the form <code>(1 + E)</code>.  Contrived, yes,
 365 but illustrates the matter.  First we do
 366 <code>getRegister</code> on E.  Now we are forced to examine what
 367 comes back.
 368 <pre>
 369    getRegister (OnePlus e)
 370       = getRegister e           `thenNat`   \ e_result ->
 371         case e_result of
 372            Fixed e_code e_fixed
 373               -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
 374            Any e_any
 375               -> Any (\dst -> e_any dst ++ [INC dst])
 376 </pre>
 377 This seems unreasonably cumbersome, yet the instruction selector is
 378 full of such idioms.  A good example of the complexities induced by
 379 this scheme is shown by <code>trivialCode</code> for x86 in
 380 <code>MachCode.lhs</code>.  This deals with general integer dyadic
 381 operations on x86 and has numerous cases.  It was difficult to get
 382 right.
 383 <p>
 384 An alternative suggestion is to simplify the type of
 385 <code>getRegister</code> to this:
 386 <pre>
 387    getRegister :: StixExpr -> NatM (InstrBloc, VReg)
 388    type VReg = .... a vreg ...
 389 </pre>
 390 and then we could safely write
 391 <pre>
 392    getRegister (OnePlus e)
 393       = getRegister e        `thenNat`  \ (e_code, e_vreg) ->
 394         returnNat (e_code ++ [INC e_vreg], e_vreg)
 395 </pre>
 396 which is about as straightforward as you could hope for.
 397 Unfortunately, it requires <code>getRegister</code> to insert moves of
 398 values which naturally compute into an rreg, into a vreg.  Consider:
 399 <pre>
 400    1 + ccall some-C-fn
 401 </pre>
 402 On x86 the ccall result is returned in rreg <code>%eax</code>.  The
 403 resulting sequence, prior to register allocation, would be:
 404 <pre>
 405    # push args
 406    call some-C-fn
 407    # move %esp to nuke args
 408    movl   %eax, %vreg
 409    incl   %vreg
 410 </pre>
 411 If, as is likely, <code>%eax</code> is not held live beyond this point
 412 for any other purpose, the move into a fresh register is pointless;
 413 we'd have been better off leaving the value in <code>%eax</code> as
 414 long as possible.
 415 <p>
 416 The simplified <code>getRegister</code> story is attractive.  It would
 417 clean up the instruction selectors significantly and make it simpler
 418 to write new ones.  The only drawback is that it generates redundant
 419 register moves.  I suggest that eliminating these should be the job
 420 of the register allocator.  Indeed:
 421 <ul>
 422 <li>There has been some work on this already ("Iterated register
 423     coalescing" ?), so this isn't a new idea.
 424 <p>
 425 <li>You could argue that the existing scheme inappropriately blurs the
 426     boundary between the instruction selector and the register
 427     allocator.  The instruction selector should .. well .. just
 428     select instructions, without having to futz around worrying about
 429     what kind of registers subtrees get generated into.  Register
 430     allocation should be <em>entirely</em> the domain of the register
 431     allocator, with the proviso that it should endeavour to allocate
 432     registers so as to minimise the number of non-redundant reg-reg
 433     moves in the final output.
 434 </ul>
 435
 436
 437 <h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>
 438
 439 Note that this stuff doesn't apply on 64-bit archs, since the
 440 <code>getRegister</code> mechanism applies there.
 441
 442 The relevant functions are:
 443 <pre>
 444    assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
 445    assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock
 446    iselExpr64        :: StixExpr -> NatM ChildCode64
 447
 448    data ChildCode64     -- a.k.a "Register64"
 449       = ChildCode64
 450            InstrBlock   -- code
 451            VRegUnique   -- unique for the lower 32-bit temporary
 452 </pre>
 453 <code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
 454 <code>getRegister</code>, and <code>ChildCode64</code> is the analogue
 455 of <code>Register</code>.  The aim here was to generate working 64
 456 bit code as simply as possible.  To this end, I used the
 457 simplified <code>getRegister</code> scheme described above, in which
 458 <code>iselExpr64</code>generates its results into two vregs which
 459 can always safely be modified afterwards.
 460 <p>
 461 Virtual registers are, unsurprisingly, distinguished by their
 462 <code>Unique</code>s.  There is a small difficulty in how to
 463 know what the vreg for the upper 32 bits of a value is, given the vreg
 464 for the lower 32 bits.  The simple solution adopted is to say that
 465 any low-32 vreg may also have a hi-32 counterpart which shares the
 466 same unique, but is otherwise regarded as a separate entity.
 467 <code>getHiVRegFromLo</code> gets one from the other.
 468 <pre>
 469    data VRegUnique
 470       = VRegUniqueLo Unique          -- lower part of a split quantity
 471       | VRegUniqueHi Unique          -- upper part thereof
 472 </pre>
 473 Apart from that, 64-bit code generation is really simple.  The sparc
 474 and x86 versions are almost copy-n-pastes of each other, with minor
 475 adjustments for endianness.  The generated code isn't wonderful but
 476 is certainly acceptable, and it works.
 477
 478
 479
 480 <h3>Shortcomings and inefficiencies in the register allocator</h3>
 481
 482 <h4>Redundant reconstruction of the control flow graph</h4>
 483
 484 The allocator goes to considerable computational expense to construct
 485 all the flow edges in the group of instructions it's allocating for,
 486 by using the <code>insnFuture</code> function in the
 487 <code>Instr</code> pseudo-abstract type.
 488 <p>
 489 This is really silly, because all that information is present at the
 490 abstract C stage, but is thrown away in the translation to Stix.
 491 So a good thing to do is to modify that translation to
 492 produce a directed graph of Stix straight-line code blocks,
 493 and to preserve that structure through the insn selector, so the
 494 allocator can see it.
 495 <p>
 496 This would eliminate the fragile, hacky, arch-specific
 497 <code>insnFuture</code> mechanism, and probably make the whole
 498 compiler run measurably faster.  Register allocation is a fair chunk
 499 of the time of non-optimising compilation (10% or more), and
 500 reconstructing the flow graph is an expensive part of reg-alloc.
 501 It would probably accelerate the vreg liveness computation too.
 502
 503 <h4>Really ridiculous method for doing spilling</h4>
 504
 505 This is a more ambitious suggestion, but ... reg-alloc should be
 506 reimplemented, using the scheme described in "Quality and speed in
 507 linear-scan register allocation."  (Traub?)  For straight-line code
 508 blocks, this gives an elegant one-pass algorithm for assigning
 509 registers and creating the minimal necessary spill code, without the
 510 need for reserving spill registers ahead of time.
 511 <p>
 512 I tried it in Rigr, replacing the previous spiller which used the
 513 current GHC scheme described above, and it cut the number of spill
 514 loads and stores by a factor of eight.  Not to mention being simpler,
 515 easier to understand and very fast.
 516 <p>
 517 The Traub paper also describes how to extend their method to multiple
 518 basic blocks, which will be needed for GHC.  It comes down to
 519 reconciling multiple vreg-to-rreg mappings at points where control
 520 flow merges.
 521
 522 <h4>Redundant-move support for revised instruction selector suggestion</h4>
 523
 524 As mentioned above, simplifying the instruction selector will require
 525 the register allocator to try and allocate source and destination
 526 vregs to the same rreg in reg-reg moves, so as to make as many as
 527 possible go away.  Without that, the revised insn selector would
 528 generate worse code than at present.  I know this stuff has been done
 529 but know nothing about it.  The Linear-scan reg-alloc paper mentioned
 530 above does indeed mention a bit about it in the context of single
 531 basic blocks, but I don't know if that's sufficient.
 532
 533
 534
 535 <h3>x86 arcana that you should know about</h3>
 536
 537 The main difficulty with x86 is that many instructions have fixed
 538 register constraints, which can occasionally make reg-alloc fail
 539 completely.  And the FPU doesn't have the flat register model which
 540 the reg-alloc abstraction (implicitly) assumes.
 541 <p>
 542 Our strategy is: do a good job for the common small subset, that is
 543 integer loads, stores, address calculations, basic ALU ops (+, -,
 544 and, or, xor), and jumps.  That covers the vast majority of
 545 executed insns.  And indeed we do do a good job, with a loss of
 546 less than 2% compared with gcc.
 547 <p>
 548 Initially we tried to handle integer instructions with awkward
 549 register constraints (mul, div, shifts by non-constant amounts) via
 550 various jigglings of the spiller et al.  This never worked robustly,
 551 and putting platform-specific tweaks in the generic infrastructure is
 552 a big No-No.  (Not quite true; shifts by a non-constant amount are
 553 still done by a giant kludge, and should be moved into this new
 554 framework.)
 555 <p>
 556 Fortunately, all such insns are rare.  So the current scheme is to
 557 pretend that they don't have any such constraints.  This fiction is
 558 carried all the way through the register allocator.  When the insn
 559 finally comes to be printed, we emit a sequence which copies the
 560 operands through memory (<code>%esp</code>-relative), satisfying the
 561 constraints of the real instruction.  This localises the gruesomeness
 562 to just one place.  Here, for example, is the code generated for
 563 integer divison of <code>%esi</code> by <code>%ecx</code>:
 564 <pre>
 565    # BEGIN IQUOT %ecx, %esi
 566    pushl $0
 567    pushl %eax
 568    pushl %edx
 569    pushl %ecx
 570    movl  %esi,% eax
 571    cltd
 572    idivl 0(%esp)
 573    movl %eax, 12(%esp)
 574    popl %edx
 575    popl %edx
 576    popl %eax
 577    popl %esi
 578    # END   IQUOT %ecx, %esi
 579 </pre>
 580 This is not quite as appalling as it seems, if you consider that the
 581 division itself typically takes 16+ cycles, whereas the rest of the
 582 insns probably go through in about 1 cycle each.
 583 <p>
 584 This trick is taken to extremes for FP operations.
 585 <p>
 586 All notions of the x86 FP stack and its insns have been removed.
 587 Instead, we pretend, to the instruction selector and register
 588 allocator, that x86 has six floating point registers,
 589 <code>%fake0</code> .. <code>%fake5</code>, which can be used in the
 590 usual flat manner.  We further claim that x86 has floating point
 591 instructions very similar to SPARC and Alpha, that is, a simple
 592 3-operand register-register arrangement.  Code generation and register
 593 allocation proceed on this basis.
 594 <p>
 595 When we come to print out the final assembly, our convenient fiction
 596 is converted to dismal reality.  Each fake instruction is
 597 independently converted to a series of real x86 instructions.
 598 <code>%fake0</code> .. <code>%fake5</code> are mapped to
 599 <code>%st(0)</code> .. <code>%st(5)</code>.  To do reg-reg arithmetic
 600 operations, the two operands are pushed onto the top of the FP stack,
 601 the operation done, and the result copied back into the relevant
 602 register.  When one of the operands is also the destination, we emit a
 603 slightly less scummy translation.  There are only six
 604 <code>%fake</code> registers because 2 are needed for the translation,
 605 and x86 has 8 in total.
 606 <p>
 607 The translation is inefficient but is simple and it works.  A cleverer
 608 translation would handle a sequence of insns, simulating the FP stack
 609 contents, would not impose a fixed mapping from <code>%fake</code> to
 610 <code>%st</code> regs, and hopefully could avoid most of the redundant
 611 reg-reg moves of the current translation.
 612 <p>
 613 There are, however, two unforeseen bad side effects:
 614 <ul>
 615 <li>This doesn't work properly, because it doesn't observe the normal
 616     conventions for x86 FP code generation.  It turns out that each of
 617     the 8 elements in the x86 FP register stack has a tag bit which
 618     indicates whether or not that register is notionally in use or
 619     not.  If you do a FPU operation which happens to read a
 620     tagged-as-empty register, you get an x87 FPU (stack invalid)
 621     exception, which is normally handled by the FPU without passing it
 622     to the OS: the program keeps going, but the resulting FP values
 623     are garbage.  The OS can ask for the FPU to pass it FP
 624     stack-invalid exceptions, but it usually doesn't.
 625     <p>
 626     Anyways: inside NCG created x86 FP code this all works fine.
 627     However, the NCG's fiction of a flat register set does not operate
 628     the x87 register stack in the required stack-like way.  When
 629     control returns to a gcc-generated world, the stack tag bits soon
 630     cause stack exceptions, and thus garbage results.
 631     <p>
 632     The only fix I could think of -- and it is horrible -- is to clear
 633     all the tag bits just before the next STG-level entry, in chunks
 634     of code which use FP insns.  <code>i386_insert_ffrees</code>
 635     inserts the relevant <code>ffree</code> insns into such code
 636     blocks.  It depends critically on <code>is_G_instr</code> to
 637     detect such blocks.
 638 <p>
 639 <li>It's very difficult to read the generated assembly and
 640     reason about it when debugging, because there's so much clutter.
 641     We print the fake insns as comments in the output, and that helps
 642     a bit.
 643 </ul>
 644
 645
 646
 647 <h3>Generating code for ccalls</h3>
 648
 649 For reasons I don't really understand, the instruction selectors for
 650 generating calls to C (<code>genCCall</code>) have proven surprisingly
 651 difficult to get right, and soaked up a lot of debugging time.  As a
 652 result, I have once again opted for schemes which are simple and not
 653 too difficult to argue as correct, even if they don't generate
 654 excellent code.
 655 <p>
 656 The sparc ccall generator in particular forces all arguments into
 657 temporary virtual registers before moving them to the final
 658 out-registers (<code>%o0</code> .. <code>%o5</code>).  This creates
 659 some unnecessary reg-reg moves.  The reason is explained in a
 660 comment in the code.
 661
 662
 663 <h3>Duplicate implementation for many STG macros</h3>
 664
 665 This has been discussed at length already.  It has caused a couple of
 666 nasty bugs due to subtle untracked divergence in the macro
 667 translations.  The macro-expander really should be pushed up into the
 668 Abstract C phase, so the problem can't happen.
 669 <p>
 670 Doing so would have the added benefit that the NCG could be used to
 671 compile more "ways" -- well, at least the 'p' profiling way.
 672
 673
 674 <h3>How to debug the NCG without losing your sanity/hair/cool</h3>
 675
 676 Last, but definitely not least ...
 677 <p>
 678 The usual syndrome is that some program, when compiled via C, works,
 679 but not when compiled via the NCG.  Usually the problem is fairly
 680 simple to fix, once you find the specific code block which has been
 681 mistranslated.  But the latter can be nearly impossible, since most
 682 modules generate at least hundreds and often thousands of them.
 683 <p>
 684 My solution: cheat.
 685 <p>
 686 Because the via-C and native routes diverge only late in the day,
 687 it is not difficult to construct a 1-1 correspondence between basic
 688 blocks on the two routes.  So, if the program works via C but not on
 689 the NCG, do the following:
 690 <ul>
 691 <li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler
 692     with <code>-DDEBUG_NCG</code>, so that it inserts
 693     <code>___ncg_debug_marker</code>s
 694     into the assembly it emits.
 695 <p>
 696 <li>Using a binary search on modules, find the module which is causing
 697     the problem.
 698 <p>
 699 <li>Compile that module to assembly code, with identical flags, twice,
 700     once via C and once via NCG.
 701     Call the outputs <code>ModuleName.s-gcc</code> and
 702     <code>ModuleName.s-nat</code>.  Check that the latter does indeed have
 703     <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail.
 704 <p>
 705 <li>Build (with a working compiler) the program
 706     <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>.
 707 <p>
 708 <li>Run: <code>diff_gcc_nat ModuleName.s</code>.  This will
 709     construct the 1-1 correspondence, and emits on stdout
 710     a cppable assembly output.  Place this in a file -- I always
 711     call it <code>synth.S</code>.  Note, the capital S is important;
 712     otherwise it won't get cpp'd.  You can feed this file directly to
 713     ghc and it will automatically get cpp'd; you don't have to do so
 714     yourself.
 715 <p>
 716 <li>By messing with the <code>#define</code>s at the top of
 717     <code>synth.S</code>, do a binary search to find the incorrect
 718     block.  Keep a careful record of where you are in the search; it
 719     is easy to get confused.  Remember also that multiple blocks may
 720     be wrong, which also confuses matters.  Finally, I usually start
 721     off by re-checking that I can build the executable with all the
 722     <code>#define</code>s set to 0 and then all to 1.  This ensures
 723     you won't get halfway through the search and then get stuck due to
 724     some snafu with gcc-specific literals.  Usually I set
 725     <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should
 726     contain only literal data.
 727     <code>UNMATCHED_NAT</code> should be empty.
 728 </ul>
 729 <p>
 730 <code>diff_gcc_nat</code> was known to work correctly last time I used
 731 it, in December 01, for both x86 and sparc.  If it doesn't work, due
 732 to changes in assembly syntax, or whatever, make it work.  The
 733 investment is well worth it.  Searching for the incorrect block(s) any
 734 other way is a total time waster.
 735
 736
 737
 738 </ul>
 739
 740
 741
 742
 743     <p><small>
 744 <!-- hhmts start -->
 745 Last modified: Fri Feb  1 16:14:11 GMT 2002
 746 <!-- hhmts end -->
 747     </small>
 748   </body>
 749 </html>