X-Git-Url: http://git.megacz.com/?p=ghc-hetmet.git;a=blobdiff_plain;f=docs%2Fcomm%2Fthe-beast%2Fncg.html;fp=docs%2Fcomm%2Fthe-beast%2Fncg.html;h=5810a352126232bfe592597efad525e5a261f55e;hp=0000000000000000000000000000000000000000;hb=0065d5ab628975892cea1ec7303f968c3338cbe1;hpb=28a464a75e14cece5db40f2765a29348273ff2d2 diff --git a/docs/comm/the-beast/ncg.html b/docs/comm/the-beast/ncg.html new file mode 100644 index 0000000..5810a35 --- /dev/null +++ b/docs/comm/the-beast/ncg.html @@ -0,0 +1,749 @@ + + +
+ ++ On some platforms (currently x86 and PowerPC, with bitrotted + support for Sparc and Alpha), GHC can generate assembly code + directly, without having to go via C. This can sometimes almost + halve compilation time, and avoids the fragility and + horribleness of the mangler. The NCG + is enabled by default for + non-optimising compilation on supported platforms. For most programs + it generates code which runs only 1-3% slower + (depending on platform and type of code) than that + created by gcc on x86s, so it is well worth using even with + optimised compilation. FP-intensive x86 programs see a bigger + slowdown, and all Sparc code runs about 5% slower due to + us not filling branch delay slots. +
+ The NCG has always been something of a second-class citizen + inside GHC, an unloved child, rather. This means that its + integration into the compiler as a whole is rather clumsy, which + brings some problems described below. That apart, the NCG + proper is fairly cleanly designed, as target-independent as it + reasonably can be, and so should not be difficult to retarget. +
+ NOTE! The native code generator was largely rewritten as part + of the C-- backend changes, around May 2004. Unfortunately the + rest of this document still refers to the old version, and was written + with relation to the CVS head as of end-Jan 2002. Some of it is relevant, + some of it isn't. + +
+ absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)
+
+ The returned SDoc
is for debugging, so is empty unless
+ you specify -ddump-stix
. The Pretty.Doc
+ bit is the final assembly code. Translation involves three main
+ phases, the first and third of which are target-independent.
+
Stix
representation. Stix
+ is a simple tree-like RTL-style language, in which you can
+ mention:
+ +
MagicId
), such as
+ the heap and stack pointers.
+ MachOp
).
+ if ... goto ...
style control-flow.
+ + Stix has two main associated types: +
+
StixStmt
-- trees executed for their side
+ effects: assignments, control transfers, and auxiliary junk
+ such as segment changes and literal data.
+ StixExpr
-- trees which denote a value.
+
+ Translation into Stix is almost completely
+ target-independent. Needed dependencies are knowledge of
+ word size and endianness, used when generating code to do
+ deal with half-word fields in info tables. This could be
+ abstracted out easily enough. Also, the Stix translation
+ needs to know which MagicId
s map to registers
+ on the given target, and which are stored in offsets from
+ BaseReg
.
+
+ After initial Stix generation, the trees are cleaned up with
+ constant-folding and a little copy-propagation ("Stix
+ inlining", as the code misleadingly calls it). We take
+ the opportunity to translate MagicId
s which are
+ stored in memory on the given target, into suitable memory
+ references. Those which are stored in registers are left
+ alone. There is also a half-hearted attempt to lift literal
+ strings to the top level in cases where nested strings have
+ been observed to give incorrect code in the past.
+
+ Primitive machine-level operations will already be phrased in
+ terms of MachOp
s in the presented Abstract C, and
+ these are passed through unchanged. We comment only that the
+ MachOp
s have been chosen so as to be easy to
+ implement on all targets, and their meaning is intended to be
+ unambiguous, and the same on all targets, regardless of word
+ size or endianness.
+
+ A note on MagicId
s.
+ Those which are assigned to
+ registers on the current target are left unmodified. Those
+ which are not are stored in memory as offsets from
+ BaseReg
(which is assumed to permanently have the
+ value (&MainCapability.r)
), so the constant folder
+ calculates the offsets and inserts suitable loads/stores. One
+ complication is that not all archs have BaseReg
+ itself in a register, so for those (sparc), we instead
+ generate the address as an offset from the static symbol
+ MainCapability
, since the register table lives in
+ there.
+
+ Finally, BaseReg
does occasionally itself get
+ mentioned in Stix expression trees, and in this case what is
+ denoted is precisely (&MainCapability.r)
, not, as
+ in all other cases, the value of memory at some offset from
+ the start of the register table. Since what it denotes is an
+ r-value and not an l-value, assigning BaseReg
is
+ meaningless, so the machinery checks to ensure this never
+ happens. All these details are taken into account by the
+ constant folder.
+
+
Instr
, a data
+ type which is different for each architecture.
+ Instr
, unsurprisingly, has various supporting
+ types, such as Reg
, Operand
,
+ Imm
, etc. The generated instructions may refer
+ to specific machine registers, or to arbitrary virtual
+ registers, either those created within the instruction
+ selector, or those mentioned in the Stix passed to it.
+
+ The instruction selectors live in MachCode.lhs
.
+ The core functions, for each target, are:
+
+
+ getAmode :: StixExpr -> NatM Amode
+
+
getRegister :: StixExpr -> NatM Register
+
assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
+
assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock
+
+ The insn selectors use the "maximal munch" algorithm. The
+ bizarrely-misnamed getRegister
translates
+ expressions. A simplified version of its type is:
+
+ getRegister :: StixExpr -> NatM (OrdList Instr, Reg)
+
+ That is: it (monadically) turns a StixExpr
into a
+ sequence of instructions, and a register, with the meaning
+ that after executing the (possibly empty) sequence of
+ instructions, the (possibly virtual) register will
+ hold the resulting value. The real situation is complicated
+ by the presence of fixed registers, and is detailed below.
+
+ Maximal munch is a greedy algorithm and is known not to give + globally optimal code sequences, but it is good enough, and + fast and simple. Early incarnations of the NCG used something + more sophisticated, but that is long gone now. +
+ Similarly, getAmode
translates a value, intended
+ to denote an address, into a sequence of insns leading up to
+ a (processor-specific) addressing mode. This stuff could be
+ done using the general getRegister
selector, but
+ would necessarily generate poorer code, because the calculated
+ address would be forced into a register, which might be
+ unnecessary if it could partially or wholly be calculated
+ using an addressing mode.
+
+ Finally, assignMem_IntCode
and
+ assignReg_IntCode
create instruction sequences to
+ calculate a value and store it in the given register, or at
+ the given address. Because these guys translate a statement,
+ not a value, they just return a sequence of insns and no
+ associated register. Floating-point and 64-bit integer
+ assignments have analogous selectors.
+
+ Apart from the complexities of fixed vs floating registers, + discussed below, the instruction selector is as simple + as it can be. It looks long and scary but detailed + examination reveals it to be fairly straightforward. +
+
AsmRegAlloc.lhs
takes sequences of
+ Instr
s which mention a mixture of real and
+ virtual registers, and returns a modified sequence referring
+ only to real ones. It is gloriously and entirely
+ target-independent. Well, not exactly true. Instead it
+ regards Instr
(instructions) and Reg
+ (virtual and real registers) as abstract types, to which it has
+ the following interface:
+
+
+ insnFuture :: Instr -> InsnFuture
+
+
regUsage :: Instr -> RegUsage
+
patchRegs :: Instr -> (Reg -> Reg) -> Instr
+
+ insnFuture
is used to (re)construct the graph of
+ all possible control transfers between the insns to be
+ allocated. regUsage
returns the sets of registers
+ read and written by an instruction. And
+ patchRegs
is used to apply the allocator's final
+ decision on virtual-to-real reg mapping to an instruction.
+
+ Clearly these 3 fns have to be written anew for each
+ architecture. They are defined in
+ RegAllocInfo.lhs
. Think twice, no, thrice,
+ before modifying them: making false claims about insn
+ behaviour will lead to hard-to-find register allocation
+ errors.
+
+ AsmRegAlloc.lhs
contains detailed comments about
+ how the allocator works. Here is a summary. The head honcho
+
+ allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])
+
+ takes a list of instructions and a list of real registers
+ available for allocation, and maps as many of the virtual regs
+ in the input into real ones as it can. The returned
+ Bool
indicates whether or not it was
+ successful. If so, that's the end of it. If not, the caller
+ of allocUsingTheseRegs
will attempt spilling.
+ More of that later. What allocUsingTheseRegs
+ does is:
+
+
+
insnFuture
, create the set of all flow
+ edges -- possible control transfers -- within this set of
+ insns.
+ +
regUsage
and iterating around the flow
+ graph from the previous step, calculate, for each virtual
+ register, the set of flow edges on which it is live.
+ +
+
patchInstr
to apply the mapping to the insns themselves.
+ + Spilling +
+ If allocUsingTheseRegs
fails, a baroque
+ mechanism comes into play. We now know that much simpler
+ schemes are available to do the same thing and give better
+ results.
+ Anyways:
+
+ The logic above allocUsingTheseRegs
, in
+ doGeneralAlloc
and runRegAllocate
,
+ observe that allocation has failed with some set R of real
+ registers. So they apply runRegAllocate
a second
+ time to the code, but remove (typically) two registers from R
+ before doing so. This naturally fails too, but returns a
+ partially-allocated sequence. doGeneralAlloc
+ then inserts spill code into the sequence, and finally re-runs
+ allocUsingTheseRegs
, but supplying the original,
+ unadulterated R. This is guaranteed to succeed since the two
+ registers previously removed from R are sufficient to allocate
+ all the spill/restore instructions added.
+
+ Because x86 is very short of registers, and in the worst case
+ needs three removed from R, a softly-softly approach is used.
+ doGeneralAlloc
first tries with zero regs removed
+ from R, then if that fails one, then two, etc. This means
+ allocUsingTheseRegs
may get run several times
+ before a successful arrangement is arrived at.
+ findReservedRegs
cooks up the sets of spill
+ registers to try with.
+
+ The resulting machinery is complicated and the generated spill + code is appalling. The saving grace is that spills are very + rare so it doesn't matter much. I did not invent this -- I inherited it. +
+ Dealing with common cases fast +
+ The entire reg-alloc mechanism described so far is general and
+ correct, but expensive overkill for many simple code blocks.
+ So to begin with we use
+ doSimpleAlloc
, which attempts to do something
+ simple. It exploits the observation that if the total number
+ of virtual registers does not exceed the number of real ones
+ available, we can simply dole out a new realreg each time we
+ see mention of a new vreg, with no regard for control flow.
+ doSimpleAlloc
therefore attempts this in a
+ single pass over the code. It gives up if it runs out of real
+ regs or sees any condition which renders the above observation
+ invalid (fixed reg uses, for example).
+
+ This clever hack handles the majority of code blocks quickly. + It was copied from the previous reg-allocator (the + Mattson/Partain/Marlow/Gill one). +
+
getRegister
, are complicated by the fact that some
+expressions can only be computed into a specific register, whereas
+the majority can be computed into any register. We take x86 as an
+example, but the problem applies to all archs.
+
+Terminology: rreg means real register, a real machine
+register. vreg means one of an infinite set of virtual
+registers. The type Reg
is the sum of rreg and
+vreg. The instruction selector generates sequences with
+unconstrained use of vregs, leaving the register allocator to map them
+all into rregs.
+
+Now, where was I ? Oh yes. We return to the type of
+getRegister
, which despite its name, selects instructions
+to compute the value of an expression tree.
+
+ getRegister :: StixExpr -> NatM Register + + data Register + = Fixed PrimRep Reg InstrBlock + | Any PrimRep (Reg -> InstrBlock) + + type InstrBlock -- sequence of instructions ++At first this looks eminently reasonable (apart from the stupid +name).
getRegister
, and nobody else, knows whether or
+not a given expression has to be computed into a fixed rreg or can be
+computed into any rreg or vreg. In the first case, it returns
+Fixed
and indicates which rreg the result is in. In the
+second case it defers committing to any specific target register by
+returning a function from Reg
to InstrBlock
,
+and the caller can specify the target reg as it sees fit.
+
+Unfortunately, that forces getRegister
's callers (usually
+itself) to use a clumsy and confusing idiom in the common case where
+they do not care what register the result winds up in. The reason is
+that although a value might be computed into a fixed rreg, we are
+forbidden (on pain of segmentation fault :) from subsequently
+modifying the fixed reg. This and other rules are record in "Rules of
+the game" inside MachCode.lhs
.
+
+Why can't fixed registers be modified post-hoc? Consider a simple
+expression like Hp+1
. Since the heap pointer
+Hp
is definitely in a fixed register, call it R,
+getRegister
on subterm Hp
will simply return
+Fixed
with an empty sequence and R. But we can't just
+emit an increment instruction for R, because that trashes
+Hp
; instead we first have to copy it into a fresh vreg
+and increment that.
+
+With all that in mind, consider now writing a getRegister
+clause for terms of the form (1 + E)
. Contrived, yes,
+but illustrates the matter. First we do
+getRegister
on E. Now we are forced to examine what
+comes back.
+
+ getRegister (OnePlus e) + = getRegister e `thenNat` \ e_result -> + case e_result of + Fixed e_code e_fixed + -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst])) + Any e_any + -> Any (\dst -> e_any dst ++ [INC dst]) ++This seems unreasonably cumbersome, yet the instruction selector is +full of such idioms. A good example of the complexities induced by +this scheme is shown by
trivialCode
for x86 in
+MachCode.lhs
. This deals with general integer dyadic
+operations on x86 and has numerous cases. It was difficult to get
+right.
+
+An alternative suggestion is to simplify the type of
+getRegister
to this:
+
+ getRegister :: StixExpr -> NatM (InstrBloc, VReg) + type VReg = .... a vreg ... ++and then we could safely write +
+ getRegister (OnePlus e) + = getRegister e `thenNat` \ (e_code, e_vreg) -> + returnNat (e_code ++ [INC e_vreg], e_vreg) ++which is about as straightforward as you could hope for. +Unfortunately, it requires
getRegister
to insert moves of
+values which naturally compute into an rreg, into a vreg. Consider:
++ 1 + ccall some-C-fn ++On x86 the ccall result is returned in rreg
%eax
. The
+resulting sequence, prior to register allocation, would be:
++ # push args + call some-C-fn + # move %esp to nuke args + movl %eax, %vreg + incl %vreg ++If, as is likely,
%eax
is not held live beyond this point
+for any other purpose, the move into a fresh register is pointless;
+we'd have been better off leaving the value in %eax
as
+long as possible.
+
+The simplified getRegister
story is attractive. It would
+clean up the instruction selectors significantly and make it simpler
+to write new ones. The only drawback is that it generates redundant
+register moves. I suggest that eliminating these should be the job
+of the register allocator. Indeed:
+
+
getRegister
mechanism applies there.
+
+The relevant functions are:
++ assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock + assignReg_I64Code :: StixReg -> StixExpr -> NatM InstrBlock + iselExpr64 :: StixExpr -> NatM ChildCode64 + + data ChildCode64 -- a.k.a "Register64" + = ChildCode64 + InstrBlock -- code + VRegUnique -- unique for the lower 32-bit temporary ++
iselExpr64
is the 64-bit, plausibly-named analogue of
+getRegister
, and ChildCode64
is the analogue
+of Register
. The aim here was to generate working 64
+bit code as simply as possible. To this end, I used the
+simplified getRegister
scheme described above, in which
+iselExpr64
generates its results into two vregs which
+can always safely be modified afterwards.
+
+Virtual registers are, unsurprisingly, distinguished by their
+Unique
s. There is a small difficulty in how to
+know what the vreg for the upper 32 bits of a value is, given the vreg
+for the lower 32 bits. The simple solution adopted is to say that
+any low-32 vreg may also have a hi-32 counterpart which shares the
+same unique, but is otherwise regarded as a separate entity.
+getHiVRegFromLo
gets one from the other.
+
+ data VRegUnique + = VRegUniqueLo Unique -- lower part of a split quantity + | VRegUniqueHi Unique -- upper part thereof ++Apart from that, 64-bit code generation is really simple. The sparc +and x86 versions are almost copy-n-pastes of each other, with minor +adjustments for endianness. The generated code isn't wonderful but +is certainly acceptable, and it works. + + + +
insnFuture
function in the
+Instr
pseudo-abstract type.
++This is really silly, because all that information is present at the +abstract C stage, but is thrown away in the translation to Stix. +So a good thing to do is to modify that translation to +produce a directed graph of Stix straight-line code blocks, +and to preserve that structure through the insn selector, so the +allocator can see it. +
+This would eliminate the fragile, hacky, arch-specific
+insnFuture
mechanism, and probably make the whole
+compiler run measurably faster. Register allocation is a fair chunk
+of the time of non-optimising compilation (10% or more), and
+reconstructing the flow graph is an expensive part of reg-alloc.
+It would probably accelerate the vreg liveness computation too.
+
+
+I tried it in Rigr, replacing the previous spiller which used the +current GHC scheme described above, and it cut the number of spill +loads and stores by a factor of eight. Not to mention being simpler, +easier to understand and very fast. +
+The Traub paper also describes how to extend their method to multiple +basic blocks, which will be needed for GHC. It comes down to +reconciling multiple vreg-to-rreg mappings at points where control +flow merges. + +
+Our strategy is: do a good job for the common small subset, that is +integer loads, stores, address calculations, basic ALU ops (+, -, +and, or, xor), and jumps. That covers the vast majority of +executed insns. And indeed we do do a good job, with a loss of +less than 2% compared with gcc. +
+Initially we tried to handle integer instructions with awkward +register constraints (mul, div, shifts by non-constant amounts) via +various jigglings of the spiller et al. This never worked robustly, +and putting platform-specific tweaks in the generic infrastructure is +a big No-No. (Not quite true; shifts by a non-constant amount are +still done by a giant kludge, and should be moved into this new +framework.) +
+Fortunately, all such insns are rare. So the current scheme is to
+pretend that they don't have any such constraints. This fiction is
+carried all the way through the register allocator. When the insn
+finally comes to be printed, we emit a sequence which copies the
+operands through memory (%esp
-relative), satisfying the
+constraints of the real instruction. This localises the gruesomeness
+to just one place. Here, for example, is the code generated for
+integer divison of %esi
by %ecx
:
+
+ # BEGIN IQUOT %ecx, %esi + pushl $0 + pushl %eax + pushl %edx + pushl %ecx + movl %esi,% eax + cltd + idivl 0(%esp) + movl %eax, 12(%esp) + popl %edx + popl %edx + popl %eax + popl %esi + # END IQUOT %ecx, %esi ++This is not quite as appalling as it seems, if you consider that the +division itself typically takes 16+ cycles, whereas the rest of the +insns probably go through in about 1 cycle each. +
+This trick is taken to extremes for FP operations. +
+All notions of the x86 FP stack and its insns have been removed.
+Instead, we pretend, to the instruction selector and register
+allocator, that x86 has six floating point registers,
+%fake0
.. %fake5
, which can be used in the
+usual flat manner. We further claim that x86 has floating point
+instructions very similar to SPARC and Alpha, that is, a simple
+3-operand register-register arrangement. Code generation and register
+allocation proceed on this basis.
+
+When we come to print out the final assembly, our convenient fiction
+is converted to dismal reality. Each fake instruction is
+independently converted to a series of real x86 instructions.
+%fake0
.. %fake5
are mapped to
+%st(0)
.. %st(5)
. To do reg-reg arithmetic
+operations, the two operands are pushed onto the top of the FP stack,
+the operation done, and the result copied back into the relevant
+register. When one of the operands is also the destination, we emit a
+slightly less scummy translation. There are only six
+%fake
registers because 2 are needed for the translation,
+and x86 has 8 in total.
+
+The translation is inefficient but is simple and it works. A cleverer
+translation would handle a sequence of insns, simulating the FP stack
+contents, would not impose a fixed mapping from %fake
to
+%st
regs, and hopefully could avoid most of the redundant
+reg-reg moves of the current translation.
+
+There are, however, two unforeseen bad side effects: +
+ Anyways: inside NCG created x86 FP code this all works fine. + However, the NCG's fiction of a flat register set does not operate + the x87 register stack in the required stack-like way. When + control returns to a gcc-generated world, the stack tag bits soon + cause stack exceptions, and thus garbage results. +
+ The only fix I could think of -- and it is horrible -- is to clear
+ all the tag bits just before the next STG-level entry, in chunks
+ of code which use FP insns. i386_insert_ffrees
+ inserts the relevant ffree
insns into such code
+ blocks. It depends critically on is_G_instr
to
+ detect such blocks.
+
+
genCCall
) have proven surprisingly
+difficult to get right, and soaked up a lot of debugging time. As a
+result, I have once again opted for schemes which are simple and not
+too difficult to argue as correct, even if they don't generate
+excellent code.
+
+The sparc ccall generator in particular forces all arguments into
+temporary virtual registers before moving them to the final
+out-registers (%o0
.. %o5
). This creates
+some unnecessary reg-reg moves. The reason is explained in a
+comment in the code.
+
+
+
+Doing so would have the added benefit that the NCG could be used to +compile more "ways" -- well, at least the 'p' profiling way. + + +
+The usual syndrome is that some program, when compiled via C, works, +but not when compiled via the NCG. Usually the problem is fairly +simple to fix, once you find the specific code block which has been +mistranslated. But the latter can be nearly impossible, since most +modules generate at least hundreds and often thousands of them. +
+My solution: cheat. +
+Because the via-C and native routes diverge only late in the day, +it is not difficult to construct a 1-1 correspondence between basic +blocks on the two routes. So, if the program works via C but not on +the NCG, do the following: +
AsmCodeGen.lhs
in the afflicted compiler
+ with -DDEBUG_NCG
, so that it inserts
+ ___ncg_debug_marker
s
+ into the assembly it emits.
++
+
ModuleName.s-gcc
and
+ ModuleName.s-nat
. Check that the latter does indeed have
+ ___ncg_debug_marker
s in it; otherwise the next steps fail.
++
fptools/ghc/utils/debugNCG/diff_gcc_nat
.
++
diff_gcc_nat ModuleName.s
. This will
+ construct the 1-1 correspondence, and emits on stdout
+ a cppable assembly output. Place this in a file -- I always
+ call it synth.S
. Note, the capital S is important;
+ otherwise it won't get cpp'd. You can feed this file directly to
+ ghc and it will automatically get cpp'd; you don't have to do so
+ yourself.
++
#define
s at the top of
+ synth.S
, do a binary search to find the incorrect
+ block. Keep a careful record of where you are in the search; it
+ is easy to get confused. Remember also that multiple blocks may
+ be wrong, which also confuses matters. Finally, I usually start
+ off by re-checking that I can build the executable with all the
+ #define
s set to 0 and then all to 1. This ensures
+ you won't get halfway through the search and then get stuck due to
+ some snafu with gcc-specific literals. Usually I set
+ UNMATCHED_GCC
to 1 all the time, and this bit should
+ contain only literal data.
+ UNMATCHED_NAT
should be empty.
+
+diff_gcc_nat
was known to work correctly last time I used
+it, in December 01, for both x86 and sparc. If it doesn't work, due
+to changes in assembly syntax, or whatever, make it work. The
+investment is well worth it. Searching for the incorrect block(s) any
+other way is a total time waster.
+
+
+
+
+
+
+
+
+
+ +Last modified: Fri Feb 1 16:14:11 GMT 2002 + + + +