On some platforms (currently x86 and PowerPC, with bitrotted support for Sparc and Alpha), GHC can generate assembly code directly, without having to go via C. This can sometimes almost halve compilation time, and avoids the fragility and horribleness of the mangler. The NCG is enabled by default for non-optimising compilation on supported platforms. For most programs it generates code which runs only 1-3% slower (depending on platform and type of code) than that created by gcc on x86s, so it is well worth using even with optimised compilation. FP-intensive x86 programs see a bigger slowdown, and all Sparc code runs about 5% slower due to us not filling branch delay slots.
The NCG has always been something of a second-class citizen inside GHC, an unloved child, rather. This means that its integration into the compiler as a whole is rather clumsy, which brings some problems described below. That apart, the NCG proper is fairly cleanly designed, as target-independent as it reasonably can be, and so should not be difficult to retarget.
NOTE! The native code generator was largely rewritten as part of the C-- backend changes, around May 2004. Unfortunately the rest of this document still refers to the old version, and was written with relation to the CVS head as of end-Jan 2002. Some of it is relevant, some of it isn't.
absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)
The returned SDoc
is for debugging, so is empty unless
you specify -ddump-stix
. The Pretty.Doc
bit is the final assembly code. Translation involves three main
phases, the first and third of which are target-independent.
Stix
representation. Stix
is a simple tree-like RTL-style language, in which you can
mention:
MagicId
), such as
the heap and stack pointers.
MachOp
).
if ... goto ...
style control-flow.
Stix has two main associated types:
StixStmt
-- trees executed for their side
effects: assignments, control transfers, and auxiliary junk
such as segment changes and literal data.
StixExpr
-- trees which denote a value.
Translation into Stix is almost completely
target-independent. Needed dependencies are knowledge of
word size and endianness, used when generating code to do
deal with half-word fields in info tables. This could be
abstracted out easily enough. Also, the Stix translation
needs to know which MagicId
s map to registers
on the given target, and which are stored in offsets from
BaseReg
.
After initial Stix generation, the trees are cleaned up with
constant-folding and a little copy-propagation ("Stix
inlining", as the code misleadingly calls it). We take
the opportunity to translate MagicId
s which are
stored in memory on the given target, into suitable memory
references. Those which are stored in registers are left
alone. There is also a half-hearted attempt to lift literal
strings to the top level in cases where nested strings have
been observed to give incorrect code in the past.
Primitive machine-level operations will already be phrased in
terms of MachOp
s in the presented Abstract C, and
these are passed through unchanged. We comment only that the
MachOp
s have been chosen so as to be easy to
implement on all targets, and their meaning is intended to be
unambiguous, and the same on all targets, regardless of word
size or endianness.
A note on MagicId
s.
Those which are assigned to
registers on the current target are left unmodified. Those
which are not are stored in memory as offsets from
BaseReg
(which is assumed to permanently have the
value (&MainCapability.r)
), so the constant folder
calculates the offsets and inserts suitable loads/stores. One
complication is that not all archs have BaseReg
itself in a register, so for those (sparc), we instead
generate the address as an offset from the static symbol
MainCapability
, since the register table lives in
there.
Finally, BaseReg
does occasionally itself get
mentioned in Stix expression trees, and in this case what is
denoted is precisely (&MainCapability.r)
, not, as
in all other cases, the value of memory at some offset from
the start of the register table. Since what it denotes is an
r-value and not an l-value, assigning BaseReg
is
meaningless, so the machinery checks to ensure this never
happens. All these details are taken into account by the
constant folder.
Instr
, a data
type which is different for each architecture.
Instr
, unsurprisingly, has various supporting
types, such as Reg
, Operand
,
Imm
, etc. The generated instructions may refer
to specific machine registers, or to arbitrary virtual
registers, either those created within the instruction
selector, or those mentioned in the Stix passed to it.
The instruction selectors live in MachCode.lhs
.
The core functions, for each target, are:
getAmode :: StixExpr -> NatM Amode
getRegister :: StixExpr -> NatM Register
assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock
The insn selectors use the "maximal munch" algorithm. The
bizarrely-misnamed getRegister
translates
expressions. A simplified version of its type is:
getRegister :: StixExpr -> NatM (OrdList Instr, Reg)
That is: it (monadically) turns a StixExpr
into a
sequence of instructions, and a register, with the meaning
that after executing the (possibly empty) sequence of
instructions, the (possibly virtual) register will
hold the resulting value. The real situation is complicated
by the presence of fixed registers, and is detailed below.
Maximal munch is a greedy algorithm and is known not to give globally optimal code sequences, but it is good enough, and fast and simple. Early incarnations of the NCG used something more sophisticated, but that is long gone now.
Similarly, getAmode
translates a value, intended
to denote an address, into a sequence of insns leading up to
a (processor-specific) addressing mode. This stuff could be
done using the general getRegister
selector, but
would necessarily generate poorer code, because the calculated
address would be forced into a register, which might be
unnecessary if it could partially or wholly be calculated
using an addressing mode.
Finally, assignMem_IntCode
and
assignReg_IntCode
create instruction sequences to
calculate a value and store it in the given register, or at
the given address. Because these guys translate a statement,
not a value, they just return a sequence of insns and no
associated register. Floating-point and 64-bit integer
assignments have analogous selectors.
Apart from the complexities of fixed vs floating registers, discussed below, the instruction selector is as simple as it can be. It looks long and scary but detailed examination reveals it to be fairly straightforward.
AsmRegAlloc.lhs
takes sequences of
Instr
s which mention a mixture of real and
virtual registers, and returns a modified sequence referring
only to real ones. It is gloriously and entirely
target-independent. Well, not exactly true. Instead it
regards Instr
(instructions) and Reg
(virtual and real registers) as abstract types, to which it has
the following interface:
insnFuture :: Instr -> InsnFuture
regUsage :: Instr -> RegUsage
patchRegs :: Instr -> (Reg -> Reg) -> Instr
insnFuture
is used to (re)construct the graph of
all possible control transfers between the insns to be
allocated. regUsage
returns the sets of registers
read and written by an instruction. And
patchRegs
is used to apply the allocator's final
decision on virtual-to-real reg mapping to an instruction.
Clearly these 3 fns have to be written anew for each
architecture. They are defined in
RegAllocInfo.lhs
. Think twice, no, thrice,
before modifying them: making false claims about insn
behaviour will lead to hard-to-find register allocation
errors.
AsmRegAlloc.lhs
contains detailed comments about
how the allocator works. Here is a summary. The head honcho
allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])
takes a list of instructions and a list of real registers
available for allocation, and maps as many of the virtual regs
in the input into real ones as it can. The returned
Bool
indicates whether or not it was
successful. If so, that's the end of it. If not, the caller
of allocUsingTheseRegs
will attempt spilling.
More of that later. What allocUsingTheseRegs
does is:
insnFuture
, create the set of all flow
edges -- possible control transfers -- within this set of
insns.
regUsage
and iterating around the flow
graph from the previous step, calculate, for each virtual
register, the set of flow edges on which it is live.
patchInstr
to apply the mapping to the insns themselves.
Spilling
If allocUsingTheseRegs
fails, a baroque
mechanism comes into play. We now know that much simpler
schemes are available to do the same thing and give better
results.
Anyways:
The logic above allocUsingTheseRegs
, in
doGeneralAlloc
and runRegAllocate
,
observe that allocation has failed with some set R of real
registers. So they apply runRegAllocate
a second
time to the code, but remove (typically) two registers from R
before doing so. This naturally fails too, but returns a
partially-allocated sequence. doGeneralAlloc
then inserts spill code into the sequence, and finally re-runs
allocUsingTheseRegs
, but supplying the original,
unadulterated R. This is guaranteed to succeed since the two
registers previously removed from R are sufficient to allocate
all the spill/restore instructions added.
Because x86 is very short of registers, and in the worst case
needs three removed from R, a softly-softly approach is used.
doGeneralAlloc
first tries with zero regs removed
from R, then if that fails one, then two, etc. This means
allocUsingTheseRegs
may get run several times
before a successful arrangement is arrived at.
findReservedRegs
cooks up the sets of spill
registers to try with.
The resulting machinery is complicated and the generated spill code is appalling. The saving grace is that spills are very rare so it doesn't matter much. I did not invent this -- I inherited it.
Dealing with common cases fast
The entire reg-alloc mechanism described so far is general and
correct, but expensive overkill for many simple code blocks.
So to begin with we use
doSimpleAlloc
, which attempts to do something
simple. It exploits the observation that if the total number
of virtual registers does not exceed the number of real ones
available, we can simply dole out a new realreg each time we
see mention of a new vreg, with no regard for control flow.
doSimpleAlloc
therefore attempts this in a
single pass over the code. It gives up if it runs out of real
regs or sees any condition which renders the above observation
invalid (fixed reg uses, for example).
This clever hack handles the majority of code blocks quickly. It was copied from the previous reg-allocator (the Mattson/Partain/Marlow/Gill one).
getRegister
, are complicated by the fact that some
expressions can only be computed into a specific register, whereas
the majority can be computed into any register. We take x86 as an
example, but the problem applies to all archs.
Terminology: rreg means real register, a real machine
register. vreg means one of an infinite set of virtual
registers. The type Reg
is the sum of rreg and
vreg. The instruction selector generates sequences with
unconstrained use of vregs, leaving the register allocator to map them
all into rregs.
Now, where was I ? Oh yes. We return to the type of
getRegister
, which despite its name, selects instructions
to compute the value of an expression tree.
getRegister :: StixExpr -> NatM Register data Register = Fixed PrimRep Reg InstrBlock | Any PrimRep (Reg -> InstrBlock) type InstrBlock -- sequence of instructionsAt first this looks eminently reasonable (apart from the stupid name).
getRegister
, and nobody else, knows whether or
not a given expression has to be computed into a fixed rreg or can be
computed into any rreg or vreg. In the first case, it returns
Fixed
and indicates which rreg the result is in. In the
second case it defers committing to any specific target register by
returning a function from Reg
to InstrBlock
,
and the caller can specify the target reg as it sees fit.
Unfortunately, that forces getRegister
's callers (usually
itself) to use a clumsy and confusing idiom in the common case where
they do not care what register the result winds up in. The reason is
that although a value might be computed into a fixed rreg, we are
forbidden (on pain of segmentation fault :) from subsequently
modifying the fixed reg. This and other rules are record in "Rules of
the game" inside MachCode.lhs
.
Why can't fixed registers be modified post-hoc? Consider a simple
expression like Hp+1
. Since the heap pointer
Hp
is definitely in a fixed register, call it R,
getRegister
on subterm Hp
will simply return
Fixed
with an empty sequence and R. But we can't just
emit an increment instruction for R, because that trashes
Hp
; instead we first have to copy it into a fresh vreg
and increment that.
With all that in mind, consider now writing a getRegister
clause for terms of the form (1 + E)
. Contrived, yes,
but illustrates the matter. First we do
getRegister
on E. Now we are forced to examine what
comes back.
getRegister (OnePlus e) = getRegister e `thenNat` \ e_result -> case e_result of Fixed e_code e_fixed -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst])) Any e_any -> Any (\dst -> e_any dst ++ [INC dst])This seems unreasonably cumbersome, yet the instruction selector is full of such idioms. A good example of the complexities induced by this scheme is shown by
trivialCode
for x86 in
MachCode.lhs
. This deals with general integer dyadic
operations on x86 and has numerous cases. It was difficult to get
right.
An alternative suggestion is to simplify the type of
getRegister
to this:
getRegister :: StixExpr -> NatM (InstrBloc, VReg) type VReg = .... a vreg ...and then we could safely write
getRegister (OnePlus e) = getRegister e `thenNat` \ (e_code, e_vreg) -> returnNat (e_code ++ [INC e_vreg], e_vreg)which is about as straightforward as you could hope for. Unfortunately, it requires
getRegister
to insert moves of
values which naturally compute into an rreg, into a vreg. Consider:
1 + ccall some-C-fnOn x86 the ccall result is returned in rreg
%eax
. The
resulting sequence, prior to register allocation, would be:
# push args call some-C-fn # move %esp to nuke args movl %eax, %vreg incl %vregIf, as is likely,
%eax
is not held live beyond this point
for any other purpose, the move into a fresh register is pointless;
we'd have been better off leaving the value in %eax
as
long as possible.
The simplified getRegister
story is attractive. It would
clean up the instruction selectors significantly and make it simpler
to write new ones. The only drawback is that it generates redundant
register moves. I suggest that eliminating these should be the job
of the register allocator. Indeed:
getRegister
mechanism applies there.
The relevant functions are:
assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock assignReg_I64Code :: StixReg -> StixExpr -> NatM InstrBlock iselExpr64 :: StixExpr -> NatM ChildCode64 data ChildCode64 -- a.k.a "Register64" = ChildCode64 InstrBlock -- code VRegUnique -- unique for the lower 32-bit temporary
iselExpr64
is the 64-bit, plausibly-named analogue of
getRegister
, and ChildCode64
is the analogue
of Register
. The aim here was to generate working 64
bit code as simply as possible. To this end, I used the
simplified getRegister
scheme described above, in which
iselExpr64
generates its results into two vregs which
can always safely be modified afterwards.
Virtual registers are, unsurprisingly, distinguished by their
Unique
s. There is a small difficulty in how to
know what the vreg for the upper 32 bits of a value is, given the vreg
for the lower 32 bits. The simple solution adopted is to say that
any low-32 vreg may also have a hi-32 counterpart which shares the
same unique, but is otherwise regarded as a separate entity.
getHiVRegFromLo
gets one from the other.
data VRegUnique = VRegUniqueLo Unique -- lower part of a split quantity | VRegUniqueHi Unique -- upper part thereofApart from that, 64-bit code generation is really simple. The sparc and x86 versions are almost copy-n-pastes of each other, with minor adjustments for endianness. The generated code isn't wonderful but is certainly acceptable, and it works.
insnFuture
function in the
Instr
pseudo-abstract type.
This is really silly, because all that information is present at the abstract C stage, but is thrown away in the translation to Stix. So a good thing to do is to modify that translation to produce a directed graph of Stix straight-line code blocks, and to preserve that structure through the insn selector, so the allocator can see it.
This would eliminate the fragile, hacky, arch-specific
insnFuture
mechanism, and probably make the whole
compiler run measurably faster. Register allocation is a fair chunk
of the time of non-optimising compilation (10% or more), and
reconstructing the flow graph is an expensive part of reg-alloc.
It would probably accelerate the vreg liveness computation too.
I tried it in Rigr, replacing the previous spiller which used the current GHC scheme described above, and it cut the number of spill loads and stores by a factor of eight. Not to mention being simpler, easier to understand and very fast.
The Traub paper also describes how to extend their method to multiple basic blocks, which will be needed for GHC. It comes down to reconciling multiple vreg-to-rreg mappings at points where control flow merges.
Our strategy is: do a good job for the common small subset, that is integer loads, stores, address calculations, basic ALU ops (+, -, and, or, xor), and jumps. That covers the vast majority of executed insns. And indeed we do do a good job, with a loss of less than 2% compared with gcc.
Initially we tried to handle integer instructions with awkward register constraints (mul, div, shifts by non-constant amounts) via various jigglings of the spiller et al. This never worked robustly, and putting platform-specific tweaks in the generic infrastructure is a big No-No. (Not quite true; shifts by a non-constant amount are still done by a giant kludge, and should be moved into this new framework.)
Fortunately, all such insns are rare. So the current scheme is to
pretend that they don't have any such constraints. This fiction is
carried all the way through the register allocator. When the insn
finally comes to be printed, we emit a sequence which copies the
operands through memory (%esp
-relative), satisfying the
constraints of the real instruction. This localises the gruesomeness
to just one place. Here, for example, is the code generated for
integer divison of %esi
by %ecx
:
# BEGIN IQUOT %ecx, %esi pushl $0 pushl %eax pushl %edx pushl %ecx movl %esi,% eax cltd idivl 0(%esp) movl %eax, 12(%esp) popl %edx popl %edx popl %eax popl %esi # END IQUOT %ecx, %esiThis is not quite as appalling as it seems, if you consider that the division itself typically takes 16+ cycles, whereas the rest of the insns probably go through in about 1 cycle each.
This trick is taken to extremes for FP operations.
All notions of the x86 FP stack and its insns have been removed.
Instead, we pretend, to the instruction selector and register
allocator, that x86 has six floating point registers,
%fake0
.. %fake5
, which can be used in the
usual flat manner. We further claim that x86 has floating point
instructions very similar to SPARC and Alpha, that is, a simple
3-operand register-register arrangement. Code generation and register
allocation proceed on this basis.
When we come to print out the final assembly, our convenient fiction
is converted to dismal reality. Each fake instruction is
independently converted to a series of real x86 instructions.
%fake0
.. %fake5
are mapped to
%st(0)
.. %st(5)
. To do reg-reg arithmetic
operations, the two operands are pushed onto the top of the FP stack,
the operation done, and the result copied back into the relevant
register. When one of the operands is also the destination, we emit a
slightly less scummy translation. There are only six
%fake
registers because 2 are needed for the translation,
and x86 has 8 in total.
The translation is inefficient but is simple and it works. A cleverer
translation would handle a sequence of insns, simulating the FP stack
contents, would not impose a fixed mapping from %fake
to
%st
regs, and hopefully could avoid most of the redundant
reg-reg moves of the current translation.
There are, however, two unforeseen bad side effects:
Anyways: inside NCG created x86 FP code this all works fine. However, the NCG's fiction of a flat register set does not operate the x87 register stack in the required stack-like way. When control returns to a gcc-generated world, the stack tag bits soon cause stack exceptions, and thus garbage results.
The only fix I could think of -- and it is horrible -- is to clear
all the tag bits just before the next STG-level entry, in chunks
of code which use FP insns. i386_insert_ffrees
inserts the relevant ffree
insns into such code
blocks. It depends critically on is_G_instr
to
detect such blocks.
genCCall
) have proven surprisingly
difficult to get right, and soaked up a lot of debugging time. As a
result, I have once again opted for schemes which are simple and not
too difficult to argue as correct, even if they don't generate
excellent code.
The sparc ccall generator in particular forces all arguments into
temporary virtual registers before moving them to the final
out-registers (%o0
.. %o5
). This creates
some unnecessary reg-reg moves. The reason is explained in a
comment in the code.
Doing so would have the added benefit that the NCG could be used to compile more "ways" -- well, at least the 'p' profiling way.
The usual syndrome is that some program, when compiled via C, works, but not when compiled via the NCG. Usually the problem is fairly simple to fix, once you find the specific code block which has been mistranslated. But the latter can be nearly impossible, since most modules generate at least hundreds and often thousands of them.
My solution: cheat.
Because the via-C and native routes diverge only late in the day, it is not difficult to construct a 1-1 correspondence between basic blocks on the two routes. So, if the program works via C but not on the NCG, do the following:
AsmCodeGen.lhs
in the afflicted compiler
with -DDEBUG_NCG
, so that it inserts
___ncg_debug_marker
s
into the assembly it emits.
ModuleName.s-gcc
and
ModuleName.s-nat
. Check that the latter does indeed have
___ncg_debug_marker
s in it; otherwise the next steps fail.
fptools/ghc/utils/debugNCG/diff_gcc_nat
.
diff_gcc_nat ModuleName.s
. This will
construct the 1-1 correspondence, and emits on stdout
a cppable assembly output. Place this in a file -- I always
call it synth.S
. Note, the capital S is important;
otherwise it won't get cpp'd. You can feed this file directly to
ghc and it will automatically get cpp'd; you don't have to do so
yourself.
#define
s at the top of
synth.S
, do a binary search to find the incorrect
block. Keep a careful record of where you are in the search; it
is easy to get confused. Remember also that multiple blocks may
be wrong, which also confuses matters. Finally, I usually start
off by re-checking that I can build the executable with all the
#define
s set to 0 and then all to 1. This ensures
you won't get halfway through the search and then get stuck due to
some snafu with gcc-specific literals. Usually I set
UNMATCHED_GCC
to 1 all the time, and this bit should
contain only literal data.
UNMATCHED_NAT
should be empty.
diff_gcc_nat
was known to work correctly last time I used
it, in December 01, for both x86 and sparc. If it doesn't work, due
to changes in assembly syntax, or whatever, make it work. The
investment is well worth it. Searching for the incorrect block(s) any
other way is a total time waster.
Last modified: Fri Feb 1 16:14:11 GMT 2002