X-Git-Url: http://git.megacz.com/?p=ghc-hetmet.git;a=blobdiff_plain;f=docs%2Fcomm%2Fthe-beast%2Fncg.html;fp=docs%2Fcomm%2Fthe-beast%2Fncg.html;h=5810a352126232bfe592597efad525e5a261f55e;hp=0000000000000000000000000000000000000000;hb=0065d5ab628975892cea1ec7303f968c3338cbe1;hpb=28a464a75e14cece5db40f2765a29348273ff2d2 diff --git a/docs/comm/the-beast/ncg.html b/docs/comm/the-beast/ncg.html new file mode 100644 index 0000000..5810a35 --- /dev/null +++ b/docs/comm/the-beast/ncg.html @@ -0,0 +1,749 @@ + + + + + The GHC Commentary - The Native Code Generator + + + +

The GHC Commentary - The Native Code Generator

+

+ On some platforms (currently x86 and PowerPC, with bitrotted + support for Sparc and Alpha), GHC can generate assembly code + directly, without having to go via C. This can sometimes almost + halve compilation time, and avoids the fragility and + horribleness of the mangler. The NCG + is enabled by default for + non-optimising compilation on supported platforms. For most programs + it generates code which runs only 1-3% slower + (depending on platform and type of code) than that + created by gcc on x86s, so it is well worth using even with + optimised compilation. FP-intensive x86 programs see a bigger + slowdown, and all Sparc code runs about 5% slower due to + us not filling branch delay slots. +

+ The NCG has always been something of a second-class citizen + inside GHC, an unloved child, rather. This means that its + integration into the compiler as a whole is rather clumsy, which + brings some problems described below. That apart, the NCG + proper is fairly cleanly designed, as target-independent as it + reasonably can be, and so should not be difficult to retarget. +

+ NOTE! The native code generator was largely rewritten as part + of the C-- backend changes, around May 2004. Unfortunately the + rest of this document still refers to the old version, and was written + with relation to the CVS head as of end-Jan 2002. Some of it is relevant, + some of it isn't. + +

Overview

+ The top-level code generator fn is +

+ absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc) +

+ The returned SDoc is for debugging, so is empty unless + you specify -ddump-stix. The Pretty.Doc + bit is the final assembly code. Translation involves three main + phases, the first and third of which are target-independent. +

+ +

+

Complications, observations, and possible improvements

+ +

Real vs virtual registers in the instruction selectors

+ +The instruction selectors for expression trees, namely +getRegister, are complicated by the fact that some +expressions can only be computed into a specific register, whereas +the majority can be computed into any register. We take x86 as an +example, but the problem applies to all archs. +

+Terminology: rreg means real register, a real machine +register. vreg means one of an infinite set of virtual +registers. The type Reg is the sum of rreg and +vreg. The instruction selector generates sequences with +unconstrained use of vregs, leaving the register allocator to map them +all into rregs. +

+Now, where was I ? Oh yes. We return to the type of +getRegister, which despite its name, selects instructions +to compute the value of an expression tree. +

+   getRegister :: StixExpr -> NatM Register
+
+   data Register
+     = Fixed   PrimRep Reg InstrBlock
+     | Any     PrimRep (Reg -> InstrBlock)
+
+   type InstrBlock -- sequence of instructions
+
+At first this looks eminently reasonable (apart from the stupid +name). getRegister, and nobody else, knows whether or +not a given expression has to be computed into a fixed rreg or can be +computed into any rreg or vreg. In the first case, it returns +Fixed and indicates which rreg the result is in. In the +second case it defers committing to any specific target register by +returning a function from Reg to InstrBlock, +and the caller can specify the target reg as it sees fit. +

+Unfortunately, that forces getRegister's callers (usually +itself) to use a clumsy and confusing idiom in the common case where +they do not care what register the result winds up in. The reason is +that although a value might be computed into a fixed rreg, we are +forbidden (on pain of segmentation fault :) from subsequently +modifying the fixed reg. This and other rules are record in "Rules of +the game" inside MachCode.lhs. +

+Why can't fixed registers be modified post-hoc? Consider a simple +expression like Hp+1. Since the heap pointer +Hp is definitely in a fixed register, call it R, +getRegister on subterm Hp will simply return +Fixed with an empty sequence and R. But we can't just +emit an increment instruction for R, because that trashes +Hp; instead we first have to copy it into a fresh vreg +and increment that. +

+With all that in mind, consider now writing a getRegister +clause for terms of the form (1 + E). Contrived, yes, +but illustrates the matter. First we do +getRegister on E. Now we are forced to examine what +comes back. +

+   getRegister (OnePlus e)
+      = getRegister e           `thenNat`   \ e_result ->
+        case e_result of
+           Fixed e_code e_fixed 
+              -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
+           Any e_any 
+              -> Any (\dst -> e_any dst ++ [INC dst])
+
+This seems unreasonably cumbersome, yet the instruction selector is +full of such idioms. A good example of the complexities induced by +this scheme is shown by trivialCode for x86 in +MachCode.lhs. This deals with general integer dyadic +operations on x86 and has numerous cases. It was difficult to get +right. +

+An alternative suggestion is to simplify the type of +getRegister to this: +

+   getRegister :: StixExpr -> NatM (InstrBloc, VReg)
+   type VReg = .... a vreg ...
+
+and then we could safely write +
+   getRegister (OnePlus e)
+      = getRegister e        `thenNat`  \ (e_code, e_vreg) ->
+        returnNat (e_code ++ [INC e_vreg], e_vreg)
+
+which is about as straightforward as you could hope for. +Unfortunately, it requires getRegister to insert moves of +values which naturally compute into an rreg, into a vreg. Consider: +
+   1 + ccall some-C-fn
+
+On x86 the ccall result is returned in rreg %eax. The +resulting sequence, prior to register allocation, would be: +
+   # push args
+   call some-C-fn
+   # move %esp to nuke args
+   movl   %eax, %vreg
+   incl   %vreg
+
+If, as is likely, %eax is not held live beyond this point +for any other purpose, the move into a fresh register is pointless; +we'd have been better off leaving the value in %eax as +long as possible. +

+The simplified getRegister story is attractive. It would +clean up the instruction selectors significantly and make it simpler +to write new ones. The only drawback is that it generates redundant +register moves. I suggest that eliminating these should be the job +of the register allocator. Indeed: +

+ + +

Selecting insns for 64-bit values/loads/stores on 32-bit platforms

+ +Note that this stuff doesn't apply on 64-bit archs, since the +getRegister mechanism applies there. + +The relevant functions are: +
+   assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
+   assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock
+   iselExpr64        :: StixExpr -> NatM ChildCode64
+
+   data ChildCode64     -- a.k.a "Register64"
+      = ChildCode64 
+           InstrBlock   -- code
+           VRegUnique   -- unique for the lower 32-bit temporary
+
+iselExpr64 is the 64-bit, plausibly-named analogue of +getRegister, and ChildCode64 is the analogue +of Register. The aim here was to generate working 64 +bit code as simply as possible. To this end, I used the +simplified getRegister scheme described above, in which +iselExpr64generates its results into two vregs which +can always safely be modified afterwards. +

+Virtual registers are, unsurprisingly, distinguished by their +Uniques. There is a small difficulty in how to +know what the vreg for the upper 32 bits of a value is, given the vreg +for the lower 32 bits. The simple solution adopted is to say that +any low-32 vreg may also have a hi-32 counterpart which shares the +same unique, but is otherwise regarded as a separate entity. +getHiVRegFromLo gets one from the other. +

+   data VRegUnique
+      = VRegUniqueLo Unique          -- lower part of a split quantity
+      | VRegUniqueHi Unique          -- upper part thereof
+
+Apart from that, 64-bit code generation is really simple. The sparc +and x86 versions are almost copy-n-pastes of each other, with minor +adjustments for endianness. The generated code isn't wonderful but +is certainly acceptable, and it works. + + + +

Shortcomings and inefficiencies in the register allocator

+ +

Redundant reconstruction of the control flow graph

+ +The allocator goes to considerable computational expense to construct +all the flow edges in the group of instructions it's allocating for, +by using the insnFuture function in the +Instr pseudo-abstract type. +

+This is really silly, because all that information is present at the +abstract C stage, but is thrown away in the translation to Stix. +So a good thing to do is to modify that translation to +produce a directed graph of Stix straight-line code blocks, +and to preserve that structure through the insn selector, so the +allocator can see it. +

+This would eliminate the fragile, hacky, arch-specific +insnFuture mechanism, and probably make the whole +compiler run measurably faster. Register allocation is a fair chunk +of the time of non-optimising compilation (10% or more), and +reconstructing the flow graph is an expensive part of reg-alloc. +It would probably accelerate the vreg liveness computation too. + +

Really ridiculous method for doing spilling

+ +This is a more ambitious suggestion, but ... reg-alloc should be +reimplemented, using the scheme described in "Quality and speed in +linear-scan register allocation." (Traub?) For straight-line code +blocks, this gives an elegant one-pass algorithm for assigning +registers and creating the minimal necessary spill code, without the +need for reserving spill registers ahead of time. +

+I tried it in Rigr, replacing the previous spiller which used the +current GHC scheme described above, and it cut the number of spill +loads and stores by a factor of eight. Not to mention being simpler, +easier to understand and very fast. +

+The Traub paper also describes how to extend their method to multiple +basic blocks, which will be needed for GHC. It comes down to +reconciling multiple vreg-to-rreg mappings at points where control +flow merges. + +

Redundant-move support for revised instruction selector suggestion

+ +As mentioned above, simplifying the instruction selector will require +the register allocator to try and allocate source and destination +vregs to the same rreg in reg-reg moves, so as to make as many as +possible go away. Without that, the revised insn selector would +generate worse code than at present. I know this stuff has been done +but know nothing about it. The Linear-scan reg-alloc paper mentioned +above does indeed mention a bit about it in the context of single +basic blocks, but I don't know if that's sufficient. + + + +

x86 arcana that you should know about

+ +The main difficulty with x86 is that many instructions have fixed +register constraints, which can occasionally make reg-alloc fail +completely. And the FPU doesn't have the flat register model which +the reg-alloc abstraction (implicitly) assumes. +

+Our strategy is: do a good job for the common small subset, that is +integer loads, stores, address calculations, basic ALU ops (+, -, +and, or, xor), and jumps. That covers the vast majority of +executed insns. And indeed we do do a good job, with a loss of +less than 2% compared with gcc. +

+Initially we tried to handle integer instructions with awkward +register constraints (mul, div, shifts by non-constant amounts) via +various jigglings of the spiller et al. This never worked robustly, +and putting platform-specific tweaks in the generic infrastructure is +a big No-No. (Not quite true; shifts by a non-constant amount are +still done by a giant kludge, and should be moved into this new +framework.) +

+Fortunately, all such insns are rare. So the current scheme is to +pretend that they don't have any such constraints. This fiction is +carried all the way through the register allocator. When the insn +finally comes to be printed, we emit a sequence which copies the +operands through memory (%esp-relative), satisfying the +constraints of the real instruction. This localises the gruesomeness +to just one place. Here, for example, is the code generated for +integer divison of %esi by %ecx: +

+   # BEGIN IQUOT %ecx, %esi
+   pushl $0
+   pushl %eax  
+   pushl %edx
+   pushl %ecx
+   movl  %esi,% eax
+   cltd
+   idivl 0(%esp)
+   movl %eax, 12(%esp)
+   popl %edx  
+   popl %edx
+   popl %eax
+   popl %esi
+   # END   IQUOT %ecx, %esi
+
+This is not quite as appalling as it seems, if you consider that the +division itself typically takes 16+ cycles, whereas the rest of the +insns probably go through in about 1 cycle each. +

+This trick is taken to extremes for FP operations. +

+All notions of the x86 FP stack and its insns have been removed. +Instead, we pretend, to the instruction selector and register +allocator, that x86 has six floating point registers, +%fake0 .. %fake5, which can be used in the +usual flat manner. We further claim that x86 has floating point +instructions very similar to SPARC and Alpha, that is, a simple +3-operand register-register arrangement. Code generation and register +allocation proceed on this basis. +

+When we come to print out the final assembly, our convenient fiction +is converted to dismal reality. Each fake instruction is +independently converted to a series of real x86 instructions. +%fake0 .. %fake5 are mapped to +%st(0) .. %st(5). To do reg-reg arithmetic +operations, the two operands are pushed onto the top of the FP stack, +the operation done, and the result copied back into the relevant +register. When one of the operands is also the destination, we emit a +slightly less scummy translation. There are only six +%fake registers because 2 are needed for the translation, +and x86 has 8 in total. +

+The translation is inefficient but is simple and it works. A cleverer +translation would handle a sequence of insns, simulating the FP stack +contents, would not impose a fixed mapping from %fake to +%st regs, and hopefully could avoid most of the redundant +reg-reg moves of the current translation. +

+There are, however, two unforeseen bad side effects: +

+ + + +

Generating code for ccalls

+ +For reasons I don't really understand, the instruction selectors for +generating calls to C (genCCall) have proven surprisingly +difficult to get right, and soaked up a lot of debugging time. As a +result, I have once again opted for schemes which are simple and not +too difficult to argue as correct, even if they don't generate +excellent code. +

+The sparc ccall generator in particular forces all arguments into +temporary virtual registers before moving them to the final +out-registers (%o0 .. %o5). This creates +some unnecessary reg-reg moves. The reason is explained in a +comment in the code. + + +

Duplicate implementation for many STG macros

+ +This has been discussed at length already. It has caused a couple of +nasty bugs due to subtle untracked divergence in the macro +translations. The macro-expander really should be pushed up into the +Abstract C phase, so the problem can't happen. +

+Doing so would have the added benefit that the NCG could be used to +compile more "ways" -- well, at least the 'p' profiling way. + + +

How to debug the NCG without losing your sanity/hair/cool

+ +Last, but definitely not least ... +

+The usual syndrome is that some program, when compiled via C, works, +but not when compiled via the NCG. Usually the problem is fairly +simple to fix, once you find the specific code block which has been +mistranslated. But the latter can be nearly impossible, since most +modules generate at least hundreds and often thousands of them. +

+My solution: cheat. +

+Because the via-C and native routes diverge only late in the day, +it is not difficult to construct a 1-1 correspondence between basic +blocks on the two routes. So, if the program works via C but not on +the NCG, do the following: +

+

+diff_gcc_nat was known to work correctly last time I used +it, in December 01, for both x86 and sparc. If it doesn't work, due +to changes in assembly syntax, or whatever, make it work. The +investment is well worth it. Searching for the incorrect block(s) any +other way is a total time waster. + + + + + + + + +

+ +Last modified: Fri Feb 1 16:14:11 GMT 2002 + + + +