From: sewardj Date: Fri, 1 Feb 2002 16:18:33 +0000 (+0000) Subject: [project @ 2002-02-01 16:18:33 by sewardj] X-Git-Tag: Approximately_9120_patches~189 X-Git-Url: http://git.megacz.com/?a=commitdiff_plain;h=1100e42470e5baa05aef6833772805901398b48b;p=ghc-hetmet.git [project @ 2002-02-01 16:18:33 by sewardj] Finish off the NCG documentation. --- diff --git a/ghc/docs/comm/the-beast/ncg.html b/ghc/docs/comm/the-beast/ncg.html index 2f0d799..0dea3e2 100644 --- a/ghc/docs/comm/the-beast/ncg.html +++ b/ghc/docs/comm/the-beast/ncg.html @@ -118,7 +118,7 @@
assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock

- The insn selectors use the "maximal munch" property. The + The insn selectors use the "maximal munch" algorithm. The bizarrely-misnamed getRegister translates expressions. A simplified version of its type is:

@@ -232,7 +232,9 @@

If allocUsingTheseRegs fails, a baroque mechanism comes into play. We now know that much simpler - schemes are available to do the same thing. Anyways: + schemes are available to do the same thing and give better + results. + Anyways:

The logic above allocUsingTheseRegs, in doGeneralAlloc and runRegAllocate, @@ -264,7 +266,7 @@

The entire reg-alloc mechanism described so far is general and correct, but expensive overkill for many simple code blocks. - So to begin with we use start off with + So to begin with we use doSimpleAlloc, which attempts to do something simple. It exploits the observation that if the total number of virtual registers does not exceed the number of real ones @@ -283,16 +285,429 @@

Complications, observations, and possible improvements

-

How to debug the NCG without losing your sanity

+

Real vs virtual registers in the instruction selectors

+ +The instruction selectors for expression trees, namely +getRegister, are complicated by the fact that some +expressions can only be computed into a specific register, whereas +the majority can be computed into any register. We take x86 as an +example, but the problem applies to all archs. +

+Terminology: rreg means real register, a real machine +register. vreg means one of an infinite set of virtual +registers. The type Reg is the sum of rreg and +vreg. The instruction selector generates sequences with +unconstrained use of vregs, leaving the register allocator to map them +all into rregs. +

+Now, where was I ? Oh yes. We return to the type of +getRegister, which despite its name, selects instructions +to compute the value of an expression tree. +

+   getRegister :: StixExpr -> NatM Register
+
+   data Register
+     = Fixed   PrimRep Reg InstrBlock
+     | Any     PrimRep (Reg -> InstrBlock)
+
+   type InstrBlock -- sequence of instructions
+
+At first this looks eminently reasonable (apart from the stupid +name). getRegister, and nobody else, knows whether or +not a given expression has to be computed into a fixed rreg or can be +computed into any rreg or vreg. In the first case, it returns +Fixed and indicates which rreg the result is in. In the +second case it defers committing to any specific target register by +returning a function from Reg to InstrBlock, +and the caller can specify the target reg as it sees fit. +

+Unfortunately, that forces getRegister's callers (usually +itself) to use a clumsy and confusing idiom in the common case where +they do not care what register the result winds up in. The reason is +that although a value might be computed into a fixed rreg, we are +forbidden (on pain of segmentation fault :) from subsequently +modifying the fixed reg. This and other rules are record in "Rules of +the game" inside MachCode.lhs. +

+Why can't fixed registers be modified post-hoc? Consider a simple +expression like Hp+1. Since the heap pointer +Hp is definitely in a fixed register, call it R, +getRegister on subterm Hp will simply return +Fixed with an empty sequence and R. But we can't just +emit an increment instruction for R, because that trashes +Hp; instead we first have to copy it into a fresh vreg +and increment that. +

+With all that in mind, consider now writing a getRegister +clause for terms of the form (1 + E). Contrived, yes, +but illustrates the matter. First we do +getRegister on E. Now we are forced to examine what +comes back. +

+   getRegister (OnePlus e)
+      = getRegister e           `thenNat`   \ e_result ->
+        case e_result of
+           Fixed e_code e_fixed 
+              -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
+           Any e_any 
+              -> Any (\dst -> e_any dst ++ [INC dst])
+
+This seems unreasonably cumbersome, yet the instruction selector is +full of such idioms. A good example of the complexities induced by +this scheme is shown by trivialCode for x86 in +MachCode.lhs. This deals with general integer dyadic +operations on x86 and has numerous cases. It was difficult to get +right. +

+An alternative suggestion is to simplify the type of +getRegister to this: +

+   getRegister :: StixExpr -> NatM (InstrBloc, VReg)
+   type VReg = .... a vreg ...
+
+and then we could safely write +
+   getRegister (OnePlus e)
+      = getRegister e        `thenNat`  \ (e_code, e_vreg) ->
+        returnNat (e_code ++ [INC e_vreg], e_vreg)
+
+which is about as straightforward as you could hope for. +Unfortunately, it requires getRegister to insert moves of +values which naturally compute into an rreg, into a vreg. Consider: +
+   1 + ccall some-C-fn
+
+On x86 the ccall result is returned in rreg %eax. The +resulting sequence, prior to register allocation, would be: +
+   # push args
+   call some-C-fn
+   # move %esp to nuke args
+   movl   %eax, %vreg
+   incl   %vreg
+
+If, as is likely, %eax is not held live beyond this point +for any other purpose, the move into a fresh register is pointless; +we'd have been better off leaving the value in %eax as +long as possible. +

+The simplified getRegister story is attractive. It would +clean up the instruction selectors significantly and make it simpler +to write new ones. The only drawback is that it generates redundant +register moves. I suggest that eliminating these should be the job +of the register allocator. Indeed: +

+ + +

Selecting insns for 64-bit values/loads/stores on 32-bit platforms

+ +Note that this stuff doesn't apply on 64-bit archs, since the +getRegister mechanism applies there. + +The relevant functions are: +
+   assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
+   assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock
+   iselExpr64        :: StixExpr -> NatM ChildCode64
+
+   data ChildCode64     -- a.k.a "Register64"
+      = ChildCode64 
+           InstrBlock   -- code
+           VRegUnique   -- unique for the lower 32-bit temporary
+
+iselExpr64 is the 64-bit, plausibly-named analogue of +getRegister, and ChildCode64 is the analogue +of Register. The aim here was to generate working 64 +bit code as simply as possible. To this end, I used the +simplified getRegister scheme described above, in which +iselExpr64generates its results into two vregs which +can always safely be modified afterwards. + +Virtual registers are, unsurprisingly, distinguished by their +Uniques. There is a small difficulty in how to +know what the vreg for the upper 32 bits of a value is, given the vreg +for the lower 32 bits. The simple solution adopted is to say that +any low-32 vreg may also have a hi-32 counterpart which shares the +same unique, but is otherwise regarded as a separate entity. +getHiVRegFromLo gets one from the other. +
+   data VRegUnique
+      = VRegUniqueLo Unique          -- lower part of a split quantity
+      | VRegUniqueHi Unique          -- upper part thereof
+
+Apart from that, 64-bit code generation is really simple. The sparc +and x86 versions are almost copy-n-pastes of each other, with minor +adjustments for endianness. The generated code isn't wonderful but +is certainly acceptable, and it works. + + + +

Shortcomings and inefficiencies in the register allocator

+ +

Redundant reconstruction of the control flow graph

+ +The allocator goes to considerable computational expense to construct +all the flow edges in the group of instructions it's allocating for, +by using the insnFuture function in the +Instr pseudo-abstract type. +

+This is really silly, because all that information is present at the +abstract C stage, but is thrown away in the translation to Stix. +So a good thing to do is to modify that translation to +produce a directed graph of Stix straight-line code blocks, +and to preserve that structure through the insn selector, so the +allocator can see it. +

+This would eliminate the fragile, hacky, arch-specific +insnFuture mechanism, and probably make the whole +compiler run measurably faster. Register allocation is a fair chunk +of the time of non-optimising compilation (10% or more), and +reconstructing the flow graph is an expensive part of reg-alloc. +It would probably accelerate the vreg liveness computation too. + +

Really ridiculous method for doing spilling

+ +This is a more ambitious suggestion, but ... reg-alloc should be +reimplemented, using the scheme described in "Quality and speed in +linear-scan register allocation." (Traub?) For straight-line code +blocks, this gives an elegant one-pass algorithm for assigning +registers and creating the minimal necessary spill code, without the +need for reserving spill registers ahead of time. +

+I tried it in Rigr, replacing the previous spiller which used the +current GHC scheme described above, and it cut the number of spill +loads and stores by a factor of eight. Not to mention being simpler, +easier to understand and very fast. +

+The Traub paper also describes how to extend their method to multiple +basic blocks, which will be needed for GHC. It comes down to +reconciling multiple vreg-to-rreg mappings at points where control +flow merges. + +

Redundant-move support for revised instruction selector suggestion

+ +As mentioned above, simplifying the instruction selector will require +the register allocator to try and allocate source and destination +vregs to the same rreg in reg-reg moves, so as to make as many as +possible go away. Without that, the revised insn selector would +generate worse code than at present. I know this stuff has been done +but know nothing about it. The Linear-scan reg-alloc paper mentioned +above does indeed mention a bit about it in the context of single +basic blocks, but I don't know if that's sufficient. + + -

How to debug the NCG without losing your sanity

+

x86 arcana that you should know about

+ +The main difficulty with x86 is that many instructions have fixed +register constraints, which can occasionally make reg-alloc fail +completely. And the FPU doesn't have the flat register model which +the reg-alloc abstraction (implicitly) assumes. +

+Our strategy is: do a good job for the common small subset, that is +integer loads, stores, address calculations, basic ALU ops (+, -, +and, or, xor), and jumps. That covers the vast majority of +executed insns. And indeed we do do a good job, with a loss of +less than 2% compared with gcc. +

+Initially we tried to handle integer instructions with awkward +register constraints (mul, div, shifts by non-constant amounts) via +various jigglings of the spiller et al. This never worked robustly, +and putting platform-specific tweaks in the generic infrastructure is +a big No-No. (Not quite true; shifts by a non-constant amount are +still done by a giant kludge, and should be moved into this new +framework.) +

+Fortunately, all such insns are rare. So the current scheme is to +pretend that they don't have any such constraints. This fiction is +carried all the way through the register allocator. When the insn +finally comes to be printed, we emit a sequence which copies the +operands through memory (%esp-relative), satisfying the +constraints of the real instruction. This localises the gruesomeness +to just one place. Here, for example, is the code generated for +integer divison of %esi by %ecx: +

+   # BEGIN IQUOT %ecx, %esi
+   pushl $0
+   pushl %eax  
+   pushl %edx
+   pushl %ecx
+   movl  %esi,% eax
+   cltd
+   idivl 0(%esp)
+   movl %eax, 12(%esp)
+   popl %edx  
+   popl %edx
+   popl %eax
+   popl %esi
+   # END   IQUOT %ecx, %esi
+
+This is not quite as appalling as it seems, if you consider that the +division itself typically takes 16+ cycles, whereas the rest of the +insns probably go through in about 1 cycle each. +

+This trick is taken to extremes for FP operations. +

+All notions of the x86 FP stack and its insns have been removed. +Instead, we pretend, to the instruction selector and register +allocator, that x86 has six floating point registers, +%fake0 .. %fake5, which can be used in the +usual flat manner. We further claim that x86 has floating point +instructions very similar to SPARC and Alpha, that is, a simple +3-operand register-register arrangement. Code generation and register +allocation proceed on this basis. +

+When we come to print out the final assembly, our convenient fiction +is converted to dismal reality. Each fake instruction is +independently converted to a series of real x86 instructions. +%fake0 .. %fake5 are mapped to +%st(0) .. %st(5). To do reg-reg arithmetic +operations, the two operands are pushed onto the top of the FP stack, +the operation done, and the result copied back into the relevant +register. When one of the operands is also the destination, we emit a +slightly less scummy translation. There are only six +%fake registers because 2 are needed for the translation, +and x86 has 8 in total. +

+The translation is inefficient but is simple and it works. A cleverer +translation would handle a sequence of insns, simulating the FP stack +contents, would not impose a fixed mapping from %fake to +%st regs, and hopefully could avoid most of the redundant +reg-reg moves of the current translation. +

+There are, however, two unforeseen bad side effects: +

+ + + +

Generating code for ccalls

+ +For reasons I don't really understand, the instruction selectors for +generating calls to C (genCCall) have proven surprisingly +difficult to get right, and soaked up a lot of debugging time. As a +result, I have once again opted for schemes which are simple and not +too difficult to argue as correct, even if they don't generate +excellent code. +

+The sparc ccall generator in particular forces all arguments into +temporary virtual registers before moving them to the final +out-registers (%o0 .. %o5). This creates +some unnecessary reg-reg moves. The reason is explained in a +comment in the code. + + + +

How to debug the NCG without losing your sanity/hair/cool

+ +Last, but definitely not least ... +

+The usual syndrome is that some program, when compiled via C, works, +but not when compiled via the NCG. Usually the problem is fairly +simple to fix, once you find the specific code block which has been +mistranslated. But the latter can be nearly impossible, since most +modules generate at least hundreds and often thousands of them. +

+My solution: cheat. +

+Because the via-C and native routes diverge only late in the day, +it is not difficult to construct a 1-1 correspondence between basic +blocks on the two routes. So, if the program works via C but not on +the NCG, do the following: +

+

+diff_gcc_nat was known to work correctly last time I used +it, in December 01, for both x86 and sparc. If it doesn't work, due +to changes in assembly syntax, or whatever, make it work. The +investment is well worth it. Searching for the incorrect block(s) any +other way is a total time waster. + + + +

-Last modified: Fri Aug 10 11:47:41 EST 2001 +Last modified: Fri Feb 1 16:14:11 GMT 2002