+<h3>Real vs virtual registers in the instruction selectors</h3>
+
+The instruction selectors for expression trees, namely
+<code>getRegister</code>, are complicated by the fact that some
+expressions can only be computed into a specific register, whereas
+the majority can be computed into any register. We take x86 as an
+example, but the problem applies to all archs.
+<p>
+Terminology: <em>rreg</em> means real register, a real machine
+register. <em>vreg</em> means one of an infinite set of virtual
+registers. The type <code>Reg</code> is the sum of <em>rreg</em> and
+<em>vreg</em>. The instruction selector generates sequences with
+unconstrained use of vregs, leaving the register allocator to map them
+all into rregs.
+<p>
+Now, where was I ? Oh yes. We return to the type of
+<code>getRegister</code>, which despite its name, selects instructions
+to compute the value of an expression tree.
+<pre>
+ getRegister :: StixExpr -> NatM Register
+
+ data Register
+ = Fixed PrimRep Reg InstrBlock
+ | Any PrimRep (Reg -> InstrBlock)
+
+ type InstrBlock -- sequence of instructions
+</pre>
+At first this looks eminently reasonable (apart from the stupid
+name). <code>getRegister</code>, and nobody else, knows whether or
+not a given expression has to be computed into a fixed rreg or can be
+computed into any rreg or vreg. In the first case, it returns
+<code>Fixed</code> and indicates which rreg the result is in. In the
+second case it defers committing to any specific target register by
+returning a function from <code>Reg</code> to <code>InstrBlock</code>,
+and the caller can specify the target reg as it sees fit.
+<p>
+Unfortunately, that forces <code>getRegister</code>'s callers (usually
+itself) to use a clumsy and confusing idiom in the common case where
+they do not care what register the result winds up in. The reason is
+that although a value might be computed into a fixed rreg, we are
+forbidden (on pain of segmentation fault :) from subsequently
+modifying the fixed reg. This and other rules are record in "Rules of
+the game" inside <code>MachCode.lhs</code>.
+<p>
+Why can't fixed registers be modified post-hoc? Consider a simple
+expression like <code>Hp+1</code>. Since the heap pointer
+<code>Hp</code> is definitely in a fixed register, call it R,
+<code>getRegister</code> on subterm <code>Hp</code> will simply return
+<code>Fixed</code> with an empty sequence and R. But we can't just
+emit an increment instruction for R, because that trashes
+<code>Hp</code>; instead we first have to copy it into a fresh vreg
+and increment that.
+<p>
+With all that in mind, consider now writing a <code>getRegister</code>
+clause for terms of the form <code>(1 + E)</code>. Contrived, yes,
+but illustrates the matter. First we do
+<code>getRegister</code> on E. Now we are forced to examine what
+comes back.
+<pre>
+ getRegister (OnePlus e)
+ = getRegister e `thenNat` \ e_result ->
+ case e_result of
+ Fixed e_code e_fixed
+ -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
+ Any e_any
+ -> Any (\dst -> e_any dst ++ [INC dst])
+</pre>
+This seems unreasonably cumbersome, yet the instruction selector is
+full of such idioms. A good example of the complexities induced by
+this scheme is shown by <code>trivialCode</code> for x86 in
+<code>MachCode.lhs</code>. This deals with general integer dyadic
+operations on x86 and has numerous cases. It was difficult to get
+right.
+<p>
+An alternative suggestion is to simplify the type of
+<code>getRegister</code> to this:
+<pre>
+ getRegister :: StixExpr -> NatM (InstrBloc, VReg)
+ type VReg = .... a vreg ...
+</pre>
+and then we could safely write
+<pre>
+ getRegister (OnePlus e)
+ = getRegister e `thenNat` \ (e_code, e_vreg) ->
+ returnNat (e_code ++ [INC e_vreg], e_vreg)
+</pre>
+which is about as straightforward as you could hope for.
+Unfortunately, it requires <code>getRegister</code> to insert moves of
+values which naturally compute into an rreg, into a vreg. Consider:
+<pre>
+ 1 + ccall some-C-fn
+</pre>
+On x86 the ccall result is returned in rreg <code>%eax</code>. The
+resulting sequence, prior to register allocation, would be:
+<pre>
+ # push args
+ call some-C-fn
+ # move %esp to nuke args
+ movl %eax, %vreg
+ incl %vreg
+</pre>
+If, as is likely, <code>%eax</code> is not held live beyond this point
+for any other purpose, the move into a fresh register is pointless;
+we'd have been better off leaving the value in <code>%eax</code> as
+long as possible.
+<p>
+The simplified <code>getRegister</code> story is attractive. It would
+clean up the instruction selectors significantly and make it simpler
+to write new ones. The only drawback is that it generates redundant
+register moves. I suggest that eliminating these should be the job
+of the register allocator. Indeed:
+<ul>
+<li>There has been some work on this already ("Iterated register
+ coalescing" ?), so this isn't a new idea.
+<p>
+<li>You could argue that the existing scheme inappropriately blurs the
+ boundary between the instruction selector and the register
+ allocator. The instruction selector should .. well .. just
+ select instructions, without having to futz around worrying about
+ what kind of registers subtrees get generated into. Register
+ allocation should be <em>entirely</em> the domain of the register
+ allocator, with the proviso that it should endeavour to allocate
+ registers so as to minimise the number of non-redundant reg-reg
+ moves in the final output.
+</ul>
+
+
+<h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>
+
+Note that this stuff doesn't apply on 64-bit archs, since the
+<code>getRegister</code> mechanism applies there.
+
+The relevant functions are:
+<pre>
+ assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
+ assignReg_I64Code :: StixReg -> StixExpr -> NatM InstrBlock
+ iselExpr64 :: StixExpr -> NatM ChildCode64
+
+ data ChildCode64 -- a.k.a "Register64"
+ = ChildCode64
+ InstrBlock -- code
+ VRegUnique -- unique for the lower 32-bit temporary
+</pre>
+<code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
+<code>getRegister</code>, and <code>ChildCode64</code> is the analogue
+of <code>Register</code>. The aim here was to generate working 64
+bit code as simply as possible. To this end, I used the
+simplified <code>getRegister</code> scheme described above, in which
+<code>iselExpr64</code>generates its results into two vregs which
+can always safely be modified afterwards.
+
+Virtual registers are, unsurprisingly, distinguished by their
+<code>Unique</code>s. There is a small difficulty in how to
+know what the vreg for the upper 32 bits of a value is, given the vreg
+for the lower 32 bits. The simple solution adopted is to say that
+any low-32 vreg may also have a hi-32 counterpart which shares the
+same unique, but is otherwise regarded as a separate entity.
+<code>getHiVRegFromLo</code> gets one from the other.
+<pre>
+ data VRegUnique
+ = VRegUniqueLo Unique -- lower part of a split quantity
+ | VRegUniqueHi Unique -- upper part thereof
+</pre>
+Apart from that, 64-bit code generation is really simple. The sparc
+and x86 versions are almost copy-n-pastes of each other, with minor
+adjustments for endianness. The generated code isn't wonderful but
+is certainly acceptable, and it works.
+
+
+
+<h3>Shortcomings and inefficiencies in the register allocator</h3>
+
+<h4>Redundant reconstruction of the control flow graph</h4>
+
+The allocator goes to considerable computational expense to construct
+all the flow edges in the group of instructions it's allocating for,
+by using the <code>insnFuture</code> function in the
+<code>Instr</code> pseudo-abstract type.
+<p>
+This is really silly, because all that information is present at the
+abstract C stage, but is thrown away in the translation to Stix.
+So a good thing to do is to modify that translation to
+produce a directed graph of Stix straight-line code blocks,
+and to preserve that structure through the insn selector, so the
+allocator can see it.
+<p>
+This would eliminate the fragile, hacky, arch-specific
+<code>insnFuture</code> mechanism, and probably make the whole
+compiler run measurably faster. Register allocation is a fair chunk
+of the time of non-optimising compilation (10% or more), and
+reconstructing the flow graph is an expensive part of reg-alloc.
+It would probably accelerate the vreg liveness computation too.
+
+<h4>Really ridiculous method for doing spilling</h4>
+
+This is a more ambitious suggestion, but ... reg-alloc should be
+reimplemented, using the scheme described in "Quality and speed in
+linear-scan register allocation." (Traub?) For straight-line code
+blocks, this gives an elegant one-pass algorithm for assigning
+registers and creating the minimal necessary spill code, without the
+need for reserving spill registers ahead of time.
+<p>
+I tried it in Rigr, replacing the previous spiller which used the
+current GHC scheme described above, and it cut the number of spill
+loads and stores by a factor of eight. Not to mention being simpler,
+easier to understand and very fast.
+<p>
+The Traub paper also describes how to extend their method to multiple
+basic blocks, which will be needed for GHC. It comes down to
+reconciling multiple vreg-to-rreg mappings at points where control
+flow merges.
+
+<h4>Redundant-move support for revised instruction selector suggestion</h4>
+
+As mentioned above, simplifying the instruction selector will require
+the register allocator to try and allocate source and destination
+vregs to the same rreg in reg-reg moves, so as to make as many as
+possible go away. Without that, the revised insn selector would
+generate worse code than at present. I know this stuff has been done
+but know nothing about it. The Linear-scan reg-alloc paper mentioned
+above does indeed mention a bit about it in the context of single
+basic blocks, but I don't know if that's sufficient.
+
+