From df108ac9cc43c0b01265c637f96aaa6ae2f893f4 Mon Sep 17 00:00:00 2001
From: sewardj <unknown>
Date: Thu, 7 Feb 2002 15:31:20 +0000
Subject: [PATCH 1/1] [project @ 2002-02-07 15:31:20 by sewardj] Add many
 details about bytecode generation, the interpreter, and
 compiled/interpreted code interop.

---
 ghc/docs/comm/the-beast/ghci.html |  249 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 245 insertions(+), 4 deletions(-)
diff --git a/ghc/docs/comm/the-beast/ghci.html b/ghc/docs/comm/the-beast/ghci.html
index f05ceae..5ee2f21 100644
--- a/ghc/docs/comm/the-beast/ghci.html
+++ b/ghc/docs/comm/the-beast/ghci.html
@@ -20,7 +20,7 @@
     has proven to be extremely difficult.  In retrospect it may be
     argued a design flaw that GHC's implementation of the STG
     execution mechanism provides only the weakest of support for
-    automated internal consistency checks.  This renders it hard to
+    automated internal consistency checks.  This makes it hard to
     debug.
     <p>
     Execution failures in the interactive system can be due to
@@ -123,12 +123,253 @@
     </ul>
 
 
-    <h2>Entering and returning between interpreted and compiled code</h2>
+    <h2>Useful stuff to know about the interpreter</h2>
 
+    The code generation scheme is straightforward (naive, in fact).
+    <code>-ddump-bcos</code> prints each BCO along with the Core it
+    was generated from, which is very handy.
+    <ul>
+    <li>Simple lets are compiled in-line.  For the general case, let
+        v = E in ..., E is compiled into a new BCO which takes as
+        args its free variables, and v is bound to AP(the new BCO,
+        free vars of E).
+    <p>
+    <li><code>case</code>s as usual, become: push the return
+        continuation, enter the scrutinee.  There is some magic to
+        make all combinations of compiled/interpreted calls and
+        returns work, described below.  In the interpreted case, all
+        case alts are compiled into a single big return BCO, which
+        commences with instructions implementing a switch tree.
+    </ul>
+    <p>
+    <b>ARGCHECK magic</b>
+    <p>
+    You may find ARGCHECK instructions at the start of BCOs which
+    don't appear to need them; case continuations in particular.
+    These play an important role: they force objects which should
+    evaluated to BCOs to actually be BCOs.
+    <p>
+    Typically, there may be an application node somewhere in the heap.
+    This is a thunk which when leant on turns into a BCO for a return
+    continuation.  The thunk may get entered with an update frame on
+    top of the stack.  This is legitimate since from one viewpoint
+    this is an AP which simply reduces to a data object, so does not
+    have functional type.  However, once the AP turns itself into a
+    BCO (so to speak) we cannot simply enter the BCO, because that
+    expects to see args on top of the stack, not an update frame.
+    Therefore any BCO which expects something on the stack above an
+    update frame, even non-function BCOs, start with an ARGCHECK.  In
+    this case it fails, the update is done, the update frame is
+    removed, and the BCO re-entered.  Subsequent entries of the BCO of
+    course go unhindered.
+    <p>
+    The optimised (<code>#undef REFERENCE_INTERPRETER</code>) handles
+    this case specially, so that a trip through the scheduler is
+    avoided.  When reading traces from <code>+RTS -D2 -RTS</code>, you
+    may see BCOs which appear to execute their initial ARGCHECK insn
+    twice.  The first time it fails; the interpreter does the update
+    immediately and re-enters with no further comment.
+    <p>
+    This is all a bit ugly, and, as SimonM correctly points out, it
+    would have been cleaner to make BCOs unpointed (unthunkable)
+    objects, so that a pointer to something <code>:: BCO#</code>
+    really points directly at a BCO.
+    <p>
+    <b>Stack management</b>
+    <p>
+    There isn't any attempt to stub the stack, minimise its growth, or
+    generally remove unused pointers ahead of time.  This is really
+    due to lazyness on my part, although it does have the minor
+    advantage that doing something cleverer would almost certainly
+    increase the number of bytecodes that would have to be executed.
+    Of course we SLIDE out redundant stuff, to get the stack back to
+    the sequel depth, before returning a HNF, but that's all.  As
+    usual this is probably a cause of major space leaks.
+    <p>
+    <b>Building constructors</b>
+    <p>
+    Constructors are built on the stack and then dumped into the heap
+    with a single PACK instruction, which simply copies the top N
+    words of the stack verbatim into the heap, adds an info table, and zaps N
+    words from the stack.  The constructor args are pushed onto the
+    stack one at a time.  One upshot of this is that unboxed values
+    get pushed untaggedly onto the stack (via PUSH_UBX), because that's how they
+    will be in the heap.  That in turn means that the stack is not 
+    always walkable at arbitrary points in BCO execution, although
+    naturally it is whenever GC might occur.
+    <p>
+    Function closures created by the interpreter use the AP-node
+    (tagged) format, so although their fields are similarly
+    constructed on the stack, there is never a stack walkability
+    problem.
+    <p>
+    <b>Unpacking constructors</b>
+    <p>
+    At the start of a case continuation, the returned constructor is
+    unpacked onto the stack, which means that unboxed fields have to
+    be tagged.  Rather than burdening all such continuations with a
+    complex, general mechanism, I split it into two.  The
+    allegedly-common all-pointers case uses a single UNPACK insn
+    to fish out all fields with no further ado.  The slow case uses a
+    sequence of more complex UPK_TAG insns, one for each field (I
+    think).  This seemed like a good compromise to me.
+    <p>
+    <b>Perspective</b>
+    <p>
+    I designed the bytecode mechanism with the experience of both STG
+    hugs and Classic Hugs in mind.  The latter has an small
+    set of bytecodes, a small interpreter loop, and runs amazingly
+    fast considering the cruddy code it has to interpret.  The former
+    had a large interpretative loop with many different opcodes,
+    including multiple minor variants of the same thing, which
+    made it difficult to optimise and maintain, yet it performed more
+    or less comparably with Classic Hugs.
+    <p>
+    My design aims were therefore to minimise the interpreter's
+    complexity whilst maximising performance.  This means reducing the
+    number of opcodes implemented, whilst reducing the number of insns
+    despatched.  In particular there are only two opcodes, PUSH_UBX
+    and UPK_TAG, which deal with tags.  STG Hugs had dozens of opcodes
+    for dealing with tagged data.  In cases where the common
+    all-pointers case is significantly simpler (UNPACK) I deal with it
+    specially.  Finally, the number of insns executed is reduced a
+    little by merging multiple pushes, giving PUSH_LL and PUSH_LLL.
+    These opcode pairings were determined by using the opcode-pair
+    frequency profiling stuff which is ifdef-d out in
+    <code>Interpreter.c</code>.  These significantly improve
+    performance without having much effect on the uglyness or
+    complexity of the interpreter.
+    <p>
+    Overall, the interpreter design is something which turned out
+    well, and I was pleased with it.  Unfortunately I cannot say the
+    same of the bytecode generator.
+
+    <h2><code>case</code> returns between interpreted and compiled code</h2>
+
+    Variants of the following scheme have been drifting around in GHC
+    RTS documentation for several years.  Since what follows is
+    actually what is implemented, I guess it supersedes all other
+    documentation.  Beware; the following may make your brain melt.
+    In all the pictures below, the stack grows downwards.
+    <p>
+    <b>Returning to interpreted code</b>.
+    <p>
+    Interpreted returns employ a set of polymorphic return infotables.
+    Each element in the set corresponds to one of the possible return
+    registers (R1, D1, F1) that compiled code will place the returned
+    value in.  In fact this is a bit misleading, since R1 can be used
+    to return either a pointer or an int, and we need to distinguish
+    these cases.  So, supposing the set of return registers is {R1p,
+    R1n, D1, F1}, there would be four corresponding infotables,
+    <code>stg_ctoi_ret_R1p_info</code>, etc.  In the pictures below we
+    call them <code>stg_ctoi_ret_REP_info</code>.  
+    <p>
+    These return itbls are polymorphic, meaning that all 8 vectored
+    return codes and the direct return code are identical.
+    <p>
+    Before the scrutinee is entered, the stack is arranged like this:
+    <pre>
+   |        |
+   +--------+
+   |  BCO   | -------> the return contination BCO
+   +--------+
+   | itbl * | -------> stg_ctoi_ret_REP_info, with all 9 codes as follows:
+   +--------+
+                          BCO* bco = Sp[1];
+                          push R1/F1/D1 depending on REP
+                          push bco
+                          yield to sched
+    </pre>
+    On entry, the interpreted contination BCO expects the stack to look
+    like this:
+    <pre>
+   |        |
+   +--------+
+   |  BCO   | -------> the return contination BCO
+   +--------+
+   | itbl * | -------> ret_REP_ctoi_info, with all 9 codes as follows:
+   +--------+
+   : VALUE  :  (the returned value, shown with : since it may occupy
+   +--------+   multiple stack words)
+    </pre>
+    A machine code return will park the returned value in R1/F1/D1,
+    and enter the itbl on the top of the stack.  Since it's our magic
+    itbl, this pushes the returned value onto the stack, which is
+    where the interpreter expects to find it.  It then pushes the BCO
+    (again) and yields.  The scheduler removes the BCO from the top,
+    and enters it, so that the continuation is interpreted with the
+    stack as shown above.
+    <p>
+    An interpreted return will create the value to return at the top
+    of the stack.  It then examines the return itbl, which must be
+    immediately underneath the return value, to see if it is one of
+    the magic <code>stg_ctoi_ret_REP_info</code> set.  Since this is so,
+    it knows it is returning to an interpreted contination.  It
+    therefore simply enters the BCO which it assumes it immediately
+    underneath the itbl on the stack.
+
+    <p>
+    <b>Returning to compiled code</b>.
+    <p>
+    Before the scrutinee is entered, the stack is arranged like this:
+    <pre>
+                        ptr to vec code 8 ------> return vector code 8
+   |        |           ....
+   +--------+           ptr to vec code 1 ------> return vector code 1
+   | itbl * | --        Itbl end
+   +--------+   \       ....   
+                 \      Itbl start
+                  ----> direct return code
+    </pre>
+    The scrutinee value is then entered.
+    The case continuation(s) expect the stack to look the same, with
+    the returned HNF in a suitable return register, R1, D1, F1 etc.
+    <p>
+    A machine code return knows whether it is doing a vectored or
+    direct return, and, if the former, which vector element it is.
+    So, for a direct return we jump to <code>Sp[0]</code>, and for a
+    vectored return, jump to <code>((CodePtr*)(Sp[0]))[ - ITBL_LENGTH
+    - vector number ]</code>.  This is (of course) the scheme that
+    compiled code has been using all along.
+    <p>
+    An interpreted return will, as described just above, have examined
+    the itbl immediately beneath the return value it has just pushed,
+    and found it not to be one of the <code>ret_REP_ctoi_info</code> set,
+    so it knows this must be a return to machine code.  It needs to
+    pop the return value, currently on the stack, into R1/F1/D1, and
+    jump through the info table.  Unfortunately the first part cannot
+    be accomplished directly since we are not in Haskellised-C world.
+    <p>
+    We therefore employ a second family of magic infotables, indexed,
+    like the first, on the return representation, and therefore with
+    names of the form <code>stg_itoc_ret_REP_info</code>.  This is
+    pushed onto the stack (note, tagged values have their tag zapped),
+    giving:
+    <pre>
+   |        |
+   +--------+
+   | itbl * | -------> arbitrary machine code return itbl
+   +--------+
+   : VALUE  :  (the returned value, possibly multiple words)
+   +--------+
+   | itbl * | -------> stg_itoc_ret_REP_info, with code:
+   +--------+
+                          pop myself (stg_itoc_ret_REP_info) off the stack
+                          pop return value into R1/D1/F1
+                          do standard machine code return to itbl at t.o.s.
+    </pre>
+    We then return to the scheduler, asking it to enter the itbl at
+    t.o.s.  When entered, <code>stg_itoc_ret_REP_info</code> removes
+    itself from the stack, pops the return value into the relevant
+    return register, and returns to the itbl to which we were trying
+    to return in the first place.  
+    <p>
+    Amazingly enough, this stuff all actually works!
 
-    <p><small>
+<p><small>
+   
 <!-- hhmts start -->
-Last modified: Fri Feb  1 16:14:11 GMT 2002
+Last modified: Thursday February  7 15:33:49 GMT 2002
 <!-- hhmts end -->
     </small>
   </body>
-- 
1.7.10.4