docs/comm/the-beast/ghci.html

   1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
   2 <html>
   3   <head>
   4     <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
   5     <title>The GHC Commentary - GHCi</title>
   6   </head>
   7
   8   <body BGCOLOR="FFFFFF">
   9     <h1>The GHC Commentary - GHCi</h1>
  10
  11     This isn't a coherent description of how GHCi works, sorry.  What
  12     it is (currently) is a dumping ground for various bits of info
  13     pertaining to GHCi, which ought to be recorded somewhere.
  14
  15     <h2>Debugging the interpreter</h2>
  16
  17     The usual symptom is that some expression / program crashes when
  18     running on the interpreter (commonly), or gets wierd results
  19     (rarely).  Unfortunately, finding out what the problem really is
  20     has proven to be extremely difficult.  In retrospect it may be
  21     argued a design flaw that GHC's implementation of the STG
  22     execution mechanism provides only the weakest of support for
  23     automated internal consistency checks.  This makes it hard to
  24     debug.
  25     <p>
  26     Execution failures in the interactive system can be due to
  27     problems with the bytecode interpreter, problems with the bytecode
  28     generator, or problems elsewhere.  From the bugs seen so far,
  29     the bytecode generator is often the culprit, with the interpreter
  30     usually being correct.
  31     <p>
  32     Here are some tips for tracking down interactive nonsense:
  33     <ul>
  34       <li>Find the smallest source fragment which causes the problem.
  35       <p>
  36       <li>Using an RTS compiled with <code>-DDEBUG</code> (nb, that
  37           means the RTS from the previous stage!), run with <code>+RTS
  38           -D2</code> to get a listing in great detail from the
  39           interpreter.  Note that the listing is so voluminous that
  40           this is impractical unless you have been diligent in
  41           the previous step.
  42       <p>
  43       <li>At least in principle, using the trace and a bit of GDB
  44           poking around at the time of death, you can figure out what
  45           the problem is.  In practice you quickly get depressed at
  46           the hopelessness of ever making sense of the mass of
  47           details.  Well, I do, anyway.
  48       <p>
  49       <li><code>+RTS -D2</code> tries hard to print useful
  50           descriptions of what's on the stack, and often succeeds.
  51           However, it has no way to map addresses to names in
  52           code/data loaded by our runtime linker.  So the C function
  53           <code>ghci_enquire</code> is provided.  Given an address, it
  54           searches the loaded symbol tables for symbols close to that
  55           address.  You can run it from inside GDB:
  56       <pre>
  57       (gdb) p ghci_enquire ( 0x50a406f0 )
  58       0x50a406f0 + -48  ==  `PrelBase_Czh_con_info'
  59       0x50a406f0 + -12  ==  `PrelBase_Izh_static_info'
  60       0x50a406f0 + -48  ==  `PrelBase_Czh_con_entry'
  61       0x50a406f0 + -24  ==  `PrelBase_Izh_con_info'
  62       0x50a406f0 +  16  ==  `PrelBase_ZC_con_entry'
  63       0x50a406f0 +   0  ==  `PrelBase_ZMZN_static_entry'
  64       0x50a406f0 + -36  ==  `PrelBase_Czh_static_entry'
  65       0x50a406f0 + -24  ==  `PrelBase_Izh_con_entry'
  66       0x50a406f0 +  64  ==  `PrelBase_EQ_static_info'
  67       0x50a406f0 +   0  ==  `PrelBase_ZMZN_static_info'
  68       0x50a406f0 +  48  ==  `PrelBase_LT_static_entry'
  69       $1 = void
  70       </pre>
  71          In this case the enquired-about address is
  72          <code>PrelBase_ZMZN_static_entry</code>.  If no symbols are
  73          close to the given addr, nothing is printed.  Not a great
  74          mechanism, but better than nothing.
  75       <p>
  76       <li>We have had various problems in the past due to the bytecode
  77           generator (<code>compiler/ghci/ByteCodeGen.lhs</code>) being
  78           confused about the true set of free variables of an
  79           expression.  The compilation scheme for <code>let</code>s
  80           applies the BCO for the RHS of the let to its free
  81           variables, so if the free-var annotation is wrong or
  82           misleading, you end up with code which has wrong stack
  83           offsets, which is usually fatal.
  84       <p>
  85       <li>The baseline behaviour of the interpreter is to interpret
  86           BCOs, and hand all other closures back to the scheduler for
  87           evaluation.  However, this causes a huge number of expensive
  88           context switches, so the interpreter knows how to enter the
  89           most common non-BCO closure types by itself.
  90           <p>
  91           These optimisations complicate the interpreter.
  92           If you think you have an interpreter problem, re-enable the
  93           define <code>REFERENCE_INTERPRETER</code> in
  94           <code>ghc/rts/Interpreter.c</code>.  All optimisations are
  95           thereby disabled, giving the baseline
  96           I-only-know-how-to-enter-BCOs behaviour.
  97       <p>
  98       <li>Following the traces is often problematic because execution
  99           hops back and forth between the interpreter, which is
 100           traced, and compiled code, which you can't see.
 101           Particularly annoying is when the stack looks OK in the
 102           interpreter, then compiled code runs for a while, and later
 103           we arrive back in the interpreter, with the stack corrupted,
 104           and usually in a completely different place from where we
 105           left off.
 106           <p>
 107           If this is biting you baaaad, it may be worth copying
 108           sources for the compiled functions causing the problem, into
 109           your interpreted module, in the hope that you stay in the
 110           interpreter more of the time.  Of course this doesn't work
 111           very well if you've defined
 112           <code>REFERENCE_INTERPRETER</code> in
 113           <code>ghc/rts/Interpreter.c</code>.
 114       <p>
 115       <li>There are various commented-out pieces of code in
 116           <code>Interpreter.c</code> which can be used to get the
 117           stack sanity-checked after every entry, and even after after
 118           every bytecode instruction executed.  Note that some
 119           bytecodes (<code>PUSH_UBX</code>) leave the stack in
 120           an unwalkable state, so the <code>do_print_stack</code>
 121           local variable is used to suppress the stack walk after
 122           them.
 123     </ul>
 124
 125
 126     <h2>Useful stuff to know about the interpreter</h2>
 127
 128     The code generation scheme is straightforward (naive, in fact).
 129     <code>-ddump-bcos</code> prints each BCO along with the Core it
 130     was generated from, which is very handy.
 131     <ul>
 132     <li>Simple lets are compiled in-line.  For the general case, let
 133         v = E in ..., E is compiled into a new BCO which takes as
 134         args its free variables, and v is bound to AP(the new BCO,
 135         free vars of E).
 136     <p>
 137     <li><code>case</code>s as usual, become: push the return
 138         continuation, enter the scrutinee.  There is some magic to
 139         make all combinations of compiled/interpreted calls and
 140         returns work, described below.  In the interpreted case, all
 141         case alts are compiled into a single big return BCO, which
 142         commences with instructions implementing a switch tree.
 143     </ul>
 144     <p>
 145     <b>ARGCHECK magic</b>
 146     <p>
 147     You may find ARGCHECK instructions at the start of BCOs which
 148     don't appear to need them; case continuations in particular.
 149     These play an important role: they force objects which should
 150     evaluated to BCOs to actually be BCOs.
 151     <p>
 152     Typically, there may be an application node somewhere in the heap.
 153     This is a thunk which when leant on turns into a BCO for a return
 154     continuation.  The thunk may get entered with an update frame on
 155     top of the stack.  This is legitimate since from one viewpoint
 156     this is an AP which simply reduces to a data object, so does not
 157     have functional type.  However, once the AP turns itself into a
 158     BCO (so to speak) we cannot simply enter the BCO, because that
 159     expects to see args on top of the stack, not an update frame.
 160     Therefore any BCO which expects something on the stack above an
 161     update frame, even non-function BCOs, start with an ARGCHECK.  In
 162     this case it fails, the update is done, the update frame is
 163     removed, and the BCO re-entered.  Subsequent entries of the BCO of
 164     course go unhindered.
 165     <p>
 166     The optimised (<code>#undef REFERENCE_INTERPRETER</code>) handles
 167     this case specially, so that a trip through the scheduler is
 168     avoided.  When reading traces from <code>+RTS -D2 -RTS</code>, you
 169     may see BCOs which appear to execute their initial ARGCHECK insn
 170     twice.  The first time it fails; the interpreter does the update
 171     immediately and re-enters with no further comment.
 172     <p>
 173     This is all a bit ugly, and, as SimonM correctly points out, it
 174     would have been cleaner to make BCOs unpointed (unthunkable)
 175     objects, so that a pointer to something <code>:: BCO#</code>
 176     really points directly at a BCO.
 177     <p>
 178     <b>Stack management</b>
 179     <p>
 180     There isn't any attempt to stub the stack, minimise its growth, or
 181     generally remove unused pointers ahead of time.  This is really
 182     due to lazyness on my part, although it does have the minor
 183     advantage that doing something cleverer would almost certainly
 184     increase the number of bytecodes that would have to be executed.
 185     Of course we SLIDE out redundant stuff, to get the stack back to
 186     the sequel depth, before returning a HNF, but that's all.  As
 187     usual this is probably a cause of major space leaks.
 188     <p>
 189     <b>Building constructors</b>
 190     <p>
 191     Constructors are built on the stack and then dumped into the heap
 192     with a single PACK instruction, which simply copies the top N
 193     words of the stack verbatim into the heap, adds an info table, and zaps N
 194     words from the stack.  The constructor args are pushed onto the
 195     stack one at a time.  One upshot of this is that unboxed values
 196     get pushed untaggedly onto the stack (via PUSH_UBX), because that's how they
 197     will be in the heap.  That in turn means that the stack is not
 198     always walkable at arbitrary points in BCO execution, although
 199     naturally it is whenever GC might occur.
 200     <p>
 201     Function closures created by the interpreter use the AP-node
 202     (tagged) format, so although their fields are similarly
 203     constructed on the stack, there is never a stack walkability
 204     problem.
 205     <p>
 206     <b>Unpacking constructors</b>
 207     <p>
 208     At the start of a case continuation, the returned constructor is
 209     unpacked onto the stack, which means that unboxed fields have to
 210     be tagged.  Rather than burdening all such continuations with a
 211     complex, general mechanism, I split it into two.  The
 212     allegedly-common all-pointers case uses a single UNPACK insn
 213     to fish out all fields with no further ado.  The slow case uses a
 214     sequence of more complex UPK_TAG insns, one for each field (I
 215     think).  This seemed like a good compromise to me.
 216     <p>
 217     <b>Perspective</b>
 218     <p>
 219     I designed the bytecode mechanism with the experience of both STG
 220     hugs and Classic Hugs in mind.  The latter has an small
 221     set of bytecodes, a small interpreter loop, and runs amazingly
 222     fast considering the cruddy code it has to interpret.  The former
 223     had a large interpretative loop with many different opcodes,
 224     including multiple minor variants of the same thing, which
 225     made it difficult to optimise and maintain, yet it performed more
 226     or less comparably with Classic Hugs.
 227     <p>
 228     My design aims were therefore to minimise the interpreter's
 229     complexity whilst maximising performance.  This means reducing the
 230     number of opcodes implemented, whilst reducing the number of insns
 231     despatched.  In particular there are only two opcodes, PUSH_UBX
 232     and UPK_TAG, which deal with tags.  STG Hugs had dozens of opcodes
 233     for dealing with tagged data.  In cases where the common
 234     all-pointers case is significantly simpler (UNPACK) I deal with it
 235     specially.  Finally, the number of insns executed is reduced a
 236     little by merging multiple pushes, giving PUSH_LL and PUSH_LLL.
 237     These opcode pairings were determined by using the opcode-pair
 238     frequency profiling stuff which is ifdef-d out in
 239     <code>Interpreter.c</code>.  These significantly improve
 240     performance without having much effect on the uglyness or
 241     complexity of the interpreter.
 242     <p>
 243     Overall, the interpreter design is something which turned out
 244     well, and I was pleased with it.  Unfortunately I cannot say the
 245     same of the bytecode generator.
 246
 247     <h2><code>case</code> returns between interpreted and compiled code</h2>
 248
 249     Variants of the following scheme have been drifting around in GHC
 250     RTS documentation for several years.  Since what follows is
 251     actually what is implemented, I guess it supersedes all other
 252     documentation.  Beware; the following may make your brain melt.
 253     In all the pictures below, the stack grows downwards.
 254     <p>
 255     <b>Returning to interpreted code</b>.
 256     <p>
 257     Interpreted returns employ a set of polymorphic return infotables.
 258     Each element in the set corresponds to one of the possible return
 259     registers (R1, D1, F1) that compiled code will place the returned
 260     value in.  In fact this is a bit misleading, since R1 can be used
 261     to return either a pointer or an int, and we need to distinguish
 262     these cases.  So, supposing the set of return registers is {R1p,
 263     R1n, D1, F1}, there would be four corresponding infotables,
 264     <code>stg_ctoi_ret_R1p_info</code>, etc.  In the pictures below we
 265     call them <code>stg_ctoi_ret_REP_info</code>.
 266     <p>
 267     These return itbls are polymorphic, meaning that all 8 vectored
 268     return codes and the direct return code are identical.
 269     <p>
 270     Before the scrutinee is entered, the stack is arranged like this:
 271     <pre>
 272    |        |
 273    +--------+
 274    |  BCO   | -------> the return contination BCO
 275    +--------+
 276    | itbl * | -------> stg_ctoi_ret_REP_info, with all 9 codes as follows:
 277    +--------+
 278                           BCO* bco = Sp[1];
 279                           push R1/F1/D1 depending on REP
 280                           push bco
 281                           yield to sched
 282     </pre>
 283     On entry, the interpreted contination BCO expects the stack to look
 284     like this:
 285     <pre>
 286    |        |
 287    +--------+
 288    |  BCO   | -------> the return contination BCO
 289    +--------+
 290    | itbl * | -------> ret_REP_ctoi_info, with all 9 codes as follows:
 291    +--------+
 292    : VALUE  :  (the returned value, shown with : since it may occupy
 293    +--------+   multiple stack words)
 294     </pre>
 295     A machine code return will park the returned value in R1/F1/D1,
 296     and enter the itbl on the top of the stack.  Since it's our magic
 297     itbl, this pushes the returned value onto the stack, which is
 298     where the interpreter expects to find it.  It then pushes the BCO
 299     (again) and yields.  The scheduler removes the BCO from the top,
 300     and enters it, so that the continuation is interpreted with the
 301     stack as shown above.
 302     <p>
 303     An interpreted return will create the value to return at the top
 304     of the stack.  It then examines the return itbl, which must be
 305     immediately underneath the return value, to see if it is one of
 306     the magic <code>stg_ctoi_ret_REP_info</code> set.  Since this is so,
 307     it knows it is returning to an interpreted contination.  It
 308     therefore simply enters the BCO which it assumes it immediately
 309     underneath the itbl on the stack.
 310
 311     <p>
 312     <b>Returning to compiled code</b>.
 313     <p>
 314     Before the scrutinee is entered, the stack is arranged like this:
 315     <pre>
 316                         ptr to vec code 8 ------> return vector code 8
 317    |        |           ....
 318    +--------+           ptr to vec code 1 ------> return vector code 1
 319    | itbl * | --        Itbl end
 320    +--------+   \       ....
 321                  \      Itbl start
 322                   ----> direct return code
 323     </pre>
 324     The scrutinee value is then entered.
 325     The case continuation(s) expect the stack to look the same, with
 326     the returned HNF in a suitable return register, R1, D1, F1 etc.
 327     <p>
 328     A machine code return knows whether it is doing a vectored or
 329     direct return, and, if the former, which vector element it is.
 330     So, for a direct return we jump to <code>Sp[0]</code>, and for a
 331     vectored return, jump to <code>((CodePtr*)(Sp[0]))[ - ITBL_LENGTH
 332     - vector number ]</code>.  This is (of course) the scheme that
 333     compiled code has been using all along.
 334     <p>
 335     An interpreted return will, as described just above, have examined
 336     the itbl immediately beneath the return value it has just pushed,
 337     and found it not to be one of the <code>ret_REP_ctoi_info</code> set,
 338     so it knows this must be a return to machine code.  It needs to
 339     pop the return value, currently on the stack, into R1/F1/D1, and
 340     jump through the info table.  Unfortunately the first part cannot
 341     be accomplished directly since we are not in Haskellised-C world.
 342     <p>
 343     We therefore employ a second family of magic infotables, indexed,
 344     like the first, on the return representation, and therefore with
 345     names of the form <code>stg_itoc_ret_REP_info</code>.  (Note:
 346     <code>itoc</code>; the previous bunch were <code>ctoi</code>).
 347     This is pushed onto the stack (note, tagged values have their tag
 348     zapped), giving:
 349     <pre>
 350    |        |
 351    +--------+
 352    | itbl * | -------> arbitrary machine code return itbl
 353    +--------+
 354    : VALUE  :  (the returned value, possibly multiple words)
 355    +--------+
 356    | itbl * | -------> stg_itoc_ret_REP_info, with code:
 357    +--------+
 358                           pop myself (stg_itoc_ret_REP_info) off the stack
 359                           pop return value into R1/D1/F1
 360                           do standard machine code return to itbl at t.o.s.
 361     </pre>
 362     We then return to the scheduler, asking it to enter the itbl at
 363     t.o.s.  When entered, <code>stg_itoc_ret_REP_info</code> removes
 364     itself from the stack, pops the return value into the relevant
 365     return register, and returns to the itbl to which we were trying
 366     to return in the first place.
 367     <p>
 368     Amazingly enough, this stuff all actually works!  Well, mostly ...
 369     <p>
 370     <b>Unboxed tuples: a Right Royal Spanner In The Works</b>
 371     <p>
 372     The above scheme depends crucially on having magic infotables
 373     <code>stg_{itoc,ctoi}_ret_REP_info</code> for each return
 374     representation <code>REP</code>.  It unfortunately fails miserably
 375     in the face of unboxed tuple returns, because the set of required
 376     tables would be infinite; this despite the fact that for any given
 377     unboxed tuple return type, the scheme could be made to work fine.
 378     <p>
 379     This is a serious problem, because it prevents interpreted
 380     code from doing <code>IO</code>-typed returns, since <code>IO
 381     t</code> is implemented as <code>(# t, RealWorld# #)</code> or
 382     thereabouts.  This restriction in turn rules out FFI stuff in the
 383     interpreter.  Not good.
 384     <p>
 385     Although we have no way to make general unboxed tuples work, we
 386     can at least make <code>IO</code>-types work using the following
 387     ultra-kludgey observation: <code>RealWorld#</code> doesn't really
 388     exist and so has zero size, in compiled code.  In turn this means
 389     that a type of the form <code>(# t, RealWorld# #)</code> has the
 390     same representation as plain <code>t</code> does.  So the bytecode
 391     generator, whilst rejecting code with general unboxed tuple
 392     returns, recognises and accepts this special case.  Which means
 393     that <code>IO</code>-typed stuff works in the interpreter.  Just.
 394     <p>
 395     If anyone asks, I will claim I was out of radio contact, on a
 396     6-month walking holiday to the south pole, at the time this was
 397     ... er ... dreamt up.
 398
 399
 400 <p><small>
 401
 402 <!-- hhmts start -->
 403 Last modified: Thursday February  7 15:33:49 GMT 2002
 404 <!-- hhmts end -->
 405     </small>
 406   </body>
 407 </html>