1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
4 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
5 <title>The GHC Commentary - GHCi</title>
8 <body BGCOLOR="FFFFFF">
9 <h1>The GHC Commentary - GHCi</h1>
11 This isn't a coherent description of how GHCi works, sorry. What
12 it is (currently) is a dumping ground for various bits of info
13 pertaining to GHCi, which ought to be recorded somewhere.
15 <h2>Debugging the interpreter</h2>
17 The usual symptom is that some expression / program crashes when
18 running on the interpreter (commonly), or gets wierd results
19 (rarely). Unfortunately, finding out what the problem really is
20 has proven to be extremely difficult. In retrospect it may be
21 argued a design flaw that GHC's implementation of the STG
22 execution mechanism provides only the weakest of support for
23 automated internal consistency checks. This makes it hard to
26 Execution failures in the interactive system can be due to
27 problems with the bytecode interpreter, problems with the bytecode
28 generator, or problems elsewhere. From the bugs seen so far,
29 the bytecode generator is often the culprit, with the interpreter
30 usually being correct.
32 Here are some tips for tracking down interactive nonsense:
34 <li>Find the smallest source fragment which causes the problem.
36 <li>Using an RTS compiled with <code>-DDEBUG</code> (nb, that
37 means the RTS from the previous stage!), run with <code>+RTS
38 -D2</code> to get a listing in great detail from the
39 interpreter. Note that the listing is so voluminous that
40 this is impractical unless you have been diligent in
43 <li>At least in principle, using the trace and a bit of GDB
44 poking around at the time of death, you can figure out what
45 the problem is. In practice you quickly get depressed at
46 the hopelessness of ever making sense of the mass of
47 details. Well, I do, anyway.
49 <li><code>+RTS -D2</code> tries hard to print useful
50 descriptions of what's on the stack, and often succeeds.
51 However, it has no way to map addresses to names in
52 code/data loaded by our runtime linker. So the C function
53 <code>ghci_enquire</code> is provided. Given an address, it
54 searches the loaded symbol tables for symbols close to that
55 address. You can run it from inside GDB:
57 (gdb) p ghci_enquire ( 0x50a406f0 )
58 0x50a406f0 + -48 == `PrelBase_Czh_con_info'
59 0x50a406f0 + -12 == `PrelBase_Izh_static_info'
60 0x50a406f0 + -48 == `PrelBase_Czh_con_entry'
61 0x50a406f0 + -24 == `PrelBase_Izh_con_info'
62 0x50a406f0 + 16 == `PrelBase_ZC_con_entry'
63 0x50a406f0 + 0 == `PrelBase_ZMZN_static_entry'
64 0x50a406f0 + -36 == `PrelBase_Czh_static_entry'
65 0x50a406f0 + -24 == `PrelBase_Izh_con_entry'
66 0x50a406f0 + 64 == `PrelBase_EQ_static_info'
67 0x50a406f0 + 0 == `PrelBase_ZMZN_static_info'
68 0x50a406f0 + 48 == `PrelBase_LT_static_entry'
71 In this case the enquired-about address is
72 <code>PrelBase_ZMZN_static_entry</code>. If no symbols are
73 close to the given addr, nothing is printed. Not a great
74 mechanism, but better than nothing.
76 <li>We have had various problems in the past due to the bytecode
77 generator (<code>compiler/ghci/ByteCodeGen.lhs</code>) being
78 confused about the true set of free variables of an
79 expression. The compilation scheme for <code>let</code>s
80 applies the BCO for the RHS of the let to its free
81 variables, so if the free-var annotation is wrong or
82 misleading, you end up with code which has wrong stack
83 offsets, which is usually fatal.
85 <li>The baseline behaviour of the interpreter is to interpret
86 BCOs, and hand all other closures back to the scheduler for
87 evaluation. However, this causes a huge number of expensive
88 context switches, so the interpreter knows how to enter the
89 most common non-BCO closure types by itself.
91 These optimisations complicate the interpreter.
92 If you think you have an interpreter problem, re-enable the
93 define <code>REFERENCE_INTERPRETER</code> in
94 <code>ghc/rts/Interpreter.c</code>. All optimisations are
95 thereby disabled, giving the baseline
96 I-only-know-how-to-enter-BCOs behaviour.
98 <li>Following the traces is often problematic because execution
99 hops back and forth between the interpreter, which is
100 traced, and compiled code, which you can't see.
101 Particularly annoying is when the stack looks OK in the
102 interpreter, then compiled code runs for a while, and later
103 we arrive back in the interpreter, with the stack corrupted,
104 and usually in a completely different place from where we
107 If this is biting you baaaad, it may be worth copying
108 sources for the compiled functions causing the problem, into
109 your interpreted module, in the hope that you stay in the
110 interpreter more of the time. Of course this doesn't work
111 very well if you've defined
112 <code>REFERENCE_INTERPRETER</code> in
113 <code>ghc/rts/Interpreter.c</code>.
115 <li>There are various commented-out pieces of code in
116 <code>Interpreter.c</code> which can be used to get the
117 stack sanity-checked after every entry, and even after after
118 every bytecode instruction executed. Note that some
119 bytecodes (<code>PUSH_UBX</code>) leave the stack in
120 an unwalkable state, so the <code>do_print_stack</code>
121 local variable is used to suppress the stack walk after
126 <h2>Useful stuff to know about the interpreter</h2>
128 The code generation scheme is straightforward (naive, in fact).
129 <code>-ddump-bcos</code> prints each BCO along with the Core it
130 was generated from, which is very handy.
132 <li>Simple lets are compiled in-line. For the general case, let
133 v = E in ..., E is compiled into a new BCO which takes as
134 args its free variables, and v is bound to AP(the new BCO,
137 <li><code>case</code>s as usual, become: push the return
138 continuation, enter the scrutinee. There is some magic to
139 make all combinations of compiled/interpreted calls and
140 returns work, described below. In the interpreted case, all
141 case alts are compiled into a single big return BCO, which
142 commences with instructions implementing a switch tree.
145 <b>ARGCHECK magic</b>
147 You may find ARGCHECK instructions at the start of BCOs which
148 don't appear to need them; case continuations in particular.
149 These play an important role: they force objects which should
150 evaluated to BCOs to actually be BCOs.
152 Typically, there may be an application node somewhere in the heap.
153 This is a thunk which when leant on turns into a BCO for a return
154 continuation. The thunk may get entered with an update frame on
155 top of the stack. This is legitimate since from one viewpoint
156 this is an AP which simply reduces to a data object, so does not
157 have functional type. However, once the AP turns itself into a
158 BCO (so to speak) we cannot simply enter the BCO, because that
159 expects to see args on top of the stack, not an update frame.
160 Therefore any BCO which expects something on the stack above an
161 update frame, even non-function BCOs, start with an ARGCHECK. In
162 this case it fails, the update is done, the update frame is
163 removed, and the BCO re-entered. Subsequent entries of the BCO of
164 course go unhindered.
166 The optimised (<code>#undef REFERENCE_INTERPRETER</code>) handles
167 this case specially, so that a trip through the scheduler is
168 avoided. When reading traces from <code>+RTS -D2 -RTS</code>, you
169 may see BCOs which appear to execute their initial ARGCHECK insn
170 twice. The first time it fails; the interpreter does the update
171 immediately and re-enters with no further comment.
173 This is all a bit ugly, and, as SimonM correctly points out, it
174 would have been cleaner to make BCOs unpointed (unthunkable)
175 objects, so that a pointer to something <code>:: BCO#</code>
176 really points directly at a BCO.
178 <b>Stack management</b>
180 There isn't any attempt to stub the stack, minimise its growth, or
181 generally remove unused pointers ahead of time. This is really
182 due to lazyness on my part, although it does have the minor
183 advantage that doing something cleverer would almost certainly
184 increase the number of bytecodes that would have to be executed.
185 Of course we SLIDE out redundant stuff, to get the stack back to
186 the sequel depth, before returning a HNF, but that's all. As
187 usual this is probably a cause of major space leaks.
189 <b>Building constructors</b>
191 Constructors are built on the stack and then dumped into the heap
192 with a single PACK instruction, which simply copies the top N
193 words of the stack verbatim into the heap, adds an info table, and zaps N
194 words from the stack. The constructor args are pushed onto the
195 stack one at a time. One upshot of this is that unboxed values
196 get pushed untaggedly onto the stack (via PUSH_UBX), because that's how they
197 will be in the heap. That in turn means that the stack is not
198 always walkable at arbitrary points in BCO execution, although
199 naturally it is whenever GC might occur.
201 Function closures created by the interpreter use the AP-node
202 (tagged) format, so although their fields are similarly
203 constructed on the stack, there is never a stack walkability
206 <b>Unpacking constructors</b>
208 At the start of a case continuation, the returned constructor is
209 unpacked onto the stack, which means that unboxed fields have to
210 be tagged. Rather than burdening all such continuations with a
211 complex, general mechanism, I split it into two. The
212 allegedly-common all-pointers case uses a single UNPACK insn
213 to fish out all fields with no further ado. The slow case uses a
214 sequence of more complex UPK_TAG insns, one for each field (I
215 think). This seemed like a good compromise to me.
219 I designed the bytecode mechanism with the experience of both STG
220 hugs and Classic Hugs in mind. The latter has an small
221 set of bytecodes, a small interpreter loop, and runs amazingly
222 fast considering the cruddy code it has to interpret. The former
223 had a large interpretative loop with many different opcodes,
224 including multiple minor variants of the same thing, which
225 made it difficult to optimise and maintain, yet it performed more
226 or less comparably with Classic Hugs.
228 My design aims were therefore to minimise the interpreter's
229 complexity whilst maximising performance. This means reducing the
230 number of opcodes implemented, whilst reducing the number of insns
231 despatched. In particular there are only two opcodes, PUSH_UBX
232 and UPK_TAG, which deal with tags. STG Hugs had dozens of opcodes
233 for dealing with tagged data. In cases where the common
234 all-pointers case is significantly simpler (UNPACK) I deal with it
235 specially. Finally, the number of insns executed is reduced a
236 little by merging multiple pushes, giving PUSH_LL and PUSH_LLL.
237 These opcode pairings were determined by using the opcode-pair
238 frequency profiling stuff which is ifdef-d out in
239 <code>Interpreter.c</code>. These significantly improve
240 performance without having much effect on the uglyness or
241 complexity of the interpreter.
243 Overall, the interpreter design is something which turned out
244 well, and I was pleased with it. Unfortunately I cannot say the
245 same of the bytecode generator.
247 <h2><code>case</code> returns between interpreted and compiled code</h2>
249 Variants of the following scheme have been drifting around in GHC
250 RTS documentation for several years. Since what follows is
251 actually what is implemented, I guess it supersedes all other
252 documentation. Beware; the following may make your brain melt.
253 In all the pictures below, the stack grows downwards.
255 <b>Returning to interpreted code</b>.
257 Interpreted returns employ a set of polymorphic return infotables.
258 Each element in the set corresponds to one of the possible return
259 registers (R1, D1, F1) that compiled code will place the returned
260 value in. In fact this is a bit misleading, since R1 can be used
261 to return either a pointer or an int, and we need to distinguish
262 these cases. So, supposing the set of return registers is {R1p,
263 R1n, D1, F1}, there would be four corresponding infotables,
264 <code>stg_ctoi_ret_R1p_info</code>, etc. In the pictures below we
265 call them <code>stg_ctoi_ret_REP_info</code>.
267 These return itbls are polymorphic, meaning that all 8 vectored
268 return codes and the direct return code are identical.
270 Before the scrutinee is entered, the stack is arranged like this:
274 | BCO | -------> the return contination BCO
276 | itbl * | -------> stg_ctoi_ret_REP_info, with all 9 codes as follows:
279 push R1/F1/D1 depending on REP
283 On entry, the interpreted contination BCO expects the stack to look
288 | BCO | -------> the return contination BCO
290 | itbl * | -------> ret_REP_ctoi_info, with all 9 codes as follows:
292 : VALUE : (the returned value, shown with : since it may occupy
293 +--------+ multiple stack words)
295 A machine code return will park the returned value in R1/F1/D1,
296 and enter the itbl on the top of the stack. Since it's our magic
297 itbl, this pushes the returned value onto the stack, which is
298 where the interpreter expects to find it. It then pushes the BCO
299 (again) and yields. The scheduler removes the BCO from the top,
300 and enters it, so that the continuation is interpreted with the
301 stack as shown above.
303 An interpreted return will create the value to return at the top
304 of the stack. It then examines the return itbl, which must be
305 immediately underneath the return value, to see if it is one of
306 the magic <code>stg_ctoi_ret_REP_info</code> set. Since this is so,
307 it knows it is returning to an interpreted contination. It
308 therefore simply enters the BCO which it assumes it immediately
309 underneath the itbl on the stack.
312 <b>Returning to compiled code</b>.
314 Before the scrutinee is entered, the stack is arranged like this:
316 ptr to vec code 8 ------> return vector code 8
318 +--------+ ptr to vec code 1 ------> return vector code 1
319 | itbl * | -- Itbl end
322 ----> direct return code
324 The scrutinee value is then entered.
325 The case continuation(s) expect the stack to look the same, with
326 the returned HNF in a suitable return register, R1, D1, F1 etc.
328 A machine code return knows whether it is doing a vectored or
329 direct return, and, if the former, which vector element it is.
330 So, for a direct return we jump to <code>Sp[0]</code>, and for a
331 vectored return, jump to <code>((CodePtr*)(Sp[0]))[ - ITBL_LENGTH
332 - vector number ]</code>. This is (of course) the scheme that
333 compiled code has been using all along.
335 An interpreted return will, as described just above, have examined
336 the itbl immediately beneath the return value it has just pushed,
337 and found it not to be one of the <code>ret_REP_ctoi_info</code> set,
338 so it knows this must be a return to machine code. It needs to
339 pop the return value, currently on the stack, into R1/F1/D1, and
340 jump through the info table. Unfortunately the first part cannot
341 be accomplished directly since we are not in Haskellised-C world.
343 We therefore employ a second family of magic infotables, indexed,
344 like the first, on the return representation, and therefore with
345 names of the form <code>stg_itoc_ret_REP_info</code>. (Note:
346 <code>itoc</code>; the previous bunch were <code>ctoi</code>).
347 This is pushed onto the stack (note, tagged values have their tag
352 | itbl * | -------> arbitrary machine code return itbl
354 : VALUE : (the returned value, possibly multiple words)
356 | itbl * | -------> stg_itoc_ret_REP_info, with code:
358 pop myself (stg_itoc_ret_REP_info) off the stack
359 pop return value into R1/D1/F1
360 do standard machine code return to itbl at t.o.s.
362 We then return to the scheduler, asking it to enter the itbl at
363 t.o.s. When entered, <code>stg_itoc_ret_REP_info</code> removes
364 itself from the stack, pops the return value into the relevant
365 return register, and returns to the itbl to which we were trying
366 to return in the first place.
368 Amazingly enough, this stuff all actually works! Well, mostly ...
370 <b>Unboxed tuples: a Right Royal Spanner In The Works</b>
372 The above scheme depends crucially on having magic infotables
373 <code>stg_{itoc,ctoi}_ret_REP_info</code> for each return
374 representation <code>REP</code>. It unfortunately fails miserably
375 in the face of unboxed tuple returns, because the set of required
376 tables would be infinite; this despite the fact that for any given
377 unboxed tuple return type, the scheme could be made to work fine.
379 This is a serious problem, because it prevents interpreted
380 code from doing <code>IO</code>-typed returns, since <code>IO
381 t</code> is implemented as <code>(# t, RealWorld# #)</code> or
382 thereabouts. This restriction in turn rules out FFI stuff in the
383 interpreter. Not good.
385 Although we have no way to make general unboxed tuples work, we
386 can at least make <code>IO</code>-types work using the following
387 ultra-kludgey observation: <code>RealWorld#</code> doesn't really
388 exist and so has zero size, in compiled code. In turn this means
389 that a type of the form <code>(# t, RealWorld# #)</code> has the
390 same representation as plain <code>t</code> does. So the bytecode
391 generator, whilst rejecting code with general unboxed tuple
392 returns, recognises and accepts this special case. Which means
393 that <code>IO</code>-typed stuff works in the interpreter. Just.
395 If anyone asks, I will claim I was out of radio contact, on a
396 6-month walking holiday to the south pole, at the time this was
397 ... er ... dreamt up.
403 Last modified: Thursday February 7 15:33:49 GMT 2002