ghc/docs/storage-mgt/sm.tex

   1 \documentclass{article}
   2 \usepackage{code}
   3
   4 \setlength{\parskip}{0.25cm}
   5 \setlength{\parsep}{0.25cm}
   6 \setlength{\topsep}{0cm}
   7 \setlength{\parindent}{0cm}
   8 \renewcommand{\textfraction}{0.2}
   9 \renewcommand{\floatpagefraction}{0.7}
  10
  11
  12 % Terminology
  13 \newcommand{\block}{block}
  14 \newcommand{\Block}{Block}
  15 \newcommand{\segment}{segment}
  16 \newcommand{\Segment}{Segment}
  17 \newcommand{\step}{step}
  18 \newcommand{\Step}{Step}
  19
  20 \newcommand{\note}[1]{{\em $\spadesuit$ #1}}
  21
  22 \begin{document}
  23 \title{The GHC storage manager}
  24 \author{Simon Peyton Jones}
  25
  26 \makeatactive
  27 \maketitle
  28
  29 \section{Introduction}
  30
  31 This is a draft design document, not a completed thing.
  32
  33 \section{Goals}
  34
  35 Storage management goals:
  36
  37 \begin{itemize}
  38 \item Generational collection, supporting multiple generations.
  39
  40 \item The ability to pin the allocation
  41 area into a few pages that we hope will fit entirely in the cache.
  42
  43 \item Allows objects to age within a generation before getting promoted.
  44
  45 \item Heap can grow as needed, rather than having to be pre-sized
  46   by the programmer.
  47
  48 \item We support mark/sweep/compact collection for older generations.
  49 This is a Good Thing when the live memory approaches the available
  50 physical memory, because it reduces paging.
  51
  52 \item Little OS support needed.  No mmap etc. All that we require is
  53   the ability to @malloc@ (or whatever) a new chunk of memory.
  54   There can be intervening ``sandbars'' allocated by other programs
  55   (e.g. DLLs or other @malloc@'d structures) between chunks of heap.
  56
  57 \item Possible feature: ability to support mostly-copying conservative
  58 collection.[.bartlett mostly 1989, morrisset mostly.]
  59 Not top priority.
  60 \end{itemize}
  61
  62 Language-support goals:
  63 \begin{itemize}
  64 \item The garbage collector ``shorts out'' indirection objects introduced
  65 by the mutator (notably when overwriting a thunk with an indirection).
  66
  67 \item The garbage collector executes selector thunks.
  68 For example, a thunk for
  69 @(fst x)@ where @x@ is a pointer to a pair @(a,b)@ would be
  70 evaluated by the garbage collector to just @a@.  This is an important
  71 strategy for plugging space leaks.
  72
  73 \item The garbage collector traversese the code tree, as well as
  74 the heap data structures, to find which CAFs are live.  This is a Royal Pain.
  75
  76 \item The garbage collector finalises some objects (typically a tiny minority).
  77 At the moment ``finalisation'' means ``call a C routine when this thing
  78 dies'' but it would be more general to schedule a call to a Haskell
  79 procedure.
  80 \end{itemize}
  81
  82 Instrumentation goals:
  83
  84 \begin{itemize}
  85 \item The garbage collector can gather heap-census information for profiling.
  86 To this end we can force GC to happen more often than it otherwise would,
  87 and the collector can gather information about the type and cost-centre
  88 associated with each heap object.
  89 \end{itemize}
  90
  91
  92 \section{Assumptions}
  93
  94 Every garbage collector uses assupmtions about the sort of
  95 objects it manipulates, and how the mutator behaves.  Here we
  96 collect our assumptions:
  97 \begin{itemize}
  98 \item Each heap object starts with a one-word header that encodes, or points
  99 to, information about the layout of the object.  The details of this
 100 encoding aren't fundamental to the collector, but are, of course, rather
 101 important.
 102
 103 \item Every heap pointer points to the beginning of
 104 an object; no pointers into the middle of objects.
 105
 106 \item There are two spare bits in the LSB of a pointer, because
 107 every heap object is at least word aligned.
 108
 109 \item
 110 Heap objects are packed contiguously, with no ``slop'' between them.
 111 This assumption makes it possible to scan a region of heap linearly.
 112 It's easy to arrange that objects are packed contiguously in the
 113 to-space of a copying collector, but the no-slop assumption also
 114 applies during mutation.  This is a bit of a pain, because we often
 115 update a thunk with an indirection to the thunk's value; and the
 116 indirection isn't necessarily the same size as the thunk.  This can be
 117 solved, but it takes a few more instructions.
 118
 119 We know of two reasons to uphold the no-slop assumption:
 120 \begin{itemize}
 121 \item A {\em mostly-copying conservative (MCC) collector} has to deal with
 122 possible pointers from conservative regions.[.morrisset mostly.]
 123 Given a possible pointer the MCC collector must find the head of the object into
 124 which the putative pointer points.  One easy way to do this is
 125 to scan forward from a known point; but only if there is no
 126 slop.
 127 \item An {\em incremental collector}
 128 will scan linearly during mutation.[.field while.]
 129 \end{itemize}
 130 \end{itemize}
 131
 132
 133 \section{\Block{}s}
 134
 135 The basic memory structure is similar to other BIBOP-based
 136 collectors.[.stop the bibop, reppy garbage collector, moss garbage toolkit.]
 137
 138 A {\em \block} is a contiguous chunk of $2^K$ bytes, starting on a
 139 $2^K$-byte boundary.  We expect a \block{} to be between 4k and 64k bytes.
 140
 141 A \block{} is the unit of allocation for the storage manager.  That is,
 142 the {\em block allocator} hands over store to the mutator in multiples of
 143 one \block.  (The mutator may then allocate multiple heap objects within
 144 a \block, of course.)
 145
 146 \Block{}s are often allocated individually, but for large
 147 objects, or just to reduce inter-block linkage costs, a
 148 contiguous group of \block{}s can be allocated; we call
 149 this a {\em \block{} group} or sometimes just {\em group}.
 150 The first \block{} of a group is called the {\em group head}.
 151
 152 \subsection{\Block{} descriptors}
 153
 154 Each block has an associated {\em block{} descriptor}, a
 155 record with the following structure:
 156 \begin{code}
 157   typedef struct {
 158         int Gen;
 159         int Step;
 160         void *Start;
 161         void *Free;
 162         bdescr *Link
 163     } bdescr;
 164 \end{code}
 165 The function @BDescr(a)@ takes an address @a@ and returns
 166 the address of its block descriptor.  It is expected that
 167 @BDescr@ only takes a few instructions (see below).
 168 The field of a \block{} descriptor have the following
 169 purposes:
 170 \begin{center}
 171 \begin{tabular}{lp{4.5in}}
 172 @Gen@ & The generation to which this block belongs.  The
 173 youngest generation is 0. Valid for all blocks whether or not
 174 it is a group head.
 175 \\ \\
 176 @Step@ & The number of GCs that this block has survived
 177 while within its current generation. Only valid for group heads.
 178 \\ \\
 179 @Start@ & Points to the first byte of the block itself.
 180 Valid for all blocks.
 181 \\ \\
 182 @Free@ & For a group head, @Free@ points to the first free (un-allocated)
 183 byte in the group. For non-group heads, @Free@ is set to zero;
 184 this identifies the \block{} as a non-head.
 185 \\ \\
 186 @Link@ & Used to chain \block{}s together.  For non-group-heads,
 187 @Link@ points to the (\block{} descriptor of the) group head.
 188 \end{tabular}
 189 \end{center}
 190
 191 \subsection{\Step{}s}
 192
 193 A {\em \step} is a linked list of \block{} groups, {\em not} necessarily contiguous,
 194 all belonging to the same generation,
 195 and all of which have survived the same number of garbage collections.
 196
 197 We think of a \step{} as {\em logically} contiguous (via their @Link@ fields) even though
 198 it is not physically contiguous.
 199
 200 The \step{} number of a \step{} indicates its age (in GCs) within its generation.
 201 (\Step{} zero is the youngest \step{} in a generation.)  During GC of generation G,
 202 live objects
 203 from \step{} N are moved to \step{} N+1, and live objects from \step{} @MaxStep@(G)
 204 are promoted to generation G+1. In this way, objects are given a decent chance
 205 to die before being promoted.
 206
 207 In effect, we record the age for a set of objects grouped by address (the \step{}),
 208 rather than recording the age for each object individually.  This is quite
 209 a conventional idea.
 210
 211 \subsection{Generations}
 212
 213 A {\em generation} is a set of \step{}s
 214 (not necessarily contiguous).  Most allocation takes
 215 place in generation zero, the youngest generation.
 216
 217 The big difference between generations and \step{}s is that:
 218 \begin{itemize}
 219 \item Garbage collection applies to all of generation G and all younger
 220   generations (for some G).  You cannot do garbage collection on just
 221   part of a generation.
 222
 223 \item Consequently we have to track inter-generational pointers from
 224 older to younger generations; we do not
 225   track inter-\step{} pointers within a generation.
 226 \end{itemize}
 227
 228 \subsection{The \block{} allocator}
 229
 230 The {\em \block{} allocator} is responsible for
 231 allocating blocks.  It mediates between the higher levels of the
 232 storage manager and the operating system.
 233
 234 It supports the following API:
 235 \begin{description}
 236 \item{@InitialiseBlockAllocator()@} initialises the \block{} allocator.
 237 \item{@bdescr *AllocBlocks( int n )@} allocates a {\em contiguous}
 238 group of @n@ \block{}s, returning a pointer to the \block{}
 239 descriptor of the first.  Returns @NULL@ if the request can't be
 240 satisfied.  The block descriptors of the group are initialised by the block allocator.
 241 \item{@FreeBlocks( bdescr *p )@} returns a block group to the block allocator.
 242 \item{@BDescr( void *p )@} takes a pointer @p@ to any
 243 byte within a block{} and returns a pointer to its \block{}
 244 descriptor.  It does no error checking, and fails horribly if the @p@
 245 does not point into a \block{} allocated by the \block{} allocator.
 246 It is probably implemented as an @inline@ procedure.
 247 \end{description}
 248
 249
 250 \subsection{Implementation}
 251
 252 Our current implementation plan for the \block{} allocator is as
 253 follows.  It allocates memory from the OS in {\em megablocks}.  A
 254 megablock is a contiguous chunk of memory of $2^M$ bytes or less,
 255 starting on a $2^M$ byte boundary.  All the \block{} descriptors are
 256 laid out in an array at the start of the megablock.  Each is less than 32 bytes
 257 long.  So to get from an address @p@ to its block descriptor, @BDESCR@
 258 does this:
 259 \begin{code}
 260   bdescr *BDescr( void *p ) {
 261         return (((p>>(K-5)) & Bmask) | (p & Mmask)));
 262   }
 263 \end{code}
 264 Depending on how many \block{}s fit in the megablock, the first
 265 one or more descriptors will be unused.
 266
 267 (An alternative design has the \block{} descriptors at the start of
 268 each \block{}.  This makes @BDescr@ shorter, but is pessimal for
 269 cache locality when fiddling with \block{} descriptors.
 270 It also means that only the first block in a contiguous chunk of
 271 blocks can have a descriptor.  That in turn makes life difficult
 272 for a mostly-copying conservative collector.  Given a possible pointer
 273 from a conservative region, a MCC collector needs
 274 to find the start of the object into which the putative pointer
 275 points.  One way to do this is to scan from the start of the
 276 contiguous chunk of \block{}s into which the pointer points; and to
 277 find that it helps to have a descriptor for each \block{}.)
 278
 279 The \block{} allocator maintains a pool of free \block{}s.
 280 Because it may be asked for a contiguous chunk of \block{}s it
 281 needs to keep the free-list ordered (or something similar).
 282
 283
 284 \section{The Storage Manager API}
 285
 286 The storage manager supports the following API.
 287 \begin{description}
 288 \item[@void *Allocate( int n )@] allocates @n@ bytes and returns
 289 a pointer to the first byte of the block.  Returns @NULL@ if
 290 the request could not be satisfied --- @Allocate@ does not call
 291 the garbage collector itself.  It is the responsibility of the caller
 292 to completely fill in the allocated chunk with objects packed end to
 293 end. @OpenNursery( void **p_hp, void **p_hplim )@
 294
 295 \item[@OpenNursery( void **p\_hp, void **p\_hplim )@]
 296 returns pointers in @*p_hp@ and @*p_hplim@, to the beginning and end
 297 of a free chunk of store.
 298 More precisely, @*p_hp@ points to the first byte of the region
 299 and @*p_hplim@ to the first byte beyond the end of the region.
 300 @*p_hp@ is set to @NULL@ if there is no chunk available.
 301
 302 @OpenNursery@ is used by the main mutuator to get a chunk of
 303 contiguous store to allocate in.  It allocates until @hp@ gets
 304 too close to @hplim@ and then calls @ExtendNursery@ to try
 305 to get a new region of nursery.
 306
 307 \item[@ExtendNursery( void **p\_hp, void **p\_hplim )@]
 308 returns to the storage manager the region handed out by
 309 the previous @OpenNursery@ or @ExtendNursery@.  Suppose that
 310 this previous call returned with @*p_hp@ set
 311 to $hp$ and @p_hplim@ set to
 312 $hplim$.
 313 Then the @*p_hplim@ given to @ExtendNursery@ should be the same
 314 as $hplim$, and @*p_hp@ should be greater than or equal to $hp$ and
 315 less than or equal to $hplim$.
 316
 317 @ExtendNursery@ tries to find another suitable region.  If it
 318 succeeds, it sets @*hp@ and @*hplim@ appropriately; if not, it sets
 319 @*hp@ to @NULL@.  The new region is not necessarily contiguous
 320 with the old one.
 321
 322 \note{Maybe @ExtendNursery@ should be a macro, so that we can pass a
 323 globally allocated register to it?  (We can't take the address of a register.)
 324 If so then presumably so should @OpenNursery@.}
 325
 326 \item[@ZapNursery()@] arranges that the next call to @ExtendNursery@
 327 will fail.  @ZapNursery@ is called by asynchronous interrupts to arrange that
 328 execution is brought to an orderly halt.
 329
 330 \item[@GarbageCollect( void *Roots )@] performs garbage collection.
 331 The parameter @Roots@ is a procedure which is called by the garbage
 332 collector when it wishes to find all the roots.  This procedure
 333 should in turn call @MarkRoot@ on each such root.
 334
 335 \item[@void *MarkRoot( void *p )@] informs the garbage collector that
 336 @p@ is a root.  It returns the new location of the object.
 337
 338 \item[@RecordMutable( void *p )@] informs the storage manager that
 339 @p@ points to an object that may contain pointers to objects in
 340 a younger generation.
 341
 342 It is not necessary to call @RecordMutable@ on objects known to
 343 be in the nursery.
 344
 345 It is only necessary to call @RecordMutable@ once.  For objects that
 346 are genuinely mutable (like mutable arrays), the object is permanently
 347 recorded as mutable.  On the other hand, thunks that are updated only
 348 once can be dropped from the mutables pool once their pointer has been
 349 dealt with (promoted).
 350
 351 \item{@boolean InNursery( void *p )@} tells whether @p@ points into
 352 the nursery.  This should be a rather fast test; it is used to guard
 353 calls to @RecordMutable@.
 354 \end{description}
 355
 356
 357 \section{Allocation}
 358
 359 We want to experiment with the idea of keeping the youngest generation
 360 entirely in the cache, so that an object might be allocated, used,
 361 and garbage collected without ever hitting main memory.
 362
 363 To do this, generation zero consists of a fixed number of one-\block{}
 364 \segment{}s, carefully allocated in a contiguous address range so that
 365 they fit snugly next to each other in the cache.
 366 (Large objects are allocated directly into generation 1,
 367 so that there is no need to find contiguous \block{}s in generation 0 --- Section~\ref{sect:large}.)
 368
 369 After a GC, the storage manager decides how many generation-0 \segment{}s
 370 (= \block{}s) can be
 371 allocated before the next minor GC; it chains these \segment{}s together
 372 through their @SegLink@ fields.  The storage manager then makes the allocation pointer
 373 (@Hp@) point to the start of the first \segment{}, and the limit pointer
 374 (@HpLim@) point to the end of that \segment{}.
 375 Objects are allocated by incrementing the @Hp@ until
 376 it hits @HpLim@.  If the @SegLink@ field of the current \segment{} is NIL,
 377 then it is time for GC; otherwise @Hp@ is set to point to the first data
 378 word of the next \segment{},
 379 and @HpLim@ to the end of the \segment{}, and allocation
 380 continues.  So between GCs allocation is only interrupted briefly
 381 (i.e. a handful of instructions)
 382 to move @Hp@/@HpLim@ on to the next \segment{}.  Actually, this regular
 383 interruption can be useful, because it makes a good moment to pre-empt
 384 a thread to schedule another, or to respond to an interrupt.
 385 (Pre-emption cannot occur at arbitrary times because we are assuming
 386 an accurate garbage collector.)
 387
 388 We expect there to be two \step{}s in generation zero, \step{} 0 and \step{} 1.
 389 At a minor GC we promote live objects in \step{} 1, and copy \step{} 0 to
 390 become a new \step{} 1.  The new \step{} 1 should fit also fit into the cache.
 391 Suppose there are 16 generation-zero \segment{}s.  A reasonable budget
 392 might be this:
 393 \begin{center}
 394 \begin{tabular}{l}
 395         10 \segment{}s to allocate into (\step{} 0) \\
 396         3 \segment{}s for \step{} 1 \\
 397         3 \segment{}s free for the new \step{} 1
 398 \end{tabular}
 399 \end{center}
 400
 401 This assumes a 30\% survival rate from \step{} 0 to \step{} 1.  We can do in-flight
 402 measurements to tune this.  And (MOST IMPORTANT) if we get it wrong, and
 403 get a blip in which 9 \segment{}s survive minor GC, all is not lost... we simply
 404 have to use some new \block{}s and generation zero gets bigger for a while
 405 and perhaps less cache resident.  (This contrasts with systems which rely
 406 on contiguous partitions, where an overflow is catastrophic, and which must
 407 therefore over-reserve.  We get better address-space usage by not having
 408 to be so conservative.  Again, this isn't a new observation.)
 409
 410
 411 \section{Doing garbage collection}
 412
 413 When GC happens, we first decide which generation to collect.  (How?)
 414 When we collect generation G we collect all younger generations too.
 415
 416 To begin with we assume a copying style garbage collection
 417 algorithm.  The structure of a full GC is as follows:
 418
 419 \begin{enumerate}
 420 \item Initialise
 421 \item For each root @r@ do @r:=Evacuate(r)@.
 422 \item Scavenge
 423 \item Tidy up
 424 \end{enumerate}
 425
 426 A {\em root} is either a pointer held by the RTS (notably the queue
 427 of runnable threads), or a pointer from a generation > G that might
 428 point into a generation <= G.
 429
 430 The @Evacuate@ operation does this:
 431
 432 \begin{code}
 433 Evacuate(p) {
 434   /* Ignore pointers into non-collected generations */
 435   if generation(p) > G then return(p);
 436
 437   /* Copy the object and return its new location */
 438   p' := copy the object pointed to by p into to-space(g);
 439
 440   return(p');
 441 }
 442 \end{code}
 443 The @Scavenge@ operation operation does this:
 444 \begin{code}
 445 Scavenge() {
 446   While there is an un-scavenged object O in to-space do
 447         for each pointer q in O do
 448                 q := Mark(q)
 449                 remember that O has been scavenged
 450 }
 451 \end{code}
 452 We say ``copy into to-space'' and ``find an un-scavenged object in to-space'',
 453 but actually each \step{} of each generation has a separate to-space,
 454 complete with allocation pointer and sweep pointer. (The sweep pointer
 455 finds the un-scavenged objects.)  When copying
 456 into a to-space a check must be made for to-space
 457 overflow and a new \segment{} chained on if necessary.
 458 This is much as when the
 459 mutator is doing allocation.  The difference is that we are allocating
 460 in multiple \segment{}s in an interleaved fashion, so we can't hold
 461 the allocation pointers in registers.
 462
 463 Only when all the sweep pointers
 464 have reached the corresponding allocation pointer is GC done.
 465 This entails multiple passes through the \step{}s until a pass
 466 finds no un-scavenged objects.  [NB: if we treated inter-generational pointers
 467 between generations <= G as roots, then we could avoid some of these multiple
 468 passes.  But we'd collect less garbage, and hence copy more
 469 objects, so it would probably be slower.]
 470
 471 Notice that there is no issue about partitioning the address space
 472 among generations, as there might be if each lived in a contiguous address
 473 space.
 474
 475 \section{Large objects}
 476 \label{sect:large}
 477
 478 Large objects are allocated directly in generation 1, each alone in its
 479 own \segment{}.  This means that they never, ever, need to be copied!  Instead,
 480 we simply change their generation or \step{} number.  We also need to chain the
 481 object onto a list of large objects to be swept for pointers.  This act also
 482 serves to mark the object, so we don't chain it on twice, or increment its
 483 \step{} number twice.
 484
 485 Stack objects probably count as large objects.  This is one reason for not
 486 wanting \block{}s to be too big.
 487
 488
 489 \section{Mark sweep}
 490
 491 So far we have assumed copying collection.  Actually, for
 492 older generations we may want to use
 493 mark/sweep/compact because it is a bit less greedy on address
 494 space, and may help paging behaviour.
 495
 496 What I have in mind is a pointer-reversing mark phase, so that we don't have
 497 to allocate space for a mark stack, although one could argue that a mark
 498 stack will almost invariably be small, and if it gets big then you can always
 499 malloc some more space.  Mumble mumble.
 500
 501 \bibliographystyle{plain}
 502 \bibliography{bibl,simon}
 503
 504 \end{document}