Reorganisation of the source tree

[ghc-hetmet.git] / docs / storage-mgt / ldv.tex
diff --git a/docs/storage-mgt/ldv.tex b/docs/storage-mgt/ldv.tex

new file mode 100644 (file)

index 0000000..936407c
--- /dev/null
+++ b/docs/storage-mgt/ldv.tex
@@ -0,0 +1,695 @@
+\documentclass{article}
+\usepackage{code,a4wide}
+
+\usepackage{graphics,epsfig,epic,eepic,epsfig}
+
+\setlength{\parskip}{0.25cm}
+\setlength{\parsep}{0.25cm}
+\setlength{\topsep}{0cm}
+\setlength{\parindent}{0cm}
+\renewcommand{\textfraction}{0.2}
+\renewcommand{\floatpagefraction}{0.7}
+
+
+% Terminology
+\newcommand{\block}{block}
+\newcommand{\Block}{Block}
+\newcommand{\segment}{segment}
+\newcommand{\Segment}{Segment}
+\newcommand{\step}{step}
+\newcommand{\Step}{Step}
+
+\newcommand{\note}[1]{{\em $\spadesuit$ #1}}
+
+\begin{document}
+\title{Implementation of Lag/Drag/Void/Use Profiling}
+\author{Sungwoo Park \\ Simon Marlow}
+
+\makeatactive
+\maketitle
+
+\section{Lag/Drag/Void/Use Profiling}
+
+\emph{Lag/Drag/Void/Use} (LDVU) profiling~\cite{RR} is a profiling technique 
+which yields a summary of the biography of all the dynamic closures created
+during program execution.
+In this profiling scheme,
+the biography of a closure is determined by four important events associated
+with the closure: \emph{creation}, \emph{first use}, 
+\emph{last use}, and \emph{destruction} (see Figure~\ref{fig-ldv}).
+The intervals between these successive events correspond to three phases
+for the closure: \emph{lag} (between creation and first use), 
+\emph{use} (between first use and last use), and 
+\emph{drag} (between last use and destruction).
+If the closure is never used, it is considered to remain in the \emph{void}
+phase all its lifetime.
+
+\begin{figure}[ht]
+\begin{center}
+\input{ldv.eepic}
+\caption{The biography of a closure}
+\label{fig-ldv}
+\end{center}
+\end{figure}
+
+The LDVU profiler regularly performs heap censuses during program execution.
+Each time a heap census is performed, the LDVU profiler increments a global
+time, which is used for timing all the events (such as creation and destruction
+of a closure) occurring during program execution.
+Hence, for instance, all closures creating between two successive heap censuses
+have the same creation time and belong to the same \emph{generation}.\footnote{In
+this document, a generation is related with heap censuses, not garbage collections
+as in other profiling schemes.}
+After the program terminates, it yields a post-mortem report on how much 
+of the \emph{live} graph is in one of the four phases at the moment of each 
+heap census.
+
+It must be emphasized that the LDVU profiler considers only live closures;
+it should not take into consideration dead closures which do not constitute
+the graph. Therefore, the result of LDVU profiling does not depend on the
+frequency of garbage collections.
+
+This document describes the implementation of LDVU profiling on the Glasgow
+Haskell Compiler runtime system.\footnote{Unless otherwise noted, all identifiers
+are defined in @LdvProfile.c@}.
+
+\section{An Overview of the Implementation}
+
+Every closure is augmented with an additional word in its profiling header
+to accommodate three additional pieces of information: 
+1) state flag indicating whether the closure has been used at least once or not.
+2) creation time; 3) time of most recent use if any so far.
+We refer to such a word as an LDV word.
+
+The LDVU profiler maintains a global time, stored in @ldvTime@.
+It is incremented each time a heap census is performed.
+During a heap census, the profiler scans all live closures and computes the 
+following: 
+1) the total size of all closures which have never been used; 
+2) the total size of all closures which have been used at least once 
+in the past.\footnote{There is another category of closures, namely, 
+\emph{inherently used} closures. We will explain 
+in Section~\ref{sec-heap-censuses}.}
+It is not until the whole program execution finishes that the profiler 
+can actually decide the total size corresponding to each of the four phases for
+a particular heap census. It is only when a closure is destroyed that the profiler
+can determine how long the closure has been in a specific phase. 
+Therefore, it is not sufficient to perform heap censuses periodically in order to
+compute the profiling statistics: the runtime system needs to intercept
+all events associated with any closures and update necessary information.
+
+All events associated with closures are handled by one of the three 
+macros defined
+in @includes/StgLdv.h@: @LDV_recordCreate()@, @LDV_recordUse()@, and 
+@LDV_recordDead()@.
+
+\begin{itemize}
+\item{@LDV_recordCreate()@} is called when a closure is created and updates its 
+creation time field.
+
+\item{@LDV_recordUse()@} is called when a closure is used and updates its most recent
+use time field.
+
+\item{@LDV_recordDead()@} is called when a closure @c@ is removed from the graph.
+It does not update its LDV word (because @c@ is about to be destroyed).
+Instead, it updates the statistics on LDVU profiling according to the following
+observation:
+if @c@ has never been used (which is indicated by the state flag in its LDV 
+word), 
+@c@ contributes to the void phase from its creation time to the last census
+time; if @c@ was used at least once (which is also indicated by the state flag),
+@c@ contributes to the @drag@ phase after its last use time. 
+\end{itemize}
+
+At the end of the program execution, the profiler performs a last census during
+which all closures in the heap are declared to be dead and @LDV_recordDead()@
+is invoked on each of them. 
+Then, the profiler computes the final statistics. 
+
+\section{LDV Words}
+
+We choose to share the LDV word for both retainer profiling and LDVU 
+profiling, which cannot take place simultaneously. 
+This is the reason why there is a
+union structure inside the @StgProHeader@ structure.
+The field @hp.ldvw@ in the @StgProfHeader@ structure corresponds to the LDV 
+word:
+\begin{code}
+typedef struct {
+  ...
+  union {
+    retainerSet *rs;          // Retainer Set
+    StgWord ldvw;             // Lag/Drag/Void Word
+  } hp;
+} StgProfHeader;
+\end{code}
+For instance, the LDV word of a closure @c@ can now be accessed with 
+@c->header.prof.hp.ldvw@ (or by @LDVW(c)@ where @LDVW()@ is a macro in 
+@includes/StgLdvProf.h@).
+
+An LDV word is divided into three fields, whose position is specified
+by three constants in @includes/StgLdvProf.h@:
+\begin{itemize}
+\item{@LDV_STATE_MASK@} corresponds to the state flag. 
+\item{@LDV_CREATE_MASK@} corresponds to the creation time.
+\item{@LDV_LAST_MASK@} corresponds to the most recent use time.
+\end{itemize}
+The constant @LDV_SHIFT@ specifies how many bits are allocated for 
+creation time or most recent use time.
+For instance, the creation time of a closure @c@ can be obtained by
+@(LDVW(c) & LDV_CREATE_MASK) >> LDV_SHIFT@.
+
+The creation time field and the most recent use time field can be set only by the 
+macros @LDV_recordCreate()@ and @LDV_recordUse()@. 
+@LDV_recordCreate()@ must be called whenever a new dynamic closure is created,
+and this is handily accomplished by rewriting the macro @SET_PROF_HDR()@ 
+(in @includes/ClosureMacros.h@) (we do not need to change @SET_STATIC_PROF_HDR()@
+because static closures are not involved in LDVU profiling at all):
+
+\begin{code}
+#define SET_PROF_HDR(c,ccs_)            \
+        ((c)->header.prof.ccs = ccs_,   \
+        LDV_recordCreate((c)))
+\end{code}
+
+There are a few cases in which the info table of a closure changes through
+an explicit invocation of @SET_INFO()@ or a direct assignment to its @header.info@
+field: 1) an indirection closure is replaced by an old-generation 
+indirection closure; 2) a thunk is replaced by a blackhole; 3) a thunk is replaced 
+by an indirection closure when its evaluation result becomes available.
+
+\emph{We regard such a situation as
+the destruction of an old closure followed by the creation of a new closure
+at the same memory address.}\footnote{This would be unnecessary if the two closures
+are of the same size, but it is not always the case. We choose to distinguish
+the two closures for the sake of consistency.}
+For instance, when an @IND_PERM@ closure is replaced by an @IND_OLDGEN_PERM@
+closures (during scavenging in @GC.c@), we wrap the invocation of @SET_INFO()@ with 
+the invocations of @LDV_recordDead()@ and @LDV_recordCreate()@ as follows 
+(@LDV_recordDead()@ requires the actual size of the closures being destroyed):
+
+\begin{code}
+  LDV_recordDead((StgClosure *)p, sizeofW(StgInd) - sizeofW(StgProfHeader));
+  SET_INFO(((StgClosure *)p), &stg_IND_OLDGEN_PERM_info);
+  LDV_recordCreate((StgClosure *)p);
+\end{code}
+
+\textbf{To do:}
+A direct assignment to the @header.info@ field implies that its cost centre 
+field is not initialized. This is no problem in the case of @EVACUATED@ closures 
+because they will 
+not be used again after a garbage collection. However, I am not sure if this is safe
+for @BLACKHOLE_BQ@ closures (in @StgMiscClosures.hc@) when retainer profiling, 
+which employs cost centre stacks, is going on. 
+If it is safe, please leave a comment there.
+
+@LDV_recordUse()@ is called on a closure whenever it is used, or \emph{entered}.
+Its state flag changes if necessary to indicate that it has been used, and
+the current global time is stored in its last use time field. 
+
+\section{Global Time \texttt{ldvTime} and Retainer Profiling}
+
+The global time, stored in @ldvTime@, records the current time period.
+It is initialized to $1$ and incremented after each time a heap census
+is completed through an invocation of @LdvCensus()@.  Note that each
+value of @ldvTime@ represents a time \emph{period}, not a point in
+time.
+
+All closures created between two successive invocations of
+@LdvCensus()@ have the same creation time.  If a closure is used at
+least once between two successive heap censuses, we consider the
+closure to be in the use phase during the corresponding time period
+(because we just set its last use time field to the current value of
+@ldvTime@ whenever it is used).  Notice that a closure with a creation
+time $t_c$ may be destroyed before the actual heap census for time
+$t_c$ and thus may \emph{not} be observed during the heap census for
+time $t_c$.  Such a closure does not show up in the profile at all.
+
+In addition, the value of @ldvTime@ indicates which of LDVU profiling
+and retainer profiling is currently active: during LDVU profiling, it
+is initialized to $1$ in @initLdvProfiling()@ and then increments as
+LDVU profiling proceeds; during retainer profiling, however, it is
+always fixed to $0$.  Thus, wherever a piece of code shared by both
+retainer profiling and LDVU profiling comes to play, we usually need
+to first examine the value of @ldvTime@ if necessary. For instance,
+consider the macro @LDV_recordUse()@:
+
+\begin{code}
+#define LDV_recordUse(c)                              \
+  if (ldvTime > 0)                                    \
+    LDVW((c)) = (LDVW((c)) & LDV_CREATE_MASK) | ldvTime | LDV_STATE_USE; 
+\end{code}
+
+If retainer profiling is being performed, @ldvTime@ is equal to $0$,
+and @LDV_recordUse()@ causes no side effect.\footnote{Due to this
+interference with LDVU profiling, retainer profiling slows down a bit;
+for instance, checking @ldvTime@ against $0$ in the above example
+would always evaluate to @rtsFalse@ during retainer profiling.
+However, this is the price to be paid for our decision not to employ a
+separate field for LDVU profiling.}
+
+As another example, consider @LDV_recordCreate()@:
+
+\begin{code}
+#define LDV_recordCreate(c)   \
+  LDVW((c)) = (ldvTime << LDV_SHIFT) | LDV_STATE_CREATE
+\end{code}
+
+The above definition of @LDV_recordCreate()@ works without any problem
+even for retainer profiling: during retainer profiling, 
+a retainer set field (@hp.ldvw@) must be initialized to a null pointer.
+Since @ldvTime@ is fixed to $0$, @LDV_recordCreate()@ initializes 
+retainer set fields correctly.
+
+\section{Heap Censuses}
+\label{sec-heap-censuses}
+
+The LDVU profiler performs heap censuses periodically by invoking the
+function @LdvCensus()@.  Because we need to know exactly which
+closures in the heap are live at census time, we always precede the
+census with a major garbage collection.
+
+During a census, we examine each closure one by one and compute the
+following three quantities:
+
+\begin{enumerate}
+\item the total size of all \emph{inherently used} closures.
+\item the total size of all closures which have not been used (yet).
+\item the total size of all closures which have been used at least once. 
+\end{enumerate}
+
+For most closures, a \emph{use} consists of entering the closure.  For
+unlifted objects which are never entered (eg. @ARR_WORDS@), it would
+be difficult to determine their points of use because such points are
+scattered around the implementation in various primitive operations.
+For this reason we consider all unlifted objects as ``inherently
+used''.  The following types of closures are considered to be
+inherently used: @TSO@, @MVAR@, @MUT_ARR_PTRS@, @MUT_ARR_PTRS_FROZEN@,
+@ARR_WORDS@, @WEAK@, @MUT_VAR@, @MUT_CONS@, @FOREIGN@, @BCO@, and
+@STABLE_NAME@.
+
+The three quantities are stored in an @LdvGenInfo@ array @gi[]@.
+@gi[]@ is indexed by time period.  For instance, @gi[ldvTime]@ stores
+the three quantaties for the current global time period.  The
+structure @LdvGenInfo@ is defined as follows:
+
+\begin{code}
+typedef struct {
+  ...
+  int inherentlyUsed;   // total size of 'inherently used' closures
+  int notUsed;          // total size of 'not used yet' closures
+  int used;             // total size of 'used at least once' closures
+  ...
+} LdvGenInfo;
+\end{code} 
+
+The above three quantities account for mutually exclusive sets of closures.
+In other words, if a closure is not inherently used, it belongs to 
+either the second or the third.
+
+\subsection{Taking a Census of the Live Heap}
+
+During a heap census, we need to visit every live closure once, so we
+perform a linear scan of the live heap after a major GC.  We can take
+advantage of the following facts to implement a linear scan for heap
+censuses:
+
+\begin{itemize}
+\item The nursery is empty. The small object pool and the large object pool, 
+      however, may \emph{not} be empty. This is because the garbage collector
+      invokes @scheduleFinalizer()@ after removing dead closures, and
+      @scheduleFinalizer()@ may create new closures through @allocate()@.
+\item @IND@, @IND_OLDGEN@, and @EVACUATED@ closures do not appear in
+the live heap.
+\end{itemize}
+
+There is one small complication when traversing the live heap: the
+garbage collector may have replaced @WEAK@ objects with @DEAD_WEAK@
+objects, which have a smaller size and hence leave some space before
+the next object.  To avoid this problem we change the size of
+@DEAD_WEAK@ objects to match that of @WEAK@ objects when profiling is
+enabled (see @StgMiscClosures.hc@).
+
+\section{Destruction of Closures}
+
+In order to compute the total size of closures for each of the four
+phases, we must report the destruction of every closure (except
+inherently used closures) to the LDVU profiler by invoking
+@LDV_recordDead()@.  @LDV_recordDead()@ must not be called on any
+inherently used closure because any invocation of @LDV_recordDead()@
+affects the statistics regarding void and drag phases, which no
+inherently used closure can be in.
+
+@LDV_recordDead()@ updates two fields @voidNew@ and @dragNew@ in the
+@LdvGenInfo@ array @gi[]@:
+
+\begin{code}
+typedef struct {
+  ...
+  int voidNew; 
+  int dragnew;
+  ...
+} LdvGenInfo;
+\end{code} 
+
+@gi[ldvTime].voidNew@ accumulates the size of all closures satisfying
+the following two conditions: 1) observed during the heap census at
+time @ldvTime@; 2) now known to have been in the void phase at time
+@ldvTime@.  It is updated when a closure which has never been used is
+destroyed.  Suppose that a closure @c@ which has never been used is
+about to be destroyed.  If its creation time is $t_c$, we judge that
+@c@ has been in the void phase all its lifetime, namely, from time
+$t_c$ to @ldvTime@.  Since @c@ will not be observed during the next
+heap census, which corresponds to time @ldvTime@, @c@ contributes to
+the void phase of times $t_c$ through @ldvTime@ - 1.  Therefore, we
+increase the @voidNew@ field of @gi[@$t_c$@]@ through @gi[ldvTime - 1]@
+ by the size of @c@.\footnote{In the actual implementation, we
+update @gi[$t_c$]@ and @gi[ldvTime]@ (not @gi[ldvTime@$ - $1@]@) only:
+@gi[$t_c$]@ and @gi[ldvTime]@ are increased and decreased by the size
+of @c@, respectively.  After finishing the program execution, we can
+correctly adjust all the fields as follows: @gi[$t_c$]@ is computed as
+$\sum_{i=0}^{t_c}$@gi[$i$]@.  }
+
+@gi[ldvTime].dragNew@ accumulates the size of all closures satisfying the following
+two conditions: 1) observed during the heap census at time @ldvTime@;
+2) now known to have been in the drag phase at time @ldvTime@.
+It is updated when a closure which has been used at least once is destroyed.
+Suppose that a closure @c@ which has been used last at time $t_l$ is about to
+be destroyed.
+We judge that @c@ has been in the drag phase from time $t_l + 1$ to 
+time @ldvTime@$ - 1$ (if $t_l + 1 > $@ldvTime@$ - 1$, nothing happens).
+Therefore, we increase the @dragNew@ field of @gi[@$t_l + 1$@]@ through 
+@gi[ldvTime@$ - 1$@]@
+by the size of @c@.\footnote{As in the case of @voidNew@, we update
+@gi[@$t_l + 1$@]@ and @gi[ldvTime]@ only.}
+
+Now we need to find out all the cases of closure destruction.
+There are four cases in which a closure is destroyed:
+
+\begin{enumerate}
+\item A closure is overwritten with a blackhole: 
+  @UPD_BH_UPDATABLE()@ and @UPD_BH_SINGLE_ENTRY()@ in @includes/StgMacros.h@,
+  @threadLazyBlackHole()@ and @threadSqueezeStack()@ in @GC.c@,
+  the entry code for @BLACKHOLE@ closures in @StgMiscClosures.hc@ (a 
+  @BLACKHOLE@ closure is changed into a @BLACKHOLE_BQ@ closure).
+  We call either @LDV_recordDead()@ or @LDV_recordDead_FILL_SLOP_DYNAMIC()@.
+
+\item A weak pointer is overwritten with a dead weak pointer:
+  @finalizzeWeakzh_fast()@ in @PrimOps.hc@, 
+  @finalizeWeakPointersNow()@ and @scheduleFinalizers()@ in @Weak.c@.
+  Since a weak pointer is inherently used, we do not call @LDV_recordDead()@.
+
+\item A closure is overwritten with an indirection closure:
+  @updateWithIndirection()@ and @updateWithPermIndirection()@ in @Storage.h@,
+  @scavenge()@ in @GC.c@, in which an @IND_PERM@ closure is explicitly replaced
+  with an @IND_OLDGEN_PERM@ closure during scavenging.
+  We call either @LDV_recordDead()@ or @LDV_recordDead_FILL_SLOP_DYNAMIC()@.
+  
+\item Closures are removed permanently from the graph during garbage
+collections.  We locate and dispose of all dead closures by linearly
+scanning the from-space right before tidying up.  This is feasible
+because any closures which is about to be removed from the graph still
+remains in the from-space until tidying up is completed.  The next
+subsection explains how to implement this idea.
+\end{enumerate}
+
+\subsection{Linear scan of the from-space during garbage collections}
+
+In order to implement linear scan of the from-space during a garbage collection 
+(before tidying up),
+we need to take into consideration the following facts:
+
+\begin{itemize}
+\item The pointer @free@ of a block in the nursery may incorrectly point to
+a byte past its actual boundary.
+This happens because 
+the Haskell mutator first increases @hpLim@ without comparing it with the
+actual boundary when allocating fresh memory for a new closure.
+@hpLim@ is later assigned to the pointer @free@ of the corresponding memory
+block, which means that during a heap census, the pointer @hpLim@ may not
+be trusted. 
+Notice that @hpLim@ is not available during LDVU profiling; it is valid
+only during the Haskell mutator time.
+
+\item The from-space may well contain a good number of @EVACUATED@ closures,
+and they must be skipped over.
+
+\item The from-space includes the nursery. 
+Furthermore, a closure in the nursery may not necessarily be adjacent to the next 
+closure because slop words may lie between the two closures;
+the Haskell mutator may allocate more space than actually needed in the
+nursery when creating a closure, potentially leaving slop words. 
+\end{itemize}
+
+The first problem is easily solved by limiting the scan up to the
+actual block boundary for each nursery block (see
+@processNurseryForDead()@).  In other words, for a nursery block
+descriptor @bd@, whichever of @bd->start@$ + $@BLOCK_SIZE_W@ and
+@bd->free@ is smaller is used as the actual boundary.
+
+We solve the second problem by exploiting LDV words of @EVACUATED@
+closures: we store the size of an evacuated closure, which now resides
+in the to-space, in the LDV word of the new @EVACUATED@ closure
+occupying its memory.  This is easily implemented by inserting a call
+to the macro @SET_EVACUAEE_FOR_LDV()@ in @copy()@ and @copyPart()@ (in
+@GC.c@).  Thus, when we encounter an @EVACUATED@ closure while
+linearly scanning the nursery, we can skip a correct number of words
+by referring to its LDV word.
+
+The third problem could be partially solved by always monitoring @Hp@
+during the Haskell mutator time: whenever @Hp@ is increased, we fill
+with zeroes as many words as the change of @HP@. Then, we could skip
+any trailing zero words when linearly scanning the nursery.
+Alternatively we could initialize the entire nursery with zeroes after
+each garbage collection and not worry about any change made to @Hp@
+during the Haskell mutator time.  The number of zero words to be
+written to the nursery could be reduced in the first approach, for we
+do not have to fill the header for a new closure.  Nevertheless we
+choose to employ the second approach because it simplifies the
+implementation code significantly (see @resetNurseries()@ in
+@Storage.c@).  Moreover, the second approach compensates for its
+redundant initialization cost by providing faster execution (due to a
+single memory write loop in contrast to frequent memory write loops in
+the first approach).  Also, we attribute the initialization cost to
+the runtime system and thus the Haskell mutator behavior is little
+affected.
+
+There is further complication though: occasionally a closure is
+overwritten with a closure of a smaller size, leaving some slop
+between itself and the next closure in the heap.  There are two cases:
+
+\begin{enumerate}
+\item A closure is overwritten with a blackhole. 
+\item A closure is overwritten with an indirection closure.
+\end{enumerate}
+
+In either case, an existing closure is destroyed after being replaced
+with a new closure.  If the two closures are of the same size, no slop
+words are introduced and we only need to invoke @LDV_recordDead()@ on
+the existing closure, which cannot be an inherently used closure.  If
+not, that is, the new closure is smaller than the existing closure
+(the opposite cannot happen), we need to fill one or more slop words
+with zeroes as well as invoke @LDV_recordDead()@ on the existing
+closure.  The macro @LDV_recordDead_FILL_SLOP_DYNAMIC()@ accomplishes
+these two tasks: it determines the size of the existing closure,
+invokes @LDV_recordDead()@, and fills the slop words with zeroes.
+After excluding all cases in which the two closures are of the same
+size, we invoke @LDV_recordDead_FILL_SLOP_DYNAMIC()@ only from:
+
+\begin{enumerate}
+\item @threadLazyBlackHole()@ and @threadSqueezeStack()@ in @GC.c@
+(for lazy blackholing),
+\item @UPD_BH_UPDATABLE()@ and @UPD_BH_SINGLE_ENTRY()@ in
+@includes/StgMacros.h@ (for eager blackholing, which isn't the
+default),
+\item @updateWithIndirection()@ and @updateWithPermIndirection()@ 
+in @Storage.h@.\footnote{Actually slop words created in 
+@updateWithIndirection()@ cannot survive major garbage collections.
+Still we invoke @LDV\_recordDead\_FILL\_SLOP\_DYNAMIC()@ to support linear
+scan of the heap during a garbage collection, which is discussed in the next
+section.}
+\end{enumerate}
+
+The linear scan of the from-space is initiated by the garbage
+collector.  From the function @LdvCensusForDead()@, every dead closure
+in the from-space is visited through an invocation of
+@processHeapClosureForDead()@.
+
+\subsection{Final scan of the heap}
+
+Since a closure surviving the final garbage collection is implicitly destroyed
+when the runtime system shuts down, we must invoke @processHeapClosureForDead@
+on \emph{every} closure in the heap once more after the final garbage collection.
+The function @LdvCensusKillAll()@, which is invoked from @shutdownHaskell()@
+in @RtsStartup.c@, traverses the entire heap and visits each closure.
+It also stops LDVU profiling by resetting @ldvTime@ to $0$. 
+
+It may be that after LDVU profiling stops, new closures may be created
+and even garbage collections may be performed.
+We choose to ignore these closures because they are all concerned about
+finalizing weak pointers (in @finalizeWeakPointersNow()@).
+It can be catastrophic to invoke @LdvCensusKillAll()@ after finishing
+@finalizeWeakPointersNow()@: @finalizeWeakPointersNow()@ calls
+@rts_evalIO()@, which is essentially initiating a new program execution,
+and no assumptions made upon LDVU profiling hold any longer. 
+
+\section{Time of Use}
+
+In order to yield correct LDVU profiling results, we must make sure
+that @LDV_recordUse()@ be called on a closure whenever it is used;
+otherwise, most of closures would be reported to be in the void phase.
+@includes/StgLdvProf.h@ provides an entry macro @LDV_ENTER@ which
+expands to @LDV_recordUse()@.  The compiler arranges to invoke
+@LDV_ENTER@ in the entry code for every dynamic closure it generates
+code for (constructors, thunks and functions).  We also have to add
+@LDV_ENTER@ calls to the closures statically compiled into the RTS:
+@PAP@s, @AP_UPD@s, standard thunk forms (in @StgStdThunks.hc@, and
+several others in @StgMiscClosures.hc@.
+
+\section{Computing Final Statistics}
+
+After the final scan of the heap, we can accurately determine the total
+size of closures in one of the four phases at the moment of each heap census.
+The structure @LdvGenInfo@ is augmented with two additional fields 
+@voidTotal@ and @dragTotal@:
+
+\begin{code}
+typedef struct {
+  ...
+  int voidTotal;
+  int dragTotal;
+  ...
+} LdvGenInfo;
+\end{code} 
+
+@gi[@$i$@].voidTotal@ and @gi[@$i$@].dragTotal@ are computed
+from @gi[@$i$@].voidNew@ and @gi[@$i$@].dragNew@, respectively.\footnote{Due
+to a slight optimization described before, @gi[@$i$@].voidTotal@ is actually
+computed as $\sum_{1 \leq j \leq i}$@gi[@$j$@].voidNew@. 
+@gi[@$i$@].dragTotal@ is computed in a similar way.}
+Then, the total size of closures in the lag phase @gi[@$i$@].lagTotal@ is computed
+as @gi[@$i$@].notUsed@$-$@gi[@$i$@].voidTotal@ (because any unused closure
+is either in the void phase or in the lag phase).
+Similarly, 
+the total size of closures in the use phase @gi[@$i$@].useTotal@ is computed
+as @gi[@$i$@].used@$-$@gi[@$i$@].dragTotal@ (because any used closure
+is either in the use phase or in the drag phase).
+@endLdvProfiling()@, called from @endHeapProfiling@ in @ProfHeap.c@, computes these 
+final statistics.
+
+\section{Usage}
+
+The runtime system option @-hL@ tells the executable program to
+perform LDVU profiling and produce a @.hp@ file:
+
+\begin{code}
+$ Foo.out +RTS -hL
+\end{code}
+
+The option @-i@ can be used to 
+specify a desired interval at which LDVU profiling is performed.
+The default and minimum value is half a second:
+
+\begin{code}
+$ Foo.out +RTS -hL -i2.5 -RTS
+\end{code}
+
+The @.hp@ file can be supplied to the @hp2ps@ program to create a postscript
+file showing the progress of LDVU profiling in a graph:
+
+\begin{code}
+$ hp2ps Foo.hs
+$ gv Foo.ps
+\end{code}
+
+The horizontal axis of the graph is in the Haskell mutator time, which excludes
+the runtime system time such as garbage collection time and LDVU profiling
+time. 
+The Haskell mutator runs a bit slower than it would without performing
+LDVU profiling, but the difference is minute.
+Also, the timer employed to periodically perform retainer profiling
+is not perfectly accurate. Therefore, the result may slightly vary for each
+execution of retainer profiling.
+
+\textbf{To do:} Currently the LDVU profiling is not supported with @-G1@ option.
+
+\textbf{To do:} When we perform LDVU profiling, the Haskell mutator time seems to
+be affected by @-S@ or @-s@ runtime option. For instance, the following 
+two options should result in nearly same profiling outputs, but
+the second run (without @-Sstderr@ option) spends almost twice as
+long in the Haskell mutator as the first run:
+1) @+RTS -Sstderr -hL -RTS@; 2) @+RTS -hL -RTS@.
+This is quite a subtle bug because this wierd phenomenon is not 
+observed in retainer profiling, yet the implementation of 
+@mut_user_time_during_LDV()@ is completely analogous to that of 
+@mut_user_time_during_RP()@. The overall shapes of the resultant graphs 
+are almost the same, though.
+
+\section{Files}
+
+This section gives a summary of changes made to the GHC in 
+implementing LDVU profiling.
+Only three files (@includes/StgLdvProf.h@, @LdvProfile.c@, and 
+@LdvProfile.h@) are new, and all others exist in the GHC.
+
+@\includes@ directory:
+
+\begin{description}
+\item[StgLdvProf.h] defines type @LdvGenInfo@, constants, and macros related
+with LDVU profiling.
+\item[ClosureMacros.h] changes macro @SET_PROF_HDR()@.
+\item[Stg.h] includes th header file @StgLdvProf.h@.
+\item[StgMacros.h] changes macros @UPD_BH_UPDATABLE()@ and @UPD_BH_SINGLE_ENTRY()@.
+\end{description}
+
+@\rts@ directory:
+
+\begin{description}
+\item[GC.c] invokes @LdvCensusForDead()@ before tidying up, sets @hasBeenAnyGC@ to 
+  @rtsTrue@, and changes @copy()@ and @copyPart()@.
+  Invokes @LDV_recordDead()@ and @LDV_recordDead_FILL_SLOP_DYNAMIC()@.
+\item[Itimer.c] changes @handle_tick()@.
+\item[LdvProfile.c] implements the LDVU profiling engine. 
+\item[LdvProfile.h] is the header for @LdvProfile.c@.
+\item[PrimOps.hc] changes @finalizzeWeakzh_fast()@.
+\item[ProfHeap.c] changes @initHeapProfiling()@ and @endHeapProfiling()@.
+\item[Profiling.c] changes @initProfilingLogFile@ and @report_ccs_profiling()@.
+\item[Proftimer.c] declares @ticks_to_retainer_ldv_profiling@,
+  @performRetainerLdvProfiling@, and @doContextSwitch@.
+\item[Proftimer.h] is the header for @Proftimer.c@. Defines @PROFILING_MIN_PERIOD@,
+  which specifies the minimum profiling period and the default profiling period.
+%\item[RtsAPI.c] implements @setProfileHeader()@.
+\item[RtsFlags.c]
+  sets @RtsFlags.ProfFlags.doHeapProfile@,
+  adds a string to @usage_text[]@ in @setupRtsFlags()@.
+\item[RtsFlags.h] defines constants @HEAP_BY_LDV@ and @LDVchar@.
+\item[RtsStartup.c] changes @shutDownHaskell()@.
+\item[Schedule.c] changes @schedule()@.
+\item[Stats.c] 
+  declares @LDV_start_time@, @LDV_tot_time@, @LDVe_start_time@, 
+  @LDVe_tot_time@.
+  Changes @mut_user_time_during_GC()@, @mut_user_time()@,
+  @stat_startExit()@,
+  @stat_endExit()@, and
+  @stat_exit()@.
+  Defines
+  @mut_user_time_during_LDV()@,
+  @stat_startLDV()@, and
+  @stat_endLDV()@.
+\item[Stats.h] is hte header for @Stats.c@.
+\item[StgMiscClosures.hc] inserts entry macros in 
+  @stg_IND_entry()@, @stg_IND_PERM_entry()@, @stg_IND_OLDGEN_entry()@,
+  @stg_IND_OLDGEN_PERM_entry()@, @stg_BLACKHOLE_entry()@, @stg_BLACKHOLE_BQ_entry()@,
+  and @stg_CAF_BLACKHOLE_entry()@.
+  Invokes @LDV_recordDead()@ in @stg_BLACKHOLE_entry@.
+  Redefines @stg_DEAD_WEAK_info@.
+\item[Storage.c] changes @initStorage()@, @resetNurseries()@, and @allocNursery()@.
+\item[Storage.h] changes @updateWithIndirection()@ and @updateWithPermIndirection()@.
+\item[Updates.hc] inserts entry macros in @stg_PAP_entry()@ and @stg_AP_UPD_entry()@.
+\item[Weak.c] changes @scheduleFinalizers()@.
+\end{description}
+
+\bibliographystyle{plain}
+\bibliography{reference}
+
+\end{document}