%%%\tableofcontents
%%%\newpage
-
%%-----------------------------------------------------------------%%
\section{Details}
-\subsection{Data structures and lifetimes}
+\subsection{Outline of the design}
+\label{sec:details-intro}
-About lifetimes:
+The design falls into three major parts:
\begin{itemize}
-\item {\bf Session} lifetime covers a complete run of GHCI,
- encompassing multiple recompilation runs.
-\item {\bf Rebuild} lifetime covers the actions needed to bring
- the target module up to date -- a downsweep from the
- target to reestablish the module graph, and an upsweep to
- bring the translations (compiled code) and global symbol
- table back up to date.
-\item {\bf Module} lifetime: that of data needed to translate
- a single module, but then discarded, for example Core,
- AbstractC, Stix trees.
+\item The compilation manager (CM), which coordinates the
+ system and supplies a HEP-like interface to clients.
+\item The module compiler (@compile@), which translates individual
+ modules to interpretable or machine code.
+\item The linker (@link@),
+ which maintains the executable image in interpreted mode.
\end{itemize}
-Structures with module lifetime are well documented and understood.
-Here we're really interested in structures with session and rebuild
-lifetimes. Most of these structures are ``owned'' by CM, since that's
+There are also three auxiliary parts: the finder, which locates
+source, object and interface files, the summariser, which quickly
+finds dependency information for modules, and the static info
+(compiler flags and package details), which is unchanged over the
+course of a session.
+
+This section continues with an overview of the session-lifetime data
+structures. Then follows the finder (section~\ref{sec:finder}),
+summariser (section~\ref{sec:summariser}),
+static info (section~\ref{sec:staticinfo}),
+and finally the three big sections
+(\ref{sec:manager},~\ref{sec:compiler},~\ref{sec:linker})
+on the compilation manager, compiler and linker respectively.
+
+\subsubsection*{Some terminology}
+
+Lifetimes: the phrase {\bf session lifetime} covers a complete run of
+GHCI, encompassing multiple recompilation runs. {\bf Module lifetime}
+is a lot shorter, being that of data needed to translate a single
+module, but then discarded, for example Core, AbstractC, Stix trees.
+
+Data structures with module lifetime are well documented and understood.
+This document is mostly concerned with session-lifetime data.
+Most of these structures are ``owned'' by CM, since that's
the only major component of GHCI which deals with session-lifetime
issues.
-Terminology: ``home'' refers to modules in this package, precisely
-the ones tracked and updated by CM. ``Package'' refers to all other
-packages, which are assumed static.
+Modules and packages: {\bf home} refers to modules in this package,
+precisely the ones tracked and updated by the compilation manager.
+{\bf Package} refers to all other packages, which are assumed static.
+
+\subsubsection*{A summary of all session-lifetime data structures}
-New data structures are:
+These structures have session lifetime but not necessarily global
+visibility. Subsequent sections elaborate who can see what.
\begin{itemize}
+\item {\bf Home Symbol Table (HST)} (owner: CM) holds the post-renaming
+ environments created by compiling each home module.
+\item {\bf Home Interface Table (HIT)} (owner: CM) holds in-memory
+ representations of the interface file created by compiling
+ each home module.
+\item {\bf Unlinked Images (UI)} (owner: CM) are executable but as-yet
+ unlinked translations of home modules only.
+\item {\bf Module Graph (MG)} (owner: CM) is the current module graph.
+\item {\bf Static Info (SI)} (owner: CM) is the package configuration
+ information (PCI) and compiler flags (FLAGS).
+\item {\bf Persistent Compiler State (PCS)} (owner: @compile@)
+ is @compile@'s private cache of information about package
+ modules.
+\item {\bf Persistent Linker State (PLS)} (owner: @link@) is
+ @link@'s private information concerning the the current
+ state of the (in-memory) executable image.
+\end{itemize}
+
+
+%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
+\subsection{The finder (\mbox{\tt type Finder})}
+\label{sec:finder}
+
+@Path@ could be an indication of a location in a filesystem, or it
+could be some more generic kind of resource identifier, a URL for
+example.
+\begin{verbatim}
+ data Path = ...
+\end{verbatim}
+And some names. @Module@s are now used as primary keys for various
+maps, so they are given a @Unique@.
+\begin{verbatim}
+ type ModName = String -- a module name
+ type PkgName = String -- a package name
+ type Module = -- contains ModName and a Unique, at least
+\end{verbatim}
+
+A @ModLocation@ says where a module is, what it's called and in what
+form it is.
+\begin{verbatim}
+ data ModLocation = SourceOnly Module Path -- .hs
+ | ObjectCode Module Path Path -- .o, .hi
+ | InPackage Module PkgName
+ -- examine PCI to determine package Path
+\end{verbatim}
+
+The module finder generates @ModLocation@s from @ModName@s. We expect
+it will assume packages to be static, but we want to be able to track
+changes in home modules during the session. Specifically, we want to
+be able to notice that a module's object and interface have been
+updated, presumably by a compile run outside of the GHCI session.
+Hence the two-stage type:
+\begin{verbatim}
+ type Finder = ModName -> IO ModLocation
+ newFinder :: PCI -> IO Finder
+\end{verbatim}
+@newFinder@ examines the package information right at the start, but
+returns an @IO@-typed function which can inspect home module changes
+later in the session.
+
+
+%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
+\subsection{The summariser (\mbox{\tt summarise})}
+\label{sec:summariser}
+
+A @ModSummary@ records the minimum information needed to establish the
+module graph and determine whose source has changed. @ModSummary@s
+can be created quickly.
+\begin{verbatim}
+ data ModSummary = ModSummary
+ ModLocation -- location and kind
+ (Maybe (String, Fingerprint))
+ -- source and fingerprint if .hs
+ (Maybe [ModName]) -- imports if .hs or .hi
+
+ type Fingerprint = ... -- file timestamp, or source checksum?
+
+ summarise :: ModLocation -> IO ModSummary
+\end{verbatim}
+
+The summary contains the location and source text, and the location
+contains the name. We would like to remove the assumption that
+sources live on disk, but I'm not sure this is good enough yet.
+
+\ToDo{Should @ModSummary@ contain source text for interface files too?}
+\ToDo{Also say that @ModIFace@ contains its module's @ModSummary@ (why?).}
+
+
+%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
+\subsection{Static information (SI)}
+\label{sec:staticinfo}
+
+PCI, the package configuration information, is a list of @PkgInfo@,
+each containing at least the following:
+\begin{verbatim}
+ data PkgInfo
+ = PkgInfo PkgName -- my name
+ Path -- path to my base location
+ [PkgName] -- who I depend on
+ [ModName] -- modules I supply
+ [Unlinked] -- paths to my object files
+
+ type PCI = [PkgInfo]
+\end{verbatim}
+The @Path@s in it, including those in the @Unlinked@s, are set up
+when GHCI starts.
+
+FLAGS is a bunch of compiler options. We haven't figured out yet how
+to partition them into those for the whole session vs those for
+specific source files, so currently the best we can do is:
+\begin{verbatim}
+ data FLAGS = ...
+\end{verbatim}
+
+The static information (SI) is the both of these:
+\begin{verbatim}
+ data SI = SI PCI
+ FLAGS
+\end{verbatim}
+
+
+
+%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
+\subsection{The Compilation Manager (CM)}
+\label{sec:manager}
+
+\subsubsection{Data structures owned by CM}
+
+CM maintains two maps (HST, HIT) and a set (UI). It's important to
+realise that CM only knows about the map/set-ness, and has no idea
+what a @ModDetails@, @ModIFace@ or @Linkable@ is. Only @compile@ and
+@link@ know that, and CM passes these types around without
+inspecting them.
+
+\begin{itemize}
\item
{\bf Home Symbol Table (HST)} @:: FiniteMap Module ModDetails@
- The @ModDetails@ contain tycons, classes, instances,
- etc, collectively known as ``entities''. Referrals from other
- modules to these entities is direct, with no intervening
+ The @ModDetails@ (a couple of layers down) contain tycons, classes,
+ instances, etc, collectively known as ``entities''. Referrals from
+ other modules to these entities is direct, with no intervening
indirections of any kind; conversely, these entities refer directly
- to other entities, regardless of module boundaries. HST only
- holds information for home modules; the corresponding wired-up
- details for package (non-home) modules are created lazily in
- the package rename cache (PRC).
+ to other entities, regardless of module boundaries. HST only holds
+ information for home modules; the corresponding wired-up details
+ for package (non-home) modules are created on demand in the package
+ symbol table (PST) inside the persistent compiler's state (PCS).
CM maintains the HST, which is passed to, but not modified by,
@compile@. If compilation of a module is successful, @compile@
(Completely private to CM; nobody else sees this).
Compilation of a module always creates a @ModIFace@, which contains
- the unlinked symbol table entries. CM maintains a @FiniteMap@
+ the unlinked symbol table entries. CM maintains this @FiniteMap@
@ModName@ @ModIFace@, with session lifetime. CM never throws away
@ModIFace@s, but it does update them, by passing old ones to
@compile@ if they exist, and getting new ones back.
CM acquires @ModuleIFace@s from @compile@, which it only applies
to modules in the home package. As a result, HIT only contains
@ModuleIFace@s for modules in the home package. Those from other
- packages reside in ...
-
-\item
- {\bf Persistent Compiler State (PCS)} @:: known-only-to-compile@
-
- This contains info about foreign packages only, acting as a cache,
- which is private to @compile@. The cache never becomes out of
- date. There are at least two parts to it:
-
- \begin{itemize}
- \item
- {\bf Package Interface Table (PIT)} @:: FiniteMap Module ModIFace@
-
- @compile@ reads interfaces from modules in foreign packages, and
- caches them in the PIT. Subsequent imports of the same module get
- them directly out of the PIT, avoiding slow lexing/parsing phases.
- Because foreign packages are assumed never to become out of date,
- all contents of PIT remain valid forever.
-
- Successful runs of @compile@ can add arbitrary numbers of new
- interfaces to the PIT. Failed runs could also contribute any new
- interfaces read, but this could create inconsistencies between the
- PIT and the unlinked images (UI). Specifically, we don't want the
- PIT to acquire interfaces for which UI hasn't got a corresponding
- @Linkable@, and we don't want @Linkable@s from failed compilation
- runs to enter UI, because we can't be sure that they are actually
- necessary for a successful link. So it seems simplest, albeit at a
- small compilation speed loss, for @compile@ not to update PCS at
- all following a failed compile. We may revisit this
- decision later.
-
- \item
- {\bf Package Rename Cache (PRC)} @:: FiniteMap Module ModDetails@
-
- Adding an package interface to PIT doesn't make it directly usable
- to @compile@, because it first needs to be wired (renamed +
- typechecked) into the sphagetti of the HST. On the other hand,
- most modules only use a few entities from any imported interface,
- so wiring-in the interface at PIT-entry time might be a big time
- waster. Also, wiring in an interface could mean reading other
- interfaces, and we don't want to do that unnecessarily.
-
- The PRC avoids these problems by allowing incremental wiring-in to
- happen. Pieces of foreign interfaces are renamed and placed in the
- PRC, but only as @compile@ discovers it needs them. In the process
- of incremental renaming, @compile@ may need to read more package
- interfaces, which are returned to CM to add to the PIT.
-
- CM passes the PRC to @compile@ and is returned an updated version
- on success. On failure, @compile@ doesn't return an updated
- version even though it might have created some updates on the way
- to failure. This seems necessary to retain the (thus far unstated)
- invariant that PRC only contains renamed fragments of interfaces in
- PIT.
- \end{itemize}
-
- PCS is opaque to CM; only @compile@ knows what's in it, and how to
- update it. Because packages are assumed static, PCS never becomes
- out of date. So CM only needs to be able to create an empty PCS,
- with @emptyPCS@, and thence just passes it through @compile@ with
- no further ado.
-
- In return, @compile@ must promise not to store in PCS any
- information pertaining to the home modules. If it did so, CM would
- need to have a way to remove this information prior to commencing a
- rebuild, which conflicts with PCS's opaqueness to CM.
+ packages reside in the package interface table (PIT) which is a
+ component of PCS.
\item
{\bf Unlinked Images (UI)} @:: Set Linkable@
object, archive or DLL file. In interactive mode, it may also be
the STG trees derived from translating a module. So @compile@
returns a @Linkable@ from each successful run, namely that of
- translating the module at hand. At link-time, CM supplies these
- @Linkable@s to @link@. It also examines the @ModSummary@s for all
- home modules, and by examining their imports and the PCI (package
- configuration info) it can determine the @Linkable@s from all
- required imported packages too.
+ translating the module at hand.
+
+ At link-time, CM supplies @Linkable@s for the upwards closure of
+ all packages which have changed, to @link@. It also examines the
+ @ModSummary@s for all home modules, and by examining their imports
+ and the SI.PCI (package configuration info) it can determine the
+ @Linkable@s from all required imported packages too.
@Linkable@s and @ModIFace@s have a close relationship. Each
translated module has a corresponding @Linkable@ somewhere.
single @Linkable@ -- as is the case for any module from a
multi-module package. For these reasons it seems appropriate to
keep the two concepts distinct. @Linkable@s also provide
- information about how to link package components together, and that
- insn't the business of any specific module to know.
+ information about the sequence in which individual package
+ components should be linked, and that isn't the business of any
+ specific module to know.
CM passes @compile@ a module's old @ModIFace@, if it has one, in
the hope that the module won't need recompiling. If so, @compile@
- can just return the @ModIFace@ along with a new @ModDetails@
- created from it. Similarly, CM passes in a module's old
- @Linkable@, if it has one, and that's returned unchanged if the
- module isn't recompiled.
+ can just return the new @ModDetails@ created from it, and CM will
+ re-use the old @ModIFace@. If the module {\em is} recompiled (or
+ scheduled to be loaded from disk), @compile@ returns both the
+ new @ModIFace@ and new @Linkable@.
-\item
- {\bf Object Symbol Table (OST)} @:: FiniteMap String Addr+HValue@
-
- OST keeps track of symbol entry points in the linked image. In
- some sense it {\em is} the linked image. The mapping supplies
- @Addr@s for low level symbol names (eg, @Foo_bar_fast3@) which are
- in machine code modules in memory. For symbols of the form
- @Foo_bar_closure@ pertaining to an interpreted module, OST supplies
- an @HValue@, which is the application of the interpreter function to
- the STG tree for @Foo.bar@.
-
- When @link@ loads object code from disk, symbols from the object
- are entered as @Addr@s into OST. When preparing to link an
- unlinked bunch of STG trees, @HValue@s are added. Resolving of
- object-level references can then be done purely by consulting OST,
- with no need to look in HST, PRC, or anywhere else.
-
- Following the downsweep (re-establishment of the state and
- up-to-dateness of the module graph), CM may determine that certain
- parts of the linked image are out of date. It then will instruct
- @unlink@ to throw groups of @Unlinked@s out of OST, working down
- the module graph, so that at no time does OST hold entries for
- modules/packages which refer to modules/packages which have already
- been removed from OST. In other words, the transitive completeness
- of OST is maintained even during unlinking operations. Because of
- mutually recursive module groups, CM asks @unlink@ to delete sets
- of @Unlinked@s in one go, rather than singly.
-
- \ToDo{Need a way to refer to @Unlinked@s. Some kind of keys?}
-
- For batch mode compilation, OST doesn't exist. CM doesn't know
- anything aboyt OST's representation, and the only modifiers of it
- are @link@ and @unlink@. So for batch compilation, OST can just
- be a unit value ignored by all parties.
+\item
+ {\bf Module Graph (MG)} @:: known-only-to-CM@
-\item
- {\bf Linked Image (LI)} @:: no-explicit-representation@
+ Records, for CM's purposes, the current module graph,
+ up-to-dateness and summaries. More details when I get to them.
+ Only contains home modules.
+\end{itemize}
+Probably all this stuff is rolled together into the Persistent CM
+State (PCMS):
+\begin{verbatim}
+ data PCMS = PCMS HST HIT UI MG
+ emptyPCMS :: IO PCMS
+\end{verbatim}
- LI isn't explicitly represented in the system, but we record it
- here for completeness anyway. LI is the current set of
- linked-together module, package and other library fragments
- constituting the current executable mass. LI comprises:
- \begin{itemize}
- \item Machine code (@.o@, @.a@, @.DLL@ file images) in memory.
- These are loaded from disk when needed, and stored in
- @malloc@ville. To simplify storage management, they are
- never freed or reused, since this creates serious
- complications for storage management. When no longer needed,
- they are simply abandoned. New linkings of the same object
- code produces new copies in memory. We hope this not to be
- too much of a space leak.
- \item STG trees, which live in the GHCI heap and are managed by the
- storage manager in the usual way. They are held alive (are
- reachable) via the @HValue@s in the OST. Such @HValue@s are
- applications of the interpreter function to the trees
- themselves. Linking a tree comprises travelling over the
- tree, replacing all the @Id@s with pointers directly to the
- relevant @_closure@ labels, as determined by searching the
- OST. Once the leaves are linked, trees are wrapped with the
- interpreter function. The resulting @HValue@s then behave
- indistinguishably from compiled versions of the same code.
- \end{itemize}
- Because object code is outside the heap and never deallocated,
- whilst interpreted code is held alive by the OST, there's no need
- to have a data structure which ``is'' the linked image.
+\subsubsection{What CM implements}
+It pretty much implements the HEP interface. First, though, define a
+containing structure for the state of the entire CM system and its
+subsystems @compile@ and @link@:
+\begin{verbatim}
+ data CmState
+ = CmState PCMS -- CM's stuff
+ PCS -- compile's stuff
+ PLS -- link's stuff
+ SI -- the static info, never changes
+ Finder -- the finder
+\end{verbatim}
- For batch compilation, LI doesn't exist because OST doesn't exist,
- and because @link@ doesn't load code into memory, instead just
- invokes the system linker.
+The @CmState@ is threaded through the HEP interface. In reality
+this might be done using @IORef@s, but for clarity:
+\begin{verbatim}
+ type ModHandle = ... (opaque to CM/HEP clients) ...
+ type HValue = ... (opaque to CM/HEP clients) ...
- \ToDo{Do we need to say anything about CAFs and SRTs? Probably ...}
-\end{itemize}
+ cmInit :: FLAGS
+ -> [PkgInfo]
+ -> IO CmState
-There are also a few auxiliary structures, of somehow lesser importance:
+ cmLoadModule :: CmState
+ -> ModName
+ -> IO (CmState, Either [SDoc] ModHandle)
-\begin{itemize}
-\item
- {\bf Module Graph (MG)} @:: known-only-to-CM@
+ cmGetExpr :: ModHandle
+ -> CmState
+ -> String -> IO (CmState, Either [SDoc] HValue)
- Records, for CM's purposes, the current module graph,
- up-to-dateness and summaries. More details when I get to them.
+ cmRunExpr :: HValue -> IO () -- don't need CmState here
+\end{verbatim}
+Almost all the huff and puff in this document pertains to @cmLoadModule@.
-\item
- {\bf Package Config Info (PCI)} @:: [PkgInfo]@
-
- A value static over the entire session, giving, for each package,
- its name, dependencies, linkable components and constitutent module
- names.
-\item
- {\bf Flags/options (FLAGS)} @:: dunno@
-
- Another session-static value, containing flags/options. Burble.
-\end{itemize}
+\subsubsection{Implementing \mbox{\tt cmInit}}
+@cmInit@ creates an empty @CmState@ using @emptyPCMS@, @emptyPCS@,
+@emptyPLS@, making SI from the supplied flags and package info, and
+by supplying the package info the @newFinder@.
+\subsubsection{Implementing \mbox{\tt cmLoadModule}}
-\subsection{Important datatypes}
+\begin{enumerate}
+\item {\bf Downsweep:} using @finder@ and @summarise@, chase from
+ the given module to
+ establish the new home module graph (MG). Do not chase into
+ package modules.
+\item Remove from HIT, HST, UI any modules in the old MG which are
+ not in the new one. The old MG is then replaced by the new one.
+\item Topologically sort MG to generate a bottom-to-top traversal
+ order, giving a worklist.
+\item {\bf Upsweep:} call @compile@ on each module in the worklist in
+ turn, passing it
+ the ``correct'' HST, PCS, the old @ModIFace@ if
+ available, and the summary. ``Correct'' HST in the sense that
+ HST contains only the modules in the this module's downward
+ closure, so that @compile@ can construct the correct instance
+ and rule environments simply as the union of those in
+ the module's downward closure.
+
+ If @compile@ doesn't return a new interface/linkable pair,
+ compilation wasn't necessary. Either way, update HST with
+ the new @ModDetails@, and UI and HIT respectively if a
+ compilation {\em did} occur.
+
+ Keep going until the root module is successfully done, or
+ compilation fails.
+
+\item If the previous step terminated because compilation failed,
+ define the successful set as those modules in successfully
+ completed SCCs, i.e. all @Linkable@s returned by @compile@ excluding
+ those from modules in any cycle which includes the module which failed.
+ Remove from HST, HIT, UI and MG all modules mentioned in MG which
+ are not in the successful set. Call @link@ with the successful
+ set,
+ which should succeed. The net effect is to back off to a point
+ in which those modules which are still aboard are correctly
+ compiled and linked.
+
+ If the previous step terminated successfully,
+ call @link@ passing it the @Linkable@s in the upward closure of
+ all those modules for which @compile@ produced a new @Linkable@.
+\end{enumerate}
+As a small optimisation, do this:
+\begin{enumerate}
+\item[3a.] Remove from the worklist any module M where M's source
+ hasn't changed and neither has the source of any module in M's
+ downward closure. This has the effect of not starting the upsweep
+ right at the bottom of the graph when that's not needed.
+ Source-change checking can be done quickly by CM by comparing
+ summaries of modules in MG against corresponding
+ summaries from the old MG.
+\end{enumerate}
-\subsubsection*{Names, location and summarisation}
-The summary should contain the location, and the location contain the
-name. Also it is hoped to remove the assumption that sources live on
-disk, but I'm not sure this is good enough yet. @Module@s are now
-used as primary keys in various maps, so they are given a @Unique@.
-\begin{verbatim}
- type ModName = String -- a module name
- type PkgName = String -- a package name
- type Module = -- contains ModName and a Unique, at least
-\end{verbatim}
-@Path@ could be an indication of a location in a filesystem, or it
-could be some more generic kind of resource identifier, a URL for
-example.
-\begin{verbatim}
- data Path = ...
-\end{verbatim}
-A @ModLocation@ says where a module is, what it's called and in what
-form it it.
-\begin{verbatim}
- data ModLocation = SourceOnly Module Path -- .hs
- | ObjectCode Module Path Path -- .o, .hi
- | InPackage Module PkgName
- -- examine PCI to determine package Path
-\end{verbatim}
-A @ModSummary@ records the minimum information needed to establish the
-module graph and determine whose source has changed. @ModSummary@s
-can be created quickly.
-\begin{verbatim}
- data ModSummary = ModSummary
- ModLocation -- location and kind
- Maybe (String, Fingerprint)
- -- source and fingerprint if .hs
- [ModName] -- imports
+%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
+\subsection{The compiler (\mbox{\tt compile})}
+\label{sec:compiler}
- type Fingerprint = ... -- file timestamp, or source checksum?
-\end{verbatim}
-\ToDo{Should @ModSummary@ contain source text for interface files
- too?}
-\ToDo{Also say that @ModIFace@ contains its module's @ModSummary@.}
+\subsubsection{Data structures owned by \mbox{\tt compile}}
+{\bf Persistent Compiler State (PCS)} @:: known-only-to-compile@
-\subsubsection*{To do with linking}
-Two important types: @Unlinked@ and @Linkable@. The latter is a
-higher-level representation involving multiple of the former.
-An @Unlinked@ is a reference to unlinked executable code, something
-a linker could take as input:
-\begin{verbatim}
- data Unlinked = DotO Path
- | DotA Path
- | DotDLL Path
- | Trees [StgTree RdrName]
-\end{verbatim}
-The first three describe the location of a file (presumably)
-containing the code to link. @Trees@, which only exists in
-interactive mode, gives a list of @StgTrees@, in which the
-unresolved references are @RdrNames@ -- hence it's non-linkedness.
-Once linked, those @RdrNames@ are replaced with pointers to the
-machine code implementing them.
+This contains info about foreign packages only, acting as a cache,
+which is private to @compile@. The cache never becomes out of
+date. There are three parts to it:
-A @Linkable@ gathers together several @Unlinked@s and associates them
-with either a module or package:
-\begin{verbatim}
- data Linkable = LM Module [Unlinked] -- a module
- | LP PkgName [Unlinked] -- a package
-\end{verbatim}
-The order of the @Unlinked@s in the list is important, particularly
-for package contents -- we'll have to decide on a left-to-right or
-right-to-left dependency ordering.
+ \begin{itemize}
+ \item
+ {\bf Package Interface Table (PIT)} @:: FiniteMap Module ModIFace@
-@compile@ is supplied with, and checks PIT (inside PCS) before
-reading package interfaces, so it doesn't read and add duplicate
-@ModIFace@s to PIT.
+ @compile@ reads interfaces from modules in foreign packages, and
+ caches them in the PIT. Subsequent imports of the same module get
+ them directly out of the PIT, avoiding slow lexing/parsing phases.
+ Because foreign packages are assumed never to become out of date,
+ all contents of PIT remain valid forever. @compile@ of course
+ tries to find package interfaces in PIT in preference to reading
+ them from files.
-PCI, the package configuration information, is a list of @PkgInfo@,
-each containing at least the following:
-\begin{verbatim}
- data PkgInfo
- = PkgInfo PkgName -- my name
- Path -- path to my base location
- [PkgName] -- who I depend on
- [ModName] -- modules I supply
- [Unlinked] -- paths to my object files
-\end{verbatim}
-The @Path@s in it, including those in the @Unlinked@s, are set up
-when GHCI starts.
+ Both successful and failed runs of @compile@ can add arbitrary
+ numbers of new interfaces to the PIT. The failed runs don't matter
+ because we assume that packages are static, so the data cached even
+ by a failed run is valid forever (ie for the rest of the session).
-\subsection{Signatures}
+ \item
+ {\bf Package Symbol Table (PST)} @:: FiniteMap Module ModDetails@
-\subsubsection*{The finder}
-The module finder generates @ModLocation@s from @ModName@s. We
-expect that it will assume packages to be static, but we want to
-be able to track changes in home modules during the session.
-Specifically, we want to be able to notice that a module's object and
-interface have been updated, presumably by a compile run outside of
-the GHCI session. Hence the two-stage type:
-\begin{verbatim}
- type Finder = ModName -> IO ModLocation
- newFinder :: [PkgConfig] -> IO Finder
-\end{verbatim}
-@newFinder@ examines the package information right at the start, but
-returns an @IO@-typed function which can inspect home module changes
-later in the session.
+ Adding an package interface to PIT doesn't make it directly usable
+ to @compile@, because it first needs to be wired (renamed +
+ typechecked) into the sphagetti of the HST. On the other hand,
+ most modules only use a few entities from any imported interface,
+ so wiring-in the interface at PIT-entry time might be a big time
+ waster. Also, wiring in an interface could mean reading other
+ interfaces, and we don't want to do that unnecessarily.
+
+ The PST avoids these problems by allowing incremental wiring-in to
+ happen. Pieces of foreign interfaces are copied out of the holding
+ pen (HP), renamed, typechecked, and placed in the PST, but only as
+ @compile@ discovers it needs them. In the process of incremental
+ renaming/typechecking, @compile@ may need to read more package
+ interfaces, which are added to the PIT and hence to
+ HP.~\ToDo{How? When?}
+
+ CM passes the PST to @compile@ and is returned an updated version
+ on both success and failure.
+
+ \item
+ {\bf Holding Pen (HP)} @:: HoldingPen@
+
+ HP holds parsed but not-yet renamed-or-typechecked fragments of
+ package interfaces. As typechecking of other modules progresses,
+ fragments are removed (``slurped'') from HP, renamed and
+ typechecked, and placed in PCS.PST (see above). Slurping a
+ fragment may require new interfaces to be read into HP. The hope
+ is, though, that many fragments will never get slurped, reducing
+ the total number of interfaces read (as compared to eager slurping).
+
+ \end{itemize}
+
+ PCS is opaque to CM; only @compile@ knows what's in it, and how to
+ update it. Because packages are assumed static, PCS never becomes
+ out of date. So CM only needs to be able to create an empty PCS,
+ with @emptyPCS@, and thence just passes it through @compile@ with
+ no further ado.
+
+ In return, @compile@ must promise not to store in PCS any
+ information pertaining to the home modules. If it did so, CM would
+ need to have a way to remove this information prior to commencing a
+ rebuild, which conflicts with PCS's opaqueness to CM.
-\subsubsection*{Starting up}
-Some of the session-lifetime data structures are opaque to CM, so
-it doesn't know how to create an initial one. Hence it relies on its
-client to supply the following:
-\begin{verbatim}
- emptyPCS :: PCS
- emptyOST :: OST
-\end{verbatim}
-The PCS is maintained solely by @compile@, and OST solely by
-@link@/@unlink@. CM cannot know the representation of the latter
-since it depends on whether we're operating in interactive or batch
-mode.
-\subsubsection*{What {\tt compile} does}
+
+\subsubsection{What {\tt compile} does}
@compile@ is necessarily somewhat complex. We've decided to do away
-with private global variables -- they make the design harder to
-understand and may interfere with CM's need to roll the system back
-to a consistent state following compilation failure for modules in
-a cycle. Without further ado:
+with private global variables -- they make the design specification
+less clear, although the implementation might use them. Without
+further ado:
\begin{verbatim}
- compile :: FLAGS -- obvious
+ compile :: SI -- obvious
-> Finder -- to find modules
-> ModSummary -- summary, including source
- -> Maybe (ModIFace, Linkable)
- -- former summary and code, if avail
+ -> Maybe ModIFace
+ -- former summary, if avail
-> HST -- for home module ModDetails
-> PCS -- IN: the persistent compiler state
- -> CompResult
+ -> IO CompResult
data CompResult
= CompOK ModDetails -- new details (== HST additions)
- (ModIFace, Linkable)
- -- summary and code; same as went in if
- -- compilation was not needed
+ (Maybe (ModIFace, Linkable))
+ -- summary and code; Nothing => compilation
+ -- not needed (old summary and code are still valid)
PCS -- updated PCS
[SDoc] -- warnings
data PCS
= MkPCS PIT -- package interfaces
- PRC -- rename cache/global symtab contents
+ PST -- post slurping global symtab contribs
+ HoldingPen -- pre slurping interface bits and pieces
+
+ emptyPCS :: IO PCS -- since CM has no other way to make one
\end{verbatim}
Although @compile@ is passed three of the global structures (FLAGS,
HST and PCS), it only modifies PCS. The rest are modified by CM as it
\item
If recompilation is not needed, create a new @ModDetails@ from the
- old @ModIFace@, looking up information in HST and PCS.PRC as necessary.
- Return the new details, the old @ModIFace@ and @Linkable@, the PCS
- \ToDo{I don't think the PCS should be updated, but who knows?}, and
- an empty warning list.
+ old @ModIFace@, looking up information in HST and PCS.PST as
+ necessary. Return the new details, a @Nothing@ denoting
+ compilation was not needed, the PCS \ToDo{I don't think the PCS
+ should be updated, but who knows?}, and an empty warning list.
\item
Otherwise, compilation is needed.
If the module is only available in object+interface form, read the
interface, make up details, create a linkable pointing at the
- object code. Does this involve reading any more interfaces? Does
- it involve updating PRC?
+ object code. \ToDo{Does this involve reading any more interfaces? Does
+ it involve updating PST?}
Otherwise, translate from source, then create and return: an
- details, interface, linkable, updated PRC, and warnings.
+ details, interface, linkable, updated PST, and warnings.
When looking for a new interface, search HST, then PCS.PIT, and only
then read from disk. In which case add the new interface(s) to
boot interface against the inferred interface.}
\end{itemize}
-\subsection{What {\tt link} and {\tt unlink} do}
+
+\subsubsection{Contents of \mbox{\tt ModDetails},
+ \mbox{\tt ModIFace} and \mbox{\tt HoldingPen}}
+Only @compile@ can see inside these three types -- they are opaque to
+everyone else. @ModDetails@ holds the post-renaming,
+post-typechecking environment created by compiling a module.
+
\begin{verbatim}
- link :: [[Unlinked]] -> OST -> IO LinkResult
+ data ModDetails
+ = ModDetails {
+ moduleExports :: Avails
+ moduleEnv :: GlobalRdrEnv -- == FM RdrName [Name]
+ typeEnv :: FM Name TyThing -- TyThing is in TcEnv.lhs
+ instEnv :: InstEnv
+ fixityEnv :: FM Name Fixity
+ ruleEnv :: FM Id [Rule]
+ }
+\end{verbatim}
- unlink :: [Unlinked] -> OST -> IO OST
+@ModIFace@ is nearly the same as @ParsedIFace@ from @RnMonad.lhs@:
+\begin{verbatim}
+ type ModIFace = ParsedIFace -- not really, but ...
+ data ParsedIface
+ = ParsedIface {
+ pi_mod :: Module, -- Complete with package info
+ pi_vers :: Version, -- Module version number
+ pi_orphan :: WhetherHasOrphans, -- Whether this module has orphans
+ pi_usages :: [ImportVersion OccName], -- Usages
+ pi_exports :: [ExportItem], -- Exports
+ pi_insts :: [RdrNameInstDecl], -- Local instance declarations
+ pi_decls :: [(Version, RdrNameHsDecl)], -- Local definitions
+ pi_fixity :: (Version, [RdrNameFixitySig]), -- Local fixity declarations,
+ -- with their version
+ pi_rules :: (Version, [RdrNameRuleDecl]), -- Rules, with their version
+ pi_deprecs :: [RdrNameDeprecation] -- Deprecations
+ }
+\end{verbatim}
- data LinkResult = LinkOK OST
- | LinkErrs [SDoc] OST
+@HoldingPen@ is a cleaned-up version of that found in @RnMonad.lhs@,
+retaining just the 3 pieces actually comprising the holding pen:
+\begin{verbatim}
+ data HoldingPen
+ = HoldingPen {
+ iDecls :: DeclsMap, -- A single, global map of Names to decls
+
+ iInsts :: IfaceInsts,
+ -- The as-yet un-slurped instance decls; this bag is depleted when we
+ -- slurp an instance decl so that we don't slurp the same one twice.
+ -- Each is 'gated' by the names that must be available before
+ -- this instance decl is needed.
+
+ iRules :: IfaceRules
+ -- Similar to instance decls, only for rules
+ }
\end{verbatim}
-Given a list of list of @Unlinked@s, @link@ places the symbols they
-export in the OST, then resolves symbol references in the new code.
-
-The list-of-lists scheme reflects the fact that CM has to handle
-recursive module groups. Each list is a minimal strongly connected
-group. CM guarantees that @link@ can process the outer list left to
-right, so that after each group (inner list) is linked, the linked
-image as a whole is consistent -- there are no unresolved references
-in it. If linking in of a group should fail for some reason, it is
-@link@'s responsibility to not modify OST at all. In other words,
-linking each group is atomic; it either succeeds or fails.
-
-A successful link returns the final OST. Failed links return some
-error message and the OST updated up to but not including the group
-that failed. In either case, the intention is (1) that the linked
-image does not contain any dangling references, and (2) that CM can
-determine by inspecting the resulting OST how much linking succeeded.
-
-CM specifies not only the @Unlinked@s for the home modules, but also
-those for all needed packages. It can examine the module graph (MG)
-which presumably contains @ModSummary@s to determine all package
-modules needed, then look in PCI to discover which packages those
-modules correspond to. The needed @Unlinked@s are those for all
-needed packages {\em plus all indirectly dependent packages}.
-Packages dependencies are also recorded in PCI.
-
-\ToDo{What happens in batch linking, where there isn't a real OST for
- CM to examine?}
-
-@unlink@ is used by CM to remove out-of-date code from the LI prior
-to an upsweep. CM calls @unlink@ in a top-down fashion, specifying
-groups of @Unlinked@s to delete, again in such a manner that LI has
-no dangling references between invokations.
-
-CM may call @unlink@ repeatedly in order to reduce the LI to what it
-wants. By contrast, CM promises to call @link@ only when it has
-successfully compiled the root module. This is so that @link@ doesn't
-have to do incremental linking, which is important when working with
-system linkers in batch mode. In batch mode, @unlink@ does nothing,
-and @link@ just invokes the system linker. Presumably CM must
-insert package @Unlinked@s in the list-of-lists in such a way as to
-ensure that they can be correctly processed in a single left-to-right
-pass idiomatic of Unix linkers.
-
-\ToDo{Be more specific about how OST is organised -- how does @unlink@
- know which entries came from which @Linkable@s ?}
-
-
-\subsection{What CM does}
-Pretty much as before.
-
-Plus: detect module cycles during the downsweep. During the upsweep,
-ensure that compilation failures for modules in cycles do not leave
-any of the global structures in an inconsistent state.
+
+%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
+\subsection{The linker (\mbox{\tt link})}
+\label{sec:linker}
+
+\subsubsection{Data structures owned by the linker}
+
+In the same way that @compile@ has a persistent compiler state (PCS),
+the linker has a persistent (session-lifetime) state, PLS, the
+Linker's Persistent State. In batch mode PLS is entirely irrelevant,
+because there is only a single link step, and can be a unit value
+ignored by everybody. In interactive mode PLS is composed of the
+following three parts:
+
\begin{itemize}
\item
- For PCS, that's never a problem because PCS doesn't hold any
- information pertaining to home modules.
-\item
- HST and HIT: CM knows that these are mappings from @Module@ to
- whatever, and can throw away entries from failed cycles, or,
- equivalently, not commit updates to them until cycles succeed,
- remembering of course to synthesise appropriate HSTs during
- compilation of a cycle.
-\item
- UI -- a collection of @Linkable@s, between which there are no
- direct refererences, so CM can remove additions from failed cycles
- with no difficulty.
-\item
- OST -- linking is not carried out until the upsweep has
- succeeded, so there's no problem here.
+\textbf{The Source Symbol Table (SST)}@ :: FiniteMap RdrName HValue@
+ The source symbol table is used when linking interpreted code.
+ Unlinked interpreted code consists of an STG tree where
+ the leaves are @RdrNames@. The linker's job is to resolve these to
+ actual addresses (the alternative is to resolve these lazily when
+ the code is run, but this requires passing the full symbol table
+ through the interpreter and the repeated lookups will probably be
+ expensive).
+
+ The source symbol table therefore maps @RdrName@s to @HValue@s, for
+ every @RdrName@ that currently \emph{has} an @HValue@, including all
+ exported functions from object code modules that are currently
+ linked in. Linking therefore turns a @StgTree RdrName@ into an
+ @StgTree HValue@.
+
+ It is important that we can prune this symbol table by throwing away
+ the mappings for an entire module, whenever we recompile/relink a
+ given module. The representation is therefore probably a two-level
+ mapping, from module names, to function/constructor names, to
+ @HValue@s.
+
+\item \textbf{The Object Symbol Table (OST)}@ :: FiniteMap String Addr@
+ This is a lower level symbol table, mapping symbol names in object
+ modules to their addresses in memory. It is used only when
+ resolving the external references in an object module, and contains
+ only entries that are defined in object modules.
+
+ Why have two symbol tables? Well, there is a clear distinction
+ between the two: the source symbol table maps Haskell symbols to
+ Haskell values, and the object symbol table maps object symbols to
+ addresses. There is some overlap, in that Haskell symbols certainly
+ have addresses, and we could look up a Haskell symbol's address by
+ manufacturing the right object symbol and looking that up in the
+ object symbol table, but this is likely to be slow and would force
+ us to extend the object symbol table with all the symbols
+ ``exported'' by interpreted code. Doing it this way enables us to
+ decouple the object management subsystem from the rest of the linker
+ with a minimal interface; something like
+
+ \begin{verbatim}
+ loadObject :: Unlinked -> IO Object
+ unloadModule :: Unlinked -> IO ()
+ lookupSymbol :: String -> IO Addr
+ \end{verbatim}
+
+ Rather unfortunately we need @lookupSymbol@ in order to populate the
+ source symbol table when linking in a new compiled module. Our
+ object management subsystem is currently written in C, so decoupling
+ this interface as much as possible is highly desirable.
+
+\item
+ {\bf Linked Image (LI)} @:: no-explicit-representation@
+
+ LI isn't explicitly represented in the system, but we record it
+ here for completeness anyway. LI is the current set of
+ linked-together module, package and other library fragments
+ constituting the current executable mass. LI comprises:
+ \begin{itemize}
+ \item Machine code (@.o@, @.a@, @.DLL@ file images) in memory.
+ These are loaded from disk when needed, and stored in
+ @malloc@ville. To simplify storage management, they are
+ never freed or reused, since this creates serious
+ complications for storage management. When no longer needed,
+ they are simply abandoned. New linkings of the same object
+ code produces new copies in memory. We hope this not to be
+ too much of a space leak.
+ \item STG trees, which live in the GHCI heap and are managed by the
+ storage manager in the usual way. They are held alive (are
+ reachable) via the @HValue@s in the OST. Such @HValue@s are
+ applications of the interpreter function to the trees
+ themselves. Linking a tree comprises travelling over the
+ tree, replacing all the @Id@s with pointers directly to the
+ relevant @_closure@ labels, as determined by searching the
+ OST. Once the leaves are linked, trees are wrapped with the
+ interpreter function. The resulting @HValue@s then behave
+ indistinguishably from compiled versions of the same code.
+ \end{itemize}
+ Because object code is outside the heap and never deallocated,
+ whilst interpreted code is held alive via the HST, there's no need
+ to have a data structure which ``is'' the linked image.
+
+ For batch compilation, LI doesn't exist because OST doesn't exist,
+ and because @link@ doesn't load code into memory, instead just
+ invokes the system linker.
+
+ \ToDo{Do we need to say anything about CAFs and SRTs? Probably ...}
+\end{itemize}
+As with PCS, CM has no way to create an initial PLS, so we supply
+@emptyPLS@ for that purpose.
+
+\subsubsection{The linker's interface}
+
+In practice, the PLS might be hidden in the I/O monad rather
+than passed around explicitly. (The same might be true for PCS).
+Anyway:
+
+\begin{verbatim}
+ data PLS -- as described above; opaque to everybody except the linker
+
+ link :: PCI -> ??? -> [[Linkable]] -> PLS -> IO LinkResult
+
+ data LinkResult = LinkOK PLS
+ | LinkErrs PLS [SDoc]
+
+ emptyPLS :: IO PLS -- since CM has no other way to make one
+\end{verbatim}
+
+CM uses @link@ as follows:
+
+After repeatedly using @compile@ to compile all modules which are
+out-of-date, the @link@ is invoked. The @[[Linkable]]@ argument to
+@link@ represents the list of (recursive groups of) home modules which
+have been newly compiled, along with @Linkable@s for each of
+the packages in use (the compilation manager knows which external
+packages are referenced by the home package). The order of the list
+is important: it is sorted in such a way that linking any prefix of
+the list will result in an image with no unresolved references. Note
+that for batch linking there may be further restrictions; for example
+it may not be possible to link recursive groups containing libraries.
+
+@link@ does the following:
+
+\begin{itemize}
+ \item
+ In batch mode, do nothing. In interactive mode,
+ examine the supplied @[[Linkable]]@ to determine which home
+ module @Unlinked@s are new. Remove precisely these @Linkable@s
+ from PLS. (In fact we really need to remove their upwards
+ transitive closure, but I think it is an invariant that CM will
+ supply an upwards transitive closure of new modules).
+ See below for descriptions of @Linkable@ and @Unlinked@.
+
+ \item
+ Batch system: invoke the external linker to link everything in one go.
+ Interactive: bind the @Unlinked@s for the newly compiled modules,
+ plus those for any newly required packages, into PLS.
+
+ Note that it is the linker's responsibility to remember which
+ objects and packages have already been linked. By comparing this
+ with the @Linkable@s supplied to @link@, it can determine which
+ of the linkables in LI are out of date
\end{itemize}
-Plus: clear out the global data structures after the downsweep but
-before the upsweep.
+If linking in of a group should fail for some reason, @link@ should
+not modify its PLS at all. In other words, linking each group
+is atomic; it either succeeds or fails.
+
+\subsubsection*{\mbox{\tt Unlinked} and \mbox{\tt Linkable}}
+
+Two important types: @Unlinked@ and @Linkable@. The latter is a
+higher-level representation involving multiple of the former.
+An @Unlinked@ is a reference to unlinked executable code, something
+a linker could take as input:
+
+\begin{verbatim}
+ data Unlinked = DotO Path
+ | DotA Path
+ | DotDLL Path
+ | Trees [StgTree RdrName]
+\end{verbatim}
+
+The first three describe the location of a file (presumably)
+containing the code to link. @Trees@, which only exists in
+interactive mode, gives a list of @StgTrees@, in which the unresolved
+references are @RdrNames@ -- hence it's non-linkedness. Once linked,
+those @RdrNames@ are replaced with pointers to the machine code
+implementing them.
+
+A @Linkable@ gathers together several @Unlinked@s and associates them
+with either a module or package:
+
+\begin{verbatim}
+ data Linkable = LM Module [Unlinked] -- a module
+ | LP PkgName -- a package
+\end{verbatim}
+
+The order of the @Unlinked@s in the list is important, as
+they are linked in left-to-right order. The @Unlinked@ objects for a
+particular package can be obtained from the package configuration (see
+Section \ref{sec:staticinfo}).
-\ToDo{CM needs to supply a way for @compile@ to know which modules in
- HST are in its downwards closure, and which not, so it can
- correctly construct its instance environment.}
+\ToDo{When adding @Addr@s from an object module to SST, we need to
+ somehow find out the @RdrName@s of the symbols exported by that
+ module.
+ So we'd need to pass in the @ModDetails@ or @ModIFace@ or some such?}