docs/ghci/ghci.tex

   1 %
   2 % (c) The OBFUSCATION-THROUGH-GRATUITOUS-PREPROCESSOR-ABUSE Project,
   3 %     Glasgow University, 1990-2000
   4 %
   5
   6 % \documentstyle[preprint]{acmconf}
   7 \documentclass[11pt]{article}
   8 \oddsidemargin 0.1 in       %   Note that \oddsidemargin = \evensidemargin
   9 \evensidemargin 0.1 in
  10 \marginparwidth 0.85in    %   Narrow margins require narrower marginal notes
  11 \marginparsep 0 in
  12 \sloppy
  13
  14 %\usepackage{epsfig}
  15 \usepackage{shortvrb}
  16 \MakeShortVerb{\@}
  17
  18 %\newcommand{\note}[1]{{\em Note: #1}}
  19 \newcommand{\note}[1]{{{\bf Note:}\sl #1}}
  20 \newcommand{\ToDo}[1]{{{\bf ToDo:}\sl #1}}
  21 \newcommand{\Arg}[1]{\mbox{${\tt arg}_{#1}$}}
  22 \newcommand{\bottom}{\perp}
  23
  24 \newcommand{\secref}[1]{Section~\ref{sec:#1}}
  25 \newcommand{\figref}[1]{Figure~\ref{fig:#1}}
  26 \newcommand{\Section}[2]{\section{#1}\label{sec:#2}}
  27 \newcommand{\Subsection}[2]{\subsection{#1}\label{sec:#2}}
  28 \newcommand{\Subsubsection}[2]{\subsubsection{#1}\label{sec:#2}}
  29
  30 % DIMENSION OF TEXT:
  31 \textheight 8.5 in
  32 \textwidth 6.25 in
  33
  34 \topmargin 0 in
  35 \headheight 0 in
  36 \headsep .25 in
  37
  38
  39 \setlength{\parskip}{0.15cm}
  40 \setlength{\parsep}{0.15cm}
  41 \setlength{\topsep}{0cm}        % Reduces space before and after verbatim,
  42                                 % which is implemented using trivlist
  43 \setlength{\parindent}{0cm}
  44
  45 \renewcommand{\textfraction}{0.2}
  46 \renewcommand{\floatpagefraction}{0.7}
  47
  48 \begin{document}
  49
  50 \title{The GHCi Draft Design, round 2}
  51 \author{MSR Cambridge Haskell Crew \\
  52         Microsoft Research Ltd., Cambridge}
  53
  54 \maketitle
  55
  56 %%%\tableofcontents
  57 %%%\newpage
  58
  59 %%-----------------------------------------------------------------%%
  60 \section{Details}
  61
  62 \subsection{Outline of the design}
  63 \label{sec:details-intro}
  64
  65 The design falls into three major parts:
  66 \begin{itemize}
  67 \item The compilation manager (CM), which coordinates the
  68       system and supplies a HEP-like interface to clients.
  69 \item The module compiler (@compile@), which translates individual
  70       modules to interpretable or machine code.
  71 \item The linker (@link@),
  72       which maintains the executable image in interpreted mode.
  73 \end{itemize}
  74
  75 There are also three auxiliary parts: the finder, which locates
  76 source, object and interface files, the summariser, which quickly
  77 finds dependency information for modules, and the static info
  78 (compiler flags and package details), which is unchanged over the
  79 course of a session.
  80
  81 This section continues with an overview of the session-lifetime data
  82 structures.  Then follows the finder (section~\ref{sec:finder}),
  83 summariser (section~\ref{sec:summariser}),
  84 static info (section~\ref{sec:staticinfo}),
  85 and finally the three big sections
  86 (\ref{sec:manager},~\ref{sec:compiler},~\ref{sec:linker})
  87 on the compilation manager, compiler and linker respectively.
  88
  89 \subsubsection*{Some terminology}
  90
  91 Lifetimes: the phrase {\bf session lifetime} covers a complete run of
  92 GHCI, encompassing multiple recompilation runs.  {\bf Module lifetime}
  93 is a lot shorter, being that of data needed to translate a single
  94 module, but then discarded, for example Core, AbstractC, Stix trees.
  95
  96 Data structures with module lifetime are well documented and understood.
  97 This document is mostly concerned with session-lifetime data.
  98 Most of these structures are ``owned'' by CM, since that's
  99 the only major component of GHCI which deals with session-lifetime
 100 issues.
 101
 102 Modules and packages: {\bf home} refers to modules in this package,
 103 precisely the ones tracked and updated by the compilation manager.
 104 {\bf Package} refers to all other packages, which are assumed static.
 105
 106 \subsubsection*{A summary of all session-lifetime data structures}
 107
 108 These structures have session lifetime but not necessarily global
 109 visibility.  Subsequent sections elaborate who can see what.
 110 \begin{itemize}
 111 \item {\bf Home Symbol Table (HST)} (owner: CM) holds the post-renaming
 112       environments created by compiling each home module.
 113 \item {\bf Home Interface Table (HIT)} (owner: CM) holds in-memory
 114       representations of the interface file created by compiling
 115       each home module.
 116 \item {\bf Unlinked Images (UI)} (owner: CM) are executable but as-yet
 117       unlinked translations of home modules only.
 118 \item {\bf Module Graph (MG)} (owner: CM) is the current module graph.
 119 \item {\bf Static Info (SI)} (owner: CM) is the package configuration
 120       information (PCI) and compiler flags (FLAGS).
 121 \item {\bf Persistent Compiler State (PCS)} (owner: @compile@)
 122       is @compile@'s private cache of information about package
 123       modules.
 124 \item {\bf Persistent Linker State (PLS)} (owner: @link@) is
 125       @link@'s private information concerning the the current
 126       state of the (in-memory) executable image.
 127 \end{itemize}
 128
 129
 130 %%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
 131 \subsection{The finder (\mbox{\tt type Finder})}
 132 \label{sec:finder}
 133
 134 @Path@ could be an indication of a location in a filesystem, or it
 135 could be some more generic kind of resource identifier, a URL for
 136 example.
 137 \begin{verbatim}
 138    data Path = ...
 139 \end{verbatim}
 140
 141 And some names.  @Module@s are now used as primary keys for various
 142 maps, so they are given a @Unique@.
 143 \begin{verbatim}
 144    type ModName = String      -- a module name
 145    type PkgName = String      -- a package name
 146    type Module  = -- contains ModName and a Unique, at least
 147 \end{verbatim}
 148
 149 A @ModLocation@ says where a module is, what it's called and in what
 150 form it is.
 151 \begin{verbatim}
 152    data ModLocation = SourceOnly Module Path         -- .hs
 153                     | ObjectCode Module Path Path    -- .o, .hi
 154                     | InPackage  Module PkgName
 155                           -- examine PCI to determine package Path
 156 \end{verbatim}
 157
 158 The module finder generates @ModLocation@s from @ModName@s.  We expect
 159 it will assume packages to be static, but we want to be able to track
 160 changes in home modules during the session.  Specifically, we want to
 161 be able to notice that a module's object and interface have been
 162 updated, presumably by a compile run outside of the GHCI session.
 163 Hence the two-stage type:
 164 \begin{verbatim}
 165    type Finder = ModName -> IO ModLocation
 166    newFinder :: PCI -> IO Finder
 167 \end{verbatim}
 168 @newFinder@ examines the package information right at the start, but
 169 returns an @IO@-typed function which can inspect home module changes
 170 later in the session.
 171
 172
 173 %%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
 174 \subsection{The summariser (\mbox{\tt summarise})}
 175 \label{sec:summariser}
 176
 177 A @ModSummary@ records the minimum information needed to establish the
 178 module graph and determine whose source has changed.  @ModSummary@s
 179 can be created quickly.
 180 \begin{verbatim}
 181    data ModSummary = ModSummary
 182                         ModLocation   -- location and kind
 183                         (Maybe (String, Fingerprint))
 184                                       -- source and fingerprint if .hs
 185                         (Maybe [ModName])     -- imports if .hs or .hi
 186
 187    type Fingerprint = ...  -- file timestamp, or source checksum?
 188
 189    summarise :: ModLocation -> IO ModSummary
 190 \end{verbatim}
 191
 192 The summary contains the location and source text, and the location
 193 contains the name.  We would like to remove the assumption that
 194 sources live on disk, but I'm not sure this is good enough yet.
 195
 196 \ToDo{Should @ModSummary@ contain source text for interface files too?}
 197 \ToDo{Also say that @ModIFace@ contains its module's @ModSummary@  (why?).}
 198
 199
 200 %%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
 201 \subsection{Static information (SI)}
 202 \label{sec:staticinfo}
 203
 204 PCI, the package configuration information, is a list of @PkgInfo@,
 205 each containing at least the following:
 206 \begin{verbatim}
 207    data PkgInfo
 208       = PkgInfo PkgName    -- my name
 209                 Path       -- path to my base location
 210                 [PkgName]  -- who I depend on
 211                 [ModName]  -- modules I supply
 212                 [Unlinked] -- paths to my object files
 213
 214    type PCI = [PkgInfo]
 215 \end{verbatim}
 216 The @Path@s in it, including those in the @Unlinked@s, are set up
 217 when GHCI starts.
 218
 219 FLAGS is a bunch of compiler options.  We haven't figured out yet how
 220 to partition them into those for the whole session vs those for
 221 specific source files, so currently the best we can do is:
 222 \begin{verbatim}
 223    data FLAGS = ...
 224 \end{verbatim}
 225
 226 The static information (SI) is the both of these:
 227 \begin{verbatim}
 228    data SI = SI PCI
 229                 FLAGS
 230 \end{verbatim}
 231
 232
 233
 234 %%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
 235 \subsection{The Compilation Manager (CM)}
 236 \label{sec:manager}
 237
 238 \subsubsection{Data structures owned by CM}
 239
 240 CM maintains two maps (HST, HIT) and a set (UI).  It's important to
 241 realise that CM only knows about the map/set-ness, and has no idea
 242 what a @ModDetails@, @ModIFace@ or @Linkable@ is.  Only @compile@ and
 243 @link@ know that, and CM passes these types around without
 244 inspecting them.
 245
 246 \begin{itemize}
 247 \item
 248    {\bf Home Symbol Table (HST)} @:: FiniteMap Module ModDetails@
 249
 250    The @ModDetails@ (a couple of layers down) contain tycons, classes,
 251    instances, etc, collectively known as ``entities''.  Referrals from
 252    other modules to these entities is direct, with no intervening
 253    indirections of any kind; conversely, these entities refer directly
 254    to other entities, regardless of module boundaries.  HST only holds
 255    information for home modules; the corresponding wired-up details
 256    for package (non-home) modules are created on demand in the package
 257    symbol table (PST) inside the persistent compiler's state (PCS).
 258
 259    CM maintains the HST, which is passed to, but not modified by,
 260    @compile@.  If compilation of a module is successful, @compile@
 261    returns the resulting @ModDetails@ (inside the @CompResult@) which
 262    CM then adds to HST.
 263
 264    CM throws away arbitrarily large parts of HST at the start of a
 265    rebuild, and uses @compile@ to incrementally reconstruct it.
 266
 267 \item
 268    {\bf Home Interface Table (HIT)} @:: FiniteMap Module ModIFace@
 269
 270    (Completely private to CM; nobody else sees this).
 271
 272    Compilation of a module always creates a @ModIFace@, which contains
 273    the unlinked symbol table entries.  CM maintains this @FiniteMap@
 274    @ModName@ @ModIFace@, with session lifetime.  CM never throws away
 275    @ModIFace@s, but it does update them, by passing old ones to
 276    @compile@ if they exist, and getting new ones back.
 277
 278    CM acquires @ModuleIFace@s from @compile@, which it only applies
 279    to modules in the home package.  As a result, HIT only contains
 280    @ModuleIFace@s for modules in the home package.  Those from other
 281    packages reside in the package interface table (PIT) which is a
 282    component of PCS.
 283
 284 \item
 285    {\bf Unlinked Images (UI)} @:: Set Linkable@
 286
 287    The @Linkable@s in UI represent executable but as-yet unlinked
 288    module translations.  A @Linkable@ can contain the name of an
 289    object, archive or DLL file.  In interactive mode, it may also be
 290    the STG trees derived from translating a module.  So @compile@
 291    returns a @Linkable@ from each successful run, namely that of
 292    translating the module at hand.
 293
 294    At link-time, CM supplies @Linkable@s for the upwards closure of
 295    all packages which have changed, to @link@.  It also examines the
 296    @ModSummary@s for all home modules, and by examining their imports
 297    and the SI.PCI (package configuration info) it can determine the
 298    @Linkable@s from all required imported packages too.
 299
 300    @Linkable@s and @ModIFace@s have a close relationship.  Each
 301    translated module has a corresponding @Linkable@ somewhere.
 302    However, there may be @Linkable@s with no corresponding modules
 303    (the RTS, for example).  Conversely, multiple modules may share a
 304    single @Linkable@ -- as is the case for any module from a
 305    multi-module package.  For these reasons it seems appropriate to
 306    keep the two concepts distinct.  @Linkable@s also provide
 307    information about the sequence in which individual package
 308    components should be linked, and that isn't the business of any
 309    specific module to know.
 310
 311    CM passes @compile@ a module's old @ModIFace@, if it has one, in
 312    the hope that the module won't need recompiling.  If so, @compile@
 313    can just return the new @ModDetails@ created from it, and CM will
 314    re-use the old @ModIFace@.  If the module {\em is} recompiled (or
 315    scheduled to be loaded from disk), @compile@ returns both the
 316    new @ModIFace@ and new @Linkable@.
 317
 318 \item
 319    {\bf Module Graph (MG)} @:: known-only-to-CM@
 320
 321    Records, for CM's purposes, the current module graph,
 322    up-to-dateness and summaries.  More details when I get to them.
 323    Only contains home modules.
 324 \end{itemize}
 325 Probably all this stuff is rolled together into the Persistent CM
 326 State (PCMS):
 327 \begin{verbatim}
 328   data PCMS = PCMS HST HIT UI MG
 329   emptyPCMS :: IO PCMS
 330 \end{verbatim}
 331
 332 \subsubsection{What CM implements}
 333 It pretty much implements the HEP interface.  First, though, define a
 334 containing structure for the state of the entire CM system and its
 335 subsystems @compile@ and @link@:
 336 \begin{verbatim}
 337    data CmState
 338       = CmState PCMS      -- CM's stuff
 339                 PCS       -- compile's stuff
 340                 PLS       -- link's stuff
 341                 SI        -- the static info, never changes
 342                 Finder    -- the finder
 343 \end{verbatim}
 344
 345 The @CmState@ is threaded through the HEP interface.  In reality
 346 this might be done using @IORef@s, but for clarity:
 347 \begin{verbatim}
 348   type ModHandle = ... (opaque to CM/HEP clients) ...
 349   type HValue    = ... (opaque to CM/HEP clients) ...
 350
 351   cmInit       :: FLAGS
 352                -> [PkgInfo]
 353                -> IO CmState
 354
 355   cmLoadModule :: CmState
 356                -> ModName
 357                -> IO (CmState, Either [SDoc] ModHandle)
 358
 359   cmGetExpr    :: ModHandle
 360                -> CmState
 361                -> String -> IO (CmState, Either [SDoc] HValue)
 362
 363   cmRunExpr    :: HValue -> IO ()   -- don't need CmState here
 364 \end{verbatim}
 365 Almost all the huff and puff in this document pertains to @cmLoadModule@.
 366
 367
 368 \subsubsection{Implementing \mbox{\tt cmInit}}
 369 @cmInit@ creates an empty @CmState@ using @emptyPCMS@, @emptyPCS@,
 370 @emptyPLS@, making SI from the supplied flags and package info, and
 371 by supplying the package info the @newFinder@.
 372
 373
 374 \subsubsection{Implementing \mbox{\tt cmLoadModule}}
 375
 376 \begin{enumerate}
 377 \item {\bf Downsweep:} using @finder@ and @summarise@, chase from
 378       the given module to
 379       establish the new home module graph (MG).  Do not chase into
 380       package modules.
 381 \item Remove from HIT, HST, UI any modules in the old MG which are
 382       not in the new one.  The old MG is then replaced by the new one.
 383 \item Topologically sort MG to generate a bottom-to-top traversal
 384       order, giving a worklist.
 385 \item {\bf Upsweep:} call @compile@ on each module in the worklist in
 386       turn, passing it
 387       the ``correct'' HST, PCS, the old @ModIFace@ if
 388       available, and the summary.  ``Correct'' HST in the sense that
 389       HST contains only the modules in the this module's downward
 390       closure, so that @compile@ can construct the correct instance
 391       and rule environments simply as the union of those in
 392       the module's downward closure.
 393
 394       If @compile@ doesn't return a new interface/linkable pair,
 395       compilation wasn't necessary.  Either way, update HST with
 396       the new @ModDetails@, and UI and HIT respectively if a
 397       compilation {\em did} occur.
 398
 399       Keep going until the root module is successfully done, or
 400       compilation fails.
 401
 402 \item If the previous step terminated because compilation failed,
 403       define the successful set as those modules in successfully
 404       completed SCCs, i.e. all @Linkable@s returned by @compile@ excluding
 405       those from modules in any cycle which includes the module which failed.
 406       Remove from HST, HIT, UI and MG all modules mentioned in MG which
 407       are not in the successful set.  Call @link@ with the successful
 408       set,
 409       which should succeed.  The net effect is to back off to a point
 410       in which those modules which are still aboard are correctly
 411       compiled and linked.
 412
 413       If the previous step terminated successfully,
 414       call @link@ passing it the @Linkable@s in the upward closure of
 415       all those modules for which @compile@ produced a new @Linkable@.
 416 \end{enumerate}
 417 As a small optimisation, do this:
 418 \begin{enumerate}
 419 \item[3a.] Remove from the worklist any module M where M's source
 420      hasn't changed and neither has the source of any module in M's
 421      downward closure.  This has the effect of not starting the upsweep
 422      right at the bottom of the graph when that's not needed.
 423      Source-change checking can be done quickly by CM by comparing
 424      summaries of modules in MG against corresponding
 425      summaries from the old MG.
 426 \end{enumerate}
 427
 428
 429 %%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
 430 \subsection{The compiler (\mbox{\tt compile})}
 431 \label{sec:compiler}
 432
 433 \subsubsection{Data structures owned by \mbox{\tt compile}}
 434
 435 {\bf Persistent Compiler State (PCS)} @:: known-only-to-compile@
 436
 437 This contains info about foreign packages only, acting as a cache,
 438 which is private to @compile@.  The cache never becomes out of
 439 date.  There are three parts to it:
 440
 441    \begin{itemize}
 442    \item
 443       {\bf Package Interface Table (PIT)} @:: FiniteMap Module ModIFace@
 444
 445    @compile@ reads interfaces from modules in foreign packages, and
 446    caches them in the PIT.  Subsequent imports of the same module get
 447    them directly out of the PIT, avoiding slow lexing/parsing phases.
 448    Because foreign packages are assumed never to become out of date,
 449    all contents of PIT remain valid forever.  @compile@ of course
 450    tries to find package interfaces in PIT in preference to reading
 451    them from files.
 452
 453    Both successful and failed runs of @compile@ can add arbitrary
 454    numbers of new interfaces to the PIT.  The failed runs don't matter
 455    because we assume that packages are static, so the data cached even
 456    by a failed run is valid forever (ie for the rest of the session).
 457
 458    \item
 459       {\bf Package Symbol Table (PST)} @:: FiniteMap Module ModDetails@
 460
 461    Adding an package interface to PIT doesn't make it directly usable
 462    to @compile@, because it first needs to be wired (renamed +
 463    typechecked) into the sphagetti of the HST.  On the other hand,
 464    most modules only use a few entities from any imported interface,
 465    so wiring-in the interface at PIT-entry time might be a big time
 466    waster.  Also, wiring in an interface could mean reading other
 467    interfaces, and we don't want to do that unnecessarily.
 468
 469    The PST avoids these problems by allowing incremental wiring-in to
 470    happen.  Pieces of foreign interfaces are copied out of the holding
 471    pen (HP), renamed, typechecked, and placed in the PST, but only as
 472    @compile@ discovers it needs them.  In the process of incremental
 473    renaming/typechecking, @compile@ may need to read more package
 474    interfaces, which are added to the PIT and hence to
 475    HP.~\ToDo{How? When?}
 476
 477    CM passes the PST to @compile@ and is returned an updated version
 478    on both success and failure.
 479
 480    \item
 481       {\bf Holding Pen (HP)} @:: HoldingPen@
 482
 483    HP holds parsed but not-yet renamed-or-typechecked fragments of
 484    package interfaces.  As typechecking of other modules progresses,
 485    fragments are removed (``slurped'') from HP, renamed and
 486    typechecked, and placed in PCS.PST (see above).  Slurping a
 487    fragment may require new interfaces to be read into HP.  The hope
 488    is, though, that many fragments will never get slurped, reducing
 489    the total number of interfaces read (as compared to eager slurping).
 490
 491    \end{itemize}
 492
 493    PCS is opaque to CM; only @compile@ knows what's in it, and how to
 494    update it.  Because packages are assumed static, PCS never becomes
 495    out of date.  So CM only needs to be able to create an empty PCS,
 496    with @emptyPCS@, and thence just passes it through @compile@ with
 497    no further ado.
 498
 499    In return, @compile@ must promise not to store in PCS any
 500    information pertaining to the home modules.  If it did so, CM would
 501    need to have a way to remove this information prior to commencing a
 502    rebuild, which conflicts with PCS's opaqueness to CM.
 503
 504
 505
 506
 507 \subsubsection{What {\tt compile} does}
 508 @compile@ is necessarily somewhat complex.  We've decided to do away
 509 with private global variables -- they make the design specification
 510 less clear, although the implementation might use them.  Without
 511 further ado:
 512 \begin{verbatim}
 513    compile :: SI          -- obvious
 514            -> Finder      -- to find modules
 515            -> ModSummary  -- summary, including source
 516            -> Maybe ModIFace
 517                           -- former summary, if avail
 518            -> HST         -- for home module ModDetails
 519            -> PCS         -- IN: the persistent compiler state
 520
 521            -> IO CompResult
 522
 523    data CompResult
 524       = CompOK  ModDetails   -- new details (== HST additions)
 525                 (Maybe (ModIFace, Linkable))
 526                              -- summary and code; Nothing => compilation
 527                              -- not needed (old summary and code are still valid)
 528                 PCS          -- updated PCS
 529                 [SDoc]       -- warnings
 530
 531       | CompErrs PCS         -- updated PCS
 532                  [SDoc]      -- warnings and errors
 533
 534    data PCS
 535       = MkPCS PIT         -- package interfaces
 536               PST         -- post slurping global symtab contribs
 537               HoldingPen  -- pre slurping interface bits and pieces
 538
 539    emptyPCS :: IO PCS     -- since CM has no other way to make one
 540 \end{verbatim}
 541 Although @compile@ is passed three of the global structures (FLAGS,
 542 HST and PCS), it only modifies PCS.  The rest are modified by CM as it
 543 sees fit, from the stuff returned in the @CompResult@.
 544
 545 @compile@ is allowed to return an updated PCS even if compilation
 546 errors occur, since the information in it pertains only to foreign
 547 packages and is assumed to be always-correct.
 548
 549 What @compile@ does: \ToDo{A bit vague ... needs refining.  How does
 550                            @finder@ come into the game?}
 551 \begin{itemize}
 552 \item Figure out if this module needs recompilation.
 553    \begin{itemize}
 554    \item If there's no old @ModIFace@, it does.  Else:
 555    \item Compare the @ModSummary@ supplied with that in the
 556          old @ModIFace@.  If the source has changed, recompilation
 557          is needed.  Else:
 558    \item Compare the usage version numbers in the old @ModIFace@ with
 559          those in the imported @ModIFace@s.  All needed interfaces
 560          for this should be in either HIT or PIT.  If any version
 561          numbers differ, recompilation is needed.
 562    \item Otherwise it isn't needed.
 563    \end{itemize}
 564
 565 \item
 566    If recompilation is not needed, create a new @ModDetails@ from the
 567    old @ModIFace@, looking up information in HST and PCS.PST as
 568    necessary.  Return the new details, a @Nothing@ denoting
 569    compilation was not needed, the PCS \ToDo{I don't think the PCS
 570    should be updated, but who knows?}, and an empty warning list.
 571
 572 \item
 573    Otherwise, compilation is needed.
 574
 575    If the module is only available in object+interface form, read the
 576    interface, make up details, create a linkable pointing at the
 577    object code.  \ToDo{Does this involve reading any more interfaces?  Does
 578    it involve updating PST?}
 579
 580    Otherwise, translate from source, then create and return: an
 581    details, interface, linkable, updated PST, and warnings.
 582
 583    When looking for a new interface, search HST, then PCS.PIT, and only
 584    then read from disk.  In which case add the new interface(s) to
 585    PCS.PIT.
 586
 587    \ToDo{If compiling a module with a boot-interface file, check the
 588    boot interface against the inferred interface.}
 589 \end{itemize}
 590
 591
 592 \subsubsection{Contents of \mbox{\tt ModDetails},
 593                \mbox{\tt ModIFace} and \mbox{\tt HoldingPen}}
 594 Only @compile@ can see inside these three types -- they are opaque to
 595 everyone else.  @ModDetails@ holds the post-renaming,
 596 post-typechecking environment created by compiling a module.
 597
 598 \begin{verbatim}
 599    data ModDetails
 600       = ModDetails {
 601            moduleExports :: Avails
 602            moduleEnv     :: GlobalRdrEnv    -- == FM RdrName [Name]
 603            typeEnv       :: FM Name TyThing -- TyThing is in TcEnv.lhs
 604            instEnv       :: InstEnv
 605            fixityEnv     :: FM Name Fixity
 606            ruleEnv       :: FM Id [Rule]
 607         }
 608 \end{verbatim}
 609
 610 @ModIFace@ is nearly the same as @ParsedIFace@ from @RnMonad.lhs@:
 611 \begin{verbatim}
 612    type ModIFace = ParsedIFace    -- not really, but ...
 613    data ParsedIface
 614       = ParsedIface {
 615            pi_mod       :: Module,                   -- Complete with package info
 616            pi_vers      :: Version,                  -- Module version number
 617            pi_orphan    :: WhetherHasOrphans,        -- Whether this module has orphans
 618            pi_usages    :: [ImportVersion OccName],  -- Usages
 619            pi_exports   :: [ExportItem],             -- Exports
 620            pi_insts     :: [RdrNameInstDecl],        -- Local instance declarations
 621            pi_decls     :: [(Version, RdrNameHsDecl)],    -- Local definitions
 622            pi_fixity    :: (Version, [RdrNameFixitySig]), -- Local fixity declarations,
 623                                                           -- with their version
 624            pi_rules     :: (Version, [RdrNameRuleDecl]),  -- Rules, with their version
 625            pi_deprecs   :: [RdrNameDeprecation]           -- Deprecations
 626        }
 627 \end{verbatim}
 628
 629 @HoldingPen@ is a cleaned-up version of that found in @RnMonad.lhs@,
 630 retaining just the 3 pieces actually comprising the holding pen:
 631 \begin{verbatim}
 632    data HoldingPen
 633       = HoldingPen {
 634            iDecls :: DeclsMap,     -- A single, global map of Names to decls
 635
 636            iInsts :: IfaceInsts,
 637            -- The as-yet un-slurped instance decls; this bag is depleted when we
 638            -- slurp an instance decl so that we don't slurp the same one twice.
 639            -- Each is 'gated' by the names that must be available before
 640            -- this instance decl is needed.
 641
 642            iRules :: IfaceRules
 643            -- Similar to instance decls, only for rules
 644         }
 645 \end{verbatim}
 646
 647 %%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
 648 \subsection{The linker (\mbox{\tt link})}
 649 \label{sec:linker}
 650
 651 \subsubsection{Data structures owned by the linker}
 652
 653 In the same way that @compile@ has a persistent compiler state (PCS),
 654 the linker has a persistent (session-lifetime) state, PLS, the
 655 Linker's Persistent State.  In batch mode PLS is entirely irrelevant,
 656 because there is only a single link step, and can be a unit value
 657 ignored by everybody.  In interactive mode PLS is composed of the
 658 following three parts:
 659
 660 \begin{itemize}
 661 \item
 662 \textbf{The Source Symbol Table (SST)}@ :: FiniteMap RdrName HValue@
 663   The source symbol table is used when linking interpreted code.
 664   Unlinked interpreted code consists of an STG  tree where
 665   the leaves are @RdrNames@.  The linker's job is to resolve these to
 666   actual addresses (the alternative is to resolve these lazily when
 667   the code is run, but this requires passing the full symbol table
 668   through the interpreter and the repeated lookups will probably be
 669   expensive).
 670
 671   The source symbol table therefore maps @RdrName@s to @HValue@s, for
 672   every @RdrName@ that currently \emph{has} an @HValue@, including all
 673   exported functions from object code modules that are currently
 674   linked in.  Linking therefore turns a @StgTree RdrName@ into an
 675   @StgTree HValue@.
 676
 677   It is important that we can prune this symbol table by throwing away
 678   the mappings for an entire module, whenever we recompile/relink a
 679   given module.  The representation is therefore probably a two-level
 680   mapping, from module names, to function/constructor names, to
 681   @HValue@s.
 682
 683 \item \textbf{The Object Symbol Table (OST)}@ :: FiniteMap String Addr@
 684   This is a lower level symbol table, mapping symbol names in object
 685   modules to their addresses in memory.  It is used only when
 686   resolving the external references in an object module, and contains
 687   only entries that are defined in object modules.
 688
 689   Why have two symbol tables?  Well, there is a clear distinction
 690   between the two: the source symbol table maps Haskell symbols to
 691   Haskell values, and the object symbol table maps object symbols to
 692   addresses.  There is some overlap, in that Haskell symbols certainly
 693   have addresses, and we could look up a Haskell symbol's address by
 694   manufacturing the right object symbol and looking that up in the
 695   object symbol table, but this is likely to be slow and would force
 696   us to extend the object symbol table with all the symbols
 697   ``exported'' by interpreted code.  Doing it this way enables us to
 698   decouple the object management subsystem from the rest of the linker
 699   with a minimal interface; something like
 700
 701   \begin{verbatim}
 702   loadObject   :: Unlinked -> IO Object
 703   unloadModule :: Unlinked -> IO ()
 704   lookupSymbol :: String   -> IO Addr
 705   \end{verbatim}
 706
 707   Rather unfortunately we need @lookupSymbol@ in order to populate the
 708   source symbol table when linking in a new compiled module.  Our
 709   object management subsystem is currently written in C, so decoupling
 710   this interface as much as possible is highly desirable.
 711
 712 \item
 713    {\bf Linked Image (LI)} @:: no-explicit-representation@
 714
 715    LI isn't explicitly represented in the system, but we record it
 716    here for completeness anyway.  LI is the current set of
 717    linked-together module, package and other library fragments
 718    constituting the current executable mass.  LI comprises:
 719    \begin{itemize}
 720    \item Machine code (@.o@, @.a@, @.DLL@ file images) in memory.
 721          These are loaded from disk when needed, and stored in
 722          @malloc@ville.  To simplify storage management, they are
 723          never freed or reused, since this creates serious
 724          complications for storage management.  When no longer needed,
 725          they are simply abandoned.  New linkings of the same object
 726          code produces new copies in memory.  We hope this not to be
 727          too much of a space leak.
 728    \item STG trees, which live in the GHCI heap and are managed by the
 729          storage manager in the usual way.  They are held alive (are
 730          reachable) via the @HValue@s in the OST.  Such @HValue@s are
 731          applications of the interpreter function to the trees
 732          themselves.  Linking a tree comprises travelling over the
 733          tree, replacing all the @Id@s with pointers directly to the
 734          relevant @_closure@ labels, as determined by searching the
 735          OST.  Once the leaves are linked, trees are wrapped with the
 736          interpreter function.  The resulting @HValue@s then behave
 737          indistinguishably from compiled versions of the same code.
 738    \end{itemize}
 739    Because object code is outside the heap and never deallocated,
 740    whilst interpreted code is held alive via the HST, there's no need
 741    to have a data structure which ``is'' the linked image.
 742
 743    For batch compilation, LI doesn't exist because OST doesn't exist,
 744    and because @link@ doesn't load code into memory, instead just
 745    invokes the system linker.
 746
 747    \ToDo{Do we need to say anything about CAFs and SRTs?  Probably ...}
 748 \end{itemize}
 749 As with PCS, CM has no way to create an initial PLS, so we supply
 750 @emptyPLS@ for that purpose.
 751
 752 \subsubsection{The linker's interface}
 753
 754 In practice, the PLS might be hidden in the I/O monad rather
 755 than passed around explicitly.  (The same might be true for PCS).
 756 Anyway:
 757
 758 \begin{verbatim}
 759    data PLS -- as described above; opaque to everybody except the linker
 760
 761    link :: PCI -> ??? -> [[Linkable]] -> PLS -> IO LinkResult
 762
 763    data LinkResult = LinkOK   PLS
 764                    | LinkErrs PLS [SDoc]
 765
 766    emptyPLS :: IO PLS     -- since CM has no other way to make one
 767 \end{verbatim}
 768
 769 CM uses @link@ as follows:
 770
 771 After repeatedly using @compile@ to compile all modules which are
 772 out-of-date, the @link@ is invoked.  The @[[Linkable]]@ argument to
 773 @link@ represents the list of (recursive groups of) home modules which
 774 have been newly compiled, along with @Linkable@s for each of
 775 the packages in use (the compilation manager knows which external
 776 packages are referenced by the home package).  The order of the list
 777 is important: it is sorted in such a way that linking any prefix of
 778 the list will result in an image with no unresolved references.  Note
 779 that for batch linking there may be further restrictions; for example
 780 it may not be possible to link recursive groups containing libraries.
 781
 782 @link@ does the following:
 783
 784 \begin{itemize}
 785   \item
 786   In batch mode, do nothing.  In interactive mode,
 787   examine the supplied @[[Linkable]]@ to determine which home
 788   module @Unlinked@s are new.  Remove precisely these @Linkable@s
 789   from PLS.  (In fact we really need to remove their upwards
 790   transitive closure, but I think it is an invariant that CM will
 791   supply an upwards transitive closure of new modules).
 792   See below for descriptions of @Linkable@ and @Unlinked@.
 793
 794   \item
 795   Batch system: invoke the external linker to link everything in one go.
 796   Interactive: bind the @Unlinked@s for the newly compiled modules,
 797   plus those for any newly required packages, into PLS.
 798
 799   Note that it is the linker's responsibility to remember which
 800   objects and packages have already been linked.  By comparing this
 801   with the @Linkable@s supplied to @link@, it can determine which
 802   of the linkables in LI are out of date
 803 \end{itemize}
 804
 805 If linking in of a group should fail for some reason, @link@ should
 806 not modify its PLS at all.  In other words, linking each group
 807 is atomic; it either succeeds or fails.
 808
 809 \subsubsection*{\mbox{\tt Unlinked} and \mbox{\tt Linkable}}
 810
 811 Two important types: @Unlinked@ and @Linkable@.  The latter is a
 812 higher-level representation involving multiple of the former.
 813 An @Unlinked@ is a reference to unlinked executable code, something
 814 a linker could take as input:
 815
 816 \begin{verbatim}
 817    data Unlinked = DotO   Path
 818                  | DotA   Path
 819                  | DotDLL Path
 820                  | Trees  [StgTree RdrName]
 821 \end{verbatim}
 822
 823 The first three describe the location of a file (presumably)
 824 containing the code to link.  @Trees@, which only exists in
 825 interactive mode, gives a list of @StgTrees@, in which the unresolved
 826 references are @RdrNames@ -- hence it's non-linkedness.  Once linked,
 827 those @RdrNames@ are replaced with pointers to the machine code
 828 implementing them.
 829
 830 A @Linkable@ gathers together several @Unlinked@s and associates them
 831 with either a module or package:
 832
 833 \begin{verbatim}
 834    data Linkable = LM Module  [Unlinked]   -- a module
 835                  | LP PkgName              -- a package
 836 \end{verbatim}
 837
 838 The order of the @Unlinked@s in the list is important, as
 839 they are linked in left-to-right order.  The @Unlinked@ objects for a
 840 particular package can be obtained from the package configuration (see
 841 Section \ref{sec:staticinfo}).
 842
 843 \ToDo{When adding @Addr@s from an object module to SST, we need to
 844       somehow find out the @RdrName@s of the symbols exported by that
 845       module.
 846       So we'd need to pass in the @ModDetails@ or @ModIFace@ or some such?}
 847
 848
 849
 850 %%-----------------------------------------------------------------%%
 851 \section{Background ideas}
 852 \subsubsection*{Out of date, but correct in spirit}
 853
 854 \subsection{Restructuring the system}
 855
 856 At the moment @hsc@ compiles one source module into C or assembly.
 857 This functionality is pushed inside a function called @compile@,
 858 introduced shortly.  The main new chunk of code is CM, the compilation manager,
 859 which supervises multiple runs of @compile@ so as to create up-to-date
 860 translations of a whole bunch of modules, as quickly as possible.
 861 CM also employs some minor helper functions, @finder@, @summarise@ and
 862 @link@, to do its work.
 863
 864 Our intent is to allow CM to be used as the basis either of a
 865 multi-module, batch mode compilation system, or to supply an
 866 interactive environment similar to that of Hugs.
 867 Only minor modifications to the behaviour of @compile@ and @link@
 868 are needed to give these different behaviours.
 869
 870 CM and @compile@, and, for interactive use, an interpreter, are the
 871 main code components.  The most important data structure is the global
 872 symbol table; much design effort has been expended thereupon.
 873
 874
 875 \subsection{How the global symbol table is implemented}
 876
 877 The top level symbol table is a @FiniteMap@ @ModuleName@
 878 @ModuleDetails@.  @ModuleDetails@ contains essentially the environment
 879 created by compiling a module.  CM manages this finite map, adding and
 880 deleting module entries as required.
 881
 882 The @ModuleDetails@ for a module @M@ contains descriptions of all
 883 tycons, classes, instances, values, unfoldings, etc (henceforth
 884 referred to as ``entities''), available from @M@.  These are just
 885 trees in the GHCI heap.  References from other modules to these
 886 entities is direct -- when you have a @TyCon@ in your hand, you really
 887 have a pointer directly to the @TyCon@ structure in the defining module,
 888 rather than some kind of index into a global symbol table.  So there
 889 is a global symbol table, but it has a distributed (sphagetti-like?)
 890 nature.
 891
 892 This gives fast and convenient access to tycon, class, instance,
 893 etc, information.  But because there are no levels of indirection,
 894 there's a problem when we replace @M@ with an updated version of @M@.
 895 We then need to find all references to entities in the old @M@'s
 896 sphagetti, and replace them with pointers to the new @M@'s sphagetti.
 897 This problem motivates a large part of the design.
 898
 899
 900
 901 \subsection{Implementing incremental recompilation -- simple version}
 902 Given the following module graph
 903 \begin{verbatim}
 904          D
 905        /   \
 906       /     \
 907      B       C
 908       \     /
 909        \   /
 910          A
 911 \end{verbatim}
 912 (@D@ imports @B@ and @C@, @B@ imports @A@, @C@ imports @A@) the aim is to do the
 913 least possible amount of compilation to bring @D@ back up to date.  The
 914 simplest scheme we can think of is:
 915 \begin{itemize}
 916 \item {\bf Downsweep}:
 917   starting with @D@, re-establish what the current module graph is
 918   (it might have changed since last time).  This means getting a
 919   @ModuleSummary@ of @D@.  The summary can be quickly generated,
 920   contains @D@'s import lists, and gives some way of knowing whether
 921   @D@'s source has changed since the last time it was summarised.
 922
 923   Transitively follow summaries from @D@, thereby establishing the
 924   module graph.
 925 \item
 926   Remove from the global symbol table (the @FiniteMap@ @ModuleName@
 927   @ModuleDetails@) the upwards closure of all modules in this package
 928   which are out-of-date with respect to their previous versions.  Also
 929   remove all modules no longer reachable from @D@.
 930 \item {\bf Upsweep}:
 931   Starting at the lowest point in the still-in-date module graph,
 932   start compiling upwards, towards @D@.  At each module, call
 933   @compile@, passing it a @FiniteMap@ @ModuleName@ @ModuleDetails@,
 934   and getting a new @ModuleDetails@ for the module, which is added to
 935   the map.
 936
 937   When compiling a module, the compiler must be able to know which
 938   entries in the map are for modules in its strict downwards closure,
 939   and which aren't, so that it can manufacture the instance
 940   environment correctly (as union of instances in its downwards
 941   closure).
 942 \item
 943   Once @D@ has been compiled, invoke some kind of linking phase
 944   if batch compilation.  For interactive use, can either do it all
 945   at the end, or as you go along.
 946 \end{itemize}
 947 In this simple world, recompilation visits the upwards closure of
 948 all changed modules.  That means when a module @M@ is recompiled,
 949 we can be sure no-one has any references to entities in the old @M@,
 950 because modules importing @M@ will have already been removed from the
 951 top-level finite map in the second step above.
 952
 953 The upshot is that we don't need to worry about updating links to @M@ in
 954 the global symbol table -- there shouldn't be any to update.
 955 \ToDo{What about mutually recursive modules?}
 956
 957 CM will happily chase through module interfaces in other packages in
 958 the downsweep.  But it will only process modules in this package
 959 during the upsweep.  So it assumes that modules in other packages
 960 never become out of date.  This is a design decision -- we could have
 961 decided otherwise.
 962
 963 In fact we go further, and require other packages to be compiled,
 964 i.e. to consist of a collection of interface files, and one or more
 965 source files.  CM will never apply @compile@ to a foreign package
 966 module, so there's no way a package can be built on the fly from source.
 967
 968 We require @compile@ to cache foreign package interfaces it reads, so
 969 that subsequent uses don't have to re-read them.  The cache never
 970 becomes out of date, since we've assumed that the source of foreign
 971 packages doesn't change during the course of a session (run of GHCI).
 972 As well as caching interfaces, @compile@ must cache, in some sense,
 973 the linkable code for modules.  In batch compilation this might simply
 974 mean remembering the names of object files to link, whereas in
 975 interactive mode @compile@ probably needs to load object code into
 976 memory in preparation for in-memory linking.
 977
 978 Important signatures for this simple scheme are:
 979 \begin{verbatim}
 980    finder :: ModuleName -> ModLocation
 981
 982    summarise :: ModLocation -> IO ModSummary
 983
 984    compile :: ModSummary
 985               -> FM ModName ModDetails
 986               -> IO CompileResult
 987
 988    data CompileResult = CompOK  ModDetails
 989                       | CompErr [ErrMsg]
 990
 991    link :: [ModLocation] -> [PackageLocation] -> IO Bool  -- linked ok?
 992 \end{verbatim}
 993
 994
 995 \subsection{Implementing incremental recompilation -- clever version}
 996
 997 So far, our upsweep, which is the computationally expensive bit,
 998 recompiles a module if either its source is out of date, or it
 999 imports a module which has been recompiled.  Sometimes we know
1000 we can do better than this:
1001 \begin{verbatim}
1002    module B where                module A
1003    import A ( f )                {-# NOINLINE f #-}
1004    ... f ...                     f x = x + 42
1005 \end{verbatim}
1006 If the definition of @f@ is changed to @f x = x + 43@, the simple
1007 upsweep would recompile @B@ unnecessarily.  We would like to detect
1008 this situation and avoid propagating recompilation all the way to the
1009 top.  There are two parts to this: detecting when a module doesn't
1010 need recompilation, and managing inter-module references in the
1011 global symbol table.
1012
1013 \subsubsection*{Detecting when a module doesn't need recompilation}
1014
1015 To do this, we introduce a new concept: the @ModuleIFace@.  This is
1016 effectively an in-memory interface file.  References to entities in
1017 other modules are done via strings, rather than being pointers
1018 directly to those entities.  Recall that, by comparison,
1019 @ModuleDetails@ do contain pointers directly to the entities they
1020 refer to.  So a @ModuleIFace@ is not part of the global symbol table.
1021
1022 As before, compiling a module produces a @ModuleDetails@ (inside the
1023 @CompileResult@), but it also produces a @ModuleIFace@.  The latter
1024 records, amongst things, the version numbers of all imported entities
1025 needed for the compilation of that module.  @compile@ optionally also
1026 takes the old @ModuleIFace@ as input during compilation:
1027 \begin{verbatim}
1028    data CompileResult = CompOK  ModDetails ModIFace
1029                       | CompErr [ErrMsg]
1030
1031    compile :: ModSummary
1032               -> FM ModName ModDetails
1033               -> Maybe ModuleIFace
1034               -> IO CompileResult
1035 \end{verbatim}
1036 Now, if the @ModuleSummary@ indicates this module's source hasn't
1037 changed, we only need to recompile it if something it depends on has
1038 changed.  @compile@ can detect this by inspecting the imported entity
1039 version numbers in the module's old @ModuleIFace@, and comparing them
1040 with the version numbers from the entities in the modules being
1041 imported.  If they are all the same, nothing it depends on has
1042 changed, so there's no point in recompiling.
1043
1044 \subsubsection*{Managing inter-module references in the global symbol table}
1045
1046 In the above example with @A@, @B@ and @f@, the specified change to @f@ would
1047 require @A@ but not @B@ to be recompiled.  That generates a new
1048 @ModuleDetails@ for @A@.  Problem is, if we leave @B@'s @ModuleDetails@
1049 unchanged, they continue to refer (directly) to the @f@ in @A@'s old
1050 @ModuleDetails@.  This is not good, especially if equality between
1051 entities is implemented using pointer equality.
1052
1053 One solution is to throw away @B@'s @ModuleDetails@ and recompile @B@.
1054 But this is precisely what we're trying to avoid, as it's expensive.
1055 Instead, a cheaper mechanism achieves the same thing: recreate @B@'s
1056 details directly from the old @ModuleIFace@.  The @ModuleIFace@ will
1057 (textually) mention @f@; @compile@ can then find a pointer to the
1058 up-to-date global symbol table entry for @f@, and place that pointer
1059 in @B@'s @ModuleDetails@.  The @ModuleDetails@ are, therefore,
1060 regenerated just by a quick lookup pass over the module's former
1061 @ModuleIFace@.  All this applies, of course, only when @compile@ has
1062 concluded it doesn't need to recompile @B@.
1063
1064 Now @compile@'s signature becomes a little clearer.  @compile@ has to
1065 recompile the module, generating a fresh @ModuleDetails@ and
1066 @ModuleIFace@, if any of the following hold:
1067 \begin{itemize}
1068 \item
1069   The old @ModuleIFace@ wasn't supplied, for some reason (perhaps
1070   we've never compiled this module before?)
1071 \item
1072   The module's source has changed.
1073 \item
1074   The module's source hasn't changed, but inspection of @ModuleIFaces@
1075   for this and its imports indicates that an imported entity has
1076   changed.
1077 \end{itemize}
1078 If none of those are true, we're in luck: quickly knock up a new
1079 @ModuleDetails@ from the old @ModuleIFace@, and return them both.
1080
1081 As a result, the upsweep still visits all modules in the upwards
1082 closure of those whose sources have changed.  However, at some point
1083 we hopefully make a transition from generating new @ModuleDetails@ the
1084 expensive way (recompilation) to a cheap way (recycling old
1085 @ModuleIFaces@).  Either way, all modules still get new
1086 @ModuleDetails@, so the global symbol table is correctly
1087 reconstructed.
1088
1089
1090 \subsection{How linking works, roughly}
1091
1092 When @compile@ translates a module, it produces a @ModuleDetails@,
1093 @ModuleIFace@ and a @Linkable@.  The @Linkable@ contains the
1094 translated but un-linked code for the module.  And when @compile@
1095 ventures into an interface in package it hasn't seen so far, it
1096 copies the package's object code into memory, producing one or more
1097 @Linkable@s.  CM keeps track of these linkables.
1098
1099 Once all modules have been @compile@d, CM invokes @link@, supplying
1100 the all the @Linkable@s it knows about.  If @compile@ had also been
1101 linking incrementally as it went along, @link@ doesn't have to do
1102 anything.  On the other hand, @compile@ could choose not to be
1103 incremental, and leave @link@ to do all the work.
1104
1105 @Linkable@s are opaque to CM.  For batch compilation, a @Linkable@
1106 can record just the name of an object file, DLL, archive, or whatever,
1107 in which case the CM's call to @link@ supplies exactly the set of
1108 file names to be linked.  @link@ can pass these verbatim to the
1109 standard system linker.
1110
1111
1112
1113
1114 %%-----------------------------------------------------------------%%
1115 \section{Ancient stuff}
1116 \subsubsection*{Should be selectively merged into ``Background ideas''}
1117
1118 \subsection{Overall}
1119 Top level structure is:
1120 \begin{itemize}
1121 \item The Compilation Manager (CM) calculates and maintains module
1122       dependencies, and knows how create up-to-date object or bytecode
1123       for a given module.  In doing so it may need to recompile
1124       arbitrary other modules, based on its knowledge of the module
1125       dependencies.
1126 \item On top of the CM are the ``user-level'' services.  We envisage
1127       both a HEP-like interface, for interactive use, and an
1128       @hmake@ style batch compiler facility.
1129 \item The CM only deals with inter-module issues.  It knows nothing
1130       about how to recompile an individual module, nor where the compiled
1131       result for a module lives, nor how to tell if
1132       a module is up to date, nor how to find the dependencies of a module.
1133       Instead, these services are supplied abstractly to CM via a
1134       @Compiler@ record.  To a first approximation, a @Compiler@
1135       contains
1136       the same functionality as @hsc@ has had until now -- the ability to
1137       translate a single Haskell module to C/assembly/object/bytecode.
1138
1139       Different clients of CM (HEP vs @hmake@) may supply different
1140       @Compiler@s, since they need slightly different behaviours.
1141       Specifically, HEP needs a @Compiler@ which creates bytecode
1142       in memory, and knows how to link it, whereas @hmake@ wants
1143       the traditional behaviour of emitting assembly code to disk,
1144       and making no attempt at linkage.
1145 \end{itemize}
1146
1147 \subsection{Open questions}
1148 \begin{itemize}
1149 \item
1150   Error reporting from @open@ and @compile@.
1151 \item
1152   Instance environment management
1153 \item
1154   We probably need to make interface files say what
1155   packages they depend on (so that we can figure out
1156   which packages to load/link).
1157 \item
1158   CM is parameterised both by the client uses and the @Compiler@
1159   supplied.  But it doesn't make sense to have a HEP-style client
1160   attached to a @hmake@-style @Compiler@.  So, really, the
1161   parameterising entity should contain both aspects, not just the
1162   current @Compiler@ contents.
1163 \end{itemize}
1164
1165 \subsection{Assumptions}
1166
1167 \begin{itemize}
1168 \item Packages other than the "current" one are assumed to be
1169   already compiled.
1170 \item
1171   The "current" package is usually "MAIN",
1172   but we can set it with a command-line flag.
1173   One invocation of ghci has only one "current" package.
1174 \item
1175   Packages are not mutually recursive
1176 \item
1177   All the object code for a package P is in libP.a or libP.dll
1178 \end{itemize}
1179
1180 \subsection{Stuff we need to be able to do}
1181 \begin{itemize}
1182 \item Create the environment in which a module has been translated,
1183       so that interactive queries can be satisfied as if ``in'' that
1184       module.
1185 \end{itemize}
1186
1187 %%-----------------------------------------------------------------%%
1188 \section{The Compilation Manager}
1189
1190 CM (@compilationManager@) is a functor, thus:
1191 \begin{verbatim}
1192 compilationManager :: Compiler -> IO HEP  -- IO so that it can create
1193                                           -- global vars (IORefs)
1194
1195 data HEP = HEP {
1196         load          :: ModuleName -> IO (),
1197         compileString :: ModuleName -> String -> IO HValue,
1198         ....
1199    }
1200
1201 newCompiler :: IO Compiler   -- ??? this is a peer of compilationManager?
1202
1203 run :: HValue -> IO ()       -- Run an HValue of type IO ()
1204                              -- In HEP?
1205 \end{verbatim}
1206
1207 @load@ is the central action of CM: its job is to bring a module and
1208 all its descendents into an executable state, by doing the following:
1209 \begin{enumerate}
1210 \item
1211    Use @summarise@ to descend the module hierarchy, starting from the
1212    nominated root, creating @ModuleSummary@s, and
1213    building a map @ModuleName@ @->@ @ModuleSummary@.  @summarise@
1214    expects to be passed absolute paths to files.  Use @finder@ to
1215    convert module names to file paths.
1216 \item
1217    Topologically sort the map,
1218    using dependency info in the @ModuleSummary@s.
1219 \item
1220    Clean up the symbol table by deleting the upward closure of
1221    changed modules.
1222 \item
1223    Working bottom to top, call @compile@ on the upward closure of
1224    all modules whose source has changed.  A module's source has
1225    changed when @sourceHasChanged@ indicates there is a difference
1226    between old and new summaries for the module.  Update the running
1227    @FiniteMap@ @ModuleName@ @ModuleDetails@ with the new details
1228    for this module.  Ditto for the running
1229    @FiniteMap@ @ModuleName@ @ModuleIFace@.
1230 \item
1231    Call @compileDone@ to signify that we've reached the top, so
1232    that the batch system can now link.
1233 \end{enumerate}
1234
1235
1236 %%-----------------------------------------------------------------%%
1237 \section{A compiler}
1238
1239 Most of the system's complexity is hidden inside the functions
1240 supplied in the @Compiler@ record:
1241 \begin{verbatim}
1242 data Compiler = Compiler {
1243
1244         finder :: PackageConf -> [Path] -> IO (ModuleName -> ModuleLocation)
1245
1246         summarise :: ModuleLocation -> IO ModuleSummary
1247
1248         compile :: ModuleSummary
1249                 -> Maybe ModuleIFace
1250                 -> FiniteMap ModuleName ModuleDetails
1251                 -> IO CompileResult
1252
1253         compileDone     :: IO ()
1254         compileStarting :: IO ()   -- still needed?  I don't think so.
1255     }
1256
1257 type ModuleName = String (or some such)
1258 type Path = String  -- an absolute file name
1259 \end{verbatim}
1260
1261 \subsection{The module \mbox{\tt finder}}
1262 The @finder@, given a package configuration file and a list of
1263 directories to look in, will map module names to @ModuleLocation@s,
1264 in which the @Path@s are filenames, probably with an absolute path
1265 to them.
1266 \begin{verbatim}
1267 data ModuleLocation = SourceOnly Path        -- .hs
1268                     | ObjectCode Path Path   -- .o & .hi
1269                     | InPackage  Path        -- .hi
1270 \end{verbatim}
1271 @SourceOnly@ and @ObjectCode@ are unremarkable.  For sanity,
1272 we require that a module's object and interface be in the same
1273 directory.  @InPackage@ indicates that the module is in a
1274 different package.
1275
1276 @Module@ values -- perhaps all @Name@ish things -- contain the name of
1277 their package.  That's so that
1278 \begin{itemize}
1279 \item Correct code can be generated for in-DLL vs out-of-DLL refs.
1280 \item We don't have version number dependencies for symbols
1281       imported from different packages.
1282 \end{itemize}
1283
1284 Somehow or other, it will be possible to know all the packages
1285 required, so that the for the linker can load them.
1286 We could detect package dependencies by recording them in the
1287 @compile@r's @ModuleIFace@ cache, and with that and the
1288 package config info, figure out the complete set of packages
1289 to link.  Or look at the command line args on startup.
1290
1291 \ToDo{Need some way to tell incremental linkers about packages,
1292       since in general we'll need to load and link them before
1293       linking any modules in the current package.}
1294
1295
1296 \subsection{The module \mbox{\tt summarise}r}
1297 Given a filename of a module (\ToDo{presumably source or iface}),
1298 create a summary of it.  A @ModuleSummary@ should contain only enough
1299 information for CM to construct an up-to-date picture of the
1300 dependency graph.  Rather than expose CM to details of timestamps,
1301 etc, @summarise@ merely provides an up-to-date summary of any module.
1302 CM can extract the list of dependencies from a @ModuleSummary@, but
1303 other than that has no idea what's inside it.
1304 \begin{verbatim}
1305 data ModuleSummary = ... (abstract) ...
1306
1307 depsFromSummary :: ModuleSummary -> [ModuleName]   -- module names imported
1308 sourceHasChanged :: ModuleSummary -> ModuleSummary -> Bool
1309 \end{verbatim}
1310 @summarise@ is intended to be fast -- a @stat@ of the source or
1311 interface to see if it has changed, and, if so, a quick semi-parse to
1312 determine the new imports.
1313
1314 \subsection{The module \mbox{\tt compile}r}
1315 @compile@ traffics in @ModuleIFace@s and @ModuleDetails@.
1316
1317 A @ModuleIFace@ is an in-memory representation of the contents of an
1318 interface file, including version numbers, unfoldings and pragmas, and
1319 the linkable code for the module.  @ModuleIFace@s are un-renamed,
1320 using @HsSym@/@RdrNames@ rather than (globally distinct) @Names@.
1321
1322 @ModuleDetails@, by contrast, is an in-memory representation of the
1323 static environment created by compiling a module.  It is phrased in
1324 terms of post-renaming @Names@, @TyCon@s, etc, so it's basically a
1325 renamed-to-global-uniqueness rendition of a @ModuleIFace@.
1326
1327 In an interactive session, we'll want to be able to evaluate
1328 expressions as if they had been compiled in the scope of some
1329 specified module.  This means that the @ModuleDetails@ must contain
1330 the type of everything defined in the module, rather than just the
1331 types of exported stuff.  As a consequence, @ModuleIFace@ must also
1332 contain the type of everything, because it should always be possible
1333 to generate a module's @ModuleDetails@ from its @ModuleIFace@.
1334
1335 CM maintains two mappings, one from @ModuleName@s to @ModuleIFace@s,
1336 the other from @ModuleName@s to @ModuleDetail@s.  It passes the former
1337 to each call of @compile@.  This is used to supply information about
1338 modules compiled prior to this one (lower down in the graph).  The
1339 returned @CompileResult@ supplies a new @ModuleDetails@ for the module
1340 if compilation succeeded, and CM adds this to the mapping.  The
1341 @CompileResult@ also supplies a new @ModuleIFace@, which is either the
1342 same as that supplied to @compile@, if @compile@ decided not to
1343 retranslate the module, or is the result of a fresh translation (from
1344 source).  So these mappings are an explicitly-passed-around part of
1345 the global system state.
1346
1347 @compile@ may also {\em optionally} also accumulate @ModuleIFace@s for
1348 modules in different packages -- that is, interfaces which we read,
1349 but never attempt to recompile source for.  Such interfaces, being
1350 from foreign packages, never change, so @compile@ can accumulate them
1351 in perpetuity in a private global variable.  Indeed, a major motivator
1352 of this design is to facilitate this caching of interface files,
1353 reading of which is a serious bottleneck for the current compiler.
1354
1355 When CM restarts compilation down at the bottom of the module graph,
1356 it first needs to throw away all \ToDo{all?} @ModuleDetails@ in the
1357 upward closure of the out-of-date modules.  So @ModuleDetails@ don't
1358 persist across recompilations.  But @ModuleIFace@s do, since they
1359 are conceptually equivalent to interface files.
1360
1361
1362 \subsubsection*{What @compile@ returns}
1363 @compile@ returns a @CompileResult@ to CM.
1364 Note that the @compile@'s foreign-package interface cache can
1365 become augmented even as a result of reading interfaces for a
1366 compilation attempt which ultimately fails, although it will not be
1367 augmented with a new @ModuleIFace@ for the failed module.
1368 \begin{verbatim}
1369 -- CompileResult is not abstract to the Compilation Manager
1370 data CompileResult
1371    = CompOK   ModuleIFace
1372               ModuleDetails    -- compiled ok, here are new details
1373                                -- and new iface
1374
1375    | CompErr  [SDoc]           -- compilation gave errors
1376
1377    | NoChange                  -- no change required, meaning:
1378                                -- exports, unfoldings, strictness, etc,
1379                                -- unchanged, and executable code unchanged
1380 \end{verbatim}
1381
1382
1383
1384 \subsubsection*{Re-establishing local-to-global name mappings}
1385 Consider
1386 \begin{verbatim}
1387 module Upper where                         module Lower ( f ) where
1388 import Lower ( f )                         f = ...
1389 g = ... f ...
1390 \end{verbatim}
1391 When @Lower@ is first compiled, @f@ is allocated a @Unique@
1392 (presumably inside an @Id@ or @Name@?).  When @Upper@ is then
1393 compiled, its reference to @f@ is attached directly to the
1394 @Id@ created when compiling @Lower@.
1395
1396 If the definition of @f@ is now changed, but not the type,
1397 unfolding, strictness, or any other thing which affects the way
1398 it should be called, we will have to recompile @Lower@, but not
1399 @Upper@.  This creates a problem -- @g@ will then refer to the
1400 the old @Id@ for @f@, not the new one.  This may or may not
1401 matter, but it seems safer to ensure that all @Unique@-based
1402 references into child modules are always up to date.
1403
1404 So @compile@ recreates the @ModuleDetails@ for @Upper@ from
1405 the @ModuleIFace@ of @Upper@ and the @ModuleDetails@ of @Lower@.
1406
1407 The rule is: if a module is up to date with respect to its
1408 source, but a child @C@ has changed, then either:
1409 \begin{itemize}
1410 \item On examination of the version numbers in @C@'s
1411       interface/@ModuleIFace@ that we used last time, we discover that
1412       an @Id@/@TyCon@/class/instance we depend on has changed.  So
1413       we need to retranslate the module from its source, generating
1414       a new @ModuleIFace@ and @ModuleDetails@.
1415 \item Or: there's nothing in @C@'s interface that we depend on.
1416       So we quickly recreate a new @ModuleDetails@ from the existing
1417       @ModuleIFace@, creating fresh links to the new @Unique@-world
1418       entities in @C@'s new @ModuleDetails@.
1419 \end{itemize}
1420
1421 Upshot: we need to redo @compile@ on all modules all the way up,
1422 rather than just the ones that need retranslation.  However, we hope
1423 that most modules won't need retranslation -- just regeneration of the
1424 @ModuleDetails@ from the @ModuleIFace@.  In effect, the @ModuleIFace@
1425 is a quickly-compilable representation of the module's contents, just
1426 enough to create the @ModuleDetails@.
1427
1428 \ToDo{Is there anything in @ModuleDetails@ which can't be
1429       recreated from @ModuleIFace@ ?}
1430
1431 So the @ModuleIFace@s persist across calls to @HEP.load@, whereas
1432 @ModuleDetails@ are reconstructed on every compilation pass.  This
1433 means that @ModuleIFace@s have the same lifetime as the byte/object
1434 code, and so should somehow contain their code.
1435
1436 The behind-the-scenes @ModuleIFace@ cache has some kind of holding-pen
1437 arrangement, to lazify the copying-out of stuff from it, and thus to
1438 minimise redundant interface reading.  \ToDo{Burble burble.  More
1439 details.}.
1440
1441 When CM starts working back up the module graph with @compile@, it
1442 needs to remove from the travelling @FiniteMap@ @ModuleName@
1443 @ModuleDetails@ the details for all modules in the upward closure of
1444 the compilation start points.  However, since we're going to visit
1445 precisely those modules and no others on the way back up, we might as
1446 well just zap them the old @ModuleDetails@ incrementally.  This does
1447 mean that the @FiniteMap@ @ModuleName@ @ModuleDetails@ will be
1448 inconsistent until we reach the top.
1449
1450 In interactive mode, each @compile@ call on a module for which no
1451 object code is available, or for which it is out of date wrt source,
1452 emit bytecode into memory, update the resulting @ModuleIFace@ with the
1453 address of the bytecode image, and link the image.
1454
1455 In batch mode, emit assembly or object code onto disk.  Record
1456 somewhere \ToDo{where?} that this object file needs to go into the
1457 final link.
1458
1459 When we reach the top, @compileDone@ is called, to signify that batch
1460 linking can now proceed, if need be.
1461
1462 Modules in other packages never get a @ModuleIFace@ or @ModuleDetails@
1463 entry in CM's maps -- those maps are only for modules in this package.
1464 As previously mentioned, @compile@ may optionally cache @ModuleIFace@s
1465 for foreign package modules.  When reading such an interface, we don't
1466 need to read the version info for individual symbols, since foreign
1467 packages are assumed static.
1468
1469 \subsubsection*{What's in a \mbox{\tt ModuleIFace}?}
1470
1471 Current interface file contents?
1472
1473
1474 \subsubsection*{What's in a \mbox{\tt ModuleDetails}?}
1475
1476 There is no global symbol table @:: Name -> ???@.  To look up a
1477 @Name@, first extract the @ModuleName@ from it, look that up in
1478 the passed-in @FiniteMap@ @ModuleName@ @ModuleDetails@,
1479 and finally look in the relevant @Env@.
1480
1481 \ToDo{Do we still have the @HoldingPen@, or is it now composed from
1482 per-module bits too?}
1483 \begin{verbatim}
1484 data ModuleDetails = ModuleDetails {
1485
1486         moduleExports :: what it exports (Names)
1487                          -- roughly a subset of the .hi file contents
1488
1489         moduleEnv     :: RdrName -> Name
1490                          -- maps top-level entities in this module to
1491                          -- globally distinct (Uniq-ified) Names
1492
1493         moduleDefs    :: Bag Name -- All the things in the global symbol table
1494                                   -- defined by this module
1495
1496         package       :: Package -- what package am I in?
1497
1498         lastCompile   :: Date -- of last compilation
1499
1500         instEnv       :: InstEnv                 -- local inst env
1501         typeEnv       :: Name -> TyThing         -- local tycon env?
1502    }
1503
1504 -- A (globally unique) symbol table entry. Note that Ids contain
1505 -- unfoldings.
1506 data TyThing = AClass Class
1507              | ATyCon TyCon
1508              | AnId Id
1509 \end{verbatim}
1510 What's the stuff in @ModuleDetails@ used for?
1511 \begin{itemize}
1512 \item @moduleExports@ so that the stuff which is visible from outside
1513       the module can be calculated.
1514 \item @moduleEnv@: \ToDo{umm err}
1515 \item @moduleDefs@: one reason we want this is so that we can nuke the
1516       global symbol table contribs from this module when it leaves the
1517       system.  \ToDo{except ... we don't have a global symbol table any
1518       more.}
1519 \item @package@: we will need to chase arbitrarily deep into the
1520       interfaces of other packages.  Of course we don't want to
1521       recompile those, but as we've read their interfaces, we may
1522       as well cache that info.  So @package@ indicates whether this
1523       module is in the default package, or, if not, which it is in.
1524
1525       Also, when we come to linking, we'll need to know which
1526       packages are demanded, so we know to load their objects.
1527
1528 \item @lastCompile@: When the module was last compiled.  If the
1529       source is older than that, then a recompilation can only be
1530       required if children have changed.
1531 \item @typeEnv@: obvious??
1532 \item @instEnv@: the instances contributed by this module only.  The
1533       Report allegedly says that when a module is translated, the
1534       available
1535       instance env is all the instances in the downward closure of
1536       itself in the module graph.
1537
1538       We choose to use this simple representation -- each module
1539       holds just its own instances -- and do the naive thing when
1540       creating an inst env for compilation with.  If this turns out
1541       to be a performance problem we'll revisit the design.
1542 \end{itemize}
1543
1544
1545
1546 %%-----------------------------------------------------------------%%
1547 \section{Misc text looking for a home}
1548
1549 \subsection*{Linking}
1550
1551 \ToDo{All this linking stuff is now bogus.}
1552
1553 There's an abstract @LinkState@, which is threaded through the linkery
1554 bits.  CM can call @addpkgs@ to notify the linker of packages
1555 required, and it can call @addmods@ to announce modules which need to
1556 be linked.  Finally, CM calls @endlink@, after which an executable
1557 image should be ready.  The linker may link incrementally, during each
1558 call of @addpkgs@ and @addmods@, or it can just store up names and do
1559 all the linking when @endlink@ is called.
1560
1561 In order that incremental linking is possible, CM should specify
1562 packages and module groups in dependency order, ie, from the bottom up.
1563
1564 \subsection*{In-memory linking of bytecode}
1565 When being HEP-like, @compile@ will translate sources to bytecodes
1566 in memory, with all the bytecode for a module as a contiguous lump
1567 outside the heap.  It needs to communicate the addresses of these
1568 lumps to the linker.  The linker also needs to know whether a
1569 given module is available as in-memory bytecode, or whether it
1570 needs to load machine code from a file.
1571
1572 I guess @LinkState@ needs to map module names to base addresses
1573 of their loaded images, + the nature of the image, + whether or not
1574 the image has been linked.
1575
1576 \subsection*{On disk linking of object code, to give an executable}
1577 The @LinkState@ in this case is just a list of module and package
1578 names, which @addpkgs@ and @addmods@ add to.  The final @endlink@
1579 call can invoke the system linker.
1580
1581 \subsection{Finding out about packages, dependencies, and auxiliary
1582             objects}
1583
1584 Ask the @packages.conf@ file that lives with the driver at the mo.
1585
1586 \ToDo{policy about upward closure?}
1587
1588
1589
1590 \ToDo{record story about how in memory linking is done.}
1591
1592 \ToDo{linker start/stop/initialisation/persistence.  Need to
1593       say more about @LinkState@.}
1594
1595
1596 \end{document}
1597
1598