ghc/docs/add_to_compiler/front-end.verb

   1 %************************************************************************
   2 %*                                                                      *
   3 \subsection{The front end of the compiler}
   4 \label{sec:front-end}
   5 %*                                                                      *
   6 %************************************************************************
   7
   8 The previous section covered the main points about the front end of
   9 the compiler: it is dominated by a ``renamer'' and a typechecker
  10 working directly at the Haskell source level.  In this section, we
  11 will look at some basic datatypes used or introduced in the front
  12 end---ones that you may see as input to your optimisation pass.
  13
  14 \subsubsection{``Abstract syntax'', a source-level datatype}
  15 \label{sec:AbsSyntax}
  16
  17 As Figure~\ref{fig:overview} shows, the typechecker both reads and
  18 writes a collection of datatypes we call ``Abstract syntax.''
  19 This is misleading in that what
  20 goes into the typechecker is quite different from what comes out.
  21
  22 Let's first consider this fragment of the abstract-syntax
  23 definition\srcloc{abstractSyn/HsExpr.lhs}, for Haskell explicit-list
  24 expressions (Haskell report, section~3.5
  25 \cite{hudak91a}):\nopagebreak[4]
  26 \begin{mytightcode}
  27 data Expr var pat =
  28   ...
  29   | ExplicitList        [Expr var pat]
  30   | ExplicitListOut     UniType [Expr var pat]
  31   ...
  32
  33 type ProtoNameExpr      = Expr ProtoName ProtoNamePat
  34 type RenamedExpr        = Expr Name RenamedPat
  35 type TypecheckedExpr    = Expr Id TypecheckedPat\end{mytightcode}
  36 an @ExplicitList@ appears only in typechecker input; an @ExplicitListOut@
  37 is the corresponding construct that appears
  38 only in the output, with the inferred type information attached.
  39
  40 The fragment above also shows how abstract syntax is {\em parameterised}
  41 with respect to variables and patterns.  The idea is the same for
  42 both; let's just consider variables.
  43
  44 The renamer converts @ProtoNameExprs@ (expressions with
  45 @ProtoNames@\srcloc{basicTypes/ProtoName.lhs} as variables---little
  46 more than just strings) into @RenamedExprs@, which have all naming sorted out
  47 (using @Names@\srcloc{abstractSyn/Name.lhs}).  A @Name@ is known to be
  48 properly bound, isn't duplicated, etc.; it's known if it's bound to a
  49 built-in standard-prelude entity.
  50
  51 (The renamer also does dependency analysis, which is required to
  52 maximise polymorphism in a Hindley-Milner type system.)
  53
  54 The typechecker reads the @RenamedExprs@, sorts out the types, and
  55 spits out @TypecheckedExprs@, with variables represented by
  56 @Ids@\srcloc{basicTypes/Id.lhs}.  You can find out just about anything
  57 you want to know about a variable from its @Id@.
  58
  59 To see what GHC makes of some Haskell, in a file @Foo.hs@, say:
  60 try @ghc -noC -ddump-rn4 Foo.hs@, to see what comes out of the renamer [pass~4];
  61 try @ghc -noC -ddump-tc Foo.hs@, to see what comes out of the typechecker.
  62
  63 \subsubsection{Basic datatypes in the compiler}
  64
  65 None of the internal datatypes in the example just given are
  66 particularly interesting except @Ids@\srcloc{basicTypes/Id.lhs}. A
  67 program variable, which enters the typechecker as a string, emerges as
  68 an @Id@.
  69
  70 The important thing about @Id@---and the datatypes representing other
  71 basic program entities (type variables, type constructors, classes,
  72 etc.)---is that they often include {\em memoised} information that can
  73 be used throughout the rest of the compiler.
  74
  75 Let us take a cursory look at @Ids@, as a representative example of
  76 these basic data types.  (Don't be too scared---@Ids@ are the hairiest
  77 entities in the whole compiler!)
  78 Here we go:
  79 \begin{mytightcode}
  80 \codeallowbreaks{}data Id
  81   = Id Unique      {\dcd\rm-- key for fast comparison}
  82        UniType     {\dcd\rm-- Id's type; used all the time;}
  83        IdInfo      {\dcd\rm-- non-essential info about this Id;}
  84        PragmaInfo  {\dcd\rm-- user-specified pragmas about this Id;}
  85        IdDetails   {\dcd\rm-- stuff about individual kinds of Ids.}\end{mytightcode}
  86
  87 So, every @Id@ comes with:
  88 \begin{enumerate}
  89 \item
  90 A @Unique@\srcloc{basicTypes/Unique.lhs}, essentially a unique
  91 @Int@, for fast comparison;
  92 \item
  93 A @UniType@ (more on them below... section~\ref{sec:UniType}) giving the variable's
  94 type---this is the most useful memoised information.
  95 \item
  96 A @PragmaInfo@, which is pragmatic stuff the user specified for
  97 this @Id@; e.g., @INLINE@ it; GHC does not promise to honour these
  98 pragma requests, but this is where it keeps track of them.
  99 \item
 100 An @IdInfo@ (more on {\em them} below... section~\ref{sec:IdInfo}),
 101 which tells you all the extra things
 102 that the compiler doesn't {\em have} to know about an @Id@, but it's jolly nice...
 103 This corresponds pretty closely to the @GHC_PRAGMA@ cross-module info that you will
 104 see in any interface produced using @ghc -O@.
 105 An example of some @IdInfo@
 106 would be: that @Id@'s unfolding; or its arity.
 107 \end{enumerate}
 108
 109 Then the fun begins with @IdDetails@...
 110 \begin{mytightcode}
 111 \codeallowbreaks{}data IdDetails
 112
 113   {\dcd\rm---------------- Local values}
 114
 115   = LocalId     ShortName       {\dcd\rm-- mentioned by the user}
 116
 117   | SysLocalId  ShortName       {\dcd\rm-- made up by the compiler}
 118
 119   {\dcd\rm---------------- Global values}
 120
 121   | ImportedId  FullName        {\dcd\rm-- Id imported from an interface}
 122
 123   | PreludeId   FullName        {\dcd\rm-- Prelude things the compiler ``knows'' about}
 124
 125   | TopLevId    FullName        {\dcd\rm-- Top-level in the orig source pgm}
 126                                 {\dcd\rm-- (not moved there by transformations).}
 127
 128   {\dcd\rm---------------- Data constructors}
 129
 130   | DataConId FullName
 131              ConTag
 132              [TyVarTemplate] ThetaType [UniType] TyCon
 133              {\dcd\rm-- split-up type: the constructor's type is:}
 134              {\dcd\rm-- $\forall~tyvars . theta\_ty \Rightarrow$}
 135              {\dcd\rm--    $unitype_1 \rightarrow~ ... \rightarrow~ unitype_n \rightarrow~ tycon tyvars$}
 136
 137   | TupleCon Int                         {\dcd\rm-- its arity}
 138
 139   {\dcd\rm-- There are quite a few more flavours of {\tt IdDetails}...}\end{mytightcode}
 140
 141 % A @ShortName@,\srcloc{basicTypes/NameTypes.lhs} which includes a name string
 142 % and a source-line location for the name's binding occurrence;
 143
 144 In short: everything that later parts of the compiler might want to
 145 know about a local @Id@ is readily at hand.  The same principle holds
 146 true for imported-variable and data-constructor @Ids@ (tuples are an
 147 important enough case to merit special pleading), as well as for other
 148 basic program entities.  Here are a few further notes about the @Id@
 149 fragments above:
 150 \begin{itemize}
 151 \item
 152 A @FullName@\srcloc{basicTypes/NameTypes.lhs} is one that may be
 153 globally visible, with a module-name as well; it may have been
 154 renamed.
 155 \item
 156 @DataConKey@\srcloc{prelude/PrelUniqs.lhs} is a specialised
 157 fast-comparison key for data constructors; there are several of these
 158 kinds of things.
 159 \item
 160 The @UniType@ for @DataConIds@ is given in {\em two} ways: once, just as
 161 a plain type; secondly, split up into its component parts.  This is to
 162 save frequently re-splitting such types.
 163 \item
 164 Similarly, a @TupleCon@ has a type attached, even though we could
 165 construct it just from the arity.
 166 \end{itemize}
 167
 168 \subsubsection{@UniTypes@, representing types in the compiler}
 169 \label{sec:UniType}
 170
 171 Let us look further at @UniTypes@\srcloc{uniType/}. Their definition
 172 is:
 173 \begin{mytightcode}
 174 \codeallowbreaks{}data UniType
 175   = UniTyVar    TyVar
 176
 177   | UniFun      UniType         {\dcd\rm-- function type}
 178                 UniType
 179
 180   | UniData     TyCon           {\dcd\rm-- non-synonym datatype}
 181                 [UniType]
 182
 183   | UniSyn      TyCon           {\dcd\rm-- type synonym}
 184                 [UniType]       {\dcd\rm--\ \ unexpanded form}
 185                 UniType         {\dcd\rm--\ \ expanded form}
 186
 187   | UniDict     Class           {\dcd\rm-- for types with overloading}
 188                 UniType
 189
 190   {\dcd\rm-- The next two are to do with universal quantification.}
 191   | UniTyVarTemplate TyVarTemplate
 192
 193   | UniForall   TyVarTemplate
 194                 UniType\end{mytightcode}
 195 When the typechecker processes a source module, it adds @UniType@
 196 information to all the basic entities (e.g., @Ids@), among other
 197 places (see Section~\ref{sec:second-order} for more details).  These
 198 types are used throughout the compiler.
 199
 200 The following example shows several things about @UniTypes@.
 201 If a programmer wrote @(Eq a) => a -> [a]@, it would be represented
 202 as:\footnote{The internal structures of @Ids@,
 203 @Classes@, @TyVars@, and @TyCons@ are glossed over here...}
 204 \begin{mytightcode}
 205 \codeallowbreaks{}UniForall {\dcd$\alpha$}
 206       (UniFun (UniDict {\dcd\em Eq} (UniTyVar {\dcd$\alpha$}))
 207               (UniFun (UniTyVarTemplate {\dcd$\alpha$})
 208                       (UniData {\dcd\em listTyCon}
 209                                [UniTyVarTemplate {\dcd$\alpha$}])))\end{mytightcode}
 210 From this example we see:
 211 \begin{itemize}
 212 \item
 213 The universal quantification of the type variable $\alpha$ is made explicit
 214 (with a @UniForall@).
 215 \item
 216 The class assertion @(Eq a)@ is represented with a @UniDict@ whose
 217 second component is a @UniType@---this
 218 reflects the fact that we expect @UniType@ to be used in a stylized
 219 way, avoiding nonsensical constructs (e.g.,
 220 @(UniDict f (UniDict g (UniDict h ...)))@).
 221 \item
 222 The ``double arrow'' (@=>@) of the Haskell source, indicating an
 223 overloaded type, is represented by the usual
 224 @UniFun@ ``single arrow'' (@->@), again in a stylized way.
 225 This change reflects the fact that each class assertion in a
 226 function's type is implemented by adding an extra dictionary argument.
 227 \item
 228 In keeping with the memoising tradition we saw with @Ids@, type
 229 synonyms (@UniSyns@) keep both the unexpanded and expanded forms handy.
 230 \end{itemize}
 231
 232 \subsubsection{@IdInfo@: extra pragmatic info about an @Id@}
 233 \label{sec:IdInfo}
 234
 235 [New in 0.16.]  All the nice-to-have-but-not-essential information
 236 about @Ids@ is now hidden in the
 237 @IdInfo@\srcloc{basicTypes/IdInfo.lhs} datatype.  It looks something
 238 like:
 239 \begin{mytightcode}
 240 \codeallowbreaks{}data IdInfo
 241   = NoIdInfo             {\dcd\rm-- OK, we know nothing...}
 242
 243   | MkIdInfo
 244       ArityInfo          {\dcd\rm-- its arity}
 245       DemandInfo         {\dcd\rm-- whether or not it is definitely demanded}
 246       InliningInfo       {\dcd\rm-- whether or not we should inline it}
 247       SpecialisationInfo {\dcd\rm-- specialisations of this overloaded}
 248                          {\dcd\rm-- function which exist}
 249       StrictnessInfo     {\dcd\rm-- strictness properties, notably}
 250                          {\dcd\rm-- how to conjure up ``worker'' functions}
 251       WrapperInfo        {\dcd\rm-- ({\em informational only}) if an Id is}
 252                          {\dcd\rm-- a ``worker,'' this says what Id it's}
 253                          {\dcd\rm-- a worker for, i.e., ``who is my wrapper''}
 254                          {\dcd\rm-- (used to match up workers/wrappers)}
 255       UnfoldingInfo      {\dcd\rm-- its unfolding}
 256       UpdateInfo         {\dcd\rm-- which args should be updated}
 257       SrcLocation        {\dcd\rm-- source location of definition}\end{mytightcode}
 258 As you can see, we may accumulate a lot of information about an Id!
 259 (The types for all the sub-bits are given in the same place...)
 260
 261 \subsubsection{Introducing dictionaries for overloading}
 262
 263 The major novel feature of the Haskell language is its systematic
 264 overloading using {\em type classes}; Wadler and Blott's paper is the
 265 standard reference \cite{wadler89a}.
 266
 267 To implement type classes, the typechecker not only checks the Haskell
 268 source program, it also {\em translates} it---by inserting code to
 269 pass around in {\em dictionaries}.  These dictionaries
 270 are essentially tuples of functions, from which the correct code may
 271 be plucked at run-time to give the desired effect.  Kevin Hammond
 272 wrote and described the first working implementation of type
 273 classes \cite{hammond89a}, and the ever-growing static-semantics paper
 274 by Peyton Jones and Wadler is replete with the glories of dictionary
 275 handling \cite{peyton-jones90a}.  (By the way, the typechecker's
 276 structure closely follows the static semantics paper; inquirers into
 277 the former will become devoted students of the latter.)
 278
 279 Much of the abstract-syntax datatypes are given
 280 over to output-only translation machinery.  Here are a few more
 281 fragments of the @Expr@ type, all of which appear only in typechecker
 282 output:
 283 \begin{mytightcode}
 284 data Expr var pat =
 285   ...
 286   | DictLam    [DictVar]       (Expr var pat)
 287   | DictApp    (Expr var pat)  [DictVar]
 288   | Dictionary [DictVar]       [Id]
 289   | SingleDict DictVar
 290   ...\end{mytightcode}
 291 You needn't worry about this stuff:
 292 After the desugarer gets through with such constructs, there's nothing
 293 left but @Ids@, tuples, tupling functions, etc.,---that is, ``plain
 294 simple stuff'' that should make the potential optimiser's heart throb.
 295 Optimisation passes don't deal with dictionaries explicitly but, in
 296 some cases, quite a bit of the code passed through to them will be for
 297 dictionary-handling.