ghc/docs/add_to_compiler/front-end.verb

   1 %************************************************************************
   2 %*                                                                      *
   3 \subsection{The front end of the compiler}
   4 \label{sec:front-end}
   5 %*                                                                      *
   6 %************************************************************************
   7
   8 The previous section covered the main points about the front end of
   9 the compiler: it is dominated by a ``renamer'' and a typechecker
  10 working directly at the Haskell source level.  In this section, we
  11 will look at some basic datatypes used or introduced in the front
  12 end---ones that you may see as input to your optimisation pass.
  13
  14 \subsubsection{``Abstract syntax'', a source-level datatype}
  15 \label{sec:AbsSyntax}
  16
  17 As Figure~\ref{fig:overview} shows, the typechecker both reads and
  18 writes a collection of datatypes we call ``Abstract syntax.''
  19 This is misleading in that what
  20 goes into the typechecker is quite different from what comes out.
  21
  22 Let's first consider this fragment of the abstract-syntax
  23 definition,\srcloc{abstractSyn/HsExpr.lhs} for Haskell explicit-list
  24 expressions (Haskell report, section~3.5
  25 \cite{hudak91a}):\nopagebreak[4]
  26 \begin{tightcode}
  27 data Expr var pat =
  28   ...
  29   | ExplicitList        [Expr var pat]
  30   | ExplicitListOut     UniType [Expr var pat]
  31   ...
  32
  33 type ProtoNameExpr      = Expr ProtoName ProtoNamePat
  34 type RenamedExpr        = Expr Name RenamedPat
  35 type TypecheckedExpr    = Expr Id TypecheckedPat
  36 \end{tightcode}
  37 an @ExplicitList@ appears only in typechecker input; an @ExplicitListOut@
  38 is the corresponding construct that appears
  39 only in the output, with the inferred type information attached.
  40
  41 The fragment above also shows how abstract syntax is {\em parameterised}
  42 with respect to variables and patterns.  The idea is the same for
  43 both; let's just consider variables.
  44
  45 The renamer converts @ProtoNameExprs@ (expressions with
  46 @ProtoNames@\srcloc{basicTypes/ProtoName.lhs} as variables---little
  47 more than just strings) into @RenamedExprs@, which have all naming sorted out
  48 (using @Names@\srcloc{abstractSyn/Name.lhs}).  A @Name@ is known to be
  49 properly bound, isn't duplicated, etc.; it's known if it's bound to a
  50 built-in standard-prelude entity.
  51
  52 (The renamer also does dependency analysis, which is required to
  53 maximise polymorphism in a Hindley-Milner type system.)
  54
  55 The typechecker reads the @RenamedExprs@, sorts out the types, and
  56 spits out @TypecheckedExprs@, with variables represented by
  57 @Ids@\srcloc{basicTypes/Id.lhs}.  You can find out just about anything
  58 you want to know about a variable from its @Id@.
  59
  60 To see what GHC makes of some Haskell, in a file @Foo.hs@, say:
  61 try @ghc -noC -ddump-rn4 Foo.hs@, to see what comes out of the renamer [pass~4];
  62 try @ghc -noC -ddump-tc Foo.hs@, to see what comes out of the typechecker.
  63
  64 \subsubsection{Basic datatypes in the compiler}
  65
  66 None of the internal datatypes in the example just given are
  67 particularly interesting except @Ids@.\srcloc{basicTypes/Id.lhs} A
  68 program variable, which enters the typechecker as a string, emerges as
  69 an @Id@.
  70
  71 The important thing about @Id@---and the datatypes representing other
  72 basic program entities (type variables, type constructors, classes,
  73 etc.)---is that they often include {\em memoised} information that can
  74 be used throughout the rest of the compiler.
  75
  76 Let us take a cursory look at @Ids@, as a representative example of
  77 these basic data types.  (Don't be too scared---@Ids@ are the hairiest
  78 entities in the whole compiler!)
  79 Here we go:
  80 \begin{tightcode}\codeallowbreaks{}
  81 data Id
  82   = Id Unique      {\dcd\rm-- key for fast comparison}
  83        UniType     {\dcd\rm-- Id's type; used all the time;}
  84        IdInfo      {\dcd\rm-- non-essential info about this Id;}
  85        PragmaInfo  {\dcd\rm-- user-specified pragmas about this Id;}
  86        IdDetails   {\dcd\rm-- stuff about individual kinds of Ids.}
  87 \end{tightcode}
  88
  89 So, every @Id@ comes with:
  90 \begin{enumerate}
  91 \item
  92 A @Unique@,\srcloc{basicTypes/Unique.lhs} essentially a unique
  93 @Int@, for fast comparison;
  94 \item
  95 A @UniType@ (more on them below... section~\ref{sec:UniType}) giving the variable's
  96 type---this is the most useful memoised information.
  97 \item
  98 A @PragmaInfo@, which is pragmatic stuff the user specified for
  99 this @Id@; e.g., @INLINE@ it; GHC does not promise to honour these
 100 pragma requests, but this is where it keeps track of them.
 101 \item
 102 An @IdInfo@ (more on {\em them} below... section~\ref{sec:IdInfo}),
 103 which tells you all the extra things
 104 that the compiler doesn't {\em have} to know about an @Id@, but it's jolly nice...
 105 This corresponds pretty closely to the @GHC_PRAGMA@ cross-module info that you will
 106 see in any interface produced using @ghc -O@.
 107 An example of some @IdInfo@
 108 would be: that @Id@'s unfolding; or its arity.
 109 \end{enumerate}
 110
 111 Then the fun begins with @IdDetails@...
 112 \begin{tightcode}\codeallowbreaks{}
 113 data IdDetails
 114
 115   {\dcd\rm---------------- Local values}
 116
 117   = LocalId     ShortName       {\dcd\rm-- mentioned by the user}
 118
 119   | SysLocalId  ShortName       {\dcd\rm-- made up by the compiler}
 120
 121   {\dcd\rm---------------- Global values}
 122
 123   | ImportedId  FullName        {\dcd\rm-- Id imported from an interface}
 124
 125   | PreludeId   FullName        {\dcd\rm-- Prelude things the compiler ``knows'' about}
 126
 127   | TopLevId    FullName        {\dcd\rm-- Top-level in the orig source pgm}
 128                                 {\dcd\rm-- (not moved there by transformations).}
 129
 130   {\dcd\rm---------------- Data constructors}
 131
 132   | DataConId FullName
 133              ConTag
 134              [TyVarTemplate] ThetaType [UniType] TyCon
 135              {\dcd\rm-- split-up type: the constructor's type is:}
 136              {\dcd\rm-- $\forall~tyvars . theta\_ty \Rightarrow$}
 137              {\dcd\rm--    $unitype_1 \rightarrow~ ... \rightarrow~ unitype_n \rightarrow~ tycon tyvars$}
 138
 139   | TupleCon Int                         {\dcd\rm-- its arity}
 140
 141   {\dcd\rm-- There are quite a few more flavours of {\tt IdDetails}...}
 142 \end{tightcode}
 143
 144 % A @ShortName@,\srcloc{basicTypes/NameTypes.lhs} which includes a name string
 145 % and a source-line location for the name's binding occurrence;
 146
 147 In short: everything that later parts of the compiler might want to
 148 know about a local @Id@ is readily at hand.  The same principle holds
 149 true for imported-variable and data-constructor @Ids@ (tuples are an
 150 important enough case to merit special pleading), as well as for other
 151 basic program entities.  Here are a few further notes about the @Id@
 152 fragments above:
 153 \begin{itemize}
 154 \item
 155 A @FullName@\srcloc{basicTypes/NameTypes.lhs} is one that may be
 156 globally visible, with a module-name as well; it may have been
 157 renamed.
 158 \item
 159 @DataConKey@\srcloc{prelude/PrelUniqs.lhs} is a specialised
 160 fast-comparison key for data constructors; there are several of these
 161 kinds of things.
 162 \item
 163 The @UniType@ for @DataConIds@ is given in {\em two} ways: once, just as
 164 a plain type; secondly, split up into its component parts.  This is to
 165 save frequently re-splitting such types.
 166 \item
 167 Similarly, a @TupleCon@ has a type attached, even though we could
 168 construct it just from the arity.
 169 \end{itemize}
 170
 171 \subsubsection{@UniTypes@, representing types in the compiler}
 172 \label{sec:UniType}
 173
 174 Let us look further at @UniTypes@.\srcloc{uniType/} Their definition
 175 is:
 176 \begin{tightcode}\codeallowbreaks{}
 177 data UniType
 178   = UniTyVar    TyVar
 179
 180   | UniFun      UniType         {\dcd\rm-- function type}
 181                 UniType
 182
 183   | UniData     TyCon           {\dcd\rm-- non-synonym datatype}
 184                 [UniType]
 185
 186   | UniSyn      TyCon           {\dcd\rm-- type synonym}
 187                 [UniType]       {\dcd\rm--\ \ unexpanded form}
 188                 UniType         {\dcd\rm--\ \ expanded form}
 189
 190   | UniDict     Class           {\dcd\rm-- for types with overloading}
 191                 UniType
 192
 193   {\dcd\rm-- The next two are to do with universal quantification.}
 194   | UniTyVarTemplate TyVarTemplate
 195
 196   | UniForall   TyVarTemplate
 197                 UniType
 198 \end{tightcode}
 199 When the typechecker processes a source module, it adds @UniType@
 200 information to all the basic entities (e.g., @Ids@), among other
 201 places (see Section~\ref{sec:second-order} for more details).  These
 202 types are used throughout the compiler.
 203
 204 The following example shows several things about @UniTypes@.
 205 If a programmer wrote @(Eq a) => a -> [a]@, it would be represented
 206 as:\footnote{The internal structures of @Ids@,
 207 @Classes@, @TyVars@, and @TyCons@ are glossed over here...}
 208 \begin{tightcode}\codeallowbreaks{}
 209 UniForall {\dcd$\alpha$}
 210       (UniFun (UniDict {\dcd\em Eq} (UniTyVar {\dcd$\alpha$}))
 211               (UniFun (UniTyVarTemplate {\dcd$\alpha$})
 212                       (UniData {\dcd\em listTyCon}
 213                                [UniTyVarTemplate {\dcd$\alpha$}])))
 214 \end{tightcode}
 215 From this example we see:
 216 \begin{itemize}
 217 \item
 218 The universal quantification of the type variable $\alpha$ is made explicit
 219 (with a @UniForall@).
 220 \item
 221 The class assertion @(Eq a)@ is represented with a @UniDict@ whose
 222 second component is a @UniType@---this
 223 reflects the fact that we expect @UniType@ to be used in a stylized
 224 way, avoiding nonsensical constructs (e.g.,
 225 @(UniDict f (UniDict g (UniDict h ...)))@).
 226 \item
 227 The ``double arrow'' (@=>@) of the Haskell source, indicating an
 228 overloaded type, is represented by the usual
 229 @UniFun@ ``single arrow'' (@->@), again in a stylized way.
 230 This change reflects the fact that each class assertion in a
 231 function's type is implemented by adding an extra dictionary argument.
 232 \item
 233 In keeping with the memoising tradition we saw with @Ids@, type
 234 synonyms (@UniSyns@) keep both the unexpanded and expanded forms handy.
 235 \end{itemize}
 236
 237 \subsubsection{@IdInfo@: extra pragmatic info about an @Id@}
 238 \label{sec:IdInfo}
 239
 240 [New in 0.16.]  All the nice-to-have-but-not-essential information
 241 about @Ids@ is now hidden in the
 242 @IdInfo@\srcloc{basicTypes/IdInfo.lhs} datatype.  It looks something
 243 like:
 244 \begin{tightcode}\codeallowbreaks{}
 245 data IdInfo
 246   = NoIdInfo             {\dcd\rm-- OK, we know nothing...}
 247
 248   | MkIdInfo
 249       ArityInfo          {\dcd\rm-- its arity}
 250       DemandInfo         {\dcd\rm-- whether or not it is definitely demanded}
 251       InliningInfo       {\dcd\rm-- whether or not we should inline it}
 252       SpecialisationInfo {\dcd\rm-- specialisations of this overloaded}
 253                          {\dcd\rm-- function which exist}
 254       StrictnessInfo     {\dcd\rm-- strictness properties, notably}
 255                          {\dcd\rm-- how to conjure up ``worker'' functions}
 256       WrapperInfo        {\dcd\rm-- ({\em informational only}) if an Id is}
 257                          {\dcd\rm-- a ``worker,'' this says what Id it's}
 258                          {\dcd\rm-- a worker for, i.e., ``who is my wrapper''}
 259                          {\dcd\rm-- (used to match up workers/wrappers)}
 260       UnfoldingInfo      {\dcd\rm-- its unfolding}
 261       UpdateInfo         {\dcd\rm-- which args should be updated}
 262       SrcLocation        {\dcd\rm-- source location of definition}
 263 \end{tightcode}
 264 As you can see, we may accumulate a lot of information about an Id!
 265 (The types for all the sub-bits are given in the same place...)
 266
 267 \subsubsection{Introducing dictionaries for overloading}
 268
 269 The major novel feature of the Haskell language is its systematic
 270 overloading using {\em type classes}; Wadler and Blott's paper is the
 271 standard reference \cite{wadler89a}.
 272
 273 To implement type classes, the typechecker not only checks the Haskell
 274 source program, it also {\em translates} it---by inserting code to
 275 pass around in {\em dictionaries}.  These dictionaries
 276 are essentially tuples of functions, from which the correct code may
 277 be plucked at run-time to give the desired effect.  Kevin Hammond
 278 wrote and described the first working implementation of type
 279 classes \cite{hammond89a}, and the ever-growing static-semantics paper
 280 by Peyton Jones and Wadler is replete with the glories of dictionary
 281 handling \cite{peyton-jones90a}.  (By the way, the typechecker's
 282 structure closely follows the static semantics paper; inquirers into
 283 the former will become devoted students of the latter.)
 284
 285 Much of the abstract-syntax datatypes are given
 286 over to output-only translation machinery.  Here are a few more
 287 fragments of the @Expr@ type, all of which appear only in typechecker
 288 output:
 289 \begin{tightcode}
 290 data Expr var pat =
 291   ...
 292   | DictLam    [DictVar]       (Expr var pat)
 293   | DictApp    (Expr var pat)  [DictVar]
 294   | Dictionary [DictVar]       [Id]
 295   | SingleDict DictVar
 296   ...
 297 \end{tightcode}
 298 You needn't worry about this stuff:
 299 After the desugarer gets through with such constructs, there's nothing
 300 left but @Ids@, tuples, tupling functions, etc.,---that is, ``plain
 301 simple stuff'' that should make the potential optimiser's heart throb.
 302 Optimisation passes don't deal with dictionaries explicitly but, in
 303 some cases, quite a bit of the code passed through to them will be for
 304 dictionary-handling.