%************************************************************************ %* * \subsection{The front end of the compiler} \label{sec:front-end} %* * %************************************************************************ The previous section covered the main points about the front end of the compiler: it is dominated by a ``renamer'' and a typechecker working directly at the Haskell source level. In this section, we will look at some basic datatypes used or introduced in the front end---ones that you may see as input to your optimisation pass. \subsubsection{``Abstract syntax'', a source-level datatype} \label{sec:AbsSyntax} As Figure~\ref{fig:overview} shows, the typechecker both reads and writes a collection of datatypes we call ``Abstract syntax.'' This is misleading in that what goes into the typechecker is quite different from what comes out. Let's first consider this fragment of the abstract-syntax definition\srcloc{abstractSyn/HsExpr.lhs}, for Haskell explicit-list expressions (Haskell report, section~3.5 \cite{hudak91a}):\nopagebreak[4] \begin{mytightcode} data Expr var pat = ... | ExplicitList [Expr var pat] | ExplicitListOut UniType [Expr var pat] ... type ProtoNameExpr = Expr ProtoName ProtoNamePat type RenamedExpr = Expr Name RenamedPat type TypecheckedExpr = Expr Id TypecheckedPat\end{mytightcode} an @ExplicitList@ appears only in typechecker input; an @ExplicitListOut@ is the corresponding construct that appears only in the output, with the inferred type information attached. The fragment above also shows how abstract syntax is {\em parameterised} with respect to variables and patterns. The idea is the same for both; let's just consider variables. The renamer converts @ProtoNameExprs@ (expressions with @ProtoNames@\srcloc{basicTypes/ProtoName.lhs} as variables---little more than just strings) into @RenamedExprs@, which have all naming sorted out (using @Names@\srcloc{abstractSyn/Name.lhs}). A @Name@ is known to be properly bound, isn't duplicated, etc.; it's known if it's bound to a built-in standard-prelude entity. (The renamer also does dependency analysis, which is required to maximise polymorphism in a Hindley-Milner type system.) The typechecker reads the @RenamedExprs@, sorts out the types, and spits out @TypecheckedExprs@, with variables represented by @Ids@\srcloc{basicTypes/Id.lhs}. You can find out just about anything you want to know about a variable from its @Id@. To see what GHC makes of some Haskell, in a file @Foo.hs@, say: try @ghc -noC -ddump-rn4 Foo.hs@, to see what comes out of the renamer [pass~4]; try @ghc -noC -ddump-tc Foo.hs@, to see what comes out of the typechecker. \subsubsection{Basic datatypes in the compiler} None of the internal datatypes in the example just given are particularly interesting except @Ids@\srcloc{basicTypes/Id.lhs}. A program variable, which enters the typechecker as a string, emerges as an @Id@. The important thing about @Id@---and the datatypes representing other basic program entities (type variables, type constructors, classes, etc.)---is that they often include {\em memoised} information that can be used throughout the rest of the compiler. Let us take a cursory look at @Ids@, as a representative example of these basic data types. (Don't be too scared---@Ids@ are the hairiest entities in the whole compiler!) Here we go: \begin{mytightcode} \codeallowbreaks{}data Id = Id Unique {\dcd\rm-- key for fast comparison} UniType {\dcd\rm-- Id's type; used all the time;} IdInfo {\dcd\rm-- non-essential info about this Id;} PragmaInfo {\dcd\rm-- user-specified pragmas about this Id;} IdDetails {\dcd\rm-- stuff about individual kinds of Ids.}\end{mytightcode} So, every @Id@ comes with: \begin{enumerate} \item A @Unique@\srcloc{basicTypes/Unique.lhs}, essentially a unique @Int@, for fast comparison; \item A @UniType@ (more on them below... section~\ref{sec:UniType}) giving the variable's type---this is the most useful memoised information. \item A @PragmaInfo@, which is pragmatic stuff the user specified for this @Id@; e.g., @INLINE@ it; GHC does not promise to honour these pragma requests, but this is where it keeps track of them. \item An @IdInfo@ (more on {\em them} below... section~\ref{sec:IdInfo}), which tells you all the extra things that the compiler doesn't {\em have} to know about an @Id@, but it's jolly nice... This corresponds pretty closely to the @GHC_PRAGMA@ cross-module info that you will see in any interface produced using @ghc -O@. An example of some @IdInfo@ would be: that @Id@'s unfolding; or its arity. \end{enumerate} Then the fun begins with @IdDetails@... \begin{mytightcode} \codeallowbreaks{}data IdDetails {\dcd\rm---------------- Local values} = LocalId ShortName {\dcd\rm-- mentioned by the user} | SysLocalId ShortName {\dcd\rm-- made up by the compiler} {\dcd\rm---------------- Global values} | ImportedId FullName {\dcd\rm-- Id imported from an interface} | PreludeId FullName {\dcd\rm-- Prelude things the compiler ``knows'' about} | TopLevId FullName {\dcd\rm-- Top-level in the orig source pgm} {\dcd\rm-- (not moved there by transformations).} {\dcd\rm---------------- Data constructors} | DataConId FullName ConTag [TyVarTemplate] ThetaType [UniType] TyCon {\dcd\rm-- split-up type: the constructor's type is:} {\dcd\rm-- $\forall~tyvars . theta\_ty \Rightarrow$} {\dcd\rm-- $unitype_1 \rightarrow~ ... \rightarrow~ unitype_n \rightarrow~ tycon tyvars$} | TupleCon Int {\dcd\rm-- its arity} {\dcd\rm-- There are quite a few more flavours of {\tt IdDetails}...}\end{mytightcode} % A @ShortName@,\srcloc{basicTypes/NameTypes.lhs} which includes a name string % and a source-line location for the name's binding occurrence; In short: everything that later parts of the compiler might want to know about a local @Id@ is readily at hand. The same principle holds true for imported-variable and data-constructor @Ids@ (tuples are an important enough case to merit special pleading), as well as for other basic program entities. Here are a few further notes about the @Id@ fragments above: \begin{itemize} \item A @FullName@\srcloc{basicTypes/NameTypes.lhs} is one that may be globally visible, with a module-name as well; it may have been renamed. \item @DataConKey@\srcloc{prelude/PrelUniqs.lhs} is a specialised fast-comparison key for data constructors; there are several of these kinds of things. \item The @UniType@ for @DataConIds@ is given in {\em two} ways: once, just as a plain type; secondly, split up into its component parts. This is to save frequently re-splitting such types. \item Similarly, a @TupleCon@ has a type attached, even though we could construct it just from the arity. \end{itemize} \subsubsection{@UniTypes@, representing types in the compiler} \label{sec:UniType} Let us look further at @UniTypes@\srcloc{uniType/}. Their definition is: \begin{mytightcode} \codeallowbreaks{}data UniType = UniTyVar TyVar | UniFun UniType {\dcd\rm-- function type} UniType | UniData TyCon {\dcd\rm-- non-synonym datatype} [UniType] | UniSyn TyCon {\dcd\rm-- type synonym} [UniType] {\dcd\rm--\ \ unexpanded form} UniType {\dcd\rm--\ \ expanded form} | UniDict Class {\dcd\rm-- for types with overloading} UniType {\dcd\rm-- The next two are to do with universal quantification.} | UniTyVarTemplate TyVarTemplate | UniForall TyVarTemplate UniType\end{mytightcode} When the typechecker processes a source module, it adds @UniType@ information to all the basic entities (e.g., @Ids@), among other places (see Section~\ref{sec:second-order} for more details). These types are used throughout the compiler. The following example shows several things about @UniTypes@. If a programmer wrote @(Eq a) => a -> [a]@, it would be represented as:\footnote{The internal structures of @Ids@, @Classes@, @TyVars@, and @TyCons@ are glossed over here...} \begin{mytightcode} \codeallowbreaks{}UniForall {\dcd$\alpha$} (UniFun (UniDict {\dcd\em Eq} (UniTyVar {\dcd$\alpha$})) (UniFun (UniTyVarTemplate {\dcd$\alpha$}) (UniData {\dcd\em listTyCon} [UniTyVarTemplate {\dcd$\alpha$}])))\end{mytightcode} From this example we see: \begin{itemize} \item The universal quantification of the type variable $\alpha$ is made explicit (with a @UniForall@). \item The class assertion @(Eq a)@ is represented with a @UniDict@ whose second component is a @UniType@---this reflects the fact that we expect @UniType@ to be used in a stylized way, avoiding nonsensical constructs (e.g., @(UniDict f (UniDict g (UniDict h ...)))@). \item The ``double arrow'' (@=>@) of the Haskell source, indicating an overloaded type, is represented by the usual @UniFun@ ``single arrow'' (@->@), again in a stylized way. This change reflects the fact that each class assertion in a function's type is implemented by adding an extra dictionary argument. \item In keeping with the memoising tradition we saw with @Ids@, type synonyms (@UniSyns@) keep both the unexpanded and expanded forms handy. \end{itemize} \subsubsection{@IdInfo@: extra pragmatic info about an @Id@} \label{sec:IdInfo} [New in 0.16.] All the nice-to-have-but-not-essential information about @Ids@ is now hidden in the @IdInfo@\srcloc{basicTypes/IdInfo.lhs} datatype. It looks something like: \begin{mytightcode} \codeallowbreaks{}data IdInfo = NoIdInfo {\dcd\rm-- OK, we know nothing...} | MkIdInfo ArityInfo {\dcd\rm-- its arity} DemandInfo {\dcd\rm-- whether or not it is definitely demanded} InliningInfo {\dcd\rm-- whether or not we should inline it} SpecialisationInfo {\dcd\rm-- specialisations of this overloaded} {\dcd\rm-- function which exist} StrictnessInfo {\dcd\rm-- strictness properties, notably} {\dcd\rm-- how to conjure up ``worker'' functions} WrapperInfo {\dcd\rm-- ({\em informational only}) if an Id is} {\dcd\rm-- a ``worker,'' this says what Id it's} {\dcd\rm-- a worker for, i.e., ``who is my wrapper''} {\dcd\rm-- (used to match up workers/wrappers)} UnfoldingInfo {\dcd\rm-- its unfolding} UpdateInfo {\dcd\rm-- which args should be updated} SrcLocation {\dcd\rm-- source location of definition}\end{mytightcode} As you can see, we may accumulate a lot of information about an Id! (The types for all the sub-bits are given in the same place...) \subsubsection{Introducing dictionaries for overloading} The major novel feature of the Haskell language is its systematic overloading using {\em type classes}; Wadler and Blott's paper is the standard reference \cite{wadler89a}. To implement type classes, the typechecker not only checks the Haskell source program, it also {\em translates} it---by inserting code to pass around in {\em dictionaries}. These dictionaries are essentially tuples of functions, from which the correct code may be plucked at run-time to give the desired effect. Kevin Hammond wrote and described the first working implementation of type classes \cite{hammond89a}, and the ever-growing static-semantics paper by Peyton Jones and Wadler is replete with the glories of dictionary handling \cite{peyton-jones90a}. (By the way, the typechecker's structure closely follows the static semantics paper; inquirers into the former will become devoted students of the latter.) Much of the abstract-syntax datatypes are given over to output-only translation machinery. Here are a few more fragments of the @Expr@ type, all of which appear only in typechecker output: \begin{mytightcode} data Expr var pat = ... | DictLam [DictVar] (Expr var pat) | DictApp (Expr var pat) [DictVar] | Dictionary [DictVar] [Id] | SingleDict DictVar ...\end{mytightcode} You needn't worry about this stuff: After the desugarer gets through with such constructs, there's nothing left but @Ids@, tuples, tupling functions, etc.,---that is, ``plain simple stuff'' that should make the potential optimiser's heart throb. Optimisation passes don't deal with dictionaries explicitly but, in some cases, quite a bit of the code passed through to them will be for dictionary-handling.