docs/ext-core/core.tex

   1 \documentclass[10pt]{article}
   2 \usepackage{a4wide}
   3 \usepackage{code}
   4
   5
   6 \sloppy
   7 \setlength{\parskip}{0.5\baselineskip plus 0.2\baselineskip minus 0.1\baselineskip}
   8 \setlength{\parsep}{\parskip}
   9 \setlength{\topsep}{0cm}
  10 \setlength{\parindent}{0cm}
  11 %\oddsidemargin -0.5 in
  12 %\evensidemargin -0.5 in
  13 %\textwidth 7.375 in
  14
  15 \newcommand{\derives}{\mbox{$\rightarrow$}}
  16 \newcommand{\orderives}{\mbox{$\mid$}}
  17 \newcommand{\many}[1]{\{ {#1} \}}
  18 \newcommand{\oneormore}[1]{\{ {#1} \}$^{+}$}
  19 \newcommand{\optional}[1]{[ {#1} ]}
  20
  21 \newcommand{\at}{\texttt{@}}
  22 \newcommand{\att}{@}
  23 \newcommand{\lam}{\texttt{\char`\\}}
  24
  25 \newcommand{\workingnote}[1]%
  26         {\begin{quote}
  27           \framebox{\parbox{.8 \linewidth}
  28                            {\textbf{\textsl{Working note:}} \textsl{#1}}}
  29           \end{quote}}
  30
  31 \begin{document}
  32
  33 \title{An External Representation for the GHC Core Language (DRAFT for GHC5.02)}
  34 \author{Andrew Tolmach ({\tt apt@cs.pdx.edu})\\and The GHC Team}
  35
  36 \maketitle
  37 \makeatactive
  38
  39 \abstract{
  40 This document provides a precise definition for the GHC Core language,
  41 so that it can be used to communicate between GHC and new stand-alone
  42 compilation tools such as back-ends or optimizers.
  43 The definition includes a formal grammar and an informal semantics.
  44 An executable typechecker and interpreter (in Haskell),
  45 which formally embody the static and dynamic semantics,
  46 are available separately.
  47
  48 Note: This is a draft document, which attempts to describe GHC's current
  49 behavior as precisely as possible.  Working notes scattered throughout indicate
  50 areas where further work is needed.  Constructive comments are very welcome,
  51 both on  the presentation, and on ways in which GHC could be improved in order
  52 to simplify the Core story.
  53 }
  54
  55 \section{Introduction}
  56
  57 The Glasgow Haskell Compiler (GHC) uses an intermediate language, called
  58 ``Core,''  as its internal program representation during
  59 several key stages of compiling.
  60 Core resembles a subset of Haskell, but with explicit type annotations
  61 in the style of the polymorphic lambda calculus (F$_\omega$).
  62 GHC's front end translates full Haskell 98 (plus some extensions) into
  63 well-typed Core, which is then repeatedly rewritten by the GHC optimizer.
  64 Ultimately, GHC translates Core into STG-machine code and then into
  65 C or native code.  The rationale for the design of Core and its use are discussed
  66 in existing papers~\cite{ghc-inliner,comp-by-trans-scp}, although the (two different)
  67 idealized versions of Core described therein differ in significant ways
  68 from the actual Core language in current GHC.
  69
  70 Researchers interested in writing just {\it part} of a Haskell compiler,
  71 such as a new back-end or a new optimizer pass, might like to make
  72 use of GHC to provide the other parts of the compiler.  For example, they
  73 might like to use GHC's front end to parse, desugar, and type-check source Haskell,
  74 then feeding the resulting code to their own back-end tool.
  75 Currently, they can only do this by linking their code into the
  76 GHC executable, which is an arduous process (and essentially requires
  77 the new code to be written in Haskell).   It would be much easier for
  78 external developers if GHC could be made to produce Core files in
  79 an agreed-upon external format.  To allow the widest range of interoperability,
  80 the external format should be text-based; pragmatically, it should
  81 also be human-readable. (It may ultimately be desirable to use a
  82 standard interchange base format such as ASDL or XML.)
  83
  84 In the past, Core has had no rigorously defined external representation, although
  85 by setting certain compiler flags, one can get a (rather ad-hoc) textual
  86 representation to be printed at various points in the compilation process;
  87 this is usually done to help debug the compiler.  To make Core fully useable
  88 a bi-directional communication format, it will be necssary to
  89
  90 \begin{enumerate}
  91 \item define precisely the external format of Core;
  92
  93 \item modify GHC to produce external Core files, if so requested, at one or more
  94 useful points in the compilation sequence -- e.g., just before optimization,
  95 or just after;
  96
  97 \item modify GHC to accept external Core files in place of Haskell
  98 source files, again at one or more useful points.
  99
 100 \end{enumerate}
 101
 102 The first two facilities will let one couple GHC's front-end (parser,
 103 type-checker, etc.), and optionally its optimizer, with new back-end tools.
 104 Adding the last facility will let one implement new Core-to-Core
 105 transformations in an external tool and integrate them into GHC. It will also
 106 allow new front-ends to generate Core that can be fed into GHC's optimizer or
 107 back end; however, because there are many (undocumented)
 108 idiosynracies in the way GHC produces Core from source Haskell, it will be hard
 109 for an external tool to produce Core that can be integrated with GHC-produced core
 110 (e.g., for the Prelude), and we don't aim to support this.
 111
 112 This document addresses the first requirement, a formal Core definition,
 113 by proposing  a formal grammar for an external representation of Core
 114 (Section~\ref{sec:external}, and
 115 an informal semantics (Section~\ref{sec:informal}.
 116
 117 Beginning in GHC5.02, external Core (post-optimization) adhering to this definition
 118 can be generated using the compiler flag @-fext-core@.
 119
 120 Formal static and dynamic semantics in the form of an executable typechecker and interpreter
 121 are available separately in the GHC source tree under @fptools/ghc/utils/ext-core@.
 122
 123 \section{External Grammar of Core}
 124 \label{sec:external}
 125
 126 In designing the external grammar, we have tried to strike a balance among
 127 a number of competing goals, including easy parseability by machines,
 128 easy readability by humans, and adequate structural simplicity to
 129 allow straightforward presentations of the semantics. This has inevitably
 130 led to certain compromise. In particular:
 131
 132 \begin{itemize}
 133 \item In order to avoid explosion of parentheses, various standard precedences
 134 and short-cuts are supported for expressions, types, and kinds; this led to the introduction
 135 of multiple non-terminals for each of these syntactic categories, which
 136 makes the concrete grammar longer and more complex than the underlying abstract syntax.
 137
 138 \item On the other hand, the grammar has been kept simpler by avoiding special syntax for
 139 tuple types and terms; tuples (both boxed and unboxed) are treated
 140 as ordinary constructors.
 141
 142 \item All type abstractions and applications are given in full, even though
 143 some of them (e.g., for tuples) could be reconstructed; this permits Core to
 144 be parsed without the necessity of performing any type reconstruction.
 145
 146 \item The syntax of identifiers is heavily restricted (essentially to just
 147 alphanumerics); this again makes Core easier to parse but harder to read.
 148 \end{itemize}
 149
 150 \workingnote{These choices are certainly debatable.  In particular, keeping
 151 type applications on tuples and case arms considerably increases the size of core files and
 152 makes them less human-readable, though it allows a Core parser to be simpler.}
 153
 154 We use the following notational conventions for syntax:
 155
 156 \begin{tabular}{ll}
 157 {\it [ pat ]} & optional \\
 158 {\it \{ pat \}} & zero or more repetitions \\
 159 {\it \{ pat \}$^{+}$} & one or more repetitions \\
 160 {\it pat$_1$ \orderives\ pat$_2$} & choice \\
 161 @fibonacci@ & terminal syntax in typewriter font \\
 162 \end{tabular}
 163
 164 {\it
 165 \begin{tabular}{lrclr}
 166 {\rm Module} &   module &        \derives &
 167         \multicolumn{2}{l}{@\%module@ mident \many{tdef @;@} \many{\optional{@\%local@} vdefg @;@}} \\
 168 \\
 169 {\rm Type defn.} &       tdef &  \derives &     @%data@ qtycon \many{tbind} @=@ @{@ cdef \many{@;@ cdef} @}@ & {\rm algebraic type}\\
 170                  &       &       \orderives &    @%newtype@ qtycon \many{tbind} \optional{@=@ ty} &      {\rm newtype} \\
 171 \\
 172 {\rm Constr. defn.} &   cdef &   \derives &      qdcon \many{@\at@ tbind} \many{aty} \\
 173 \\
 174 {\rm Value defn.}  &    vdefg &  \derives &      @%rec@ @{@ vdef \many{@;@ vdef} @}@ &                   {\rm recursive} \\
 175                    &    &        \orderives &    vdef &                                                  {\rm non-recursive} \\
 176                    &    vdef  &  \derives & qvar @::@ ty @=@ exp & \\
 177 \\
 178 {\rm Atomic expr.} &     aexp &  \derives &      qvar &                                                 {\rm variable} \\
 179                  &      &        \orderives &    qdcon &                                                {\rm data constructor}\\
 180                  &      &        \orderives &    lit &                                                  {\rm literal} \\
 181                  &      &        \orderives &    @(@ exp @)@ &                                          {\rm nested expr.}\\
 182 \\
 183 {\rm Expression} &      exp   &  \derives    &   aexp &                                                 {\rm atomic expresion}\\
 184                  &      &       \orderives  &    aexp \oneormore{arg} &                                 {\rm application}\\
 185                  &      &        \orderives &    @\@ \oneormore{binder} @->@ exp &                      {\rm abstraction}\\
 186                  &      &        \orderives &    @%let@ vdefg @%in@ exp &                               {\rm local definition}\\
 187                  &      &        \orderives &    @%case@ exp @%of@ vbind @{@ alt \many{@;@ alt} @}@ &   {\rm case expression}\\
 188                  &      &        \orderives &    @%coerce@ aty exp &                                    {\rm type coercion}\\
 189                  &      &        \orderives &    @%note@  @"@ \many{char} @"@ exp &                     {\rm expression note}\\
 190                  &      &        \orderives &    @%external@ @"@ \many{char} @"@ aty &                  {\rm external reference}\\
 191 \\
 192 {\rm Argument}   &      arg &   \derives &       \at\ aty &                                             {\rm type argument}\\
 193                  &      &        \orderives &    aexp &                                                 {\rm value argument} \\
 194 \\
 195 {\rm Case alt.} &       alt &    \derives &     qdcon  \many {@\at@ tbind} \many{vbind} @->@ exp &{\rm constructor alternative}\\
 196                 &       &        \orderives &    lit @->@ exp &                         {\rm literal alternative} \\
 197                 &       &        \orderives &    @%_@ @->@ exp &                                {\rm default alternative} \\
 198 \\
 199 {\rm Binder}     &      binder & \derives & \at\ tbind  &                                       {\rm type binder}\\
 200                  &              & \orderives &  vbind   &                                               {\rm value binder}\\
 201 \\
 202 {\rm Type binder} &     tbind & \derives   & tyvar & {\rm implicitly of  kind @*@} \\
 203                   &           & \orderives & @(@ tyvar @::@ kind @)@ & {\rm explicitly kinded} \\
 204 \\
 205 {\rm Value binder} &    vbind & \derives &   @(@ var @::@ ty @)@ \\
 206 \\
 207 {\rm Literal} &  lit &   \derives &      @(@ [@-@] \oneormore{digit} @::@ ty @)@ & {\rm integer} \\
 208             &   &        \orderives &    @(@ [@-@] \oneormore{digit} @.@ \oneormore{digit} @::@ ty @)@ & {\rm rational} \\
 209             &   &        \orderives &    @(@ @'@ char @'@ @::@ ty @)@ & {\rm character} \\
 210             &   &        \orderives &    @(@ @"@ \many{char} @"@ @::@ ty @)@ & {\rm string} \\
 211 \\
 212 {\rm Character}  & char & \derives & \multicolumn{2}{l}{any ASCII character in range 0x20-0x7E except 0x22,0x27,0x5c}\\
 213                 &       & \orderives & @\x@ hex hex  & {\rm ASCII code escape sequence} \\
 214                 &  hex   & \derives & @0@ \orderives  \ldots \orderives  @9@ \orderives  @a@ \orderives  \ldots \orderives  @f@ \\
 215 \end{tabular}
 216
 217 \begin{tabular}{lrclr}
 218 {\rm Atomic type} & aty & \derives &     tyvar &                                        {\rm type variable} \\
 219           &     &        \orderives &    qtycon &                                       {\rm type constructor}\\
 220           &     &        \orderives &    @(@ ty @)@ &                                   {\rm nested type}\\
 221 \\
 222 {\rm Basic type} & bty  & \derives &    aty  &                                          {\rm atomic type}\\
 223                   &      & \orderives & bty aty &                                       {\rm type application}\\
 224 \\
 225 {\rm Type} &     ty &    \derives   &   bty &                                           {\rm basic type}\\
 226           &     &        \orderives &   @%forall@ \oneormore{tbind} @.@ ty  &           {\rm type abstraction}\\
 227           &     &        \orderives &   bty @->@ ty  &                                  {\rm arrow type construction} \\
 228 \\
 229 {\rm Atomic kind} & akind & \derives &   @*@ &                          {\rm lifted kind}\\
 230                 &       & \orderives &   @#@ &                          {\rm unlifted kind}\\
 231                 &       & \orderives &   @?@ &                          {\rm open kind}\\
 232                 &       & \orderives &   @(@ kind @)@&                  {\rm nested kind}\\
 233 \\
 234 {\rm Kind} &    kind &   \derives &      akind &                        {\rm atomic kind}\\
 235            &    &        \orderives &    akind @->@ kind         &      {\rm arrow kind} \\
 236 \\
 237 {\rm Identifier}        &       mident & \derives &uname &      {\rm module} \\
 238         &       tycon &  \derives &      uname &                {\rm type constr.}  \\
 239         &       qtycon & \derives &      mident @.@  tycon &    {\rm qualified type constr.} \\
 240         &       tyvar &  \derives &      lname &                {\rm type variable} \\
 241         &       dcon &   \derives &      uname &                {\rm data constr.} \\
 242         &       qdcon &  \derives &      mident @.@  dcon &     {\rm qualified data constr.} \\
 243         &       var &    \derives &      lname &                {\rm variable} \\
 244         &       qvar &   \derives &      [ mident @.@ ] var &   {\rm optionally qualified variable} \\
 245 \\
 246 {\rm Name}      &       lname  &  \derives &     lower \many{namechar} \\
 247         &       uname &  \derives &      upper \many{namechar} & \\
 248         &       namechar & \derives &    lower \orderives\  upper \orderives\  digit \orderives\  @'@ \\
 249         &       lower &  \derives &      @a@ \orderives\  @b@ \orderives\  \ldots \orderives\  @z@ \orderives\  @_@ \\
 250         &       upper &  \derives &      @A@ \orderives\  @B@ \orderives\  \ldots \orderives\  @Z@ \\
 251         &       digit &  \derives &      @0@ \orderives\  @1@ \orderives\  \ldots \orderives\  @9@ \\
 252 \\
 253 \end{tabular}
 254 }
 255
 256 \workingnote{Should add some provision for comments.}
 257
 258 \section{Informal Semantics}
 259 \label{sec:informal}
 260
 261 Core resembles a explicitly-typed polymorphic lambda calculus (F$_\omega$), with the addition
 262 of local @let@ bindings, algebraic type definitions, constructors, and @case@ expressions,
 263 and primitive types, literals and operators.
 264 It is hoped that this makes it easy to obtain an informal understanding of Core programs
 265 without elaborate description.   This section therefore concentrates on the less obvious points.
 266
 267 \subsection{Program Organization and Modules}
 268
 269 Core programs are organized into {\em modules}, corresponding directly to source-level Haskell modules.
 270 Each module has a identifying name {\it mident}.
 271
 272 Each module may contain the following kinds of top-level declarations:
 273 \begin{itemize}
 274 \item Algebraic data type declarations, each defining a type constructor and one or more data constructors;
 275 \item Newtype declarations,  corresponding to Haskell @newtype@ declarations, each defining a type constructor; and
 276 \item Value declarations, defining the types and values of top-level variables.
 277 \end{itemize}
 278 No type constructor, data constructor, or top-level value may be declared more than once within a given module.
 279 All the type declarations are (potentially) mutually recursive. Value declarations must be
 280 in dependency order, with explicit grouping of mutually recursive declarations.
 281
 282 Identifiers defined in top-level declarations may be {\it external} or {\it internal}.
 283 External identifiers can be referenced from any other module in
 284 the program, using conventional dot notation (e.g., @PrelBase.Bool@, @PrelBase.True@).
 285 Internal identifiers are visible only within the defining module.
 286 All type and data constructors are external, and are always defined and referenced using
 287 fully qualified names (with dots).  A top-level value is external if it is defined and referenced
 288 using a fully qualified name with a dot (e.g., @MyModule.foo = ...@); otherwise, it is internal
 289 (e.g., @bar = ...@).
 290 Note that the notion of external identifier does not necessarily coincide with that of ``exported''
 291 identifier in a Haskell source module: all constructors are external, even if not exported, and
 292 non-exported values may be external if they are referenced from potentially in-lineable exported values.
 293 Core modules have no explicit import or export lists.
 294 Modules may be mutually recursive.
 295
 296 \workingnote{But in the presence of inter-module recursion, is there much point in
 297 keeping track of recursive groups within modules? Options: (1) don't worry about it;
 298 (2) put all declarations in module (indeed whole program) into one huge recursive pot;
 299 (3) abandon general module recursion, and introduce some kind of import declaration to define the
 300 types (only) of things from external modules that currently introduce module recursion.}
 301
 302 There is also an implicitly-defined module @PrelGHC@, which exports the ``built-in'' types and values
 303 that must be provided by any implementation of Core (including GHC).   Details of this
 304 module are in Section~\ref{sec:prims}.
 305
 306 A Core {\em program} is a collection of distinctly-named modules that includes a module
 307 called @Main@ having an exported value called @main@ of type @PrelIOBase.IO a@ (for some type @a@).
 308
 309 Many modules of interest derive from library modules, such as @PrelBase@, which implement parts of
 310 the Haskell basis library.  In principle, these modules have no special status.  In practice, the
 311 requirement on the type of @Main.main@ implies that every program will contain a large subset of
 312 the Prelude library modules.
 313
 314 \subsection{Namespaces}
 315
 316 There are five distinct name spaces:
 317 \begin{enumerate}
 318 \item module identifiers (@mident@),
 319 \item type constructors (@tycon@),
 320 \item type variables (@tyvar@),
 321 \item data constructors (@dcon@),
 322 \item term variables (@var@).
 323 \end{enumerate}
 324 Spaces (1), (2+3), and (4+5) can be distinguished from each other by context.
 325 To distinguish (2) from (3) and (4) from (5), we require that
 326 (both sorts of) constructors begin with an upper-case character
 327 and that (both sorts of) variables begin with a lower-case character (or @_@).
 328 Primitive types and operators are not syntactically distinguished.
 329
 330 A given variable (type or term) may have multiple (local) definitions within a module.
 331 However, definitions never ``shadow'' one another; that is, the scope of the definition
 332 of a given variable never contains a redefinition of the same variable.  The only exception
 333 to this is that (necessarily closed) types labelling @%external@ expressions may contain
 334 @tyvar@ bindings that shadow outer bindings.
 335
 336 Core generated by GHC makes heavy use of encoded names, in which the characters @Z@ and @z@ are
 337 used to introduce escape sequences for non-alphabetic characters such as dollar sign @$@ (@zd@),
 338 hash @#@ (@zh@), plus @+@ (@zp@), etc.  This is the same encoding used in @.hi@ files and in the
 339 back-end of GHC itself, except that we sometimes change an initial @z@ to @Z@, or vice-versa,
 340 in order to maintain case distinctions.
 341
 342 \subsection{Types and Kinds}
 343
 344 In Core, all type abstractions and applications are explicit.  This make it easy to
 345 typecheck any (closed) fragment.  An full executable typechecker is available separately.
 346
 347 Types are described by type expressions, which
 348 are built from named type constructors and type variables
 349 using type application and universal quantification.
 350 Each type constructor has a fixed arity $\geq 0$.
 351 Because it is so widely used, there is
 352 special infix syntax for the fully-applied function type constructor (@->@).
 353 (The prefix identifier for this constructor is @PrelGHC.ZLzmzgZR@; this should
 354 only appear in unapplied or partially applied form.)
 355 There are also a number of other primitive type constructors (e.g., @Intzh@) that
 356 are predefined in the @PrelGHC@ module, but have no special syntax.
 357 Additional type constructors are
 358 introduced by @%data@ and @%newtype@ declarations, as described below.
 359 Type constructors are distinguished solely by name.
 360
 361 As described in the Haskell definition, it is necessary to distinguish
 362 well-formed type-expressions by classifying them into different {\it kinds}.
 363 In particular, Core explicitly records the kind of every bound type variable.
 364 Base kinds (@*@,@#@, and @?@) represent actual types, i.e., those that can be assigned
 365 to term variables; all the nullary type constructors have one of these kinds.
 366 Non-nullary type constructors have higher kinds of the form $k_1 @->@ k_2$,
 367 where $k_1$ and $k_2$ are kinds.   For example, the function type constructor
 368 @->@ has kind @* -> (* -> *)@.  Since Haskell allows abstracting over type
 369 constructors, it is possible for type variables to have higher kinds; however,
 370 it is much more common for them to have kind @*@, so this is the default if
 371 the kind is omitted in a type binder.
 372
 373 The three base kinds distinguish the {\it liftedness} of the types they classify:
 374 @*@ represents lifted types; @#@ represents unlifted types; and @?@ represents
 375 ``open'' types, which may be either lifted or unlifted.  Of these, only @*@ ever
 376 appears in Core code generated from user code; the other two are needed to describe
 377 certain types in primitive (or otherwise specially-generated) code.
 378 Semantically, a type is lifted if and only if it has bottom as an element.
 379 Operationally, lifted types may be represented by closures; hence, any unboxed
 380 value is necessarily unlifted.
 381 In particular, no top-level identifier (except in @PrelGHC@) has a type of kind @#@ or @?@.
 382 Currently, all the primitive types are unlifted
 383 (including a few boxed primitive types such as @ByteArrayzh@).
 384 The ideas behind the use of unboxed and unlifted types are described in ~\cite{pj:unboxed}.
 385
 386 There is no mechanism for defining type synonyms (corresponding to
 387 Haskell @type@ declarations).
 388 Type equivalence is just syntactic equivalence on type expressions
 389 (of base kinds) modulo:
 390
 391 \begin{itemize}
 392 \item alpha-renaming of variables bound in @%forall@ types;
 393 \item the identity $a$ @->@ $b$ $\equiv$ @PrelGHC.ZLzmzgZR@ $a$ $b$
 394 \item the substitution of representation types for {\it fully applied} instances of newtypes
 395 (see Section~\ref{sec:newtypes}).
 396 \end{itemize}
 397
 398 \subsection{Algebraic data types}
 399
 400 Each @data@ declaration introduces a new type constructor and a set of one or
 401 more data constructors, normally corresponding directly to a source Haskell @data@ declaration.
 402 For example, the source declaration
 403 \begin{code}
 404 data Bintree a =
 405    Fork (Bintree a) (Bintree a)
 406 |  Leaf a
 407 \end{code}
 408 might induce the following Core declaration
 409 \begin{code}
 410 %data Bintree a = {
 411    Fork (Bintree a) (Bintree a);
 412    Leaf a)}
 413 \end{code}
 414 which introduces the unary type constructor @Bintree@ of kind @*->*@  and two data constructors with types
 415 \begin{code}
 416 Fork :: %forall a . Bintree a -> Bintree a -> Bintree a
 417 Leaf :: %forall a . a -> Bintree a
 418 \end{code}
 419 We define the {\it arity} of each data constructor to be the number of value arguments it takes;
 420 e.g. @Fork@ has arity 2 and @Leaf@ has arity 1.
 421
 422 For a less conventional example illustrating the possibility of higher-order kinds, the Haskell source declaration
 423 \begin{code}
 424 data A f a = MkA (f a)
 425 \end{code}
 426 might induce the core declaration
 427 \begin{code}
 428 %data A (f::*->*) (a::*) = { MkA (f a) }
 429 \end{code}
 430 which introduces the constructor
 431 \begin{code}
 432 MkA :: %forall (f::*->*) (a::*) . (f a) -> (A f) a
 433 \end{code}
 434
 435
 436 GHC (like some other Haskell implementations) supports an extension to Haskell98
 437 for existential types such as
 438 \begin{code}
 439 data T = forall a . MkT a (a -> Bool)
 440 \end{code}
 441 This is represented by the Core declaration
 442 \begin{code}
 443 %data T = {MkT @a a (a -> Bool)}
 444 \end{code}
 445 which introduces the nullary type constructor @T@ and the data constructor
 446 \begin{code}
 447 MkT :: %forall a . a -> (a -> Bool) -> T
 448 \end{code}
 449 In general, existentially quantified variables appear as extra univerally
 450 quantified variables in the data contructor types.
 451 An example of how to construct and deconstruct values of type @T@ is shown in
 452 Section~\ref{sec:exprs}.
 453
 454 \subsection{Newtypes}
 455 \label{sec:newtypes}
 456
 457
 458 Each Core @%newtype@ declaration introduces a new type constructor and (usually) an associated
 459 representation type, corresponding to a source Haskell @newtype@
 460 declaration.  However, unlike in source Haskell, no data constructors are introduced.
 461 In fact, newtypes seldom appear in value types
 462 in Core programs, because GHC usually replaces them with their representation type.
 463 For example, the Haskell fragment
 464 \begin{code}
 465 newtype U = MkU Bool
 466 u = MkU True
 467 v = case u of
 468   MkU b -> not b
 469 \end{code}
 470 might induce the Core fragment
 471 \begin{code}
 472 %newtype U = Bool;
 473 u :: Bool = True;
 474 v :: Bool =
 475    %let b :: Bool = u
 476    %in not b;
 477 \end{code}
 478 The main purpose of including @%newtype@ declarations in Core is to permit checking of
 479 type expressions in which partially-applied newtype constructors are used to instantiate higher-kinded
 480 type variables.  For example:
 481 \begin{code}
 482 newtype W a = MkW (Bool -> a)
 483 data S k = MkS (k Bool)
 484 a :: S W = MkS (MkW(\x -> not x))
 485 \end{code}
 486 might generate this Core:
 487 \begin{code}
 488 %newtype W a = Bool -> a;
 489 %data S (k::(*->*)) = MkS (k Bool);
 490 a :: S W = MkS @ W (\(x::Bool) -> not x)
 491 \end{code}
 492 The type application @(S W)@ cannot be checked without a definition for @W@.
 493
 494 Very rarely, source @newtype@ declarations may be (directly or indirectly) recursive. In such
 495 cases, it is not possible to subsitute the representation type for the new type;
 496 in fact, the representation type is omitted from the corresponding Core @%newtype@ declaration.
 497 Elements of the new
 498 type can only be created or examined by first explicitly coercing them from/to
 499 the representation type, using a @%coerce@ expression.  For example, the silly
 500 Haskell fragment
 501 \begin{code}
 502 newtype U = MkU (U -> Bool)
 503 u = MkU (\x -> True)
 504 v = case u of
 505   MkU f -> f u
 506 \end{code}
 507 might induce the Core fragment
 508 \begin{code}
 509 %newtype U;
 510 u :: U = %coerce U (\ (x::U) -> True);
 511 v :: Bool =
 512    %let f :: U -> Bool = %coerce (U -> Bool) u
 513    %in f u;
 514 \end{code}
 515
 516 \workingnote{The treatment of newtypes is still very unattractive: acres of explanation for
 517 very rare phenomena.}
 518
 519 \subsection{Expression Forms}
 520 \label{sec:exprs}
 521
 522 Variables and data constructors are straightforward.
 523
 524 Literal ({\it lit}) expressions consist of a literal value, in one of four different formats,
 525 and a (primitive) type annotation.   Only certain combinations of format and type
 526 are permitted; see Section~\ref{sec:prims}.  The character and string formats can describe only
 527 8-bit ASCII characters.  Moreover, because strings are interpreted as C-style null-terminated
 528 strings, they should not contain embedded nulls.
 529
 530 Both value applications and type applications are made explicit, and similarly
 531 for value and type abstractions.  To tell them apart, type arguments in applications
 532 and formal type arguments in abstractions are preceded by an \at\ symbol. (In abstractions,
 533 the \at\ plays essentially the same role as the more usual $\Lambda$ symbol.)
 534 For example, the Haskell source declaration
 535 \begin{code}
 536 f x = Leaf (Leaf x)
 537 \end{code}
 538 might induce the Core declaration
 539 \begin{code}
 540 f :: %forall a . a -> BinTree (BinTree a) =
 541   \ @a (x::a) -> Leaf @(Bintree a) (Leaf @a x)
 542 \end{code}
 543
 544 Value applications may be of user-defined functions, data constructors, or primitives.
 545 None of these sorts of applications are necessarily saturated (although previously published variants
 546 of Core did require the latter two sorts to be).
 547
 548 Note that the arguments of type applications are not always of kind @*@.  For example,
 549 given our previous definition of type @A@:
 550 \begin{code}
 551 data A f a = MkA (f a)
 552 \end{code}
 553 the source code
 554 \begin{code}
 555 MkA (Leaf True)
 556 \end{code}
 557 becomes
 558 \begin{code}
 559 (MkA @Bintree @Bool) (Leaf @Bool True)
 560 \end{code}
 561
 562 Local bindings, of a single variable or of a set of mutually recursive variables,
 563 are represented by @%let@ expressions in the usual way.
 564
 565 By far the most complicated expression form is @%case@.
 566 @%case@ expressions are permitted over values of any type, although they will normally
 567 be algebraic or primitive types (with literal values).
 568 Evaluating a @%case@ forces the evaluation of the expression being
 569 tested (the ``scrutinee''). The value of the scrutinee is bound to the variable
 570 following the @%of@ keyword, which is in scope in all alternatives;
 571 this is useful when the scrutinee is a non-atomic
 572 expression (see next example).
 573
 574 In an algebraic @%case@, all the case alternatives must be
 575 labeled with distinct data constructors from the algebraic type, followed by
 576 any existential type variable bindings (see below), and
 577 typed term variable bindings corresponding to the data constructor's
 578 arguments. The number of variables must match the data constructor's arity.
 579
 580 For example, the following Haskell source expression
 581 \begin{code}
 582 case g x of
 583   Fork l r -> Fork r l
 584   t@(Leaf v) -> Fork t t
 585 \end{code}
 586 might induce the Core expression
 587 \begin{code}
 588 %case g x %of (t::Bintree a)
 589    Fork (l::Bintree a) (r::Bintree a) ->
 590       Fork @a r l
 591    Leaf (v::a) ->
 592       Fork @a t t
 593 \end{code}
 594
 595 When performing a @%case@ over a value of an existentially-quantified algebraic
 596 type, the alternative must include extra local type bindings
 597 for the existentially-quantified variables.  For example, given
 598 \begin{code}
 599 data T = forall a . MkT a (a -> Bool)
 600 \end{code}
 601 the source
 602 \begin{code}
 603 case x of
 604   MkT w g -> g w
 605 \end{code}
 606 becomes
 607 \begin{code}
 608 %case x %of (x'::T)
 609   MkT @b (w::b) (g::b->Bool) -> g w
 610 \end{code}
 611
 612 In a @%case@ over literal alternatives,
 613 all the case alternatives must be distinct literals of the same primitive type.
 614
 615 The list of alternatives may begin with a
 616 default alternative labeled with an underscore (@%_@), which will be chosen if
 617 none of the other alternative match.  The default is optional except for a case
 618 over a primitive type, or when there are no other alternatives.
 619 If the case is over neither an
 620 algebraic type nor a primitive type, the default alternative is the {\it only}
 621 one that can appear.
 622 For algebraic cases, the set of alternatives
 623 need not be exhaustive, even if no default is given; if alternatives are missing,
 624 this implies that GHC has deduced that they cannot occur.
 625
 626 The @%coerce@ expression is primarily used in conjunction with manipulation of
 627 newtypes, as described in Section~\ref{sec:newtypes}.
 628 However, @%coerce@ is sometimes used for
 629 other purposes, e.g. to coerce the return type of a function (such as @error@)
 630 that is guaranteed never to return.  By their natures, uses of @%coerce@ cannot
 631 be independently justified, and must be taken on faith by a type-checker for Core.
 632
 633 A @%note@ expression is used to carry arbitrary internal information of interest to
 634 GHC.  The information must be encoded as a string.  Expression notes currently generated by GHC
 635 include the inlining pragma (@InlineMe@) and cost-center labels for profiling.
 636
 637 A @%external@ expression denotes an external identifier, which has
 638 the indicated type (always expressed in terms of Haskell primitive types).
 639 \workingnote{The present syntax is sufficient for describing C functions and labels.
 640 Interfacing to other languages may require additional information or a different interpretation
 641 of the name string.}
 642
 643
 644 \subsection{Expression Evaluation}
 645
 646 The dynamic semantics of Core are defined on the type-erasure of the program;
 647 ie. we ignore all type abstractions and applications.  The denotational semantics
 648 the resulting type-free program are just the conventional ones for a call-by-name
 649 language, in which expressions are only evaluated on demand.
 650 But Core is intended to be a call-by-{\it{need}} language, in which
 651 expressions are only evaluated {\it once}.  To express the sharing behavior
 652 of call-by-need, we give an operational model in the style of Launchbury.
 653 This section describes the model informally; a more formal semantics is
 654 separately available in the form of an executable interpreter.
 655
 656 To simplify the semantics, we consider only ``well-behaved'' Core programs in which
 657 constructor and primitive applications are fully saturated, and in which
 658 non-trivial expresssions of unlifted kind (@#@) appear only as scrutinees
 659 in @%case@ expressions.  Any program can easily be put into this form;
 660 a separately available executable preprocessor illustrates how.
 661 In the remainder of this section, we use ``Core'' to mean ``well-behaved'' Core.
 662
 663 Evaluating a Core expression means reducing it to {\it weak-head normal form (WHNF)},
 664 i.e., a primitive value, lambda abstraction, or fully-applied data constructor.
 665 Evaluation of a program is evaluation of the expression @Main.main@.
 666
 667 To make sure that expression evaluation is shared, we
 668 make use of a {\it heap}, which can contain
 669 \begin{itemize}
 670 \item {\em Thunks} representing suspended (i.e., as yet unevaluated) expressions.
 671
 672 \item {\em WHNF}s representing the result of evaluating such thunks. Computations over
 673 primitive types are never suspended, so these results are always closures (representing
 674 lambda abstractions) or data constructions.
 675 \end{itemize}
 676 Thunks are allocated when it
 677 is necessary to suspend a computation whose result may be shared.
 678 This occurs when evaluating three different kinds of expressions:
 679 \begin{itemize}
 680 \item Value definitions at top-level or within a local @let@ expression.
 681 Here, the defining expressions are suspended and the defined names
 682 are bound to heap pointers to the suspensions.
 683
 684 \item User function applications.  Here, the actual argument expression is
 685 suspended and the formal argument is bound to a heap pointer to the suspension.
 686
 687 \item Constructor applications. Here, the actual argument expression is
 688 suspended and a heap pointer to the suspension is embedded in the constructed value.
 689 \end{itemize}
 690
 691 As computation proceeds, copies of the heap pointer propagate.
 692 When the computation is eventually forced, the heap entry is overwritten with the resulting
 693 WHNF, so all copies of the pointer now point to this WHNF.   Forcing occurs
 694 only in the context of
 695 \begin{itemize}
 696 \item evaluating the operator expression of an application;
 697
 698 \item evaluating the ``scrutinee'' of a @case@ expression; or
 699
 700 \item evaluating an argument to a primitive or external function application
 701 \end{itemize}
 702
 703 Ultimately, if there are no remaining pointers to the heap entry (whether suspended or evaluated),
 704 the entry can be garbage-collected; this is assumed to happen implicitly.
 705
 706 With the exception of functions, arrays, and mutable variables, the intention is that values of all primitive types
 707 should be held {\it unboxed}, i.e., not heap-allocated.  This causes no problems for laziness because all
 708 primitive types are {\it unlifted}.  Unboxed tuple types are not heap-allocated either.
 709
 710 Certain primitives and @%external@ functions cause side-effects to state threads or to the real world.
 711 Where the ordering of these side-effects matters, Core already forces this order
 712 by means of data dependencies on the psuedo-values representing the threads.
 713
 714 The @raisezh@ and @handlezh@ primitives requires special support in an implementation, such as a handler stack;
 715 again, real-world threading guarantees that they will execute in the correct order.
 716
 717 \section{Primitive Module}
 718 \label{sec:prims}
 719
 720 This section describes the contents and informal semantics of the primitive module @PrimGHC@.
 721 Nearly all the primitives are required in order to cover GHC's implementation of the Haskell98
 722 standard prelude; the only operators that can be completely omitted are those supporting the byte-code interpreter,
 723 parallelism, and foreign objects.  Some of the concurrency primitives are needed, but can be
 724 given degenerate implementations if it desired to target a purely sequential backend; see Section~\ref{sec:sequential}.
 725
 726 In addition to these primitives, a large number of C library functions are required to implement
 727 the full standard Prelude, particularly to handle I/O and arithmetic on less usual types.
 728 % We list these separately in section~\ref{sec:ccalls}.
 729
 730 \subsection{Types}
 731
 732 \begin{tabular}{|l|l|l|}
 733 \hline
 734 Type & Kind & Description \\
 735 \hline
 736 @ZLzmzgZR@ & @* -> * -> *@ & functions (@->@) \\
 737 @Z1H@ & @? -> #@    & unboxed 1-tuple \\
 738 @Z2H@      & @? -> ? -> #@ & unboxed 2-tuple \\
 739 \ldots   & \ldots & \ldots \\
 740 @Z100H@    & @? -> ? -> ? -> ... -> ? -> #@ & unboxed 100-tuple \\
 741 @Addrzh@    & @#@          & machine address (pointer) \\
 742 @Charzh@           & @#@           & unicode character (31 bits) \\
 743 @Doublezh@  & @#@          & double-precision float \\
 744 @Floatzh@   & @#@          & float \\
 745 @Intzh@    & @#@           & int (30+ bits) \\
 746 @Int32zh@   & @#@          & int (32 bits) \\
 747 @Int64zh@   & @#@          & int (64 bits) \\
 748 @Wordzh@           & @#@           & unsigned word (30+ bits) \\
 749 @Word32zh@   & @#@         & unsigned word (32 bits) \\
 750 @Word64zh@   & @#@         & unsigned word (64 bits) \\
 751 @RealWorld@ & @*@          & pseudo-type for real world state \\
 752 @Statezh@    & @* -> #@     & mutable state \\
 753 @Arrayzh@   & @* -> #@     & immutable arrays \\
 754 @ByteArrayzh@   & @#@     & immutable byte arrays \\
 755 @MutableArrayzh@   & @* -> * -> #@     & mutable arrays \\
 756 @MutableByteArrayzh@   & @* -> #@     & mutable byte arrays \\
 757 @MutVarzh@ & @* -> * -> #@  & mutable variables \\
 758 @MVarzh@  & @* -> * -> #@  & synchronized mutable variables \\
 759 @Weakzh@  & @* -> #@      & weak pointers \\
 760 @StablePtrzh@ & @* -> #@  & stable pointers \\
 761 @ForeignObjzh@ & @#@    & foreign object \\
 762 @ThreadIdzh@ & @#@       & thread id \\
 763 @ZCTCCallable@ & @? -> *@ & dictionaries for CCallable pseudo-class \\
 764 @ZCTCReturnable@ & @? -> *@ & dictionaries for CReturnable pseudo-class \\
 765 \hline
 766 \end{tabular}
 767
 768 In addition, the types @PrelBase.Bool@ and @PrelBase.Unit@, which are non-primitive
 769 and are defined as ordinary algebraic types in module @PrelBase@, are used in
 770 the types of some operators in @PrelGHC@.
 771
 772 The unboxed tuple types are quite special: they hold sets of values in an unlifted
 773 context, i.e., to be manipulated directly rather than being stored in the heap.  They can only
 774 appear in limited contexts in programs; in particular, they cannot be bound by a
 775 lambda abstraction or case alternative pattern. Note that they can hold either lifted
 776 or unlifted values.  The limitation to 100-tuples is an arbitrary one set by GHC.
 777
 778 The type of arbitrary precision integers (@Integer@) is not primitive; it is made
 779 up of an ordinary primitive integer (@Intzh@) and a byte array (@ByteArrzh@).
 780 The components of an @Integer@ are passed to primitive operators as two separate
 781 arguments and returned as an unboxed pair.
 782
 783 The @Statezh@ type constructor takes a dummy type argument that is used only
 784 to distinguish different state {\it threads}~\cite{Launchbury94}.
 785 The @RealWorld@ type is used only as an argument to @Statezh@, and represents
 786 the thread of real-world state; it contains just the single value @realWorldzh@.
 787 The mutable data types @MutableArrayzh@,@MutableByteArrayzh@,@MutVarzh@
 788 take an initial type argument of the form @(Statezh@ $t$@)@ for some thread $t$.
 789 The synchronized mutable variable type constructor @MVarzh@ always takes an argument of type
 790 @Statezh RealWorld@.
 791
 792 @Weakzh@ is the type of weak pointers.
 793
 794 @StablePtrzh@ is the type of stable pointers, which are guaranteed not to move
 795 during garbage collections; these are useful in connection with foreign functions.
 796
 797 @ForeignPtrzh@ is the type of foreign pointers.
 798
 799 The dictionary types @ZCTCCallable@ and @ZCTCReturnable@ are just placeholders
 800 which can be represented by a void type;
 801 any code they appear in should be unreachable.
 802
 803 \subsubsection{Non-concurrent Back End}
 804 \label{sec:sequential}
 805
 806 The Haskell98 standard prelude doesn't include any concurrency support, but GHC's
 807 implementation of it relies on the existence of some concurrency primitives.  However,
 808 it never actually forks multiple threads.  Hence, the concurrency primitives can
 809 be given degenerate implementations that will work in a non-concurrent setting,
 810 as follows:
 811 \begin{itemize}
 812 \item  @ThreadIdzh@ can be represented
 813 by a singleton type, whose (unique) value is returned by @myThreadIdzh@.
 814
 815 \item @forkzh@ can just die with an ``unimplemented'' message.
 816
 817 \item @killThreadzh@ and @yieldzh@ can also just die ``unimplemented'' since
 818 in a one-thread world, the only thread a thread can kill is itself, and
 819 if a thread yields the program hangs.
 820
 821 \item @MVarzh a@ can be represented by @MutVarzh (Maybe a)@;
 822 where a concurrent implementation would block, the sequential implementation can
 823 just die with a suitable message (since no other thread exists to unblock it).
 824
 825 \item @waitReadzh@ and @waitWritezh@ can be implemented using a @select@ with no timeout.
 826 \end{itemize}
 827
 828 \subsection{Literals}
 829
 830 Only the following combination of literal forms and types are permitted:
 831
 832 \begin{tabular}{|l|l|l|}
 833 \hline
 834 Literal form & Type & Description \\
 835 \hline
 836 integer &  @Intzh@ & Int \\
 837 %       &  @Int32zh@ & Int32 \\
 838 %       &  @Int64zh@ & Int64 \\
 839         &  @Wordzh@ & Word \\
 840 %       &  @Word32zh@ & Word32 \\
 841 %       &  @Word64zh@ & Word64 \\
 842         &  @Addrzh@ & Address \\
 843         &  @Charzh@ & Unicode character code \\
 844 rational & @Floatzh@ & Float \\
 845          & @Doublezh@ & Double \\
 846 character & @Charzh@ & Unicode character specified by ASCII character\\
 847 string &  @Addrzh@  & Address of specified C-format string \\
 848 \hline
 849 \end{tabular}
 850
 851 \subsection{Data Constructors}
 852
 853 The only primitive data constructors are for unboxed tuples:
 854
 855 \begin{tabular}{|l|l|l|}
 856 \hline
 857 Constructor & Type & Description \\
 858 \hline
 859 @ZdwZ1H@ & @%forall (a::?).a -> Z1H a@ & unboxed 1-tuple \\
 860 @ZdwZ2H@ & @%forall (a1::?) (a2::?).a1 -> a2 -> Z2H a1 a2@ & unboxed 2-tuple \\
 861 \ldots & \ldots & \ldots \\
 862 @ZdwZ100H@ & @%forall (a1::?) (a2::?)... (a100::?) .@ & \\
 863 & \ \ \ @a1 -> a2 -> ... -> a100 -> Z100H a1 a2 ... a100@ & unboxed 100-tuple \\
 864 \hline
 865 \end{tabular}
 866
 867 \subsection{Values}
 868
 869 Operators are (roughly) divided into collections according to the primary
 870 type on which they operate.
 871
 872 \workingnote{How do primitives fail, e.g., on division by zero or
 873 attempting an invalid narrowing coercion?}
 874
 875 \workingnote{The following primop descriptions are automatically generated.
 876 The exact set of primops and their types presented here
 877 depends on the underlying word size at the time of generation; these
 878 were done for 32 bit words.  This is a bit stupid.
 879 More importantly, the word size has a big impact on just what gets produced
 880 in a Core file, but this isn't documented anywhere in the file itself.
 881 Perhaps there should be a global flag in the file?}
 882
 883 \newcommand{\primoptions}[7]{{#1} {#2} {#3} {#4} {#5}}
 884
 885 \newcommand{\primopsection}[2]{\subsubsection{#1}{#2}\vspace*{0.1in}}
 886 \newcommand{\primopdefaults}[1]{Unless otherwise noted, each primop has the following default characteristics: {#1}}
 887
 888 \newcommand{\primopdesc}[8]{
 889 \par\noindent{\texttt{{{#3} :: {#6}}}}
 890 \\{#7} {#8}\\}
 891
 892 \input{prims.tex}
 893
 894 \subsubsection{RealWorld}
 895
 896 There is just one value of type @RealWorld@, namely @realWorldzh@. It is used
 897 only for dependency threading of side-effecting operations.
 898
 899 \begin{thebibliography}{}
 900
 901 \bibitem[Launchbury and {Peyton~Jones}, 1994]{Launchbury94}
 902 Launchbury, J. and {Peyton~Jones}, S. (1994).
 903 \newblock Lazy functional state threads.
 904 \newblock Technical report FP-94-05, Department of Computing Science,
 905   University of Glasgow.
 906
 907 \bibitem[{Peyton~Jones} and Launchbury, 1991]{pj:unboxed}
 908 {Peyton~Jones}, S. and Launchbury, J. (1991).
 909 \newblock Unboxed values as first class citizens.
 910 \newblock In {\em ACM Conference on Functional Programming and Computer
 911   Architecture (FPCA'91)}, pages 636--666, Boston. ACM.
 912
 913 \bibitem[{Peyton~Jones} and Marlow, 1999]{ghc-inliner}
 914 {Peyton~Jones}, S. and Marlow, S. (1999).
 915 \newblock Secrets of the {Glasgow Haskell Compiler} inliner.
 916 \newblock In {\em Workshop on Implementing Declarative Languages}, Paris,
 917   France.
 918
 919 \bibitem[Peyton~Jones and Santos, 1998]{comp-by-trans-scp}
 920 Peyton~Jones, S. and Santos, A. (1998).
 921 \newblock A transformation-based optimiser for {Haskell}.
 922 \newblock {\em Science of Computer Programming}, 32(1-3):3--47.
 923
 924 \end{thebibliography}
 925
 926 \end{document}