docs/ext-core/core.tex

   1 \documentclass[10pt]{article}
   2 \usepackage{a4wide}
   3 \usepackage{code}
   4 \usepackage{natbib}
   5
   6 \sloppy
   7 \setlength{\parskip}{0.5\baselineskip plus 0.2\baselineskip minus 0.1\baselineskip}
   8 \setlength{\parsep}{\parskip}
   9 \setlength{\topsep}{0cm}
  10 \setlength{\parindent}{0cm}
  11 %\oddsidemargin -0.5 in
  12 %\evensidemargin -0.5 in
  13 %\textwidth 7.375 in
  14
  15 \newcommand{\derives}{\mbox{$\rightarrow$}}
  16 \newcommand{\orderives}{\mbox{$\mid$}}
  17 \newcommand{\many}[1]{\{ {#1} \}}
  18 \newcommand{\oneormore}[1]{\{ {#1} \}$^{+}$}
  19 \newcommand{\optional}[1]{[ {#1} ]}
  20
  21 \newcommand{\at}{\texttt{@}}
  22 \newcommand{\att}{@}
  23 \newcommand{\lam}{\texttt{\char`\\}}
  24
  25 \newcommand{\workingnote}[1]%
  26         {\begin{quote}
  27           \framebox{\parbox{.8 \linewidth}
  28                            {\textbf{\textsl{Working note:}} \textsl{#1}}}
  29           \end{quote}}
  30
  31 %% Can't have more than one paragraph in one of these boxes? WTF
  32 \newcommand{\tjc}[1]%
  33         {\begin{quote}
  34           \framebox{\parbox{.8 \linewidth}
  35                            {\textbf{\textsl{tjc:}} \textsl{#1}}}
  36           \end{quote}}
  37
  38 \begin{document}
  39
  40 \title{An External Representation for the GHC Core Language\\ (For GHC 6.10)}
  41 \author{Andrew Tolmach, Tim Chevalier ({\tt \{apt,tjc\}@cs.pdx.edu})\\and The GHC Team}
  42
  43 \maketitle
  44 \makeatactive
  45
  46 \abstract{
  47 This document provides a precise definition for the GHC Core language,
  48 so that it can be used to communicate between GHC and new stand-alone
  49 compilation tools such as back-ends or optimizers.\footnote{This is a draft document, which attempts to describe GHC's current
  50 behavior as precisely as possible.  Working notes scattered throughout indicate
  51 areas where further work is needed.  Constructive comments are very welcome,
  52 both on  the presentation, and on ways in which GHC could be improved in order
  53 to simplify the Core story.} The definition includes a formal grammar and an informal semantics.
  54 An executable typechecker and interpreter (in Haskell),
  55 which formally embody the static and dynamic semantics,
  56 are available separately.
  57 }
  58
  59 \section{Introduction}
  60
  61 The Glasgow Haskell Compiler (GHC) uses an intermediate language, called
  62 ``Core,''  as its internal program representation within the compiler's simplification phase.
  63 Core resembles a subset of Haskell, but with explicit type annotations
  64 in the style of the polymorphic lambda calculus (F$_\omega$).
  65
  66 GHC's front-end translates full Haskell 98 (plus some extensions) into Core. The GHC optimizer then repeatedly transforms Core programs while preserving their meaning. A ``Core Lint'' pass in GHC typechecks Core in between transformation passes (at least when the user enables linting by setting a compiler flag), verifying that transformations preserve type-correctness.\footnote{At least for source programs that do not use unsafe language extensions.} Finally, GHC's back-end translates Core into STG-machine code~\citep{stg-machine} and then into
  67 C or native code.
  68
  69 Two existing papers discuss the original rationale for the design and use of Core~\citep{ghc-inliner,comp-by-trans-scp}, although the (two different)
  70 idealized versions of Core described therein differ in significant ways from the actual Core language in current GHC. In particular, with the advent of GHC support for generalized algebraic datatypes (GADTs)~\citep{gadts} Core was extended beyond its previous F$_\omega$-style incarnation to support type equality constraints and safe coercions, and is now based on a system known as F$_C$~\citep{system-fc}.
  71
  72
  73 Researchers interested in writing just {\it part} of a Haskell compiler,
  74 such as a new back-end or a new optimizer pass, might like to make
  75 use of GHC to provide the other parts of the compiler.  For example, they
  76 might like to use GHC's front-end to parse, desugar, and type-check source Haskell,
  77 then feeding the resulting code to their own back-end tool. As another example, they might like to use Core as the target language for a front-end compiler of their own design, feeding externally synthesized Core into GHC in order to take advantage of GHC's optimizer, code generator, and run-time system. Without external Core, there are two ways to do this: link their code into the
  78 GHC executable, which is an arduous process (and essentially requires
  79 the new code to be written in Haskell), or use the GHC API~\citep{ghc-api}, which is a cleaner way of doing the same thing (but also requires new code to be written in Haskell).
  80
  81 Toward the goal of making it easier for external developers to use GHC in a modular way, we present a precisely specified external format for Core files. The external format is text-based and human-readable, to promote interoperability and ease of use.
  82
  83 In the past, Core has had no rigorously defined external representation, though GHC will print out an ad-hoc textual representation of Core if you set certain compiler flags; this representation is intended to be read by people who are debugging the compiler, not by other programs. Making Core into a machine-readable, bi-directional communication format requires:
  84
  85 \begin{enumerate}
  86 \item precisely specifying the external format of Core;
  87
  88 \item modifying GHC to generate external Core files (post-simplification; as always, users can control the exact transformations GHC does with command-line flags);
  89
  90 \item modifying GHC to accept external Core files in place of Haskell
  91 source files (users will also be able to control what GHC does to those files with command-line flags).
  92
  93 \end{enumerate}
  94
  95 The first two facilities will let developers couple GHC's front-end (parser,
  96 type-checker, desugarer), and optionally its optimizer, with new back-end tools.
  97 The last facility will let developers write new Core-to-Core
  98 transformations as an external tool and integrate them into GHC. It will also
  99 allow new front-ends to generate Core that can be fed into GHC's optimizer or
 100 back-end.
 101
 102 However, because there are many (undocumented)
 103 idiosyncracies in the way GHC produces Core from source Haskell, it will be hard
 104 for an external tool to produce Core that can be integrated with GHC-produced Core
 105 (e.g., for the Prelude), and we don't aim to support this. Indeed, for the time being, we aim to support only the first two facilities and not the third: we define and implement Core as an external format that GHC can use to communicate with external back-end tools, and defer the larger task of extending GHC to support reading this external format back in.
 106
 107 This document addresses the first requirement, a formal Core definition,
 108 by proposing  a formal grammar for an external representation of Core
 109 (Section~\ref{sec:external}), and
 110 an informal semantics (Section~\ref{sec:informal}).
 111
 112 Support for generating external Core (post-optimization) was introduced in
 113 GHC 5.02.  The definition of external Core in this document reflects the version of
 114 external Core generated by the HEAD (unstable) branch of GHC as of April 21, 2008 (version 6.9), using the compiler flag
 115 @-fext-core@. We expect that GHC 6.10 will be consistent with this definition.
 116
 117 GHC supports many type system extensions; the External Core printer built into GHC only supports some of them. However, External Core should be capable of representing any Haskell 98 program, and may be able to represent programs that require certain type system extensions as well.
 118
 119 Formal static and dynamic semantics in the form of an executable typechecker and interpreter
 120 are available separately in the GHC source tree\footnote{\url{http://darcs.haskell.org/ghc}} under @utils/ext-core@.
 121
 122 \section{External Grammar of Core}
 123 \label{sec:external}
 124
 125 In designing the external grammar, we have tried to strike a balance among
 126 a number of competing goals, including easy parseability by machines,
 127 easy readability by humans, and adequate structural simplicity to
 128 allow straightforward presentations of the semantics. Thus, we had to make some compromises. Specifically:
 129
 130 \begin{itemize}
 131 \item In order to avoid explosion of parentheses, we support standard precedences
 132 and short-cuts for expressions, types, and kinds. Thus we had to introduce
 133 multiple non-terminals for each of these syntactic categories, and as a result,
 134 the concrete grammar is longer and more complex than the underlying abstract syntax.
 135
 136 \item On the other hand, we have kept the grammar simpler by avoiding special syntax for
 137 tuple types and terms. Tuples (both boxed and unboxed) are treated
 138 as ordinary constructors.
 139
 140 \item All type abstractions and applications are given in full, even though
 141 some of them (e.g., for tuples) could be reconstructed; this means a parser for Core does not have to
 142 reconstruct types.\footnote{These choices are certainly debatable.  In particular, keeping
 143 type applications on tuples and case arms considerably increases the size of Core files and
 144 makes them less human-readable, though it allows a Core parser to be simpler.}
 145
 146 \item The syntax of identifiers is heavily restricted (to just
 147 alphanumerics and underscores); this again makes Core easier to parse but harder to read.
 148
 149 \end{itemize}
 150
 151 We use the following notational conventions for syntax:
 152
 153 \begin{tabular}{ll}
 154 {\it [ pat ]} & optional \\
 155 {\it \{ pat \}} & zero or more repetitions \\
 156 {\it \{ pat \}$^{+}$} & one or more repetitions \\
 157 {\it pat$_1$ \orderives\ pat$_2$} & choice \\
 158 @fibonacci@ & terminal syntax in typewriter font \\
 159 \end{tabular}
 160
 161 \newpage
 162
 163 {\it
 164 \begin{tabular}{lrclr}
 165 {\rm Module} &   module &        \derives &
 166         \multicolumn{2}{l}{@\%module@ mident \many{tdef @;@} \many{vdefg @;@}} \\
 167 \\
 168 {\rm Type defn.} &       tdef &  \derives &     @%data@ qtycon \many{tbind} @=@ @{@ \optional{cdef \many{@;@ cdef}} @}@ & {\rm algebraic type}\\
 169                  &       &       \orderives &    @%newtype@ qtycon qtycon \many{tbind} \optional{@=@ ty} &       {\rm newtype} \\
 170 \\
 171 {\rm Constr. defn.} &   cdef &   \derives &      qdcon \many{@\at@ tbind} \oneormore{aty} \\
 172 \\
 173 {\rm Value defn.}  &    vdefg &  \derives &      @%rec@ @{@ vdef \many{@;@ vdef} @}@ &                   {\rm recursive} \\
 174                    &    &        \orderives &    vdef &                                                  {\rm non-recursive} \\
 175                    &    vdef  &  \derives & qvar @::@ ty @=@ exp & \\
 176 \\
 177 {\rm Atomic expr.} &     aexp &  \derives &      qvar &                                                 {\rm variable} \\
 178                  &      &        \orderives &    qdcon &                                                {\rm data constructor}\\
 179                  &      &        \orderives &    lit &                                                  {\rm literal} \\
 180                  &      &        \orderives &    @(@ exp @)@ &                                          {\rm nested expr.}\\
 181 \\
 182 {\rm Expression} &      exp   &  \derives    &   aexp &                                                 {\rm atomic expresion}\\
 183                  &      &       \orderives  &    aexp \oneormore{arg} &                                 {\rm application}\\
 184                  &      &        \orderives &    @\@ \oneormore{binder} @->@ exp &                      {\rm abstraction}\\
 185                  &      &        \orderives &    @%let@ vdefg @%in@ exp &                               {\rm local definition}\\
 186                  &      &        \orderives &    @%case@ @(@aty@)@ exp @%of@ vbind @{@ alt \many{@;@ alt} @}@ & {\rm case expression}\\
 187                  &      &        \orderives &    @%cast@ exp aty &                                      {\rm type coercion}\\
 188                  &      &        \orderives &    @%note@  @"@ \many{char} @"@ exp &                     {\rm expression note}\\
 189                  &      &        \orderives &    @%external ccall@ @"@ \many{char} @"@ aty &                    {\rm external reference}\\
 190                  &      &        \orderives &    @%dynexternal ccall@ aty &                     {\rm external reference (dynamic)}\\
 191                  &      &        \orderives &    @%label@ @"@ \many{char} @"@       &               {\rm external label}
 192 \\
 193 \\
 194 {\rm Argument}   &      arg &   \derives &       \at\ aty &                                             {\rm type argument}\\
 195                  &      &        \orderives &    aexp &                                                 {\rm value argument} \\
 196 \\
 197 {\rm Case alt.} &       alt &    \derives &     qdcon \many {@\at@ tbind} \many{vbind} @->@ exp &{\rm constructor alternative}\\
 198                 &       &        \orderives &    lit @->@ exp &                         {\rm literal alternative} \\
 199                 &       &        \orderives &    @%_@ @->@ exp &                                {\rm default alternative} \\
 200 \\
 201 {\rm Binder}     &      binder & \derives & \at\ tbind  &                                       {\rm type binder}\\
 202                  &              & \orderives &  vbind   &                                               {\rm value binder}\\
 203 \\
 204 {\rm Type binder} &     tbind & \derives   & tyvar & {\rm implicitly of  kind @*@} \\
 205                   &           & \orderives & @(@ tyvar @::@ kind @)@ & {\rm explicitly kinded} \\
 206 \\
 207 {\rm Value binder} &    vbind & \derives &   @(@ var @::@ ty @)@ \\
 208 \\
 209 {\rm Literal} &  lit &   \derives &      @(@ [@-@] \oneormore{digit} @::@ ty @)@ & {\rm integer} \\
 210             &   &        \orderives &    @(@ [@-@] \oneormore{digit} @%@ \oneormore{digit} @::@ ty @)@ & {\rm rational} \\
 211             &   &        \orderives &    @(@ $'$ char $'$ @::@ ty @)@ & {\rm character} \\
 212             &   &        \orderives &    @(@ @"@ \many{char} @"@ @::@ ty @)@ & {\rm string} \\
 213 \\
 214 {\rm Character}  & char & \derives & \multicolumn{2}{l}{any ASCII character in range 0x20-0x7E except 0x22,0x27,0x5c}\\
 215                 &       & \orderives & @\x@ hex hex  & {\rm ASCII code escape sequence} \\
 216                 &  hex   & \derives & @0@ \orderives  \ldots \orderives  @9@ \orderives  @a@ \orderives  \ldots \orderives  @f@ \\
 217 \end{tabular}
 218
 219 \begin{tabular}{lrclr}
 220 {\rm Atomic type} & aty & \derives &     tyvar &                                        {\rm type variable} \\
 221           &     &        \orderives &    qtycon &                                       {\rm type constructor}\\
 222           &     &        \orderives &    @(@ ty @)@ &                                   {\rm nested type}\\
 223 \\
 224 {\rm Basic type} & bty  & \derives &    aty  &                                          {\rm atomic type}\\
 225                   &      & \orderives & bty aty &                                       {\rm type application}\\
 226         &     &        \orderives &   @ghczmprim:GHCziPrim.trans@ aty aty &           {\rm transitive coercion} \\
 227         &     &        \orderives &   @ghczmprim:GHCziPrim.sym@ aty &                   {\rm symmetric coercion} \\
 228         &     &        \orderives &   @ghczmprim:GHCziPrim.CoUnsafe@ aty aty &          {\rm unsafe coercion} \\
 229         &     &        \orderives &   @ghczmprim:GHCziPrim.left@ aty &                  {\rm left coercion} \\
 230         &     &        \orderives &   @ghczmprim:GHCziPrim.right@ aty &                 {\rm right coercion} \\
 231         &     &        \orderives &   @ghczmprim:GHCziPrim.inst@ aty aty &             {\rm instantiation coercion} \\
 232 \\
 233 {\rm Type} &     ty &    \derives   &   bty &                                           {\rm basic type}\\
 234           &     &        \orderives &   @%forall@ \oneormore{tbind} @.@ ty  &           {\rm type abstraction}\\
 235           &     &        \orderives &   bty @->@ ty  &                                  {\rm arrow type construction} \\
 236   \\
 237
 238 {\rm Atomic kind} & akind & \derives &   @*@ &                          {\rm lifted kind}\\
 239                 &       & \orderives &   @#@ &                          {\rm unlifted kind}\\
 240                 &       & \orderives &   @?@ &                          {\rm open kind}\\
 241                 &    &        \orderives &    bty @:=:@ bty      &      {\rm equality kind} \\
 242                 &       & \orderives &   @(@ kind @)@&                  {\rm nested kind}\\
 243 \\
 244 {\rm Kind} &    kind &   \derives &      akind &                        {\rm atomic kind}\\
 245            &    &        \orderives &    akind @->@ kind         &      {\rm arrow kind} \\
 246 \\
 247 {\rm Identifier}        &       mident & \derives & pname @:@ uname &   {\rm module} \\
 248         &       tycon &  \derives &      uname &                {\rm type constr.}  \\
 249         &       qtycon & \derives &      mident @.@  tycon &    {\rm qualified type constr.} \\
 250         &       tyvar &  \derives &      lname &                {\rm type variable} \\
 251         &       dcon &   \derives &      uname &                {\rm data constr.} \\
 252         &       qdcon &  \derives &      mident @.@  dcon &     {\rm qualified data constr.} \\
 253         &       var &    \derives &      lname &                {\rm variable} \\
 254         &       qvar &   \derives &      [ mident @.@ ] var &   {\rm optionally qualified variable} \\
 255 \\
 256 {\rm Name}      &       lname  &  \derives &     lower \many{namechar} \\
 257         &       uname &  \derives &      upper \many{namechar} & \\
 258         &       pname &  \derives &      \oneormore{namechar} & \\
 259         &       namechar & \derives &    lower \orderives\  upper \orderives\  digit \orderives\  @'@ \\
 260         &       lower &  \derives &      @a@ \orderives\  @b@ \orderives\  \ldots \orderives\  @z@ \orderives\  @_@ \\
 261         &       upper &  \derives &      @A@ \orderives\  @B@ \orderives\  \ldots \orderives\  @Z@ \\
 262         &       digit &  \derives &      @0@ \orderives\  @1@ \orderives\  \ldots \orderives\  @9@ \\
 263 \\
 264 \end{tabular}
 265 }
 266 \section{Informal Semantics}
 267 \label{sec:informal}
 268
 269 At the term level, Core resembles a explicitly-typed polymorphic lambda calculus (F$_\omega$), with the addition
 270 of local @let@ bindings, algebraic type definitions, constructors, and @case@ expressions,
 271 and primitive types, literals and operators. Its type system is richer than that of System F, supporting explicit type equality coercions and type functions.~\citep{system-fc}
 272
 273 We hope no further elaboration is necessary to informally understand Core programs. Therefore, in this section we concentrate on the less obvious points.
 274
 275 \subsection{Program Organization and Modules}
 276 \label{sec:modules}
 277
 278 Core programs are organized into {\em modules}, corresponding directly to source-level Haskell modules.
 279 Each module has a identifying name {\it mident}. A module identifier consists of a {\it package name} followed by a module name, which may be hierarchical: for example, @base:GHC.Base@ is the module identifier for GHC's Base module. Its name is @Base@, and it lives in the @GHC@ hierarchy within the @base@ package. Section 5.8 of the GHC users' guide explains package names~\citep{ghc-user-guide}. In particular, note that a Core program may contain multiple modules with the same (possibly hierarchical) module name that differ in their package names. In some of the code examples that follow, we will omit package names and possibly full hierarchical module names from identifiers for brevity, but be aware that they are always required.
 280
 281 A possible improvement to the Core syntax would be to add explicit import lists to Core modules, which could be used to specify abbrevations for long qualified names. This would make the code more human-readable.
 282
 283 Each module may contain the following kinds of top-level declarations:
 284 \begin{itemize}
 285 \item Algebraic data type declarations, each defining a type constructor and one or more data constructors;
 286 \item Newtype declarations,  corresponding to Haskell @newtype@ declarations, each defining a type constructor and a coercion name; and
 287 \item Value declarations, defining the types and values of top-level variables.
 288 \end{itemize}
 289
 290 No type constructor, data constructor, or top-level value may be declared more than once within a given module.
 291 All the type declarations are (potentially) mutually recursive. Value declarations must be
 292 in dependency order, with explicit grouping of mutually recursive declarations.
 293
 294
 295 Identifiers defined in top-level declarations may be {\it external} or {\it internal}.
 296 External identifiers can be referenced from any other module in
 297 the program, using conventional dot notation (e.g., @base:GHC.Base.Bool@, @base:GHC.Base.True@).
 298 Internal identifiers are visible only within the defining module.
 299 All type and data constructors are external, and are always defined and referenced using
 300 fully qualified names (with dots).
 301
 302 A top-level value is external if it is defined and referenced
 303 using a fully qualified name with a dot (e.g., @main:MyModule.foo = ...@); otherwise, it is internal
 304 (e.g., @bar = ...@).
 305 Note that Core's notion of an external identifier does not necessarily coincide with that of ``exported''
 306 identifier in a Haskell source module. An identifier can be an external identifier in Core, but not be exported by the original Haskell source module.\footnote{Two examples of such identifiers are: data constructors, and values that potentially appear in an unfolding. For an example of the latter, consider @Main.foo = ... Main.bar ...@, where @Main.foo@ is inlineable. Since @bar@ appears in @foo@'s unfolding, it is defined and referenced with an external name, even if @bar@ was not exported by the original source module.} However, if an identifier was exported by the Haskell source module, it will appear as an external name in Core.
 307
 308 Core modules have no explicit import or export lists.
 309 Modules may be mutually recursive. Note that because of the latter fact, GHC currently prints out the top-level bindings for every module as a single recursive group, in order to avoid keeping track of dependencies between top-level values within a module. An external Core tool could reconstruct dependencies later, of course.
 310
 311 There is also an implicitly-defined module @ghc-prim:GHC.Prim@, which exports the ``built-in'' types and values
 312 that must be provided by any implementation of Core (including GHC).   Details of this
 313 module are in Section~\ref{sec:prims}.
 314
 315 A Core {\em program} is a collection of distinctly-named modules that includes a module
 316 called @main:Main@ having an exported value called @main:ZCMain.main@ of type @base:GHC.IOBase.IO a@ (for some type @a@). (Note that the strangely named wrapper for @main@ is the one exception to the rule that qualified names defined within a module @m@ must have module name @m@.)
 317
 318 Many Core programs will contain library modules, such as @base:GHC.Base@, which implement parts of the Haskell standard library.  In principle, these modules are ordinary Haskell modules, with no special status.  In practice, the requirement on the type of @main:Main.main@ implies that every program will contain a large subset of
 319 the standard library modules.
 320
 321 \subsection{Namespaces}
 322 \label{sec:namespaces}
 323
 324 There are five distinct namespaces:
 325 \begin{enumerate}
 326 \item module identifiers (@mident@),
 327 \item type constructors (@tycon@),
 328 \item type variables (@tyvar@),
 329 \item data constructors (@dcon@),
 330 \item term variables (@var@).
 331 \end{enumerate}
 332
 333 Spaces (1), (2+3), and (4+5) can be distinguished from each other by context.
 334 To distinguish (2) from (3) and (4) from (5), we require that data and type constructors begin with an upper-case character, and that term and type variables begin with a lower-case character.
 335
 336 Primitive types and operators are not syntactically distinguished.
 337
 338 Primitive {\it coercion} operators, of which there are six, {\it are} syntactically distinguished in the grammar. This is because these coercions must be fully applied, and because distinguishing their applications in the syntax makes typechecking easier.
 339
 340 A given variable (type or term) may have multiple definitions within a module.
 341 However, definitions of term variables never ``shadow'' one another: the scope of the definition
 342 of a given variable never contains a redefinition of the same variable. Type variables may be shadowed. Thus, if a term variable has multiple definitions within a module, all those definitions must be local (let-bound). The only exception
 343 to this rule is that (necessarily closed) types labelling @%external@ expressions may contain
 344 @tyvar@ bindings that shadow outer bindings.
 345
 346 Core generated by GHC makes heavy use of encoded names, in which the characters @Z@ and @z@ are
 347 used to introduce escape sequences for non-alphabetic characters such as dollar sign @$@ (@zd@),
 348 hash @#@ (@zh@), plus @+@ (@zp@), etc.  This is the same encoding used in @.hi@ files and in the
 349 back-end of GHC itself, except that we sometimes change an initial @z@ to @Z@, or vice-versa,
 350 in order to maintain case distinctions.
 351
 352 Finally, note that hierarchical module names are z-encoded in Core: for example, @base:GHC.Base.foo@ is rendered as @base:GHCziBase.foo@. A parser may reconstruct the module hierarchy, or regard @GHCziBase@ as a flat name.
 353 \subsection{Types and Kinds}
 354 \label{sec:typesandkinds}
 355
 356 In Core, all type abstractions and applications are explicit.  This make it easy to
 357 typecheck any (closed) fragment of Core code.  An full executable typechecker is available separately.
 358
 359 \subsubsection{Types}
 360 Types are described by type expressions, which
 361 are built from named type constructors and type variables
 362 using type application and universal quantification.
 363 Each type constructor has a fixed arity $\geq 0$.
 364 Because it is so widely used, there is
 365 special infix syntax for the fully-applied function type constructor (@->@).
 366 (The prefix identifier for this constructor is @ghc-prim:GHC.Prim.ZLzmzgZR@; this should
 367 only appear in unapplied or partially applied form.)
 368
 369 There are also a number of other primitive type constructors (e.g., @Intzh@) that
 370 are predefined in the @GHC.Prim@ module, but have no special syntax.
 371 @%data@ and @%newtype@ declarations introduce additional type constructors, as described below.
 372 Type constructors are distinguished solely by name.
 373
 374 \subsubsection{Coercions}
 375
 376 A type may also be built using one of the primitive coercion operators, as described in Section~\ref{sec:namespaces}. For details on the meanings of these operators, see the System FC paper~\citep{system-fc}.
 377
 378 \subsubsection{Kinds}
 379 As described in the Haskell definition, it is necessary to distinguish
 380 well-formed type-expressions by classifying them into different {\it kinds}~\citep[41]{haskell98}.
 381 In particular, Core explicitly records the kind of every bound type variable.
 382
 383 In addition, Core's kind system includes equality kinds, as in System FC~\citep{system-fc}. An application of a built-in coercion, or of a user-defined coercion as introduced by a newtype declaration, has an equality kind.
 384 \subsubsection{Lifted and Unlifted Types}
 385 Semantically, a type is {\it lifted} if and only if it has bottom as an element. We need to distinguish them because operationally, terms with lifted types may be represented by closures; terms with unlifted types must not be represented by closures, which implies that any unboxed value is necessarily unlifted. We distinguish between lifted and unlifted types by ascribing them different kinds.
 386
 387 Currently, all the primitive types are unlifted
 388 (including a few boxed primitive types such as @ByteArrayzh@).
 389 Peyton Jones and Launchbury~[\citeyear{pj:unboxed}] described the ideas behind unboxed and unlifted types.
 390
 391 \subsubsection{Type Constructors; Base Kinds and Higher Kinds}
 392 Every type constructor has a kind, depending on its arity and whether it or its arguments are lifted.
 393
 394 Term variables can only be assigned types that have base kinds: the base kinds are @*@,@#@, and @?@. The three base kinds distinguish the liftedness of the types they classify:
 395 @*@ represents lifted types; @#@ represents unlifted types; and @?@ is the ``open'' kind, representing a type that may be either lifted or unlifted. Of these, only @*@ ever
 396 appears in Core type declarations generated from user code; the other two are needed to describe
 397 certain types in primitive (or otherwise specially-generated) code (which, after optimization, could potentially appear anywhere).
 398
 399 In particular, no top-level identifier (except in @ghc-prim:GHC.Prim@) has a type of kind @#@ or @?@.
 400
 401 Nullary type constructors have base kinds: for example, the type @Int@ has kind @*@, and @Int#@ has kind @#@.
 402
 403 Non-nullary type constructors have higher kinds: kinds that have the form $k_1 @->@ k_2$,
 404 where $k_1$ and $k_2$ are kinds.   For example, the function type constructor
 405 @->@ has kind @* -> (* -> *)@.  Since Haskell allows abstracting over type
 406 constructors, type variables may have higher kinds; however, much more commonly they have kind @*@, so that is the default if a type binder omits a kind.
 407
 408 \subsubsection{Type Synonyms and Type Equivalence}
 409 There is no mechanism for defining type synonyms (corresponding to
 410 Haskell @type@ declarations).
 411
 412 Type equivalence is just syntactic equivalence on type expressions
 413 (of base kinds) modulo:
 414
 415 \begin{itemize}
 416 \item alpha-renaming of variables bound in @%forall@ types;
 417 \item the identity $a$ @->@ $b$ $\equiv$ @ghc-prim:GHC.Prim.ZLzmzgZR@ $a$ $b$
 418 \item the substitution of representation types for {\it fully applied} instances of newtypes
 419 (see Section~\ref{sec:newtypes}).
 420 \end{itemize}
 421
 422 \subsection{Algebraic data types}
 423
 424 Each @data@ declaration introduces a new type constructor and a set of one or
 425 more data constructors, normally corresponding directly to a source Haskell @data@ declaration.
 426 For example, the source declaration
 427 \begin{code}
 428 data Bintree a =
 429    Fork (Bintree a) (Bintree a)
 430 |  Leaf a
 431 \end{code}
 432 might induce the following Core declaration
 433 \begin{code}
 434 %data Bintree a = {
 435    Fork (Bintree a) (Bintree a);
 436    Leaf a)}
 437 \end{code}
 438 which introduces the unary type constructor @Bintree@ of kind @*->*@  and two data constructors with types
 439 \begin{code}
 440 Fork :: %forall a . Bintree a -> Bintree a -> Bintree a
 441 Leaf :: %forall a . a -> Bintree a
 442 \end{code}
 443 We define the {\it arity} of each data constructor to be the number of value arguments it takes;
 444 e.g. @Fork@ has arity 2 and @Leaf@ has arity 1.
 445
 446 For a less conventional example illustrating the possibility of higher-order kinds, the Haskell source declaration
 447 \begin{code}
 448 data A f a = MkA (f a)
 449 \end{code}
 450 might induce the Core declaration
 451 \begin{code}
 452 %data A (f::*->*) a = { MkA (f a) }
 453 \end{code}
 454 which introduces the constructor
 455 \begin{code}
 456 MkA :: %forall (f::*->*) a . (f a) -> (A f) a
 457 \end{code}
 458
 459 GHC (like some other Haskell implementations) supports an extension to Haskell98
 460 for existential types such as
 461 \begin{code}
 462 data T = forall a . MkT a (a -> Bool)
 463 \end{code}
 464 This is represented by the Core declaration
 465 \begin{code}
 466 %data T = {MkT @a a (a -> Bool)}
 467 \end{code}
 468 which introduces the nullary type constructor @T@ and the data constructor
 469 \begin{code}
 470 MkT :: %forall a . a -> (a -> Bool) -> T
 471 \end{code}
 472 In general, existentially quantified variables appear as extra univerally
 473 quantified variables in the data contructor types.
 474 An example of how to construct and deconstruct values of type @T@ is shown in
 475 Section~\ref{sec:exprs}.
 476
 477 \subsection{Newtypes}
 478 \label{sec:newtypes}
 479
 480 Each Core @%newtype@ declaration introduces a new type constructor and (usually) an associated
 481 representation type, corresponding to a source Haskell @newtype@
 482 declaration. However, unlike in source Haskell, a @%newtype@ declaration does not introduce any data constructors.
 483
 484 Each @%newtype@ declaration also introduces a new coercion (syntactically, just another type constructor) that implies an axiom equating the type constructor, applied to any type variables bound by the @%newtype@, to the representation type.
 485
 486 For example, the Haskell fragment
 487 \begin{code}
 488 newtype U = MkU Bool
 489 u = MkU True
 490 v = case u of
 491   MkU b -> not b
 492 \end{code}
 493 might induce the Core fragment
 494 \begin{code}
 495 %newtype U ZCCoU = Bool;
 496 u :: U = %cast (True)
 497            ((sym ZCCoU));
 498 v :: Bool = not (%cast (u) ZCCoU);
 499 \end{code}
 500
 501 The newtype declaration implies that the types {\tt U} and {\tt Bool} have equivalent representations, and the coercion axiom {\tt ZCCoU} provides evidence. Notice that in the body of {\tt u}, the boolean value {\tt True} is cast to type {\tt U} using the primitive symmetry rule applied to {\tt ZCCoU}: that is, using a coercion of kind {\tt Bool :=: U}. And in the body of {\tt v}, {\tt u} is cast back to type {\tt Bool} using the axiom {\tt ZCCoU}.
 502
 503 Very rarely, source @newtype@ declarations may be (directly or indirectly) recursive. In such
 504 cases, it is not possible to substitute the representation type for the new type;
 505 in fact, the representation type is omitted from the corresponding Core @%newtype@ declaration. Otherwise, recursive newtypes are handled in the same way as non-recursive newtypes.
 506
 507 \subsection{Expression Forms}
 508 \label{sec:exprs}
 509
 510 Variables and data constructors are straightforward.
 511
 512 Literal ({\it lit}) expressions consist of a literal value, in one of four different formats,
 513 and a (primitive) type annotation.   Only certain combinations of format and type
 514 are permitted; see Section~\ref{sec:prims}.  The character and string formats can describe only
 515 8-bit ASCII characters.
 516
 517 Moreover, because the operational semantics for Core interprets strings as C-style null-terminated
 518 strings, strings should not contain embedded nulls.
 519
 520 In Core, value applications, type applications, value abstractions, and type abstractions are all explicit. To tell them apart, type arguments in applications
 521 and formal type arguments in abstractions are preceded by an \at\ symbol. (In abstractions,
 522 the \at\ plays essentially the same role as the more usual $\Lambda$ symbol.)
 523 For example, the Haskell source declaration
 524 \begin{code}
 525 f x = Leaf (Leaf x)
 526 \end{code}
 527 might induce the Core declaration
 528 \begin{code}
 529 f :: %forall a . a -> BinTree (BinTree a) =
 530   \ @a (x::a) -> Leaf @(Bintree a) (Leaf @a x)
 531 \end{code}
 532
 533 Value applications may be of user-defined functions, data constructors, or primitives.
 534 None of these sorts of applications are necessarily saturated (although previously published variants
 535 of Core did require the latter two sorts to be).
 536
 537 Note that the arguments of type applications are not always of kind @*@.  For example,
 538 given our previous definition of type @A@:
 539 \begin{code}
 540 data A f a = MkA (f a)
 541 \end{code}
 542 the source code
 543 \begin{code}
 544 MkA (Leaf True)
 545 \end{code}
 546 becomes
 547 \begin{code}
 548 (MkA @Bintree @Bool) (Leaf @Bool True)
 549 \end{code}
 550
 551 Local bindings, of a single variable or of a set of mutually recursive variables,
 552 are represented by @%let@ expressions in the usual way. In Core, @%let@ is also significant it maps directly on to the STG language's @let@ construct: in STG, evaluating a @%let@ allocates closures for the variable(s) being bound~\citep[29]{stg-machine}.
 553
 554 By far the most complicated expression form is @%case@.
 555 @%case@ expressions are permitted over values of any type, although they will normally
 556 be algebraic or primitive types (with literal values).
 557 Evaluating a @%case@ forces the evaluation of the expression being
 558 tested (the ``scrutinee''). The value of the scrutinee is bound to the variable
 559 following the @%of@ keyword, which is in scope in all alternatives;
 560 this is useful when the scrutinee is a non-atomic
 561 expression (see next example). The scrutinee is preceded by the type of the entire @%case@ expression: that is, the result type that all of the @%case@ alternatives have (this is intended to make type reconstruction in the presence of type equality coercions easier).
 562
 563 In an algebraic @%case@, all the case alternatives must be
 564 labeled with distinct data constructors from the algebraic type, followed by
 565 any existential type variable bindings (see below), and
 566 typed term variable bindings corresponding to the data constructor's
 567 arguments. The number of variables must match the data constructor's arity.
 568
 569 For example, the following Haskell source expression
 570 \begin{code}
 571 case g x of
 572   Fork l r -> Fork r l
 573   t@(Leaf v) -> Fork t t
 574 \end{code}
 575 might induce the Core expression
 576 \begin{code}
 577 %case ((Bintree a)) g x %of (t::Bintree a)
 578    Fork (l::Bintree a) (r::Bintree a) ->
 579       Fork @a r l
 580    Leaf (v::a) ->
 581       Fork @a t t
 582 \end{code}
 583
 584 When performing a @%case@ over a value of an existentially-quantified algebraic
 585 type, the alternative must include extra local type bindings
 586 for the existentially-quantified variables.  For example, given
 587 \begin{code}
 588 data T = forall a . MkT a (a -> Bool)
 589 \end{code}
 590 the source
 591 \begin{code}
 592 case x of
 593   MkT w g -> g w
 594 \end{code}
 595 becomes
 596 \begin{code}
 597 %case x %of (x'::T)
 598   MkT @b (w::b) (g::b->Bool) -> g w
 599 \end{code}
 600
 601 In a @%case@ over literal alternatives,
 602 all the case alternatives must be distinct literals of the same primitive type.
 603
 604 The list of alternatives may begin with a
 605 default alternative labeled with an underscore (@%_@), whose right-hand side will be evaluated if
 606 none of the other alternatives match.  The default is optional except for in a case
 607 over a primitive type, or when there are no other alternatives.
 608 If the case is over neither an
 609 algebraic type nor a primitive type, then the list of alternative must contain a default alternative and nothing else.
 610 For algebraic cases, the set of alternatives
 611 need not be exhaustive, even if no default is given; if alternatives are missing,
 612 this implies that GHC has deduced that they cannot occur.
 613
 614 The @%cast@ expression takes an expression and a coercion:
 615 syntactically, the coercion is an arbitrary type, but it must have an
 616 equality kind. In an expression @(cast e co)@, if @e :: T@ and @co@
 617 has kind @T :=: U@, then the overall expression has type
 618 @U@~\citep{ghc-fc-commentary}. If @co@ has a non-coercion kind or a
 619 coercion whose left-hand side is not @T@, that is a type error.
 620
 621 Note
 622 that unlike the @%coerce@ expression that existed in previous versions
 623 of Core, this means that @%cast@ is (almost) type-safe: the coercion
 624 argument provides evidence that can be verified by a
 625 typechecker. There are still unsafe @%cast@s, corresponding to the
 626 unsafe @%coerce@ construct that existed in old versions of Core,
 627 because there is a primitive @ghc-prim:GHC.Prim.CoUnsafe@ coercion type that
 628 can be used to cast arbitrary types to each other. GHC uses this for
 629 such purposes as coercing the return type of a function (such as
 630 error) which is guaranteed to never return:
 631 \begin{code}
 632 case (error "") of
 633   True -> 1
 634   False ->  2
 635 \end{code}
 636 becomes:
 637 \begin{code}
 638     %cast (error @ Bool (ZMZN @ Char))
 639     (GHC.Prim.CoUnsafe Bool Integer);
 640 \end{code}
 641 @%cast@ has no operational meaning and is only used in typechecking.
 642
 643 @%cast@ is also used to manipulate newtypes, as described in Section~\ref{sec:newtypes}.
 644
 645 A @%note@ expression carries arbitrary internal information that GHC finds interesting. The information is encoded as a string.  Expression notes currently generated by GHC
 646 include the inlining pragma (@InlineMe@) and cost-center labels for profiling.
 647
 648 A @%external@ expression denotes an external identifier, which has
 649 the indicated type (always expressed in terms of Haskell primitive types).\footnote{External Core supports two kinds of external calls: \%external and \%dynexternal. Only the former is supported by the current set of stand-alone Core tools. In addition, there is a \%label construct which GHC may generate but which the Core tools do not support.}
 650
 651 The present syntax for externals is sufficient for describing C functions and labels.
 652 Interfacing to other languages may require additional information or a different interpretation
 653 of the name string.
 654
 655
 656 \subsection{Expression Evaluation}
 657 \label{sec:evaluation}
 658
 659 The dynamic semantics of Core are defined on the type-erasure of the program: for example, we ignore all type abstractions and applications.  The denotational semantics of
 660 the resulting type-free program are just the conventional ones for a call-by-name
 661 language, in which expressions are only evaluated on demand.
 662 But Core is intended to be a call-by-{\it{need}} language, in which
 663 expressions are only evaluated {\it once}.  To express the sharing behavior
 664 of call-by-need, we give an operational model in the style of Launchbury~\citep{launchbury93natural}.
 665
 666 This section describes the model informally; a more formal semantics is
 667 separately available as an executable interpreter.
 668
 669 To simplify the semantics, we consider only ``well-behaved'' Core programs in which
 670 constructor and primitive applications are fully saturated, and in which
 671 non-trivial expresssions of unlifted kind (@#@) appear only as scrutinees
 672 in @%case@ expressions.  Any program can easily be put into this form;
 673 a separately available preprocessor illustrates how.
 674 In the remainder of this section, we use ``Core'' to mean ``well-behaved'' Core.
 675
 676 Evaluating a Core expression means reducing it to {\it weak-head normal form (WHNF)},
 677 i.e., a primitive value, lambda abstraction, or fully-applied data constructor. To evaluate a program means evaluating the expression @Main.main@.
 678
 679 To make sure that expression evaluation is shared, we
 680 make use of a {\it heap}, which contains {\it heap entries}. A heap entry can be:
 681 \begin{itemize}
 682 \item A {\em thunk}, representing an unevaluated expression, also known as a {\em suspension}.
 683
 684 \item A {\em WHNF}, representing an evaluated expression. The result of evaluating a thunk is a WHNF. A WHNF is always a closure (corresponding to a lambda abstraction in the source program) or a data constructor application: computations over primitive types are never suspended.
 685 \end{itemize}
 686
 687 {\it Heap pointers} point to heap entries: the same heap pointer can point to either a thunk or a WHNF, because the run-time system overwrites thunks with WHNFs as computation proceeds.
 688
 689 The suspended computation that a thunk represents might represent evaluating one of three different kinds of expression. The run-time system allocates a different kind of thunk depending on what kind of expression it is:
 690 \begin{itemize}
 691 \item A thunk for a value definition has a group of suspended defining expressions, along with a list of bindings between defined names and heap pointers to those suspensions. (A value definition may be a recursive group of definitions or a single non-recursive definition, and it may be top-level (global) or @let@-bound (local)).
 692
 693 \item A thunk for a function application (where the function is user-defined) has a suspended actual argument expression, and a binding between the formal argument and a heap pointer to that suspension.
 694
 695 \item A thunk for a constructor application has a suspended actual argument expression; the entire constructed value has a heap pointer to that suspension embedded in it.
 696 \end{itemize}
 697
 698 As computation proceeds, copies of the heap pointer for a given thunk propagate through the executing program.
 699 When another computation demands the result of that thunk, the run-time system computes the thunk's result, yielding a WHNF, and overwrites the heap entry for the thunk with the WHNF. Now, all copies of the heap pointer point to the new heap entry: a WHNF. Forcing occurs
 700 only in the context of
 701 \begin{itemize}
 702 \item evaluating the operator expression of an application;
 703
 704 \item evaluating the scrutinee of a @case@ expression; or
 705
 706 \item evaluating an argument to a primitive or external function application
 707 \end{itemize}
 708
 709 When no pointers to a heap entry (whether it is a thunk or WHNF) remain, the garbage collector can reclaim the space it uses. We assume this happens implicitly.
 710
 711 With the exception of functions, arrays, and mutable variables, we intend that values of all primitive types
 712 should be held {\it unboxed}: they should not be heap-allocated. This does not violate call-by-need semantics: all
 713 primitive types are {\it unlifted}, which means that values of those types must be evaluated strictly.  Unboxed tuple types are not heap-allocated either.
 714
 715 Certain primitives and @%external@ functions cause side-effects to state threads or to the real world.
 716 Where the ordering of these side-effects matters, Core already forces this order with data dependencies on the pseudo-values representing the threads.
 717
 718 An implementation must specially support the @raisezh@ and @handlezh@ primitives: for example, by using a handler stack.
 719 Again, real-world threading guarantees that they will execute in the correct order.
 720
 721 \section{Primitive Module}
 722 \label{sec:prims}
 723
 724 The semantics of External Core rely on the contents and informal semantics of the primitive module @ghc-prim:GHC.Prim@.
 725 Nearly all the primitives are required in order to cover GHC's implementation of the Haskell98
 726 standard prelude; the only operators that can be completely omitted are those supporting the byte-code interpreter,
 727 parallelism, and foreign objects.  Some of the concurrency primitives are needed, but can be
 728 given degenerate implementations if it desired to target a purely sequential backend (see Section~\ref{sec:sequential}).
 729
 730 In addition to these primitives, a large number of C library functions are required to implement
 731 the full standard Prelude, particularly to handle I/O and arithmetic on less usual types.
 732
 733 For a full listing of the names and types of the primitive operators, see the GHC library documentation~\citep{ghcprim}.
 734
 735
 736 \subsection{Non-concurrent Back End}
 737 \label{sec:sequential}
 738
 739 The Haskell98 standard prelude doesn't include any concurrency support, but GHC's
 740 implementation of it relies on the existence of some concurrency primitives.  However,
 741 it never actually forks multiple threads.  Hence, the concurrency primitives can
 742 be given degenerate implementations that will work in a non-concurrent setting,
 743 as follows:
 744 \begin{itemize}
 745 \item  @ThreadIdzh@ can be represented
 746 by a singleton type, whose (unique) value is returned by @myThreadIdzh@.
 747
 748 \item @forkzh@ can just die with an ``unimplemented'' message.
 749
 750 \item @killThreadzh@ and @yieldzh@ can also just die ``unimplemented'' since
 751 in a one-thread world, the only thread a thread can kill is itself, and
 752 if a thread yields the program hangs.
 753
 754 \item @MVarzh a@ can be represented by @MutVarzh (Maybe a)@;
 755 where a concurrent implementation would block, the sequential implementation can
 756 just die with a suitable message (since no other thread exists to unblock it).
 757
 758 \item @waitReadzh@ and @waitWritezh@ can be implemented using a @select@ with no timeout.
 759 \end{itemize}
 760
 761 \subsection{Literals}
 762
 763 Only the following combination of literal forms and types are permitted:
 764
 765 \begin{tabular}{|l|l|l|}
 766 \hline
 767 Literal form & Type & Description \\
 768 \hline
 769 integer &  @Intzh@ & Int \\
 770 %       &  @Int32zh@ & Int32 \\
 771 %       &  @Int64zh@ & Int64 \\
 772         &  @Wordzh@ & Word \\
 773 %       &  @Word32zh@ & Word32 \\
 774 %       &  @Word64zh@ & Word64 \\
 775         &  @Addrzh@ & Address \\
 776         &  @Charzh@ & Unicode character code \\
 777 rational & @Floatzh@ & Float \\
 778          & @Doublezh@ & Double \\
 779 character & @Charzh@ & Unicode character specified by ASCII character\\
 780 string &  @Addrzh@  & Address of specified C-format string \\
 781 \hline
 782 \end{tabular}
 783
 784 \bibliography{core}
 785 \bibliographystyle{abbrvnat}
 786
 787 \end{document}