[project @ 2006-01-06 16:30:17 by simonmar]
authorsimonmar <unknown>
Fri, 6 Jan 2006 16:30:19 +0000 (16:30 +0000)
committersimonmar <unknown>
Fri, 6 Jan 2006 16:30:19 +0000 (16:30 +0000)
commit9d7da331989abcd1844e9d03b8d1e4163796fa85
tree8efa2e6fdcf8bfee777ae6477a686d0594c5ff76
parent2a2efb720c0fdc06fe749f96f284b00b30f8f3f7
[project @ 2006-01-06 16:30:17 by simonmar]
Add support for UTF-8 source files

GHC finally has support for full Unicode in source files.  Source
files are now assumed to be UTF-8 encoded, and the full range of
Unicode characters can be used, with classifications recognised using
the implementation from Data.Char.  This incedentally means that only
the stage2 compiler will recognise Unicode in source files, because I
was too lazy to port the unicode classifier code into libcompat.

Additionally, the following synonyms for keywords are now recognised:

  forall symbol  (U+2200) forall
  right arrow    (U+2192) ->
  left arrow    (U+2190) <-
  horizontal ellipsis  (U+22EF) ..

there are probably more things we could add here.

This will break some source files if Latin-1 characters are being used.
In most cases this should result in a UTF-8 decoding error.  Later on
if we want to support more encodings (perhaps with a pragma to specify
the encoding), I plan to do it by recoding into UTF-8 before parsing.

Internally, there were some pretty big changes:

  - FastStrings are now stored in UTF-8

  - Z-encoding has been moved right to the back end.  Previously we
    used to Z-encode every identifier on the way in for simplicity,
    and only decode when we needed to show something to the user.
    Instead, we now keep every string in its UTF-8 encoding, and
    Z-encode right before printing it out.  To avoid Z-encoding the
    same string multiple times, the Z-encoding is cached inside the
    FastString the first time it is requested.

    This speeds up the compiler - I've measured some definite
    improvement in parsing at least, and I expect compilations overall
    to be faster too.  It also cleans up a lot of cruft from the
    OccName interface.  Z-encoding is nicely hidden inside the
    Outputable instance for Names & OccNames now.

  - StringBuffers are UTF-8 too, and are now represented as
    ForeignPtrs.

  - I've put together some test cases, not by any means exhaustive,
    but there are some interesting UTF-8 decoding error cases that
    aren't obvious.  Also, take a look at unicode001.hs for a demo.
71 files changed:
ghc/compiler/HsVersions.h
ghc/compiler/Makefile
ghc/compiler/basicTypes/Id.lhs
ghc/compiler/basicTypes/Literal.lhs
ghc/compiler/basicTypes/MkId.lhs
ghc/compiler/basicTypes/Module.lhs
ghc/compiler/basicTypes/Name.lhs
ghc/compiler/basicTypes/OccName.lhs
ghc/compiler/basicTypes/RdrName.lhs
ghc/compiler/cmm/CLabel.hs
ghc/compiler/cmm/Cmm.hs
ghc/compiler/cmm/CmmLex.x
ghc/compiler/cmm/CmmParse.y
ghc/compiler/cmm/PprC.hs
ghc/compiler/cmm/PprCmm.hs
ghc/compiler/codeGen/CgProf.hs
ghc/compiler/codeGen/CgUtils.hs
ghc/compiler/codeGen/ClosureInfo.lhs
ghc/compiler/deSugar/Check.lhs
ghc/compiler/deSugar/DsForeign.lhs
ghc/compiler/deSugar/DsMeta.hs
ghc/compiler/deSugar/DsUtils.lhs
ghc/compiler/ghci/ByteCodeGen.lhs
ghc/compiler/ghci/ByteCodeLink.lhs
ghc/compiler/ghci/InteractiveUI.hs
ghc/compiler/hsSyn/Convert.lhs
ghc/compiler/hsSyn/HsDecls.lhs
ghc/compiler/hsSyn/HsUtils.lhs
ghc/compiler/iface/LoadIface.lhs
ghc/compiler/iface/MkIface.lhs
ghc/compiler/main/DriverMkDepend.hs
ghc/compiler/main/DriverPipeline.hs
ghc/compiler/main/Finder.lhs
ghc/compiler/main/GHC.hs
ghc/compiler/main/HscTypes.lhs
ghc/compiler/nativeGen/PprMach.hs
ghc/compiler/ndpFlatten/FlattenMonad.hs
ghc/compiler/parser/Ctype.lhs
ghc/compiler/parser/Lexer.x
ghc/compiler/parser/Parser.y.pp
ghc/compiler/parser/ParserCore.y
ghc/compiler/parser/RdrHsSyn.lhs
ghc/compiler/prelude/PrelNames.lhs
ghc/compiler/prelude/PrelRules.lhs
ghc/compiler/prelude/PrimOp.lhs
ghc/compiler/prelude/TysPrim.lhs
ghc/compiler/prelude/TysWiredIn.lhs
ghc/compiler/profiling/CostCentre.lhs
ghc/compiler/rename/RnEnv.lhs
ghc/compiler/rename/RnExpr.lhs
ghc/compiler/rename/RnNames.lhs
ghc/compiler/simplCore/SetLevels.lhs
ghc/compiler/simplCore/SimplMonad.lhs
ghc/compiler/simplCore/Simplify.lhs
ghc/compiler/stgSyn/CoreToStg.lhs
ghc/compiler/typecheck/Inst.lhs
ghc/compiler/typecheck/TcClassDcl.lhs
ghc/compiler/typecheck/TcGenDeriv.lhs
ghc/compiler/typecheck/TcInstDcls.lhs
ghc/compiler/typecheck/TcRnDriver.lhs
ghc/compiler/typecheck/TcSplice.lhs
ghc/compiler/types/TypeRep.lhs
ghc/compiler/utils/Binary.hs
ghc/compiler/utils/BufWrite.hs
ghc/compiler/utils/Encoding.hs [new file with mode: 0644]
ghc/compiler/utils/FastString.lhs
ghc/compiler/utils/FastTypes.lhs
ghc/compiler/utils/Pretty.lhs
ghc/compiler/utils/PrimPacked.lhs [deleted file]
ghc/compiler/utils/StringBuffer.lhs
ghc/compiler/utils/UnicodeUtil.lhs [deleted file]