Profilingprofilingcost-centre profiling Glasgow Haskell comes with a time and space profiling
system. Its purpose is to help you improve your understanding of
your program's execution behaviour, so you can improve it. Any comments, suggestions and/or improvements you have are
welcome. Recommended “profiling tricks” would be
especially cool! Profiling a program is a three-step process: Re-compile your program for profiling with the
-prof option, and probably one of the
-auto or -auto-all
options. These options are described in more detail in -prof-auto-auto-all Run your program with one of the profiling options
-p or -h. This generates
a file of profiling information.-pRTS
option-hRTS
option Examine the generated profiling information, using one of
GHC's profiling tools. The tool to use will depend on the kind
of profiling information generated.Cost centres and cost-centre stacksGHC's profiling system assigns costs
to cost centres. A cost is simply the time
or space required to evaluate an expression. Cost centres are
program annotations around expressions; all costs incurred by the
annotated expression are assigned to the enclosing cost centre.
Furthermore, GHC will remember the stack of enclosing cost centres
for any given expression at run-time and generate a call-graph of
cost attributions.Let's take a look at an example:
main = print (nfib 25)
nfib n = if n < 2 then 1 else nfib (n-1) + nfib (n-2)
Compile and run this program as follows:
$ ghc -prof -auto-all -o Main Main.hs
$ ./Main +RTS -p
121393
$
When a GHC-compiled program is run with the
RTS option, it generates a file called
<prog>.prof. In this case, the file
will contain something like this:
Fri May 12 14:06 2000 Time and Allocation Profiling Report (Final)
Main +RTS -p -RTS
total time = 0.14 secs (7 ticks @ 20 ms)
total alloc = 8,741,204 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
nfib Main 100.0 100.0
individual inherited
COST CENTRE MODULE scc %time %alloc %time %alloc
MAIN MAIN 0 0.0 0.0 100.0 100.0
main Main 0 0.0 0.0 0.0 0.0
CAF PrelHandle 3 0.0 0.0 0.0 0.0
CAF PrelAddr 1 0.0 0.0 0.0 0.0
CAF Main 6 0.0 0.0 100.0 100.0
main Main 1 0.0 0.0 100.0 100.0
nfib Main 242785 100.0 100.0 100.0 100.0
The first part of the file gives the program name and
options, and the total time and total memory allocation measured
during the run of the program (note that the total memory
allocation figure isn't the same as the amount of
live memory needed by the program at any one
time; the latter can be determined using heap profiling, which we
will describe shortly).The second part of the file is a break-down by cost centre
of the most costly functions in the program. In this case, there
was only one significant function in the program, namely
nfib, and it was responsible for 100%
of both the time and allocation costs of the program.The third and final section of the file gives a profile
break-down by cost-centre stack. This is roughly a call-graph
profile of the program. In the example above, it is clear that
the costly call to nfib came from
main.The time and allocation incurred by a given part of the
program is displayed in two ways: “individual”, which
are the costs incurred by the code covered by this cost centre
stack alone, and “inherited”, which includes the costs
incurred by all the children of this node.The usefulness of cost-centre stacks is better demonstrated
by modifying the example slightly:
main = print (f 25 + g 25)
f n = nfib n
g n = nfib (n `div` 2)
nfib n = if n < 2 then 1 else nfib (n-1) + nfib (n-2)
Compile and run this program as before, and take a look at
the new profiling results:
COST CENTRE MODULE scc %time %alloc %time %alloc
MAIN MAIN 0 0.0 0.0 100.0 100.0
main Main 0 0.0 0.0 0.0 0.0
CAF PrelHandle 3 0.0 0.0 0.0 0.0
CAF PrelAddr 1 0.0 0.0 0.0 0.0
CAF Main 9 0.0 0.0 100.0 100.0
main Main 1 0.0 0.0 100.0 100.0
g Main 1 0.0 0.0 0.0 0.2
nfib Main 465 0.0 0.2 0.0 0.2
f Main 1 0.0 0.0 100.0 99.8
nfib Main 242785 100.0 99.8 100.0 99.8
Now although we had two calls to nfib
in the program, it is immediately clear that it was the call from
f which took all the time.The actual meaning of the various columns in the output is:entriesThe number of times this particular point in the call
graph was entered.individual %timeThe percentage of the total run time of the program
spent at this point in the call graph.individual %allocThe percentage of the total memory allocations
(excluding profiling overheads) of the program made by this
call.inherited %timeThe percentage of the total run time of the program
spent below this point in the call graph.inherited %allocThe percentage of the total memory allocations
(excluding profiling overheads) of the program made by this
call and all of its sub-calls.In addition you can use the RTS option
to
get the following additional information:ticksThe raw number of time “ticks” which were
attributed to this cost-centre; from this, we get the
%time figure mentioned
above.bytesNumber of bytes allocated in the heap while in this
cost-centre; again, this is the raw number from which we get
the %alloc figure mentioned
above.What about recursive functions, and mutually recursive
groups of functions? Where are the costs attributed? Well,
although GHC does keep information about which groups of functions
called each other recursively, this information isn't displayed in
the basic time and allocation profile, instead the call-graph is
flattened into a tree. The XML profiling tool (described in ) will be able to display real loops in
the call-graph.Inserting cost centres by handCost centres are just program annotations. When you say
to the compiler, it automatically
inserts a cost centre annotation around every top-level function
in your program, but you are entirely free to add the cost
centre annotations yourself.The syntax of a cost centre annotation is
{-# SCC "name" #-} <expression>
where "name" is an aribrary string,
that will become the name of your cost centre as it appears
in the profiling output, and
<expression> is any Haskell
expression. An SCC annotation extends as
far to the right as possible when parsing.Rules for attributing costsThe cost of evaluating any expression in your program is
attributed to a cost-centre stack using the following rules:If the expression is part of the
one-off costs of evaluating the
enclosing top-level definition, then costs are attributed to
the stack of lexically enclosing SCC
annotations on top of the special CAF
cost-centre. Otherwise, costs are attributed to the stack of
lexically-enclosing SCC annotations,
appended to the cost-centre stack in effect at the
call site of the current top-level
definitionThe call-site is just the place
in the source code which mentions the particular function or
variable.. Notice that this is a recursive
definition.What do we mean by one-off costs? Well, Haskell is a lazy
language, and certain expressions are only ever evaluated once.
For example, if we write:
x = nfib 25
then x will only be evaluated once (if
at all), and subsequent demands for x will
immediately get to see the cached result. The definition
x is called a CAF (Constant Applicative
Form), because it has no arguments.For the purposes of profiling, we say that the expression
nfib 25 belongs to the one-off costs of
evaluating x.Since one-off costs aren't strictly speaking part of the
call-graph of the program, they are attributed to a special
top-level cost centre, CAF. There may be one
CAF cost centre for each module (the
default), or one for each top-level definition with any one-off
costs (this behaviour can be selected by giving GHC the
flag).-caf-allIf you think you have a weird profile, or the call-graph
doesn't look like you expect it to, feel free to send it (and
your program) to us at
glasgow-haskell-bugs@haskell.org.Profiling memory usageIn addition to profiling the time and allocation behaviour
of your program, you can also generate a graph of its memory usage
over time. This is useful for detecting the causes of
space leaks, when your program holds on to
more memory at run-time that it needs to. Space leaks lead to
longer run-times due to heavy garbage collector ativity, and may
even cause the program to run out of memory altogether.To generate a heap profile from your program, compile it as
before, but this time run it with the runtime
option. This generates a file
<prog>.hp file, which you then process
with hp2ps to produce a Postscript file
<prog>.ps. The Postscript file can be
viewed with something like ghostview, or
printed out on a Postscript-compatible printer.For the RTS options that control the kind of heap profile
generated, see . Details on the
usage of the hp2ps program are given in Graphical time/allocation profileYou can view the time and allocation profiling graph of your
program graphically, using ghcprof. This is a
new tool with GHC 4.08, and will eventually be the de-facto
standard way of viewing GHC profiles.To run ghcprof, you need
daVinci installed, which can be
obtained from The Graph
Visualisation Tool daVinci. Install one of
the binary
distributionsdaVinci is
sadly not open-source :-(., and set your
DAVINCIHOME environment variable to point to the
installation directory.ghcprof uses an XML-based profiling log
format, and you therefore need to run your program with a
different option: . The file generated is
still called <prog>.prof. To see the
profile, run ghcprof like this:
$ ghcprof <prog>.prof
which should pop up a window showing the call-graph of your
program in glorious detail. More information on using
ghcprof can be found at The
Cost-Centre Stack Profiling Tool for
GHC.Compiler options for profilingprofilingoptionsoptionsfor profiling To make use of the cost centre profiling system
all modules must be compiled and linked with
the option. Any
_scc_ constructs you've put in
your source will spring to life.-prof Without a option, your
_scc_s are ignored; so you can
compiled _scc_-laden code
without changing it.There are a few other profiling-related compilation options.
Use them in addition to
. These do not have to be used consistently
for all modules in a program.:-autocost centresautomatically inserting GHC will automatically add
_scc_ constructs for all
top-level, exported functions.:-auto-allAll top-level functions,
exported or not, will be automatically
_scc_'d.:-caf-all The costs of all CAFs in a module are usually
attributed to one “big” CAF cost-centre. With
this option, all CAFs get their own cost-centre. An
“if all else fails” option…:-ignore-sccIgnore any _scc_
constructs, so a module which already has
_scc_s can be compiled
for profiling with the annotations ignored.Runtime options for profilingprofiling RTS optionsRTS options, for profilingIt isn't enough to compile your program for profiling with
!When you run your profiled program, you
must tell the runtime system (RTS) what you want to profile (e.g.,
time and/or space), and how you wish the collected data to be
reported. You also may wish to set the sampling interval used in
time profiling.Executive summary: ./a.out +RTS -pT
produces a time profile in a.out.prof;
./a.out +RTS -hC produces space-profiling info
which can be mangled by hp2ps and viewed with
ghostview (or equivalent).Profiling runtime flags are passed to your program between
the usual and
options. or :time profileThe option produces a standard
time profile report. It is written
into the file
<program>.prof.The option produces a more
detailed report containing the actual time and allocation
data as well. (Not used much.):The option generates profiling
information in the XML format understood by our new
profiling tool, see .: Set the profiling (sampling) interval to
<secs> seconds (the default is
1 second). Fractions are allowed: for example
will get 5 samples per second. This
only affects heap profiling; time profiles are always
sampled on a 1/50 second frequency.:heap profileProduce a detailed heap profile
of the heap occupied by live closures. The profile is
written to the file <program>.hp
from which a PostScript graph can be produced using
hp2ps (see ).The heap space profile may be broken down by different
criteria::cost centre which produced the closure (the
default).:cost centre module which produced the
closure.:closure description—a string describing
the closure.:closure type—a string describing the
closure's type.:heap profile filtering optionsIt's often useful to select just some subset of the
heap when profiling. To do this, the following filters are
available. You may use multiple filters, in which case a
closure has to satisfy all filters to appear in the final
profile. Filtering criterion are independent of what it is
you ask to see. So, for example, you can specify a profile
by closure description (-hD) but ask to
filter closures by producer module (-hm{...}).
Available filters are::Restrict to one of the specified cost centers.
Since GHC deals in cost center stacks, the specified
cost centers pertain to the top stack element. For
example, -hc{Wurble,Burble} selects
all cost center stacks whose top element is
Wurble or
Burble.
:Restrict to closures produced by functions in
one of the specified modules.
:Restrict to closures whose description-string is
one of the specified descriptions. Description
strings are pretty arcane. An easy way to find
plausible strings to specify is to first do a
-hD profile and then inspect the
description-strings which appear in the resulting profile.
:Restrict to closures having one of the specified
types.
:The option generates heap
profiling information in the XML format understood by our
new profiling tool (NOTE: heap profiling with the new tool
is not yet working! Use hp2ps-style heap
profiling for the time being).RTS
optionThis option makes use of the extra information
maintained by the cost-centre-stack profiler to provide
useful information about the location of runtime errors.
See .hp2ps--heap profile to PostScripthp2psheap profilespostscript, from heap profilesUsage:
hp2ps [flags] [<file>[.hp]]
The program
hp2pshp2ps
program converts a heap profile as produced
by the runtime option into a
PostScript graph of the heap profile. By convention, the file to
be processed by hp2ps has a
.hp extension. The PostScript output is
written to <file>@.ps. If
<file> is omitted entirely, then the
program behaves as a filter.hp2ps is distributed in
ghc/utils/hp2ps in a GHC source
distribution. It was originally developed by Dave Wakeling as part
of the HBC/LML heap profiler.The flags are:In order to make graphs more readable,
hp2ps sorts the shaded bands for each
identifier. The default sort ordering is for the bands with
the largest area to be stacked on top of the smaller ones.
The option causes rougher bands (those
representing series of values with the largest standard
deviations) to be stacked on top of smoother ones.Normally, hp2ps puts the title of
the graph in a small box at the top of the page. However, if
the JOB string is too long to fit in a small box (more than
35 characters), then hp2ps will choose to
use a big box instead. The option
forces hp2ps to use a big box.Generate encapsulated PostScript suitable for
inclusion in LaTeX documents. Usually, the PostScript graph
is drawn in landscape mode in an area 9 inches wide by 6
inches high, and hp2ps arranges for this
area to be approximately centred on a sheet of a4 paper.
This format is convenient of studying the graph in detail,
but it is unsuitable for inclusion in LaTeX documents. The
option causes the graph to be drawn in
portrait mode, with float specifying the width in inches,
millimetres or points (the default). The resulting
PostScript file conforms to the Encapsulated PostScript
(EPS) convention, and it can be included in a LaTeX document
using Rokicki's dvi-to-PostScript converter
dvips.Create output suitable for the gs
PostScript previewer (or similar). In this case the graph is
printed in portrait mode without scaling. The output is
unsuitable for a laser printer.Normally a profile is limited to 20 bands with
additional identifiers being grouped into an
OTHER band. The flag
removes this 20 band and limit, producing as many bands as
necessary. No key is produced as it won't fit!. It is useful
for creation time profiles with many bands.Normally a profile is limited to 20 bands with
additional identifiers being grouped into an
OTHER band. The flag
specifies an alternative band limit (the maximum is
20). requests the band limit to be
removed. As many bands as necessary are produced. However no
key is produced as it won't fit! It is useful for displaying
creation time profiles with many bands.Use previous parameters. By default, the PostScript
graph is automatically scaled both horizontally and
vertically so that it fills the page. However, when
preparing a series of graphs for use in a presentation, it
is often useful to draw a new graph using the same scale,
shading and ordering as a previous one. The
flag causes the graph to be drawn using
the parameters determined by a previous run of
hp2ps on file. These
are extracted from file@.aux.Use a small box for the title.Normally trace elements which sum to a total of less
than 1% of the profile are removed from the
profile. The option allows this
percentage to be modified (maximum 5%). requests no trace elements to be
removed from the profile, ensuring that all the data will be
displayed.Generate colour output.Ignore marks.Print out usage information.Using “ticky-ticky” profiling (for implementors)ticky-ticky profiling(ToDo: document properly.)It is possible to compile Glasgow Haskell programs so that
they will count lots and lots of interesting things, e.g., number
of updates, number of data constructors entered, etc., etc. We
call this “ticky-ticky”
profiling,ticky-ticky
profilingprofiling,
ticky-ticky because that's the sound a Sun4
makes when it is running up all those counters
(slowly).Ticky-ticky profiling is mainly intended for implementors;
it is quite separate from the main “cost-centre”
profiling system, intended for all users everywhere.To be able to use ticky-ticky profiling, you will need to
have built appropriate libraries and things when you made the
system. See “Customising what libraries to build,” in
the installation guide.To get your compiled program to spit out the ticky-ticky
numbers, use a RTS
option-r RTS option.
See .Compiling your program with the
switch yields an executable that performs these counts. Here is a
sample ticky-ticky statistics file, generated by the invocation
foo +RTS -rfoo.ticky.
foo +RTS -rfoo.ticky
ALLOCATIONS: 3964631 (11330900 words total: 3999476 admin, 6098829 goods, 1232595 slop)
total words: 2 3 4 5 6+
69647 ( 1.8%) function values 50.0 50.0 0.0 0.0 0.0
2382937 ( 60.1%) thunks 0.0 83.9 16.1 0.0 0.0
1477218 ( 37.3%) data values 66.8 33.2 0.0 0.0 0.0
0 ( 0.0%) big tuples
2 ( 0.0%) black holes 0.0 100.0 0.0 0.0 0.0
0 ( 0.0%) prim things
34825 ( 0.9%) partial applications 0.0 0.0 0.0 100.0 0.0
2 ( 0.0%) thread state objects 0.0 0.0 0.0 0.0 100.0
Total storage-manager allocations: 3647137 (11882004 words)
[551104 words lost to speculative heap-checks]
STACK USAGE:
ENTERS: 9400092 of which 2005772 (21.3%) direct to the entry code
[the rest indirected via Node's info ptr]
1860318 ( 19.8%) thunks
3733184 ( 39.7%) data values
3149544 ( 33.5%) function values
[of which 1999880 (63.5%) bypassed arg-satisfaction chk]
348140 ( 3.7%) partial applications
308906 ( 3.3%) normal indirections
0 ( 0.0%) permanent indirections
RETURNS: 5870443
2137257 ( 36.4%) from entering a new constructor
[the rest from entering an existing constructor]
2349219 ( 40.0%) vectored [the rest unvectored]
RET_NEW: 2137257: 32.5% 46.2% 21.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
RET_OLD: 3733184: 2.8% 67.9% 29.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
RET_UNBOXED_TUP: 2: 0.0% 0.0%100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
RET_VEC_RETURN : 2349219: 0.0% 0.0%100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
UPDATE FRAMES: 2241725 (0 omitted from thunks)
SEQ FRAMES: 1
CATCH FRAMES: 1
UPDATES: 2241725
0 ( 0.0%) data values
34827 ( 1.6%) partial applications
[2 in place, 34825 allocated new space]
2206898 ( 98.4%) updates to existing heap objects (46 by squeezing)
UPD_CON_IN_NEW: 0: 0 0 0 0 0 0 0 0 0
UPD_PAP_IN_NEW: 34825: 0 0 0 34825 0 0 0 0 0
NEW GEN UPDATES: 2274700 ( 99.9%)
OLD GEN UPDATES: 1852 ( 0.1%)
Total bytes copied during GC: 190096
**************************************************
3647137 ALLOC_HEAP_ctr
11882004 ALLOC_HEAP_tot
69647 ALLOC_FUN_ctr
69647 ALLOC_FUN_adm
69644 ALLOC_FUN_gds
34819 ALLOC_FUN_slp
34831 ALLOC_FUN_hst_0
34816 ALLOC_FUN_hst_1
0 ALLOC_FUN_hst_2
0 ALLOC_FUN_hst_3
0 ALLOC_FUN_hst_4
2382937 ALLOC_UP_THK_ctr
0 ALLOC_SE_THK_ctr
308906 ENT_IND_ctr
0 E!NT_PERM_IND_ctr requires +RTS -Z
[... lots more info omitted ...]
0 GC_SEL_ABANDONED_ctr
0 GC_SEL_MINOR_ctr
0 GC_SEL_MAJOR_ctr
0 GC_FAILED_PROMOTION_ctr
47524 GC_WORDS_COPIED_ctr
The formatting of the information above the row of asterisks
is subject to change, but hopefully provides a useful
human-readable summary. Below the asterisks all
counters maintained by the ticky-ticky system are
dumped, in a format intended to be machine-readable: zero or more
spaces, an integer, a space, the counter name, and a newline.In fact, not all counters are
necessarily dumped; compile- or run-time flags can render certain
counters invalid. In this case, either the counter will simply
not appear, or it will appear with a modified counter name,
possibly along with an explanation for the omission (notice
ENT_PERM_IND_ctr appears
with an inserted ! above). Software analysing
this output should always check that it has the counters it
expects. Also, beware: some of the counters can have
large values!