Profiling
profiling, with cost-centrescost-centre profiling
Glasgow Haskell comes with a time and space profiling system. Its
purpose is to help you improve your understanding of your program's
execution behaviour, so you can improve it.
Any comments, suggestions and/or improvements you have are welcome.
Recommended ``profiling tricks'' would be especially cool!
How to profile a Haskell program
The GHC approach to profiling is very simple: annotate the expressions
you consider ``interesting'' with cost centre labels (strings);
so, for example, you might have:
f x y
= let
output1 = _scc_ "Pass1" ( pass1 x )
output2 = _scc_ "Pass2" ( pass2 output1 y )
output3 = _scc_ "Pass3" ( pass3 (output2 `zip` [1 .. ]) )
in concat output3
The costs of the evaluating the expressions bound to output1,
output2 and output3 will be attributed to the ``cost
centres'' Pass1, Pass2 and Pass3, respectively.
The costs of evaluating other expressions, e.g., concat output4,
will be inherited by the scope which referenced the function f.
You can put in cost-centres via _scc_ constructs by hand, as in the
example above. Perfectly cool. That's probably what you
would do if your program divided into obvious ``passes'' or
``phases'', or whatever.
If your program is large or you have no clue what might be gobbling
all the time, you can get GHC to mark all functions with _scc_
constructs, automagically. Add an compilation flag to the
usual option.
Once you start homing in on the Guilty Suspects, you may well switch
from automagically-inserted cost-centres to a few well-chosen ones of
your own.
To use profiling, you must compile and run with special
options. (We usually forget the ``run'' magic!—Do as we say, not as
we do…) Details follow.
If you're serious about this profiling game, you should probably read
one or more of the Sansom/Peyton Jones papers about the GHC profiling
system. Just visit the Glasgow FP group web page…
Compiling programs for profiling
profiling optionsoptions, for profiling
To make use of the cost centre profiling system all modules must
be compiled and linked with the option.-prof option
Any _scc_ constructs you've put in your source will spring to life.
Without a option, your _scc_s are ignored; so you can
compiled _scc_-laden code without changing it.
There are a few other profiling-related compilation options. Use them
in addition to . These do not have to be used
consistently for all modules in a program.
:-auto optioncost centres, automatically inserting
GHC will automatically add _scc_ constructs for
all top-level, exported functions.
:-auto-all optionAll top-level functions, exported or not, will be automatically
_scc_'d.
:-caf-all option
The costs of all CAFs in a module are usually attributed to one
``big'' CAF cost-centre. With this option, all CAFs get their own cost-centre.
An ``if all else fails'' option…
:-ignore-scc option
Ignore any _scc_ constructs,
so a module which already has _scc_s can be
compiled for profiling with the annotations ignored.
:-G<group> option
Specifies the <group> to be attached to all the cost-centres
declared in the module. If no group is specified it defaults to the
module name.
In addition to the option your system might be setup to enable
you to compile and link with the -prof-details
option option instead. This enables additional detailed counts
to be reported with the RTS option.
How to control your profiled program at runtime
profiling RTS optionsRTS options, for profiling
It isn't enough to compile your program for profiling with !
When you run your profiled program, you must tell the runtime
system (RTS) what you want to profile (e.g., time and/or space), and
how you wish the collected data to be reported. You also may wish to
set the sampling interval used in time profiling.
Executive summary: ./a.out +RTS -pT produces a time profile in
a.out.prof; ./a.out +RTS -hC produces space-profiling
info which can be mangled by hp2ps and viewed with ghostview
(or equivalent).
Profiling runtime flags are passed to your program between the usual
and options.
or :-p<sort> RTS option (profiling)-P<sort> RTS option (profiling)time profileserial time profile
The option produces a standard time profile report.
It is written into the file <program>@.prof.
The option produces a more detailed report containing the
actual time and allocation data as well. (Not used much.)
The <sort> indicates how the cost centres are to be sorted in the
report. Valid <sort> options are:
:
by time, largest first (the default);
:
by bytes allocated, largest first;
:
alphabetically by group, module and cost centre.
:-i<secs> RTS option
(profiling) Set the profiling (sampling) interval to <secs>
seconds (the default is 1 second). Fractions are allowed: for example
will get 5 samples per second.
:-h<break-down> RTS option (profiling)heap profile
Produce a detailed space profile of the heap occupied by live
closures. The profile is written to the file <program>@.hp from
which a PostScript graph can be produced using hp2ps (see
).
The heap space profile may be broken down by different criteria:
:
cost centre which produced the closure (the default).
:
cost centre module which produced the closure.
:
cost centre group which produced the closure.
:
closure description—a string describing the closure.
:
closure type—a string describing the closure's type.
By default all live closures in the heap are profiled, but particular
closures of interest can be selected (see below).
Heap (space) profiling uses hash tables. If these tables
should fill the run will abort. The
-z<tbl><size> RTS option (profiling) option is used to
increase the size of the relevant hash table (C, M,
G, D or Y, defined as for <break-down> above). The
actual size used is the next largest power of 2.
The heap profile can be restricted to particular closures of interest.
The closures of interest can selected by the attached cost centre
(module:label, module and group), closure category (description, type,
and kind) using the following options:
}:-c{<lab> RTS option (profiling)}
Selects individual cost centre(s).
}:-m{<mod> RTS option (profiling)}
Selects all cost centres from the module(s) specified.
}:-g{<grp> RTS option (profiling)}
Selects all cost centres from the groups(s) specified.
}:-d{<des> RTS option (profiling)}
Selects closures which have one of the specified descriptions.
}:-y{<typ> RTS option (profiling)}
Selects closures which have one of the specified type descriptions.
}:-k{<knd> RTS option (profiling)}
Selects closures which are of one of the specified closure kinds.
Valid closure kinds are CON (constructor), FN (manifest
function), PAP (partial application), BH (black hole) and
THK (thunk).
The space occupied by a closure will be reported in the heap profile
if the closure satisfies the following logical expression:
([-c] or [-m] or [-g]) and ([-d] or [-y] or [-k])
where a particular option is true if the closure (or its attached cost
centre) is selected by the option (or the option is not specified).
What's in a profiling report?
profiling report, meaning thereof
When you run your profiled program with the RTS option -p
RTS option, you get the following information about your ``cost
centres'':
COST CENTRE:
The cost-centre's name.
MODULE:
The module associated with the cost-centre;
important mostly if you have identically-named cost-centres in
different modules.
scc:
How many times this cost-centre was entered; think
of it as ``I got to the _scc_ construct this many times…''
%time:
What part of the time was spent in this cost-centre (see also ``ticks,''
below).
%alloc:
What part of the memory allocation was done in this cost-centre
(see also ``bytes,'' below).
inner:
How many times this cost-centre ``passed control'' to an inner
cost-centre; for example, scc=4 plus subscc=8 means
``This _scc_ was entered four times, but went out to
other _scc_s eight times.''
cafs:CAF, profiling
How many CAFs this cost centre evaluated.
dicts:Dictionaries, profiling
How many dictionaries this cost centre evaluated.
In addition you can use the RTS option to get the following additional information:
ticks:
The raw number of time ``ticks'' which were
attributed to this cost-centre; from this, we get the %time
figure mentioned above.
bytes:
Number of bytes allocated in the heap while in
this cost-centre; again, this is the raw number from which we
get the %alloc figure mentioned above.
Finally if you built your program with
the RTS option will also
produce the following information:
closures:closures, profiling
How many heap objects were allocated; these objects may be of varying
size. If you divide the number of bytes (mentioned below) by this
number of ``closures'', then you will get the average object size.
(Not too interesting, but still…)
thunks:thunks, profiling
How many times we entered (evaluated) a thunk—an unevaluated
object in the heap—while we were in this cost-centre.
funcs:functions, profiling
How many times we entered (evaluated) a function while we we in this
cost-centre. (In Haskell, functions are first-class values and may be
passed as arguments, returned as results, evaluated, and generally
manipulated just like data values)
PAPs:partial applications, profiling
How many times we entered (evaluated) a partial application (PAP), i.e.,
a function applied to fewer arguments than it needs. For example, Int
addition applied to one argument would be a PAP. A PAP is really
just a particular form for a function.
Producing graphical heap profiles
heap profiles, producing
Utility programs which produce graphical profiles.
hp2ps--heap profile to PostScript
hp2ps (utility)heap profilesPostScript, from heap profiles
Usage:
hp2ps [flags] [<file>[.stat]]
The program hp2pshp2ps program converts a heap profile
as produced by the -h<break-down> RTS
option runtime option into a PostScript graph of the heap
profile. By convention, the file to be processed by hp2ps has a
.hp extension. The PostScript output is written to <file>@.ps. If
<file> is omitted entirely, then the program behaves as a filter.
hp2ps is distributed in ghc/utils/hp2ps in a GHC source
distribution. It was originally developed by Dave Wakeling as part of
the HBC/LML heap profiler.
The flags are:
In order to make graphs more readable, hp2ps sorts the shaded
bands for each identifier. The default sort ordering is for the bands
with the largest area to be stacked on top of the smaller ones. The
option causes rougher bands (those representing series of
values with the largest standard deviations) to be stacked on top of
smoother ones.
Normally, hp2ps puts the title of the graph in a small box at the
top of the page. However, if the JOB string is too long to fit in a
small box (more than 35 characters), then
hp2ps will choose to use a big box instead. The
option forces hp2ps to use a big box.
Generate encapsulated PostScript suitable for inclusion in LaTeX
documents. Usually, the PostScript graph is drawn in landscape mode
in an area 9 inches wide by 6 inches high, and hp2ps arranges
for this area to be approximately centred on a sheet of a4 paper.
This format is convenient of studying the graph in detail, but it is
unsuitable for inclusion in LaTeX documents. The option
causes the graph to be drawn in portrait mode, with float specifying
the width in inches, millimetres or points (the default). The
resulting PostScript file conforms to the Encapsulated PostScript
(EPS) convention, and it can be included in a LaTeX document using
Rokicki's dvi-to-PostScript converter dvips.
Create output suitable for the gs PostScript previewer (or
similar). In this case the graph is printed in portrait mode without
scaling. The output is unsuitable for a laser printer.
Normally a profile is limited to 20 bands with additional identifiers
being grouped into an OTHER band. The flag removes this
20 band and limit, producing as many bands as necessary. No key is
produced as it won't fit!. It is useful for creation time profiles
with many bands.
Normally a profile is limited to 20 bands with additional identifiers
being grouped into an OTHER band. The flag specifies an
alternative band limit (the maximum is 20).
requests the band limit to be removed. As many bands as
necessary are produced. However no key is produced as it won't fit! It
is useful for displaying creation time profiles with many bands.
Use previous parameters. By default, the PostScript graph is
automatically scaled both horizontally and vertically so that it fills
the page. However, when preparing a series of graphs for use in a
presentation, it is often useful to draw a new graph using the same
scale, shading and ordering as a previous one. The flag causes
the graph to be drawn using the parameters determined by a previous
run of hp2ps on file. These are extracted from
file@.aux.
Use a small box for the title.
Normally trace elements which sum to a total of less than 1% of the
profile are removed from the profile. The option allows this
percentage to be modified (maximum 5%).
requests no trace elements to be removed from the profile,
ensuring that all the data will be displayed.
Print out usage information.
stat2resid—residency info from GC stats
stat2resid (utility)GC stats—residency inforesidency, from GC stats
Usage:
stat2resid [<file>[.stat] [<outfile>]]
The program stat2residstat2resid converts a detailed
garbage collection statistics file produced by the
-S RTS option runtime option into a PostScript heap
residency graph. The garbage collection statistics file can be
produced without compiling your program for profiling.
By convention, the file to be processed by stat2resid has a
.stat extension. If the <outfile> is not specified the
PostScript will be written to <file>@.resid.ps. If
<file> is omitted entirely, then the program behaves as a filter.
The plot can not be produced from the statistics file for a
generational collector, though a suitable stats file can be produced
using the -F2s RTS option runtime option when the
program has been compiled for generational garbage collection (the
default).
stat2resid is distributed in ghc/utils/stat2resid in a GHC source
distribution.
Using ``ticky-ticky'' profiling (for implementors)
ticky-ticky profiling (implementors)
(ToDo: document properly.)
It is possible to compile Glasgow Haskell programs so that they will
count lots and lots of interesting things, e.g., number of updates,
number of data constructors entered, etc., etc. We call this
``ticky-ticky'' profiling,ticky-ticky profilingprofiling, ticky-ticky because that's the sound a Sun4 makes
when it is running up all those counters (slowly).
Ticky-ticky profiling is mainly intended for implementors; it is quite
separate from the main ``cost-centre'' profiling system, intended for
all users everywhere.
To be able to use ticky-ticky profiling, you will need to have built
appropriate libraries and things when you made the system. See
``Customising what libraries to build,'' in the installation guide.
To get your compiled program to spit out the ticky-ticky numbers, use
a RTS option-r RTS option. See .
Compiling your program with the switch yields an executable
that performs these counts. Here is a sample ticky-ticky statistics
file, generated by the invocation foo +RTS -rfoo.ticky.
foo +RTS -rfoo.ticky
ALLOCATIONS: 3964631 (11330900 words total: 3999476 admin, 6098829 goods, 1232595 slop)
total words: 2 3 4 5 6+
69647 ( 1.8%) function values 50.0 50.0 0.0 0.0 0.0
2382937 ( 60.1%) thunks 0.0 83.9 16.1 0.0 0.0
1477218 ( 37.3%) data values 66.8 33.2 0.0 0.0 0.0
0 ( 0.0%) big tuples
2 ( 0.0%) black holes 0.0 100.0 0.0 0.0 0.0
0 ( 0.0%) prim things
34825 ( 0.9%) partial applications 0.0 0.0 0.0 100.0 0.0
2 ( 0.0%) thread state objects 0.0 0.0 0.0 0.0 100.0
Total storage-manager allocations: 3647137 (11882004 words)
[551104 words lost to speculative heap-checks]
STACK USAGE:
ENTERS: 9400092 of which 2005772 (21.3%) direct to the entry code
[the rest indirected via Node's info ptr]
1860318 ( 19.8%) thunks
3733184 ( 39.7%) data values
3149544 ( 33.5%) function values
[of which 1999880 (63.5%) bypassed arg-satisfaction chk]
348140 ( 3.7%) partial applications
308906 ( 3.3%) normal indirections
0 ( 0.0%) permanent indirections
RETURNS: 5870443
2137257 ( 36.4%) from entering a new constructor
[the rest from entering an existing constructor]
2349219 ( 40.0%) vectored [the rest unvectored]
RET_NEW: 2137257: 32.5% 46.2% 21.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
RET_OLD: 3733184: 2.8% 67.9% 29.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
RET_UNBOXED_TUP: 2: 0.0% 0.0%100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
RET_VEC_RETURN : 2349219: 0.0% 0.0%100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
UPDATE FRAMES: 2241725 (0 omitted from thunks)
SEQ FRAMES: 1
CATCH FRAMES: 1
UPDATES: 2241725
0 ( 0.0%) data values
34827 ( 1.6%) partial applications
[2 in place, 34825 allocated new space]
2206898 ( 98.4%) updates to existing heap objects (46 by squeezing)
UPD_CON_IN_NEW: 0: 0 0 0 0 0 0 0 0 0
UPD_PAP_IN_NEW: 34825: 0 0 0 34825 0 0 0 0 0
NEW GEN UPDATES: 2274700 ( 99.9%)
OLD GEN UPDATES: 1852 ( 0.1%)
Total bytes copied during GC: 190096
**************************************************
3647137 ALLOC_HEAP_ctr
11882004 ALLOC_HEAP_tot
69647 ALLOC_FUN_ctr
69647 ALLOC_FUN_adm
69644 ALLOC_FUN_gds
34819 ALLOC_FUN_slp
34831 ALLOC_FUN_hst_0
34816 ALLOC_FUN_hst_1
0 ALLOC_FUN_hst_2
0 ALLOC_FUN_hst_3
0 ALLOC_FUN_hst_4
2382937 ALLOC_UP_THK_ctr
0 ALLOC_SE_THK_ctr
308906 ENT_IND_ctr
0 E!NT_PERM_IND_ctr requires +RTS -Z
[... lots more info omitted ...]
0 GC_SEL_ABANDONED_ctr
0 GC_SEL_MINOR_ctr
0 GC_SEL_MAJOR_ctr
0 GC_FAILED_PROMOTION_ctr
47524 GC_WORDS_COPIED_ctr
The formatting of the information above the row of asterisks is
subject to change, but hopefully provides a useful human-readable
summary. Below the asterisks all counters maintained by the
ticky-ticky system are dumped, in a format intended to be
machine-readable: zero or more spaces, an integer, a space, the
counter name, and a newline.
In fact, not all counters are necessarily dumped; compile- or
run-time flags can render certain counters invalid. In this case,
either the counter will simply not appear, or it will appear with a
modified counter name, possibly along with an explanation for the
omission (notice ENT_PERM_IND_ctr appears with an inserted !
above). Software analysing this output should always check that it
has the counters it expects. Also, beware: some of the counters can
have large values!