Profiling

Profiling profiling cost-centre profiling Glasgow Haskell comes with a time and space profiling system. Its purpose is to help you improve your understanding of your program's execution behaviour, so you can improve it. Any comments, suggestions and/or improvements you have are welcome. Recommended “profiling tricks” would be especially cool! Profiling a program is a three-step process: Re-compile your program for profiling with the -prof option, and probably one of the -auto or -auto-all options. These options are described in more detail in -prof -auto -auto-all Run your program with one of the profiling options -p or -h. This generates a file of profiling information. -pRTS option -hRTS option Examine the generated profiling information, using one of GHC's profiling tools. The tool to use will depend on the kind of profiling information generated. Cost centres and cost-centre stacks GHC's profiling system assigns costs to cost centres. A cost is simply the time or space required to evaluate an expression. Cost centres are program annotations around expressions; all costs incurred by the annotated expression are assigned to the enclosing cost centre. Furthermore, GHC will remember the stack of enclosing cost centres for any given expression at run-time and generate a call-graph of cost attributions. Let's take a look at an example: main = print (nfib 25) nfib n = if n < 2 then 1 else nfib (n-1) + nfib (n-2) Compile and run this program as follows: $ ghc -prof -auto-all -o Main Main.hs $ ./Main +RTS -p 121393 $ When a GHC-compiled program is run with the RTS option, it generates a file called <prog>.prof. In this case, the file will contain something like this: Tue Apr 18 12:52 2000 Time and Allocation Profiling Report (Final) Main +RTS -p -RTS total time = 0.14 secs (7 ticks @ 20 ms) total alloc = 8,741,204 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc nfib Main 100.0 100.0 COST CENTRE MODULE scc %time %alloc inner cafs MAIN MAIN 0 0.0 0.0 0 1 main Main 0 0.0 0.0 0 1 CAF PrelHandle 3 0.0 0.0 0 3 CAF PrelAddr 1 0.0 0.0 0 0 CAF Main 6 0.0 0.0 1 0 main Main 1 0.0 0.0 1 1 nfib Main 242785 100.0 100.0 242784 4 The first part of the file gives the program name and options, and the total time and total memory allocation measured during the run of the program (note that the total memory allocation figure isn't the same as the amount of live memory needed by the program at any one time; the latter can be determined using heap profiling, which we will describe shortly). The second part of the file is a break-down by cost centre of the most costly functions in the program. In this case, there was only one significant function in the program, namely nfib, and it was responsible for 100% of both the time and allocation costs of the program. The third and final section of the file gives a profile break-down by cost-centre stack. This is roughly a call-graph profile of the program. In the example above, it is clear that the costly call to nfib came from main. The usefulness of cost-centre stacks is better demonstrated by modifying the example slightly: main = print (f 25 + g 25) f n = nfib n g n = nfib (n `div` 2) nfib n = if n < 2 then 1 else nfib (n-1) + nfib (n-2) Compile and run this program as before, and take a look at the new profiling results: COST CENTRE MODULE scc %time %alloc inner cafs MAIN MAIN 0 0.0 0.0 0 1 main Main 0 0.0 0.0 0 1 CAF PrelHandle 3 0.0 0.0 0 3 CAF PrelAddr 1 0.0 0.0 0 0 CAF Main 9 0.0 0.0 1 1 main Main 1 0.0 0.0 2 2 g Main 1 0.0 0.0 1 3 nfib Main 465 0.0 0.2 464 0 f Main 1 0.0 0.0 1 1 nfib Main 242785 100.0 99.8 242784 1 Now although we had two calls to nfib in the program, it is immediately clear that it was the call from f which took all the time. The actual meaning of the various columns in the output is: scc The number of times this particular point in the call graph was entered. %time The percentage of the total run time of the program spent at this point in the call graph. %alloc The percentage of the total memory allocations (excluding profiling overheads) of the program made by this call. inner The number of times an inner call-graph context was entered from here (including recursive calls). cafs The number of times a CAF context was entered from here. CAFs are described in . In addition you can use the RTS option to get the following additional information: ticks The raw number of time “ticks” which were attributed to this cost-centre; from this, we get the %time figure mentioned above. bytes Number of bytes allocated in the heap while in this cost-centre; again, this is the raw number from which we get the %alloc figure mentioned above. What about recursive functions, and mutually recursive groups of functions? Where are the costs attributed? Well, although GHC does keep information about which groups of functions called each other recursively, this information isn't displayed in the basic time and allocation profile, instead the call-graph is flattened into a tree. The XML profiling tool (described in ) will be able to display real loops in the call-graph. Inserting cost centres by hand Cost centres are just program annotations. When you say to the compiler, it automatically inserts a cost centre annotation around every top-level function in your program, but you are entirely free to add the cost centre annotations yourself. The syntax of a cost centre annotation is _scc_ "name" <expression> where "name" is an aribrary string, that will become the name of your cost centre as it appears in the profiling output, and <expression> is any Haskell expression. An _scc_ annotation extends as far to the right as possible when parsing. Rules for attributing costs The cost of evaluating any expression in your program is attributed to a cost-centre stack using the following rules: If the expression is part of the one-off costs of evaluating the enclosing top-level definition, then costs are attributed to the stack of lexically enclosing _scc_ annotations on top of the special CAF cost-centre. Otherwise, costs are attributed to the stack of lexically-enclosing _scc_ annotations, appended to the cost-centre stack in effect at the call site of the current top-level definition The call-site is just the place in the source code which mentions the particular function or variable.. Notice that this is a recursive definition. What do we mean by one-off costs? Well, Haskell is a lazy language, and certain expressions are only ever evaluated once. For example, if we write: x = nfib 25 then x will only be evaluated once (if at all), and subsequent demands for x will immediately get to see the cached result. The definition x is called a CAF (Constant Applicative Form), because it has no arguments. For the purposes of profiling, we say that the expression nfib 25 belongs to the one-off costs of evaluating x. Since one-off costs aren't strictly speaking part of the call-graph of the program, they are attributed to a special top-level cost centre, CAF. There may be one CAF cost centre for each module (the default), or one for each top-level definition with any one-off costs (this behaviour can be selected by giving GHC the flag). -caf-all If you think you have a weird profile, or the call-graph doesn't look like you expect it to, feel free to send it (and your program) to us at glasgow-haskell-bugs@haskell.org. Profiling memory usage In addition to profiling the time and allocation behaviour of your program, you can also generate a graph of its memory usage over time. This is useful for detecting the causes of space leaks, when your program holds on to more memory at run-time that it needs to. Space leaks lead to longer run-times due to heavy garbage collector ativity, and may even cause the program to run out of memory altogether. To generate a heap profile from your program, compile it as before, but this time run it with the runtime option. This generates a file <prog>.hp file, which you then process with hp2ps to produce a Postscript file <prog>.ps. The Postscript file can be viewed with something like ghostview, or printed out on a Postscript-compatible printer. For the RTS options that control the kind of heap profile generated, see . Details on the usage of the hp2ps program are given in Graphical time/allocation profile You can view the time and allocation profiling graph of your program graphically, using ghcprof. This is a new tool with GHC 4.07, and will eventually be the de-facto standard way of viewing GHC profiles. To run ghcprof, you need daVinci installed, which can be obtained from The Graph Visualisation Tool daVinci. Install one of the binary distributionsdaVinci is sadly not open-source :-(., and set your DAVINCIHOME environment variable to point to the installation directory. ghcprof uses an XML-based profiling log format, and you therefore need to run your program with a different option: . The file generated is still called <prog>.prof. To see the profile, run ghcprof like this: $ ghcprof <prog>.prof which should pop up a window showing the call-graph of your program in glorious detail. More information on using ghcprof can be found at The Cost-Centre Stack Profiling Tool for GHC. Compiler options for profiling profilingoptions optionsfor profiling To make use of the cost centre profiling system all modules must be compiled and linked with the option. Any _scc_ constructs you've put in your source will spring to life. -prof Without a option, your _scc_s are ignored; so you can compiled _scc_-laden code without changing it. There are a few other profiling-related compilation options. Use them in addition to . These do not have to be used consistently for all modules in a program. : -auto cost centresautomatically inserting GHC will automatically add _scc_ constructs for all top-level, exported functions. : -auto-all All top-level functions, exported or not, will be automatically _scc_'d. : -caf-all The costs of all CAFs in a module are usually attributed to one “big” CAF cost-centre. With this option, all CAFs get their own cost-centre. An “if all else fails” option… : -ignore-scc Ignore any _scc_ constructs, so a module which already has _scc_s can be compiled for profiling with the annotations ignored. Runtime options for profiling profiling RTS options RTS options, for profiling It isn't enough to compile your program for profiling with ! When you run your profiled program, you must tell the runtime system (RTS) what you want to profile (e.g., time and/or space), and how you wish the collected data to be reported. You also may wish to set the sampling interval used in time profiling. Executive summary: ./a.out +RTS -pT produces a time profile in a.out.prof; ./a.out +RTS -hC produces space-profiling info which can be mangled by hp2ps and viewed with ghostview (or equivalent). Profiling runtime flags are passed to your program between the usual and options. or : time profile The option produces a standard time profile report. It is written into the file <program>.prof. The option produces a more detailed report containing the actual time and allocation data as well. (Not used much.) : The option generates profiling information in the XML format understood by our new profiling tool, see . : Set the profiling (sampling) interval to <secs> seconds (the default is 1 second). Fractions are allowed: for example will get 5 samples per second. This only affects heap profiling; time profiles are always sampled on a 1/50 second frequency. : heap profile Produce a detailed heap profile of the heap occupied by live closures. The profile is written to the file <program>.hp from which a PostScript graph can be produced using hp2ps (see ). The heap space profile may be broken down by different criteria: : cost centre which produced the closure (the default). : cost centre module which produced the closure. : closure description—a string describing the closure. : closure type—a string describing the closure's type. : The option generates heap profiling information in the XML format understood by our new profiling tool (NOTE: heap profiling with the new tool is not yet working! Use hp2ps-style heap profiling for the time being). <command>hp2ps</command>--heap profile to PostScript hp2ps heap profiles postscript, from heap profiles Usage: hp2ps [flags] [<file>[.hp]] The program hp2pshp2ps program converts a heap profile as produced by the runtime option into a PostScript graph of the heap profile. By convention, the file to be processed by hp2ps has a .hp extension. The PostScript output is written to <file>@.ps. If <file> is omitted entirely, then the program behaves as a filter. hp2ps is distributed in ghc/utils/hp2ps in a GHC source distribution. It was originally developed by Dave Wakeling as part of the HBC/LML heap profiler. The flags are: In order to make graphs more readable, hp2ps sorts the shaded bands for each identifier. The default sort ordering is for the bands with the largest area to be stacked on top of the smaller ones. The option causes rougher bands (those representing series of values with the largest standard deviations) to be stacked on top of smoother ones. Normally, hp2ps puts the title of the graph in a small box at the top of the page. However, if the JOB string is too long to fit in a small box (more than 35 characters), then hp2ps will choose to use a big box instead. The option forces hp2ps to use a big box. Generate encapsulated PostScript suitable for inclusion in LaTeX documents. Usually, the PostScript graph is drawn in landscape mode in an area 9 inches wide by 6 inches high, and hp2ps arranges for this area to be approximately centred on a sheet of a4 paper. This format is convenient of studying the graph in detail, but it is unsuitable for inclusion in LaTeX documents. The option causes the graph to be drawn in portrait mode, with float specifying the width in inches, millimetres or points (the default). The resulting PostScript file conforms to the Encapsulated PostScript (EPS) convention, and it can be included in a LaTeX document using Rokicki's dvi-to-PostScript converter dvips. Create output suitable for the gs PostScript previewer (or similar). In this case the graph is printed in portrait mode without scaling. The output is unsuitable for a laser printer. Normally a profile is limited to 20 bands with additional identifiers being grouped into an OTHER band. The flag removes this 20 band and limit, producing as many bands as necessary. No key is produced as it won't fit!. It is useful for creation time profiles with many bands. Normally a profile is limited to 20 bands with additional identifiers being grouped into an OTHER band. The flag specifies an alternative band limit (the maximum is 20). requests the band limit to be removed. As many bands as necessary are produced. However no key is produced as it won't fit! It is useful for displaying creation time profiles with many bands. Use previous parameters. By default, the PostScript graph is automatically scaled both horizontally and vertically so that it fills the page. However, when preparing a series of graphs for use in a presentation, it is often useful to draw a new graph using the same scale, shading and ordering as a previous one. The flag causes the graph to be drawn using the parameters determined by a previous run of hp2ps on file. These are extracted from file@.aux. Use a small box for the title. Normally trace elements which sum to a total of less than 1% of the profile are removed from the profile. The option allows this percentage to be modified (maximum 5%). requests no trace elements to be removed from the profile, ensuring that all the data will be displayed. Generate colour output. Ignore marks. Print out usage information. Using “ticky-ticky” profiling (for implementors) ticky-ticky profiling (ToDo: document properly.) It is possible to compile Glasgow Haskell programs so that they will count lots and lots of interesting things, e.g., number of updates, number of data constructors entered, etc., etc. We call this “ticky-ticky” profiling,ticky-ticky profiling profiling, ticky-ticky because that's the sound a Sun4 makes when it is running up all those counters (slowly). Ticky-ticky profiling is mainly intended for implementors; it is quite separate from the main “cost-centre” profiling system, intended for all users everywhere. To be able to use ticky-ticky profiling, you will need to have built appropriate libraries and things when you made the system. See “Customising what libraries to build,” in the installation guide. To get your compiled program to spit out the ticky-ticky numbers, use a RTS option-r RTS option. See . Compiling your program with the switch yields an executable that performs these counts. Here is a sample ticky-ticky statistics file, generated by the invocation foo +RTS -rfoo.ticky. foo +RTS -rfoo.ticky ALLOCATIONS: 3964631 (11330900 words total: 3999476 admin, 6098829 goods, 1232595 slop) total words: 2 3 4 5 6+ 69647 ( 1.8%) function values 50.0 50.0 0.0 0.0 0.0 2382937 ( 60.1%) thunks 0.0 83.9 16.1 0.0 0.0 1477218 ( 37.3%) data values 66.8 33.2 0.0 0.0 0.0 0 ( 0.0%) big tuples 2 ( 0.0%) black holes 0.0 100.0 0.0 0.0 0.0 0 ( 0.0%) prim things 34825 ( 0.9%) partial applications 0.0 0.0 0.0 100.0 0.0 2 ( 0.0%) thread state objects 0.0 0.0 0.0 0.0 100.0 Total storage-manager allocations: 3647137 (11882004 words) [551104 words lost to speculative heap-checks] STACK USAGE: ENTERS: 9400092 of which 2005772 (21.3%) direct to the entry code [the rest indirected via Node's info ptr] 1860318 ( 19.8%) thunks 3733184 ( 39.7%) data values 3149544 ( 33.5%) function values [of which 1999880 (63.5%) bypassed arg-satisfaction chk] 348140 ( 3.7%) partial applications 308906 ( 3.3%) normal indirections 0 ( 0.0%) permanent indirections RETURNS: 5870443 2137257 ( 36.4%) from entering a new constructor [the rest from entering an existing constructor] 2349219 ( 40.0%) vectored [the rest unvectored] RET_NEW: 2137257: 32.5% 46.2% 21.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% RET_OLD: 3733184: 2.8% 67.9% 29.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% RET_UNBOXED_TUP: 2: 0.0% 0.0%100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% RET_VEC_RETURN : 2349219: 0.0% 0.0%100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% UPDATE FRAMES: 2241725 (0 omitted from thunks) SEQ FRAMES: 1 CATCH FRAMES: 1 UPDATES: 2241725 0 ( 0.0%) data values 34827 ( 1.6%) partial applications [2 in place, 34825 allocated new space] 2206898 ( 98.4%) updates to existing heap objects (46 by squeezing) UPD_CON_IN_NEW: 0: 0 0 0 0 0 0 0 0 0 UPD_PAP_IN_NEW: 34825: 0 0 0 34825 0 0 0 0 0 NEW GEN UPDATES: 2274700 ( 99.9%) OLD GEN UPDATES: 1852 ( 0.1%) Total bytes copied during GC: 190096 ************************************************** 3647137 ALLOC_HEAP_ctr 11882004 ALLOC_HEAP_tot 69647 ALLOC_FUN_ctr 69647 ALLOC_FUN_adm 69644 ALLOC_FUN_gds 34819 ALLOC_FUN_slp 34831 ALLOC_FUN_hst_0 34816 ALLOC_FUN_hst_1 0 ALLOC_FUN_hst_2 0 ALLOC_FUN_hst_3 0 ALLOC_FUN_hst_4 2382937 ALLOC_UP_THK_ctr 0 ALLOC_SE_THK_ctr 308906 ENT_IND_ctr 0 E!NT_PERM_IND_ctr requires +RTS -Z [... lots more info omitted ...] 0 GC_SEL_ABANDONED_ctr 0 GC_SEL_MINOR_ctr 0 GC_SEL_MAJOR_ctr 0 GC_FAILED_PROMOTION_ctr 47524 GC_WORDS_COPIED_ctr The formatting of the information above the row of asterisks is subject to change, but hopefully provides a useful human-readable summary. Below the asterisks all counters maintained by the ticky-ticky system are dumped, in a format intended to be machine-readable: zero or more spaces, an integer, a space, the counter name, and a newline. In fact, not all counters are necessarily dumped; compile- or run-time flags can render certain counters invalid. In this case, either the counter will simply not appear, or it will appear with a modified counter name, possibly along with an explanation for the omission (notice ENT_PERM_IND_ctr appears with an inserted ! above). Software analysing this output should always check that it has the counters it expects. Also, beware: some of the counters can have large values!