From: Simon Marlow Date: Fri, 25 Aug 2006 15:12:36 +0000 (+0000) Subject: Document SMP support X-Git-Tag: Before_FC_branch_merge~118 X-Git-Url: http://git.megacz.com/?p=ghc-hetmet.git;a=commitdiff_plain;h=c5a97ea01a810333608ef1e26f5cb5422dd25928 Document SMP support --- diff --git a/docs/users_guide/flags.xml b/docs/users_guide/flags.xml index 1e09b1e..f69fc45 100644 --- a/docs/users_guide/flags.xml +++ b/docs/users_guide/flags.xml @@ -1093,45 +1093,6 @@ - Parallelism options - - - - - - - - Flag - Description - Static/Dynamic - Reverse - - - - - - Enable GRANSIM - static - - - - - - Enable Parallel Haskell - static - - - - - - Enable SMP support - static - - - - - - - - - C pre-processor options diff --git a/docs/users_guide/parallel.xml b/docs/users_guide/parallel.xml index 11c2547..1f29d2c 100644 --- a/docs/users_guide/parallel.xml +++ b/docs/users_guide/parallel.xml @@ -1,204 +1,100 @@ - -Concurrent and Parallel Haskell - - -Concurrent Haskell -Parallel Haskell -Concurrent and Parallel Haskell are Glasgow extensions to Haskell -which let you structure your program as a group of independent -`threads'. - - - -Concurrent and Parallel Haskell have very different purposes. - - - -Concurrent Haskell is for applications which have an inherent -structure of interacting, concurrent tasks (i.e. `threads'). Threads -in such programs may be required. For example, if a concurrent thread has been spawned to handle a mouse click, it isn't -optional—the user wants something done! - - - -A Concurrent Haskell program implies multiple `threads' running within -a single Unix process on a single processor. - - - -You will find at least one paper about Concurrent Haskell hanging off -of Simon Peyton -Jones's Web page. - - - -Parallel Haskell is about speed—spawning -threads onto multiple processors so that your program will run faster. -The `threads' are always advisory—if the -runtime system thinks it can get the job done more quickly by -sequential execution, then fine. - - - -A Parallel Haskell program implies multiple processes running on -multiple processors, under a PVM (Parallel Virtual Machine) framework. -An MPI interface is under development but not fully functional, yet. - - - -Parallel Haskell is still relatively new; it is more about “research -fun” than about “speed.” That will change. - - - -Check the GPH Page -for more information on “GPH” (Haskell98 with extensions for -parallel execution), the latest version of “GUM” (the runtime -system to enable parallel executions) and papers on research issues. A -list of publications about GPH and about GUM is also available from Simon's -Web Page. - - - -Some details about Parallel Haskell follow. For more information -about concurrent Haskell, see the module -Control.Concurrent in the library documentation. - - - -Features specific to Parallel Haskell -<indexterm><primary>Parallel Haskell—features</primary></indexterm> - - -The <literal>Parallel</literal> interface (recommended) -<indexterm><primary>Parallel interface</primary></indexterm> - - -GHC provides two functions for controlling parallel execution, through -the Parallel interface: - - - + + Parallel Haskell + parallelism + + + There are two implementations of Parallel Haskell: SMP paralellism + SMP + which is built-in to GHC (see ) and + supports running Parallel Haskell programs on a single multiprocessor + machine, and + Glasgow Parallel HaskellGlasgow Parallel Haskell + (GPH) which supports running Parallel Haskell + programs on both clusters of machines or single multiprocessors. GPH is + developed and distributed + separately from GHC (see The + GPH Page). + + Ordinary single-threaded Haskell programs will not benefit from + enabling SMP parallelism alone. You must expose parallelism to the + compiler in one of the following two ways. + + + Running Concurrent Haskell programs in parallel + + The first possibility is to use concurrent threads to structure your + program, and make sure + that you spread computation amongst the threads. The runtime will + schedule the running Haskell threads among the available OS + threads, running as many in parallel as you specified with the + RTS option. + + + + Annotating pure code for parallelism + + The simplest mechanism for extracting parallelism from pure code is + to use the par combinator, which is closely related to (and often used + with) seq. Both of these are available from Control.Parallel: -interface Parallel where infixr 0 `par` infixr 1 `seq` par :: a -> b -> b -seq :: a -> b -> b - - - +seq :: a -> b -> b - -The expression (x `par` y) sparks the evaluation of x -(to weak head normal form) and returns y. Sparks are queued for -execution in FIFO order, but are not executed immediately. At the -next heap allocation, the currently executing thread will yield -control to the scheduler, and the scheduler will start a new thread -(until reaching the active thread limit) for each spark which has not -already been evaluated to WHNF. - + The expression (x `par` y) + sparks the evaluation of x + (to weak head normal form) and returns y. Sparks are + queued for execution in FIFO order, but are not executed immediately. If + the runtime detects that there is an idle CPU, then it may convert a + spark into a real thread, and run the new thread on the idle CPU. In + this way the available parallelism is spread amongst the real + CPUs. - -The expression (x `seq` y) evaluates x to weak head normal -form and then returns y. The seq primitive can be used to -force evaluation of an expression beyond WHNF, or to impose a desired -execution sequence for the evaluation of an expression. - - - -For example, consider the following parallel version of our old -nemesis, nfib: - - - + For example, consider the following parallel version of our old + nemesis, nfib: -import Parallel +import Control.Parallel nfib :: Int -> Int nfib n | n <= 1 = 1 | otherwise = par n1 (seq n2 (n1 + n2 + 1)) where n1 = nfib (n-1) - n2 = nfib (n-2) - - - - - -For values of n greater than 1, we use par to spark a thread -to evaluate nfib (n-1), and then we use seq to force the -parent thread to evaluate nfib (n-2) before going on to add -together these two subexpressions. In this divide-and-conquer -approach, we only spark a new thread for one branch of the computation -(leaving the parent to evaluate the other branch). Also, we must use -seq to ensure that the parent will evaluate n2 before -n1 in the expression (n1 + n2 + 1). It is not sufficient to -reorder the expression as (n2 + n1 + 1), because the compiler may -not generate code to evaluate the addends from left to right. - - - - - -Underlying functions and primitives -<indexterm><primary>parallelism primitives</primary></indexterm> -<indexterm><primary>primitives for parallelism</primary></indexterm> - - -The functions par and seq are wired into GHC, and unfold -into uses of the par# and seq# primitives, respectively. If -you'd like to see this with your very own eyes, just run GHC with the - option. (Anything for a good time…) - - - - - -Scheduling policy for concurrent threads -<indexterm><primary>Scheduling—concurrent</primary></indexterm> -<indexterm><primary>Concurrent scheduling</primary></indexterm> - - -Runnable threads are scheduled in round-robin fashion. Context -switches are signalled by the generation of new sparks or by the -expiry of a virtual timer (the timer interval is configurable with the --C<num> RTS option (concurrent, -parallel) RTS option). However, a context switch doesn't -really happen until the current heap block is full. You can't get any -faster context switching than this. - - - -When a context switch occurs, pending sparks which have not already -been reduced to weak head normal form are turned into new threads. -However, there is a limit to the number of active threads (runnable or -blocked) which are allowed at any given time. This limit can be -adjusted with the -t <num> RTS option (concurrent, parallel) -RTS option (the default is 32). Once the -thread limit is reached, any remaining sparks are deferred until some -of the currently active threads are completed. - - - - - -Scheduling policy for parallel threads -<indexterm><primary>Scheduling—parallel</primary></indexterm> -<indexterm><primary>Parallel scheduling</primary></indexterm> - - -In GUM we use an unfair scheduler, which means that a thread continues to -perform graph reduction until it blocks on a closure under evaluation, on a -remote closure or until the thread finishes. - - - - - + n2 = nfib (n-2) + + For values of n greater than 1, we use + par to spark a thread to evaluate nfib (n-1), + and then we use seq to force the + parent thread to evaluate nfib (n-2) before going on + to add together these two subexpressions. In this divide-and-conquer + approach, we only spark a new thread for one branch of the computation + (leaving the parent to evaluate the other branch). Also, we must use + seq to ensure that the parent will evaluate + n2 before n1 + in the expression (n1 + n2 + 1). It is not sufficient + to reorder the expression as (n2 + n1 + 1), because + the compiler may not generate code to evaluate the addends from left to + right. + + When using par, the general rule of thumb is that + the sparked computation should be required at a later time, but not too + soon. Also, the sparked computation should not be too small, otherwise + the cost of forking it in parallel will be too large relative to the + amount of parallelism gained. Getting these factors right is tricky in + practice. + + More sophisticated combinators for expressing parallelism are + available from the Control.Parallel.Strategies module. + This module builds functionality around par, + expressing more elaborate patterns of parallel computation, such as + parallel map. + diff --git a/docs/users_guide/phases.xml b/docs/users_guide/phases.xml index fd034a3..e5bac79 100644 --- a/docs/users_guide/phases.xml +++ b/docs/users_guide/phases.xml @@ -839,26 +839,43 @@ $ cat foo.hspp - Link the program with the "threaded" runtime system. - This version of the runtime is designed to be used in - programs that use multiple operating-system threads. It - supports calls to foreign-exported functions from multiple - OS threads. Calls to foreign functions are made using the - same OS thread that created the Haskell thread (if it was - created by a call-in), or an arbitrary OS thread otherwise - (if the Haskell thread was created by + Link the program with the "threaded" version of the + runtime system. The threaded runtime system is so-called + because it manages multiple OS threads, as opposed to the + default runtime system which is purely + single-threaded. + + Note that you do not need + in order to use concurrency; the + single-threaded runtime supports concurrency between Haskell + threads just fine. + + The threaded runtime system provides the following + benefits: + + + + Parallelismparallelism on a multiprocessormultiprocessorSMP or multicoremulticore + machine. See . + + The ability to make a foreign call that does not + block all other Haskell threads.. + + The ability to invoke foreign exported Haskell + functions from multiple OS threads. + + + + With , calls to foreign + functions are made using the same OS thread that created the + Haskell thread (if it was created by a call to a foreign + exported Haskell function), or an arbitrary OS thread + otherwise (if the Haskell thread was created by forkIO). More details on the use of "bound threads" in the threaded runtime can be found in the Control.Concurrent module. - - The threaded RTS does not - support using multiple CPUs to speed up execution of a - multi-threaded Haskell program. The GHC runtime platform - is still single-threaded, but using the - option it can be used safely in - a multi-threaded environment. diff --git a/docs/users_guide/runtime_control.xml b/docs/users_guide/runtime_control.xml index 995e263..6a3a9e3 100644 --- a/docs/users_guide/runtime_control.xml +++ b/docs/users_guide/runtime_control.xml @@ -398,11 +398,12 @@ - RTS options for profiling and Concurrent/Parallel Haskell + RTS options for profiling and parallelism The RTS options related to profiling are described in ; and those for concurrent/parallel - stuff, in . + linkend="rts-options-heap-prof"/>, those for concurrency in + , and those for parallelism in + . diff --git a/docs/users_guide/using.xml b/docs/users_guide/using.xml index b274f62..2868876 100644 --- a/docs/users_guide/using.xml +++ b/docs/users_guide/using.xml @@ -1533,353 +1533,86 @@ f "2" = 2 - -Using parallel Haskell - - -Parallel Haskellusing -[NOTE: GHC does not support Parallel Haskell by default, you need to - obtain a special version of GHC from the GPH site. Also, -you won't be able to execute parallel Haskell programs unless PVM3 -(parallel Virtual Machine, version 3) is installed at your site.] - - - -To compile a Haskell program for parallel execution under PVM, use the - option,-parallel -option both when compiling and -linking. You will probably want to import -Control.Parallel into your Haskell modules. - - - -To run your parallel program, once PVM is going, just invoke it -“as normal”. The main extra RTS option is -, to say how many PVM -“processors” your program to run on. (For more details of -all relevant RTS options, please see .) - - - -In truth, running parallel Haskell programs and getting information -out of them (e.g., parallelism profiles) is a battle with the vagaries of -PVM, detailed in the following sections. - - - -Dummy's guide to using PVM - - -PVM, how to use -parallel Haskell—PVM use -Before you can run a parallel program under PVM, you must set the -required environment variables (PVM's idea, not ours); something like, -probably in your .cshrc or equivalent: - - -setenv PVM_ROOT /wherever/you/put/it -setenv PVM_ARCH `$PVM_ROOT/lib/pvmgetarch` -setenv PVM_DPATH $PVM_ROOT/lib/pvmd - - - - - -Creating and/or controlling your “parallel machine” is a purely-PVM -business; nothing specific to parallel Haskell. The following paragraphs -describe how to configure your parallel machine interactively. - - - -If you use parallel Haskell regularly on the same machine configuration it -is a good idea to maintain a file with all machine names and to make the -environment variable PVM_HOST_FILE point to this file. Then you can avoid -the interactive operations described below by just saying - - - -pvm $PVM_HOST_FILE - - - -You use the pvmpvm command command to start PVM on your -machine. You can then do various things to control/monitor your -“parallel machine;” the most useful being: - - - - - - - - - -ControlD -exit pvm, leaving it running - - - -halt -kill off this “parallel machine” & exit - - - -add <host> -add <host> as a processor - - - -delete <host> -delete <host> - - - -reset -kill what's going, but leave PVM up - - - -conf -list the current configuration - - - -ps -report processes' status - - - -pstat <pid> -status of a particular process - - - - - - - - -The PVM documentation can tell you much, much more about pvm! - - - - - -parallelism profiles - - -parallelism profiles -profiles, parallelism -visualisation tools - - - -With parallel Haskell programs, we usually don't care about the -results—only with “how parallel” it was! We want pretty pictures. - - - -parallelism profiles (à la hbcpp) can be generated with the --qP RTS option RTS option. The -per-processor profiling info is dumped into files named -<full-path><program>.gr. These are then munged into a PostScript picture, -which you can then display. For example, to run your program -a.out on 8 processors, then view the parallelism profile, do: - - - - - -$ ./a.out +RTS -qP -qp8 -$ grs2gr *.???.gr > temp.gr # combine the 8 .gr files into one -$ gr2ps -O temp.gr # cvt to .ps; output in temp.ps -$ ghostview -seascape temp.ps # look at it! - + + Using SMP parallelism + parallelism + + SMP + - - - -The scripts for processing the parallelism profiles are distributed -in ghc/utils/parallel/. - - - - - -Other useful info about running parallel programs - - -The “garbage-collection statistics” RTS options can be useful for -seeing what parallel programs are doing. If you do either --Sstderr RTS option or , then -you'll get mutator, garbage-collection, etc., times on standard -error. The standard error of all PE's other than the `main thread' -appears in /tmp/pvml.nnn, courtesy of PVM. - - - -Whether doing or not, a handy way to watch -what's happening overall is: tail -f /tmp/pvml.nnn. - - - - - -RTS options for Parallel Haskell - - - -RTS options, parallel -parallel Haskell—RTS options - - - -Besides the usual runtime system (RTS) options -(), there are a few options particularly -for parallel execution. - - - - - - -: - - --qp<N> RTS option -(paraLLEL ONLY) Use <N> PVM processors to run this program; -the default is 2. - - - - -: - - --C<s> RTS option Sets -the context switch interval to <s> seconds. -A context switch will occur at the next heap block allocation after -the timer expires (a heap block allocation occurs every 4k of -allocation). With or , -context switches will occur as often as possible (at every heap block -allocation). By default, context switches occur every 20ms. Note that GHC's internal timer ticks every 20ms, and -the context switch timer is always a multiple of this timer, so 20ms -is the maximum granularity available for timed context switches. - - - - -: - - --q RTS option -(paraLLEL ONLY) Produce a quasi-parallel profile of thread activity, -in the file <program>.qp. In the style of hbcpp, this profile -records the movement of threads between the green (runnable) and red -(blocked) queues. If you specify the verbose suboption (), the -green queue is split into green (for the currently running thread -only) and amber (for other runnable threads). We do not recommend -that you use the verbose suboption if you are planning to use the -hbcpp profiling tools or if you are context switching at every heap -check (with ). ---> - - - - -: - - --qt<num> RTS option -(paraLLEL ONLY) Limit the thread pool size, i.e. the number of -threads per processor to <num>. The default is -32. Each thread requires slightly over 1K words in -the heap for thread state and stack objects. (For 32-bit machines, this -translates to 4K bytes, and for 64-bit machines, 8K bytes.) - - - - - -: - - --qe<num> RTS option -(parallel) (paraLLEL ONLY) Limit the spark pool size -i.e. the number of pending sparks per processor to -<num>. The default is 100. A larger number may be -appropriate if your program generates large amounts of parallelism -initially. - - - - -: - - --qQ<num> RTS option (parallel) -(paraLLEL ONLY) Set the size of packets transmitted between processors -to <num>. The default is 1024 words. A larger number may be -appropriate if your machine has a high communication cost relative to -computation speed. - - - - -: - - --qh<num> RTS option (parallel) -(paraLLEL ONLY) Select a packing scheme. Set the number of non-root thunks to pack in one packet to -<num>-1 (0 means infinity). By default GUM uses full-subgraph -packing, i.e. the entire subgraph with the requested closure as root is -transmitted (provided it fits into one packet). Choosing a smaller value -reduces the amount of pre-fetching of work done in GUM. This can be -advantageous for improving data locality but it can also worsen the balance -of the load in the system. - - - - -: - - --qg<num> RTS option -(parallel) (paraLLEL ONLY) Select a globalisation -scheme. This option affects the -generation of global addresses when transferring data. Global addresses are -globally unique identifiers required to maintain sharing in the distributed -graph structure. Currently this is a binary option. With <num>=0 full globalisation is used -(default). This means a global address is generated for every closure that -is transmitted. With <num>=1 a thunk-only globalisation scheme is -used, which generated global address only for thunks. The latter case may -lose sharing of data but has a reduced overhead in packing graph structures -and maintaining internal tables of global addresses. - - - - - - - + GHC supports running Haskell programs in parallel on an SMP + (symmetric multiprocessor). + + There's a fine distinction between + concurrency and parallelism: + parallelism is all about making your program run + faster by making use of multiple processors + simultaneously. Concurrency, on the other hand, is a means of + abstraction: it is a convenient way to structure a program that must + respond to multiple asynchronous events. + + However, the two terms are certainly related. By making use of + multiple CPUs it is possible to run concurrent threads in parallel, + and this is exactly what GHC's SMP parallelism support does. But it + is also possible to obtain performance improvements with parallelism + on programs that do not use concurrency. This section describes how to + use GHC to compile and run parallel programs, in we desribe the language features that affect + parallelism. + + + Options to enable SMP parallelism - + In order to make use of multiple CPUs, your program must be + linked with the option (see ). Then, to run a program on multiple + CPUs, use the RTS option: + + + + + + RTS option + Use x simultaneous threads when + running the program. Normally x + should be chosen to match the number of CPU cores on the machine. + There is no means (currently) by which this value may vary after + the program has started. + + For example, on a dual-core machine we would probably use + +RTS -N2 -RTS. + + Whether hyperthreading cores should be counted or not is an + open question; please feel free to experiment and let us know what + results you find. + + + + + + + Hints for using SMP parallelism + + Add the -sstderr RTS option when + running the program to see timing stats, which will help to tell you + whether your program got faster by using more CPUs or not. If the user + time is greater than + the elapsed time, then the program used more than one CPU. You should + also run the program without -N for comparison. + + GHC's parallelism support is new and experimental. It may make your + program go faster, or it might slow it down - either way, we'd be + interested to hear from you. + + One significant limitation with the current implementation is that + the garbage collector is still single-threaded, and all execution must + stop when GC takes place. This can be a significant bottleneck in a + parallel program, especially if your program does a lot of GC. If this + happens to you, then try reducing the cost of GC by tweaking the GC + settings (): enlarging the heap or the + allocation area size is a good start. + + Platform-specific Flags