% both concurrent and parallel %************************************************************************ %* * \section[concurrent-and-parallel]{Concurrent and Parallel Haskell} \index{Concurrent Haskell} \index{Parallel Haskell} %* * %************************************************************************ Concurrent and Parallel Haskell are Glasgow extensions to Haskell which let you structure your program as a group of independent `threads'. Concurrent and Parallel Haskell have very different purposes. Concurrent Haskell is for applications which have an inherent structure of interacting, concurrent tasks (i.e. `threads'). Threads in such programs may be {\em required}. For example, if a concurrent thread has been spawned to handle a mouse click, it isn't optional---the user wants something done! A Concurrent Haskell program implies multiple `threads' running within a single Unix process on a single processor. You will find at least one paper about Concurrent Haskell hanging off of Simon Peyton Jones's Web page; \tr{http://www.dcs.gla.ac.uk/~simonpj/}. Parallel Haskell is about {\em speed}---spawning threads onto multiple processors so that your program will run faster. The `threads' are always {\em advisory}---if the runtime system thinks it can get the job done more quickly by sequential execution, then fine. A Parallel Haskell program implies multiple processes running on multiple processors, under a PVM (Parallel Virtual Machine) framework. Parallel Haskell is still relatively new; it is more about ``research fun'' than about ``speed.'' That will change. Again, check Simon's Web page for publications about Parallel Haskell (including ``GUM'', the key bits of the runtime system). Some details about Concurrent and Parallel Haskell follow. %************************************************************************ %* * \subsection{Concurrent and Parallel Haskell---language features} \index{Concurrent Haskell---features} \index{Parallel Haskell---features} %* * %************************************************************************ %************************************************************************ %* * \subsubsection{Features specific to Concurrent Haskell} %* * %************************************************************************ %************************************************************************ %* * \subsubsubsection{The \tr{Concurrent} interface (recommended)} \index{Concurrent interface} %* * %************************************************************************ GHC provides a \tr{Concurrent} module, a common interface to a collection of useful concurrency abstractions, including those mentioned in the ``concurrent paper''. Just put \tr{import Concurrent} into your modules, and away you go. To create a ``required thread'': \begin{verbatim} forkIO :: IO a -> IO a \end{verbatim} The \tr{Concurrent} interface also provides access to ``I-Vars'' and ``M-Vars'', which are two flavours of {\em synchronising variables}. \index{synchronising variables (Glasgow extension)} \index{concurrency -- synchronising variables} \tr{IVars}\index{IVars (Glasgow extension)} are write-once variables. They start out empty, and any threads that attempt to read them will block until they are filled. Once they are written, any blocked threads are freed, and additional reads are permitted. Attempting to write a value to a full \tr{IVar} results in a runtime error. Interface: \begin{verbatim} newIVar :: IO (IVar a) readIVar :: IVar a -> IO a writeIVar :: IVar a -> a -> IO () \end{verbatim} \tr{MVars}\index{MVars (Glasgow extension)} are rendezvous points, mostly for concurrent threads. They begin empty, and any attempt to read an empty \tr{MVar} blocks. When an \tr{MVar} is written, a single blocked thread may be freed. Reading an \tr{MVar} toggles its state from full back to empty. Therefore, any value written to an \tr{MVar} may only be read once. Multiple reads and writes are allowed, but there must be at least one read between any two writes. Interface: \begin{verbatim} newEmptyMVar :: IO (MVar a) newMVar :: a -> IO (MVar a) takeMVar :: MVar a -> IO a putMVar :: MVar a -> a -> IO () readMVar :: MVar a -> IO a swapMVar :: MVar a -> a -> IO a \end{verbatim} A {\em channel variable} (@CVar@) is a one-element channel, as described in the paper: \begin{verbatim} data CVar a newCVar :: IO (CVar a) putCVar :: CVar a -> a -> IO () getCVar :: CVar a -> IO a \end{verbatim} A @Channel@ is an unbounded channel: \begin{verbatim} data Chan a newChan :: IO (Chan a) putChan :: Chan a -> a -> IO () getChan :: Chan a -> IO a dupChan :: Chan a -> IO (Chan a) unGetChan :: Chan a -> a -> IO () getChanContents :: Chan a -> IO [a] \end{verbatim} General and quantity semaphores: \begin{verbatim} data QSem newQSem :: Int -> IO QSem waitQSem :: QSem -> IO () signalQSem :: QSem -> IO () data QSemN newQSemN :: Int -> IO QSemN signalQSemN :: QSemN -> Int -> IO () waitQSemN :: QSemN -> Int -> IO () \end{verbatim} Merging streams---binary and n-ary: \begin{verbatim} mergeIO :: [a] -> [a] -> IO [a] nmergeIO :: [[a]] -> IO [a] \end{verbatim} A {\em Sample variable} (@SampleVar@) is slightly different from a normal @MVar@: \begin{itemize} \item Reading an empty @SampleVar@ causes the reader to block (same as @takeMVar@ on empty @MVar@). \item Reading a filled @SampleVar@ empties it and returns value. (same as @takeMVar@) \item Writing to an empty @SampleVar@ fills it with a value, and potentially, wakes up a blocked reader (same as for @putMVar@ on empty @MVar@). \item Writing to a filled @SampleVar@ overwrites the current value. (different from @putMVar@ on full @MVar@.) \end{itemize} \begin{verbatim} type SampleVar a = MVar (Int, MVar a) emptySampleVar :: SampleVar a -> IO () newSampleVar :: IO (SampleVar a) readSample :: SampleVar a -> IO a writeSample :: SampleVar a -> a -> IO () \end{verbatim} Finally, there are operations to delay a concurrent thread, and to make one wait:\index{delay a concurrent thread} \index{wait for a file descriptor} \begin{verbatim} threadDelay :: Int -> IO () -- delay rescheduling for N microseconds threadWait :: Int -> IO () -- wait for input on specified file descriptor \end{verbatim} %************************************************************************ %* * \subsubsection{Features specific to Parallel Haskell} %* * %************************************************************************ %************************************************************************ %* * \subsubsubsection{The \tr{Parallel} interface (recommended)} \index{Parallel interface} %* * %************************************************************************ GHC provides two functions for controlling parallel execution, through the \tr{Parallel} interface: \begin{verbatim} interface Parallel where infixr 0 `par` infixr 1 `seq` par :: a -> b -> b seq :: a -> b -> b \end{verbatim} The expression \tr{(x `par` y)} {\em sparks} the evaluation of \tr{x} (to weak head normal form) and returns \tr{y}. Sparks are queued for execution in FIFO order, but are not executed immediately. At the next heap allocation, the currently executing thread will yield control to the scheduler, and the scheduler will start a new thread (until reaching the active thread limit) for each spark which has not already been evaluated to WHNF. The expression \tr{(x `seq` y)} evaluates \tr{x} to weak head normal form and then returns \tr{y}. The \tr{seq} primitive can be used to force evaluation of an expression beyond WHNF, or to impose a desired execution sequence for the evaluation of an expression. For example, consider the following parallel version of our old nemesis, \tr{nfib}: \begin{verbatim} import Parallel nfib :: Int -> Int nfib n | n <= 1 = 1 | otherwise = par n1 (seq n2 (n1 + n2 + 1)) where n1 = nfib (n-1) n2 = nfib (n-2) \end{verbatim} For values of \tr{n} greater than 1, we use \tr{par} to spark a thread to evaluate \tr{nfib (n-1)}, and then we use \tr{seq} to force the parent thread to evaluate \tr{nfib (n-2)} before going on to add together these two subexpressions. In this divide-and-conquer approach, we only spark a new thread for one branch of the computation (leaving the parent to evaluate the other branch). Also, we must use \tr{seq} to ensure that the parent will evaluate \tr{n2} {\em before} \tr{n1} in the expression \tr{(n1 + n2 + 1)}. It is not sufficient to reorder the expression as \tr{(n2 + n1 + 1)}, because the compiler may not generate code to evaluate the addends from left to right. %************************************************************************ %* * \subsubsubsection{Underlying functions and primitives} \index{parallelism primitives} \index{primitives for parallelism} %* * %************************************************************************ The functions \tr{par} and \tr{seq} are wired into GHC, and unfold into uses of the \tr{par#} and \tr{seq#} primitives, respectively. If you'd like to see this with your very own eyes, just run GHC with the \tr{-ddump-simpl} option. (Anything for a good time...) You can use \tr{par} and \tr{seq} in Concurrent Haskell, though I'm not sure why you would want to. %************************************************************************ %* * \subsubsection{Features common to Concurrent and Parallel Haskell} %* * %************************************************************************ Actually, you can use the \tr{`par`} and \tr{`seq`} combinators (really for Parallel Haskell) in Concurrent Haskell as well. But doing things like ``\tr{par} to \tr{forkIO} many required threads'' counts as ``jumping out the 9th-floor window, just to see what happens.'' %************************************************************************ %* * \subsubsubsection{Scheduling policy for concurrent/parallel threads} \index{Scheduling---concurrent/parallel} \index{Concurrent/parallel scheduling} %* * %************************************************************************ Runnable threads are scheduled in round-robin fashion. Context switches are signalled by the generation of new sparks or by the expiry of a virtual timer (the timer interval is configurable with the \tr{-C[]}\index{-C RTS option (concurrent, parallel)} RTS option). However, a context switch doesn't really happen until the next heap allocation. If you want extremely short time slices, the \tr{-C} RTS option can be used to force a context switch at each and every heap allocation. When a context switch occurs, pending sparks which have not already been reduced to weak head normal form are turned into new threads. However, there is a limit to the number of active threads (runnable or blocked) which are allowed at any given time. This limit can be adjusted with the \tr{-t}\index{-t RTS option (concurrent, parallel)} RTS option (the default is 32). Once the thread limit is reached, any remaining sparks are deferred until some of the currently active threads are completed. %************************************************************************ %* * \subsection{How to use Concurrent and Parallel Haskell} %* * %************************************************************************ [You won't get far unless your GHC system was configured/built with concurrency and/or parallelism enabled. (They require separate library modules.) The relevant section of the installation guide says how to do this.] %************************************************************************ %* * \subsubsection{Using Concurrent Haskell} \index{Concurrent Haskell---use} %* * %************************************************************************ To compile a program as Concurrent Haskell, use the \tr{-concurrent} option,\index{-concurrent option} both when compiling {\em and linking}. You will probably need the \tr{-fglasgow-exts} option, too. Three RTS options are provided for modifying the behaviour of the threaded runtime system. See the descriptions of \tr{-C[]}, \tr{-q}, and \tr{-t} in \Sectionref{parallel-rts-opts}. %************************************************************************ %* * \subsubsubsection[concurrent-problems]{Potential problems with Concurrent Haskell} \index{Concurrent Haskell problems} \index{problems, Concurrent Haskell} %* * %************************************************************************ The main thread in a Concurrent Haskell program is given its own private stack space, but all other threads are given stack space from the heap. Stack space for the main thread can be adjusted as usual with the \tr{-K} RTS option,\index{-K RTS option (concurrent, parallel)} but if this private stack space is exhausted, the main thread will switch to stack segments in the heap, just like any other thread. Thus, problems which would normally result in stack overflow in ``sequential Haskell'' can be expected to result in heap overflow when using threads. The concurrent runtime system uses black holes as synchronisation points for subexpressions which are shared among multiple threads. In ``sequential Haskell'', a black hole indicates a cyclic data dependency, which is a fatal error. However, in concurrent execution, a black hole may simply indicate that the desired expression is being evaluated by another thread. Therefore, when a thread encounters a black hole, it simply blocks and waits for the black hole to be updated. Cyclic data dependencies will result in deadlock, and the program will fail to terminate. Because the concurrent runtime system uses black holes as synchronisation points, it is not possible to disable black-holing with the \tr{-N} RTS option.\index{-N RTS option} Therefore, the use of signal handlers (including timeouts) with the concurrent runtime system can lead to problems if a thread attempts to enter a black hole that was created by an abandoned computation. The use of signal handlers in conjunction with threads is strongly discouraged. %************************************************************************ %* * \subsubsection{Using Parallel Haskell} \index{Parallel Haskell---use} %* * %************************************************************************ [You won't be able to execute parallel Haskell programs unless PVM3 (Parallel Virtual Machine, version 3) is installed at your site.] To compile a Haskell program for parallel execution under PVM, use the \tr{-parallel} option,\index{-parallel option} both when compiling {\em and linking}. You will probably want to \tr{import Parallel} into your Haskell modules. To run your parallel program, once PVM is going, just invoke it ``as normal''. The main extra RTS option is \tr{-N}, to say how many PVM ``processors'' your program to run on. (For more details of all relevant RTS options, please see \sectionref{parallel-rts-opts}.) In truth, running Parallel Haskell programs and getting information out of them (e.g., parallelism profiles) is a battle with the vagaries of PVM, detailed in the following sections. %************************************************************************ %* * \subsubsubsection{Dummy's guide to using PVM} \index{PVM, how to use} \index{Parallel Haskell---PVM use} %* * %************************************************************************ Before you can run a parallel program under PVM, you must set the required environment variables (PVM's idea, not ours); something like, probably in your \tr{.cshrc} or equivalent: \begin{verbatim} setenv PVM_ROOT /wherever/you/put/it setenv PVM_ARCH `$PVM_ROOT/lib/pvmgetarch` setenv PVM_DPATH $PVM_ROOT/lib/pvmd \end{verbatim} Creating and/or controlling your ``parallel machine'' is a purely-PVM business; nothing specific to Parallel Haskell. You use the \tr{pvm}\index{pvm command} command to start PVM on your machine. You can then do various things to control/monitor your ``parallel machine;'' the most useful being: \begin{tabular}{ll} \tr{Control-D} & exit \tr{pvm}, leaving it running \\ \tr{halt} & kill off this ``parallel machine'' \& exit \\ \tr{add } & add \tr{} as a processor \\ \tr{delete } & delete \tr{} \\ \tr{reset} & kill what's going, but leave PVM up \\ \tr{conf} & list the current configuration \\ \tr{ps} & report processes' status \\ \tr{pstat } & status of a particular process \\ \end{tabular} The PVM documentation can tell you much, much more about \tr{pvm}! %************************************************************************ %* * \subsubsection{Parallelism profiles} \index{parallelism profiles} \index{profiles, parallelism} \index{visualisation tools} %* * %************************************************************************ With Parallel Haskell programs, we usually don't care about the results---only with ``how parallel'' it was! We want pretty pictures. Parallelism profiles (\`a la \tr{hbcpp}) can be generated with the \tr{-q}\index{-q RTS option (concurrent, parallel)} RTS option. The per-processor profiling info is dumped into files named \tr{.gr}. These are then munged into a PostScript picture, which you can then display. For example, to run your program \tr{a.out} on 8 processors, then view the parallelism profile, do: \begin{verbatim} % ./a.out +RTS -N8 -q % grs2gr *.???.gr > temp.gr # combine the 8 .gr files into one % gr2ps -O temp.gr # cvt to .ps; output in temp.ps % ghostview -seascape temp.ps # look at it! \end{verbatim} The scripts for processing the parallelism profiles are distributed in \tr{ghc/utils/parallel/}. %$$************************************************************************ %$$* * %$$\subsubsection{Activity profiles} %$$\index{activity profiles} %$$\index{profiles, activity} %$$\index{visualisation tools} %$$%$$* * %$$%$$************************************************************************ %$$ %$$You can also use the standard GHC ``cost-centre'' profiling to see how %$$much time each PVM ``processor'' spends %$$ %$$No special compilation flags beyond \tr{-parallel} are required to get %$$this basic four-activity profile. Just use the \tr{-P} RTS option, %$$thusly: %$$\begin{verbatim} %$$./a.out +RTS -N7 -P # 7 processors %$$\end{verbatim} %$$ %$$The above will create files named \tr{.prof} and/or %$$\tr{.time} {\em in your home directory}. You can %$$process the \tr{.time} files into PostScript using \tr{hp2ps}, %$$\index{hp2ps} %$$as described elsewhere in this guide. %$$ %$$Because of the weird file names, you probably need to use %$$\tr{hp2ps} as a filter. Also, you probably want to give \tr{hp2ps} %$$a \tr{-t0} flag, so that no ``inconsequential'' data is ignored---in %$$parallel-land it's all consequential. So: %$$\begin{verbatim} %$$%$$ hp2ps -t0 < fooo.001.time > temp.ps %$$\end{verbatim} %$$ %$$ The first line of the %$$ \tr{.qp} file contains the name of the program executed, along with %$$ any program arguments and thread-specific RTS options. The second %$$ line contains the date and time of program execution. The third %$$ and subsequent lines contain information about thread state transitions. %$$ %$$ The thread state transition lines have the following format: %$$ \begin{verbatim} %$$ time transition thread-id thread-name [thread-id thread-name] %$$ \end{verbatim} %$$ %$$ The \tr{time} is the virtual time elapsed since the program started %$$ execution, in milliseconds. The \tr{transition} is a two-letter code %$$ indicating the ``from'' queue and the ``to'' queue, where each queue %$$ is one of: %$$ \begin{itemize} %$$ \item[\tr{*}] Void: Thread creation or termination. %$$ \item[\tr{G}] Green: Runnable (or actively running, with \tr{-qv}) threads. %$$ \item[\tr{A}] Amber: Runnable threads (\tr{-qv} only). %$$ \item[\tr{R}] Red: Blocked threads. %$$ \end{itemize} %$$ The \tr{thread-id} is a unique integer assigned to each thread. The %$$ \tr{thread-name} is currently the address of the thread's root closure %$$ (in hexadecimal). In the future, it will be the name of the function %$$ associated with the root of the thread. %$$ %$$ The first \tr{(thread-id, thread-name)} pair identifies the thread %$$ involved in the indicated transition. For \tr{RG} and \tr{RA} transitions %$$ only, there is a second \tr{(thread-id, thread-name)} pair which identifies %$$ the thread that released the blocked thread. %$$ %$$ Provided with the GHC distribution is a perl script, \tr{qp2pp}, which %$$ will convert \tr{.qp} files to \tr{hbcpp}'s \tr{.pp} format, so that %$$ you can use the \tr{hbcpp} profiling tools, such as \tr{pp2ps92}. The %$$ \tr{.pp} format has undergone many changes, so the conversion script %$$ is not compatible with earlier releases of \tr{hbcpp}. Note that GHC %$$ and \tr{hbcpp} use different thread scheduling policies (in %$$ particular, \tr{hbcpp} threads never move from the green queue to the %$$ amber queue). For compatibility, the \tr{qp2pp} script eliminates the %$$ GHC amber queue, so there is no point in using the verbose (\tr{-qv}) %$$ option if you are only interested in using the \tr{hbcpp} profiling %$$ tools. %************************************************************************ %* * \subsubsection{Other useful info about running parallel programs} %* * %************************************************************************ The ``garbage-collection statistics'' RTS options can be useful for seeing what parallel programs are doing. If you do either \tr{+RTS -Sstderr}\index{-Sstderr RTS option} or \tr{+RTS -sstderr}, then you'll get mutator, garbage-collection, etc., times on standard error. The standard error of all PE's other than the `main thread' appears in \tr{/tmp/pvml.nnn}, courtesy of PVM. Whether doing \tr{+RTS -Sstderr} or not, a handy way to watch what's happening overall is: \tr{tail -f /tmp/pvml.nnn}. %************************************************************************ %* * \subsubsection[parallel-rts-opts]{RTS options for Concurrent/Parallel Haskell} \index{RTS options, concurrent} \index{RTS options, parallel} \index{Concurrent Haskell---RTS options} \index{Parallel Haskell---RTS options} %* * %************************************************************************ Besides the usual runtime system (RTS) options (\sectionref{runtime-control}), there are a few options particularly for concurrent/parallel execution. \begin{description} \item[\tr{-N}:] \index{-N RTS option (parallel)} (PARALLEL ONLY) Use \tr{} PVM processors to run this program; the default is 2. \item[\tr{-C[]}:] \index{-C RTS option} Sets the context switch interval to \pl{} microseconds. A context switch will occur at the next heap allocation after the timer expires. With \tr{-C0} or \tr{-C}, context switches will occur as often as possible (at every heap allocation). By default, context switches occur every 10 milliseconds. Note that many interval timers are only capable of 10 millisecond granularity, so the default setting may be the finest granularity possible, short of a context switch at every heap allocation. \item[\tr{-q[v]}:] \index{-q RTS option} Produce a quasi-parallel profile of thread activity, in the file \tr{.qp}. In the style of \tr{hbcpp}, this profile records the movement of threads between the green (runnable) and red (blocked) queues. If you specify the verbose suboption (\tr{-qv}), the green queue is split into green (for the currently running thread only) and amber (for other runnable threads). We do not recommend that you use the verbose suboption if you are planning to use the \tr{hbcpp} profiling tools or if you are context switching at every heap check (with \tr{-C}). \item[\tr{-t}:] \index{-t RTS option} Limit the number of concurrent threads per processor to \pl{}. The default is 32. Each thread requires slightly over 1K {\em words} in the heap for thread state and stack objects. (For 32-bit machines, this translates to 4K bytes, and for 64-bit machines, 8K bytes.) \item[\tr{-d}:] \index{-d RTS option (parallel)} (PARALLEL ONLY) Turn on debugging. It pops up one xterm (or GDB, or something...) per PVM processor. We use the standard \tr{debugger} script that comes with PVM3, but we sometimes meddle with the \tr{debugger2} script. We include ours in the GHC distribution, in \tr{ghc/utils/pvm/}. \item[\tr{-e}:] \index{-e RTS option (parallel)} (PARALLEL ONLY) Limit the number of pending sparks per processor to \tr{}. The default is 100. A larger number may be appropriate if your program generates large amounts of parallelism initially. \item[\tr{-Q}:] \index{-Q RTS option (parallel)} (PARALLEL ONLY) Set the size of packets transmitted between processors to \tr{}. The default is 1024 words. A larger number may be appropriate if your machine has a high communication cost relative to computation speed. \end{description} %************************************************************************ %* * \subsubsubsection[parallel-problems]{Potential problems with Parallel Haskell} \index{Parallel Haskell---problems} \index{problems, Parallel Haskell} %* * %************************************************************************ The ``Potential problems'' for Concurrent Haskell also apply for Parallel Haskell. Please see \Sectionref{concurrent-problems}. %$$ \subsubsubsection[par-notes]{notes for 0.26} %$$ %$$ \begin{verbatim} %$$ Install PVM somewhere, as it says. We use 3.3 %$$ %$$ pvm.h : can do w/ a link from ghc/includes to its true home (???) %$$ %$$ %$$ ghc -gum ... => a.out %$$ %$$ a.out goes to $PVM_ROOT/bin/$PVM_ARCH/$PE %$$ %$$ (profiling outputs go to ~/$PE..) %$$ %$$ trinder scripts in: ~trinder/bin/any/instPHIL %$$ %$$ To run: %$$ %$$ Then: %$$ SysMan [-] N (PEs) args-to-program... %$$ %$$ - ==> debug mode %$$ mattson setup: GDB window per task %$$ /local/grasp_tmp5/mattson/pvm3/lib/debugger{,2} %$$ %$$ to set breakpoint, etc, before "run", just modify debugger2 %$$ %$$ stderr and stdout are directed to /tmp/pvml.NNN %$$ %$$ Visualisation stuff (normal _mp build): %$$ %$$ +RTS -q gransim-like profiling %$$ (should use exactly-gransim RTS options) %$$ -qb binary dumps : not tried, not recommended: hosed! %$$ %$$ ascii dump : same info as gransim, one extra line at top w/ %$$ start time; all times are ms since then %$$ %$$ dumps appear in $HOME/.nnn.gr %$$ %$$ ~mattson/grs2gr.pl == combine lots into one (fixing times) %$$ %$$ /local/grasp/hwloidl/GrAn/bin/ is where scripts are. %$$ %$$ gr2ps == activity profile (bash script) %$$ %$$ ~mattson/bin/`arch`/gr2qp must be picked up prior to hwloidl's for %$$ things to work... %$$ %$$ +RTS -[Pp] (parallel) 4-cost-centre "profiling" (gc,MAIN,msg,idle) %$$ %$$ ToDos: time-profiles from hp2ps: something about zeroth sample; %$$ \end{verbatim}