4 >#ifndef __GLASGOW_HASKELL__
11 %----------------------------------------------------------------------
12 \section {A Brief Review of Discrete-Time Signal Processing}
13 %----------------------------------------------------------------------
15 %%------------------------------------------------------------
16 \subsection {Discrete-Time Signals}
17 %%------------------------------------------------------------
19 This section only provides some background terminology;
20 see~\cite{OppeScha75} or \cite{ProaMano92} for motivation and more
24 A {\em discrete-time signal\/} is simply a sequence of
25 numbers. That is, a discrete-time signal is a function whose domain
26 is the set of integers and whose range is some set of numbers, e.g.,
27 the real or complex numbers. The elements of the sequence are called
28 {\em samples}. The $n$th sample of the signal $x$ is denoted by
29 $x_n$. For the rest of this paper, we will refer to discrete-time
30 signals as ``signals'' or ``sequences'' interchangeably.
33 It is mathematically convenient to consider signals to be
34 defined for all integers, with samples taking the value zero if not
35 otherwise specified. A signal $x$ is {\em finite-duration\/} if there
36 are integers $n_1$ and $n_2$ such that $x_n = 0$ whenever $n<n_1$ or
37 $n>n_2$. A {\em right-sided sequence\/} $x$ is one for which there is
38 an integer $n_1$ such that $x_n = 0$ for all $n < n_1$; if $n_1 \geq
39 0$ then the signal is also said to be {\em causal}. {\em
40 Left-sided\/} and {\em two-sided\/} sequences are defined similarly.
41 All signals in this paper are assumed to be causal with $n_1 = 0$.
44 Since signals are sequences of numbers, they are naturally
45 represented in Haskell using lists, particularly since we are assuming
46 causal signals indexed from zero and Haskell lists are indexed from
50 We will often use the notation $x^n_k$ to denote the list of
51 samples $[x_k,x_{k+1},\ldots,x_n]$ of the signal $x$. If $x$ is
52 potentially infinite in duration or if knowing its duration is not
53 important, we will replace $n$ by $\infty$. For a causal signal, this
54 convention means that $x^\infty_0$ represents the entire signal.
57 For convenience, we introduce the following type synonym:
63 The element type is unspecified because signals can be
64 sequences of integers, floating point numbers, or complex numbers.
65 Thus, \verb~Signal Int~ represents a signal whose samples are
66 finite-precision integers, \verb~Signal Float~ represents a signal
67 whose samples are floating point numbers, etc.
70 %%------------------------------------------------------------
71 \subsection {The $Z$ Transform}
72 %%------------------------------------------------------------
74 Given a signal $x$, the {\em Z transform\/} of $x$ is the
75 complex-valued function $X(z)$ of the complex variable $z$ defined by
77 X(z) \isDefined \sum_{k=-\infty}^\infty x_k z^{-k}
78 \label{eq:Z-transform}
80 for all $z$ for which the power series converges. If the
81 signal $x$ has $Z$ transform $X(z)$, we will write
85 The $Z$ transform is useful because it allows us to study
86 discrete-time systems (Section~\ref{sc:discrete-time-systems}) using
87 algebraic manipulations. This is because the $Z$ transform has a
88 number of useful properties, only three of which are needed here.
93 If we multiply each sample of the signal $x$ by a scalar $a$
94 and call the resulting sequence $y$, then
96 y \transformPair Y(z) = aX(z) \transformPair map\ (a*)\ x
99 \item {\em superposition}
101 Given two signals $x$ and $y$, if we add them
102 sample-by-sample to get a new signal $w$, then
104 w \transformPair W(z) = X(z) + Y(z)
109 Given a signal $x$ and an integer $k$, define the signal $y$
116 y \transformPair Y(z) = z^{-k}X(z)
123 %%------------------------------------------------------------
124 \subsection {Discrete-Time Systems}
125 \label{sc:discrete-time-systems}
126 %%------------------------------------------------------------
128 A {\em discrete-time system\/} is an operator that transforms
129 an {\em input\/} signal into an {\em output\/} signal. These systems
130 are commonly called {\em filters\/} when their function is to suppress
131 or enhance some feature of the input signal. In Section~\ref{sc:fir}
132 we will define a function for performing a particular type of
136 > type Filter a = Signal a -> Signal a
141 %%------------------------------------------------------------
142 \subsection {Finite Impulse Response Filtering}
144 %%------------------------------------------------------------
146 A (causal) {\em finite impulse response\/} (FIR) filter is
147 characterized by a finite-duration signal $h$ called the {\em impulse
148 response\/} of the filter. For an arbitrary input signal $x$, the
149 samples of the output signal $y$ are defined by the {\em convolution
152 y_n = \sum^{M-1}_{k=0} h_k x_{n-k}
153 \label{eq:convolution}
155 where $M$ is the length of $h$. We will use FIR filtering
156 twice, once in the definition of the function \verb~preemph~ in
157 Section~\ref{sc:preemph} and again in the definition of the function
158 \verb~cepstra~ in Section~\ref{sc:cepstra}. Hence we need to define a
159 Haskell function that implements FIR filtering. In doing so, we will
160 demonstrate how it is often possible to derive an executable
161 definition from a mathematical specification.
164 $Z$ transforms can be used to express the filtering operation.
165 Suppose signals $x$, $h$, and $y$ are related by
166 (\ref{eq:convolution}). Denoting the $Z$ transforms of these signals
167 by $X(z), Y(z)$, and $H(z)$, respectively, it is easy to show that
171 We now derive a Haskell function for performing FIR filtering.
173 Y(z) & = & H(z)X(z) \\
174 & = & \sum^{M-1}_{k=0} h_k z^{-k} X(z)\\
175 & = & \sum^{M-1}_{k=0} z^{-k} (h_kX(z)) \\
176 & = & \sum^{M-1}_{k=0} z^{-k} W^{(k)}(z)
178 where we have set $W^{(k)}(z) = h_kX(z)$. Using Horner's rule,
180 Y(z) & = & W^{(0)}(z)\ +\ z^{-1}(W^{(1)}(z)
182 +\ \ldots \nonumber \\
184 +\ z^{-1}(W^{(M-2)}(z)
185 +\ z^{-1} W^{(M-1)}(z)) \ldots\ ))
189 Defining the transform relationships $w^{(k)} \transformPair
192 y = w^{(0)} \oplus (w^{(1)} \oplus (w^{(2)} \oplus \ldots
193 \oplus (w^{(M-2)} \oplus w^{(M-1)}) \ldots\ ))
196 where the operator $\oplus$ is such that, for two arbitrary
197 signals $u$ and $v$ having $Z$ transforms $U(z)$ and $V(z)$ we have
199 u \oplus v \ \transformPair\ U(z) + z^{-1}V(z).
201 From the superposition and shift properties of the $Z$
202 transform, it is clear that the operator $\oplus$ is implemented by
203 delaying the second operand by one sample and then performing
204 sample-by-sample addition. Let's call this operation
208 > delayAdd :: (Num a) => Signal a -> Signal a -> Signal a
209 > (x:rxs) `delayAdd` y = x : combine rxs y
210 > [] `delayAdd` y = 0 : y
213 (The second equation treats the empty list as the all-zero
214 signal.) This definition uses a function \verb~combine~ which is
215 similar to the Standard Prelude function \verb~zipWith~, except that
216 \verb~combine~ returns a list as long as the {\em longer\/} of the two
217 lists, with the ``missing'' values from the shorter list assumed to be
221 > combine :: (Num a) => Signal a -> Signal a -> Signal a
222 > combine (x:rxs) (y:rys) = x+y : combine rxs rys
229 Now we observe that (\ref{eq:ws}) can be rewritten using
232 y & = & foldr1\ \ (\oplus)\ \ [\: w^{(0)},\
233 w^{(1)},\ \ldots,\ w^{(M-1)} \:] \\
234 & = & foldr1\ \ (\oplus)\ \ [\: map\ \ (h_k *)\ \ x \:|\:
235 k \leftarrow [0..M-1] \:]
239 Thus, we can implement an FIR filter having impulse response
240 \verb~hs~ on an input signal \verb~x~ as follows:\footnote{My thanks
241 to Mark Jones of Yale University who first suggested this form of the
245 > fir :: Signal Float -> Filter Float
246 > fir hs x = foldr1 delayAdd [map (h*) x | h <- hs]
251 %----------------------------------------------------------------------
252 \section {Linear Predictive Analysis}
253 %----------------------------------------------------------------------
255 The processor described here is rather conventional; we just
256 implement most of the blocks described in~\cite[pp.\
257 112--117]{RabiJuan93}.
260 %--------------------------------------------------
261 \subsection {Preemphasis}
263 %--------------------------------------------------
265 The speech signal is first passed through a simple high-pass
266 filter to reduce the degree of spectral tilt. The system
271 where $a$ is a real number less than but close to 1; a typical
272 value is 0.95. Hence, the preemphasis filter is an FIR filter with
273 the impulse response $[1,\ -a\ ]$.
276 > preemph :: Float -> Filter Float
277 > preemph a = fir [1,-a]
282 %--------------------------------------------------
283 \subsection {Framing}
284 %--------------------------------------------------
286 Because of the time-varying nature of speech, the signal is
287 broken into a list of equal-length {\em frames}, each offset from its
288 predecessor by a fixed number of samples. Typically, the frame length
289 is about $20$ ms and the offset is about $10$ ms. We need a function
290 \verb~frames~ that takes three arguments: a frame length, an offset,
291 and the signal. We can derive the definition of \verb~frames~ as
294 frames\ n\ m\ x^\infty_0
300 & = & map\ (take\ n)\
306 & = & map\ (take\ n)\ (iterate\ (drop\ m)\ x_0^\infty).
310 frame\ n\ m\ = map\ (take\ n)\ \circ\ iterate\ (drop\ m)
312 But the signal may be finite, so we define and use a function
313 \verb~check_for_end~ to avoid processing an infinite stream of empty
318 > frames :: Int -> Int -> Signal a -> [Frame a]
319 > frames n m = check_for_end . map (take n) . iterate (drop m)
322 Here is the definition of \verb~check_for_end~:
325 > check_for_end = takeWhile (not . null)
330 %--------------------------------------------------
331 \subsection {Windowing}
332 %--------------------------------------------------
334 A {\em window\/} is used to smoothly attenuate samples at the
335 edges of a frame prior to analysis. A window is applied by
336 sample-by-sample multiplication. It is tempting to define the
337 windowing operation using the Standard Prelude function
338 \verb~zipWith~, however, we don't want to analyze a frame of data that
339 is too short, as would happen if the length of the entire signal
340 exceeded that of an integral number of frames by only a few samples.
341 Hence we define the function \verb~window~ using a utility function
342 \verb~window'~ that makes sure that the segment being windowed is at
343 least as long as the window itself; otherwise an empty frame is
344 returned. Note that \verb~window'~ is not a stream function; the
345 output list is accumulated in the third argument so that it can be
346 forgotten if the signal turns out to be shorter than the window.
349 > type Window = [Float]
351 > window :: Window -> Signal Float -> Frame Float
352 > window w x = window' w x []
354 > window' (w:rws) (x:rxs) buf = window' rws rxs (w*x : buf)
355 > window' (_:_) [] _ = []
356 > window' [] _ buf = buf
361 A popular window is the {\em Hamming window}. If the window
362 is to have $N$ samples, the formula for the $n$th sample is:
364 w(n) = 0.54 - 0.46 \cos( \frac{2\pi}{N-1} n )
366 Here is the function for building a Hamming window of \verb~n~
370 > hamming_window :: Int -> Window
371 > hamming_window npts =
372 > let angle = (2*pi)/fromInt (npts-1)
373 > in [0.54 - 0.46*cos (angle*fromInt n) | n <- [0..npts-1]]
378 One advantage of lazy evaluation is that by defining the
379 windowing function the way we have we can do the windowing and the
380 framing simultaneously. Thus, it turns out that we don't need the
381 function \verb~frames~. Let the first argument be the window width
382 and the second argument be the offset between frames.
385 > windows :: Int -> Int -> Signal Float -> [Frame Float]
388 > hw = hamming_window n
389 > apply_window = window hw
391 > check_for_end . map apply_window . iterate (drop m)
396 %--------------------------------------------------
397 \subsection {The Autocorrelation Sequence}
398 %--------------------------------------------------
400 The function \verb~autocorr~ computes the autocorrelation
401 sequence for the signal $x$ as a prelude to computing the coefficients
402 of the optimal $p$th-order linear prediction filter
403 (Section~\ref{sc:Durbin}). The $i$th sample of the autocorrelation
404 sequence for an $N$-point signal is defined by the equation
408 \sum^{N-1-i}_{n=0} x_n x_{n+i} & 0 \leq i < N \\
413 The summation operation is just the familiar dot product of
414 linear algebra. The dot product can be coded in Haskell as follows:
417 > x `dot` y = sum (zipWith (*) x y)
420 However, the operator \verb~`dot`~ is actually more general
421 than the familiar dot product because the two argument lists can have
422 different length; the longer list is essentially truncated to the
423 length of the shorter list. For derivation purposes, let `$\odot$'
424 denote this more general operation, and let `$\cdot$' denote the
425 standard dot product. In general,
427 x^n_0 \cdot y^n_0 = x_0^{n+k} \odot y^n_0 =
428 x^n_0 \odot y^{n+k}_0
430 for all $k \geq 0$, whereas $x^{n+k}_0 \cdot y^n_0$ would be
431 undefined. We can now formally derive the Haskell function for
432 computing the non-zero portion of the autocorrelation sequence as
435 [r_0,\ r_1,\ \ldots,\ r_{N-1} ]
436 & = & [ \sum^{N-1-i}_{n=0} x_n x_{n+i} \ |\ i \leftarrow [0..N-1]\:]\\
437 & = & [ x^{N-1-i}_0 \cdot x^{N-1}_i \ |\ i \leftarrow [0..N-1]\:]\\
438 & = & [ x^{N-1}_0 \odot x^{N-1}_i \ |\ i \leftarrow [0..N-1]\:]\\
439 & = & map\ \ (x^{N-1}_0\ \odot)\ \ [ x^{N-1}_i \ |\ i \leftarrow
441 & = & map\ \ (x^{N-1}_0\ \odot)\ \ (tails\ x^{N-1}_0)
443 where the function $tails$ is such that
445 tails\ x^N_0 = [\ x^N_0,\ x^N_1,\ x^N_2,\ \ldots,\ x^N_N\ ]
447 To get the complete autocorrelation sequence, we just append
448 an infinite list of zeros:
451 > autocorr x = map (x `dot`) (tails x) ++ repeat 0.0
454 The function \verb~tails~ is defined as follows:\footnote{This
455 function is also provided in the Chalmer's hbc library module {tt
459 > tails xs@(_:rxs) | null rxs = [xs]
460 > | otherwise = xs : tails rxs
465 In a conventional imperative language we would need to pass
466 {\em two\/} arguments to this function, one specifying how many
467 autocorrelation values are to be computed. The function would then
468 return an array containing precisely that many values. However, in a
469 lazy language the autocorrelation values will only be computed as
470 needed, so we do not need to specify such a value for this function.
473 The value $r_0$ is the {\em energy} of the signal. The
474 logarithm of the energy is often included as a parameter in feature
475 vectors used in recognizers.
478 %--------------------------------------------------
479 \subsection {The Durbin Algorithm}
481 %--------------------------------------------------
483 The next step is to solve for the optimum linear predictor
484 filter coefficients for a given order $p$. This can be done
485 efficiently using the {\em Durbin Algorithm}. The steps are listed in
486 Figure~\ref{fg:durbin} as they appear in~\cite[p.\ 411]{RabiScha78}
487 but with a few changes in notation. Basically, we successively
488 compute the optimum prediction filters for each order $i$ starting
489 with $i=1$. The optimum $i$th-order filter coefficients, $a^{(i)} =
490 [a^{(i)}_1,\ldots,a^{(i)}_i]$, are calculated from the optimum
491 $(i-1)$th-order filter coefficients, $a^{(i-1)} = [
492 a^{(i-1)}_1,\ldots, a^{(i-1)}_{i-1}]$, the mean squared error of that
493 filter, $e_{i-1}$, the autocorrelation value $r_i$, and the
494 autocorrelation values $r_1, r_2,\ldots, r_{i-1}$.
498 Initialization: & $e_0 = r_0$ \\[0.10in]
502 \shortstack{ Iteration: \\ $(i=1,2,\ldots,p)$} &
503 \begin{minipage}{2.850in}
506 k_i & = & \left( r_i - \sum^{i-1}_{j=1} a_j^{(i-1)} r_{i-j}
507 \right) \left/ \: e_{i-1} \right. \\
509 a_j^{(i)} & = & \left\{ \begin{array}{ll}
510 a_j^{(i-1)} - k_i a_{i-j}^{(i-1)}, &
511 j = 1, \ldots, i-1 \\
516 e_i & = & (1 - k_i^2) e_{i-1}
523 \caption[]{The Durbin Algorithm.}
528 This description can be straightforwardly realized in a
529 conventional language using arrays; we will implement the algorithm in
530 Haskell without using arrays.
533 > type LPA_Filter = (Float, [Float])
538 %%------------------------------------------------------------
539 \subsubsection {Updating the Predictor Polynomial}
540 %%------------------------------------------------------------
542 First, we derive a function \verb~new_as~ that takes the
543 optimal $(i-1)$th-order linear prediction filter coefficients and the
544 $i$th reflection coefficient and returns the optimal $i$th-order
545 linear prediction filter coefficients. From Figure~\ref{fg:durbin},
546 the optimal $i$th-order linear prediction filter is given by
548 a^{(i)} & \isDefined &
549 [ a^{(i)}_1,\ a^{(i)}_2,\ \ldots,\ a^{(i)}_i ] \\
550 & = & [ a^{(i-1)}_1 - k_i a^{(i-1)}_{i-1},\ \
551 a^{(i-1)}_2 - k_i a^{(i-1)}_{i-2},\ \
553 a^{(i-1)}_{i-1} - k_i a^{(i-1)}_1,\ \
556 Note that the subtraction operation involves the first and
557 last coefficients, the second and second-to-last coefficients, etc.
558 If we place two copies of $a^{(i-1)}$ side by side,
560 [ a^{(i-1)}_1,\ a^{(i-1)}_2,\ \ldots,\
561 a^{(i-1)}_{i-2},\ a^{(i-1)}_{i-1} ]\ \ \
562 [ a^{(i-1)}_1,\ a^{(i-1)}_2,\ \ldots,\
563 a^{(i-1)}_{i-2},\ a^{(i-1)}_{i-1} ]
565 it is easy to see that what is called for is some type
566 of \verb~foldr~ operation that consumes the list on the right as it
567 moves through the list on the left. Indeed, if we define the binary
568 operator $\oplus$ by (assuming that $k_i$ is in scope)
570 a \oplus (p,\ b:bs) = ((a-k_i*b):p,\ bs)
575 a^{(i)} = fst\ \ (foldr\ \ (\oplus)\ \ ([k_i], a^{(i-1)})\ \
579 Combining (\ref{eq:oplus}) and (\ref{eq:new-as}) into a
583 > new_as as k = fst (foldr op ([k],as) as)
584 > where a `op` (p,b:bs) = ((a-k*b):p,bs)
589 %%------------------------------------------------------------
590 \subsubsection {Computing the Reflection Coefficient}
591 %%------------------------------------------------------------
593 Next we consider the calculation of the $i$th reflection
594 coefficient. Ignoring the division by $e_{i-1}$ for the moment,
595 consider just the expression
597 r_i - \sum^{i-1}_{j=1} a^{(i-1)}_j r_{i-j}
598 \label{eq:k-numerator}
600 For the summation term we can use the same approach we used
601 for calculating the new prediction filter coefficients: a
602 \verb~foldr~ operation that consumes one list while moving through
603 another. Defining the binary operator $\otimes$ by
605 a \otimes (s,\ b:bs) = (s + a*b,\ bs)
608 the summation is the first component of the pair returned by
611 foldr\ \ (\otimes)\ \ (0,rs)\ \ as
612 \label{eq:k-summation}
614 where it is assumed that the length of $rs$ is at least as
615 great as that of $as$ (it is in practice). The second component of
616 the pair returned by (\ref{eq:k-summation}) is a list having $r_i$ as
617 its head, which, as we see from (\ref{eq:k-numerator}), is what we are
618 subtracting the summation from. Combining (\ref{eq:otimes}) and
619 (\ref{eq:k-summation}) with the division by $e_{i-1}$ yields the
620 following definition.
624 > let (summation,r:_) = foldr op (0,rs) as
625 > where a `op` (s,b:bs) = (s+a*b,bs)
626 > in (r - summation)/e
631 %%------------------------------------------------------------
632 \subsubsection {One Step in the Durbin Algorithm}
633 %%------------------------------------------------------------
635 We can now define the function that produces the optimal
636 $i$th-order linear prediction filter from the autocorrelation sequence
637 and the optimal $(i-1)$th-order filter.
640 > durbin_step rs (e,as) = let k = new_k rs (e,as)
641 > in ((1-k*k)*e, new_as as k)
646 %%------------------------------------------------------------
647 \subsubsection {The Durbin Algorithm}
648 %%------------------------------------------------------------
650 To get the optimal $p$th-order linear predictor given the
651 autocorrelation sequence, we just iteratively apply the function
652 \verb~durbin_step~ to the appropriate initial condition and select the
653 $p$th element of the resulting sequence. If $(r : rrs)$ is the
654 autocorrelation sequence for a frame, then the expression
656 iterate\ \ (durbin\_step\ rrs)\ \ (r,[\ ])
660 [\: (e_0,[\ ]),\ (e_1,a^{(1)}),\ (e_2,a^{(2)}),\ \ldots \:]
662 from which we want the element at position $p$.
665 > durbin :: Int -> Signal Float -> LPA_Filter
666 > durbin p (r:rrs) = (iterate (durbin_step rrs) (r,[]))!!p
671 %%------------------------------------------------------------
672 \subsection {Conversion to Cepstral Coefficients}
674 %%------------------------------------------------------------
676 Given a discrete-time linear system with the system function
678 \frac{G}{1 - \sum^p_{i=1} a_i z^{-i}}
680 the sequence of {\em cepstral coefficients\/} $c^\infty_0$ for
681 this system is defined by
685 \ln G & n=0 \\[0.10in]
686 a_n + \frac{1}{n} \sum^{n-1}_{k=1} a_k \cdot (n-k)
688 1 \leq n \leq p \\[0.10in]
689 \frac{1}{n} \sum^p_{k=1} a_k \cdot (n-k) \cdot c_{n-k} & n > p
694 Note that the summation terms of (\ref{eq:cepstra}) are just
695 convolution sums for the FIR filter with $a^{(p)}$ as its impulse
696 response and the scaled cepstral coefficient signal $[c_1, 2c_2, 3c_3,
697 4c_4, \ldots]$. Also, the second and third formulas are the same
698 except for the adding of the $a_i$'s to the scaled and filtered
699 cepstral coefficients. Hence, we can use the function \verb~delayAdd~
700 that we defined when developing the FIR filtering function \verb~fir~
701 (Section~\ref{sc:fir}). Also, the gain term is usually ignored in
702 forming the feature vector, so we just compute the sequence
706 > cepstra :: LPA_Filter -> Signal Float
707 > cepstra (_,as) = cs
709 > cs = as `delayAdd` xs
710 > xs = zipWith (/) (fir as (zipWith (*) [1..] cs)) [2..]
713 Because there is no terminating condition and because of the
714 recursion---the sequence \verb~cs~ appears in the definition of the
715 sequence \verb~xs~---this definition relies on lazy evaluation in a
719 \subsection {Putting it all together}
721 The function \verb~analyze~ transforms a windowed frame of
722 samples to a pair of values, a log gain term and a list of cepstral
723 coeficients. The first argument is the order of the linear prediction
724 analysis. The second argument is the number of cepstral coefficients.
727 > analyze :: Int -> Int -> Frame Float -> (Float, [Float])
729 > analyze p q wxs = let
731 > log_energy = log10 (head rs)
732 > cep = take q (cepstra (durbin p rs))
736 > log10 :: Float -> Float
737 > log10 x = log x / log 10
738 > --log10 x = let result = log x / log 10 in
739 > -- trace ("log10:"++(shows x ":")++(shows (log x) "/")++(shows (log 10) "=")++(show result)) result
745 \subsection {The Main Program}
747 Figure~\ref{fg:complete} shows the main Haskell program,
748 including the speech/feature I/O functions.
751 \caption[]{The linear predictive speech analysis main program}