4 >#ifndef __GLASGOW_HASKELL__
7 > import Int( Num(fromInt) )
12 %----------------------------------------------------------------------
13 \section {A Brief Review of Discrete-Time Signal Processing}
14 %----------------------------------------------------------------------
16 %%------------------------------------------------------------
17 \subsection {Discrete-Time Signals}
18 %%------------------------------------------------------------
20 This section only provides some background terminology;
21 see~\cite{OppeScha75} or \cite{ProaMano92} for motivation and more
25 A {\em discrete-time signal\/} is simply a sequence of
26 numbers. That is, a discrete-time signal is a function whose domain
27 is the set of integers and whose range is some set of numbers, e.g.,
28 the real or complex numbers. The elements of the sequence are called
29 {\em samples}. The $n$th sample of the signal $x$ is denoted by
30 $x_n$. For the rest of this paper, we will refer to discrete-time
31 signals as ``signals'' or ``sequences'' interchangeably.
34 It is mathematically convenient to consider signals to be
35 defined for all integers, with samples taking the value zero if not
36 otherwise specified. A signal $x$ is {\em finite-duration\/} if there
37 are integers $n_1$ and $n_2$ such that $x_n = 0$ whenever $n<n_1$ or
38 $n>n_2$. A {\em right-sided sequence\/} $x$ is one for which there is
39 an integer $n_1$ such that $x_n = 0$ for all $n < n_1$; if $n_1 \geq
40 0$ then the signal is also said to be {\em causal}. {\em
41 Left-sided\/} and {\em two-sided\/} sequences are defined similarly.
42 All signals in this paper are assumed to be causal with $n_1 = 0$.
45 Since signals are sequences of numbers, they are naturally
46 represented in Haskell using lists, particularly since we are assuming
47 causal signals indexed from zero and Haskell lists are indexed from
51 We will often use the notation $x^n_k$ to denote the list of
52 samples $[x_k,x_{k+1},\ldots,x_n]$ of the signal $x$. If $x$ is
53 potentially infinite in duration or if knowing its duration is not
54 important, we will replace $n$ by $\infty$. For a causal signal, this
55 convention means that $x^\infty_0$ represents the entire signal.
58 For convenience, we introduce the following type synonym:
64 The element type is unspecified because signals can be
65 sequences of integers, floating point numbers, or complex numbers.
66 Thus, \verb~Signal Int~ represents a signal whose samples are
67 finite-precision integers, \verb~Signal Float~ represents a signal
68 whose samples are floating point numbers, etc.
71 %%------------------------------------------------------------
72 \subsection {The $Z$ Transform}
73 %%------------------------------------------------------------
75 Given a signal $x$, the {\em Z transform\/} of $x$ is the
76 complex-valued function $X(z)$ of the complex variable $z$ defined by
78 X(z) \isDefined \sum_{k=-\infty}^\infty x_k z^{-k}
79 \label{eq:Z-transform}
81 for all $z$ for which the power series converges. If the
82 signal $x$ has $Z$ transform $X(z)$, we will write
86 The $Z$ transform is useful because it allows us to study
87 discrete-time systems (Section~\ref{sc:discrete-time-systems}) using
88 algebraic manipulations. This is because the $Z$ transform has a
89 number of useful properties, only three of which are needed here.
94 If we multiply each sample of the signal $x$ by a scalar $a$
95 and call the resulting sequence $y$, then
97 y \transformPair Y(z) = aX(z) \transformPair map\ (a*)\ x
100 \item {\em superposition}
102 Given two signals $x$ and $y$, if we add them
103 sample-by-sample to get a new signal $w$, then
105 w \transformPair W(z) = X(z) + Y(z)
110 Given a signal $x$ and an integer $k$, define the signal $y$
117 y \transformPair Y(z) = z^{-k}X(z)
124 %%------------------------------------------------------------
125 \subsection {Discrete-Time Systems}
126 \label{sc:discrete-time-systems}
127 %%------------------------------------------------------------
129 A {\em discrete-time system\/} is an operator that transforms
130 an {\em input\/} signal into an {\em output\/} signal. These systems
131 are commonly called {\em filters\/} when their function is to suppress
132 or enhance some feature of the input signal. In Section~\ref{sc:fir}
133 we will define a function for performing a particular type of
137 > type Filter a = Signal a -> Signal a
142 %%------------------------------------------------------------
143 \subsection {Finite Impulse Response Filtering}
145 %%------------------------------------------------------------
147 A (causal) {\em finite impulse response\/} (FIR) filter is
148 characterized by a finite-duration signal $h$ called the {\em impulse
149 response\/} of the filter. For an arbitrary input signal $x$, the
150 samples of the output signal $y$ are defined by the {\em convolution
153 y_n = \sum^{M-1}_{k=0} h_k x_{n-k}
154 \label{eq:convolution}
156 where $M$ is the length of $h$. We will use FIR filtering
157 twice, once in the definition of the function \verb~preemph~ in
158 Section~\ref{sc:preemph} and again in the definition of the function
159 \verb~cepstra~ in Section~\ref{sc:cepstra}. Hence we need to define a
160 Haskell function that implements FIR filtering. In doing so, we will
161 demonstrate how it is often possible to derive an executable
162 definition from a mathematical specification.
165 $Z$ transforms can be used to express the filtering operation.
166 Suppose signals $x$, $h$, and $y$ are related by
167 (\ref{eq:convolution}). Denoting the $Z$ transforms of these signals
168 by $X(z), Y(z)$, and $H(z)$, respectively, it is easy to show that
172 We now derive a Haskell function for performing FIR filtering.
174 Y(z) & = & H(z)X(z) \\
175 & = & \sum^{M-1}_{k=0} h_k z^{-k} X(z)\\
176 & = & \sum^{M-1}_{k=0} z^{-k} (h_kX(z)) \\
177 & = & \sum^{M-1}_{k=0} z^{-k} W^{(k)}(z)
179 where we have set $W^{(k)}(z) = h_kX(z)$. Using Horner's rule,
181 Y(z) & = & W^{(0)}(z)\ +\ z^{-1}(W^{(1)}(z)
183 +\ \ldots \nonumber \\
185 +\ z^{-1}(W^{(M-2)}(z)
186 +\ z^{-1} W^{(M-1)}(z)) \ldots\ ))
190 Defining the transform relationships $w^{(k)} \transformPair
193 y = w^{(0)} \oplus (w^{(1)} \oplus (w^{(2)} \oplus \ldots
194 \oplus (w^{(M-2)} \oplus w^{(M-1)}) \ldots\ ))
197 where the operator $\oplus$ is such that, for two arbitrary
198 signals $u$ and $v$ having $Z$ transforms $U(z)$ and $V(z)$ we have
200 u \oplus v \ \transformPair\ U(z) + z^{-1}V(z).
202 From the superposition and shift properties of the $Z$
203 transform, it is clear that the operator $\oplus$ is implemented by
204 delaying the second operand by one sample and then performing
205 sample-by-sample addition. Let's call this operation
209 > delayAdd :: (Num a) => Signal a -> Signal a -> Signal a
210 > (x:rxs) `delayAdd` y = x : combine rxs y
211 > [] `delayAdd` y = 0 : y
214 (The second equation treats the empty list as the all-zero
215 signal.) This definition uses a function \verb~combine~ which is
216 similar to the Standard Prelude function \verb~zipWith~, except that
217 \verb~combine~ returns a list as long as the {\em longer\/} of the two
218 lists, with the ``missing'' values from the shorter list assumed to be
222 > combine :: (Num a) => Signal a -> Signal a -> Signal a
223 > combine (x:rxs) (y:rys) = x+y : combine rxs rys
230 Now we observe that (\ref{eq:ws}) can be rewritten using
233 y & = & foldr1\ \ (\oplus)\ \ [\: w^{(0)},\
234 w^{(1)},\ \ldots,\ w^{(M-1)} \:] \\
235 & = & foldr1\ \ (\oplus)\ \ [\: map\ \ (h_k *)\ \ x \:|\:
236 k \leftarrow [0..M-1] \:]
240 Thus, we can implement an FIR filter having impulse response
241 \verb~hs~ on an input signal \verb~x~ as follows:\footnote{My thanks
242 to Mark Jones of Yale University who first suggested this form of the
246 > fir :: Signal Float -> Filter Float
247 > fir hs x = foldr1 delayAdd [map (h*) x | h <- hs]
252 %----------------------------------------------------------------------
253 \section {Linear Predictive Analysis}
254 %----------------------------------------------------------------------
256 The processor described here is rather conventional; we just
257 implement most of the blocks described in~\cite[pp.\
258 112--117]{RabiJuan93}.
261 %--------------------------------------------------
262 \subsection {Preemphasis}
264 %--------------------------------------------------
266 The speech signal is first passed through a simple high-pass
267 filter to reduce the degree of spectral tilt. The system
272 where $a$ is a real number less than but close to 1; a typical
273 value is 0.95. Hence, the preemphasis filter is an FIR filter with
274 the impulse response $[1,\ -a\ ]$.
277 > preemph :: Float -> Filter Float
278 > preemph a = fir [1,-a]
283 %--------------------------------------------------
284 \subsection {Framing}
285 %--------------------------------------------------
287 Because of the time-varying nature of speech, the signal is
288 broken into a list of equal-length {\em frames}, each offset from its
289 predecessor by a fixed number of samples. Typically, the frame length
290 is about $20$ ms and the offset is about $10$ ms. We need a function
291 \verb~frames~ that takes three arguments: a frame length, an offset,
292 and the signal. We can derive the definition of \verb~frames~ as
295 frames\ n\ m\ x^\infty_0
301 & = & map\ (take\ n)\
307 & = & map\ (take\ n)\ (iterate\ (drop\ m)\ x_0^\infty).
311 frame\ n\ m\ = map\ (take\ n)\ \circ\ iterate\ (drop\ m)
313 But the signal may be finite, so we define and use a function
314 \verb~check_for_end~ to avoid processing an infinite stream of empty
319 > frames :: Int -> Int -> Signal a -> [Frame a]
320 > frames n m = check_for_end . map (take n) . iterate (drop m)
323 Here is the definition of \verb~check_for_end~:
326 > check_for_end = takeWhile (not . null)
331 %--------------------------------------------------
332 \subsection {Windowing}
333 %--------------------------------------------------
335 A {\em window\/} is used to smoothly attenuate samples at the
336 edges of a frame prior to analysis. A window is applied by
337 sample-by-sample multiplication. It is tempting to define the
338 windowing operation using the Standard Prelude function
339 \verb~zipWith~, however, we don't want to analyze a frame of data that
340 is too short, as would happen if the length of the entire signal
341 exceeded that of an integral number of frames by only a few samples.
342 Hence we define the function \verb~window~ using a utility function
343 \verb~window'~ that makes sure that the segment being windowed is at
344 least as long as the window itself; otherwise an empty frame is
345 returned. Note that \verb~window'~ is not a stream function; the
346 output list is accumulated in the third argument so that it can be
347 forgotten if the signal turns out to be shorter than the window.
350 > type Window = [Float]
352 > window :: Window -> Signal Float -> Frame Float
353 > window w x = window' w x []
355 > window' (w:rws) (x:rxs) buf = window' rws rxs (w*x : buf)
356 > window' (_:_) [] _ = []
357 > window' [] _ buf = buf
362 A popular window is the {\em Hamming window}. If the window
363 is to have $N$ samples, the formula for the $n$th sample is:
365 w(n) = 0.54 - 0.46 \cos( \frac{2\pi}{N-1} n )
367 Here is the function for building a Hamming window of \verb~n~
371 > hamming_window :: Int -> Window
372 > hamming_window npts =
373 > let angle = (2*pi)/fromInt (npts-1)
374 > in [0.54 - 0.46*cos (angle*fromInt n) | n <- [0..npts-1]]
379 One advantage of lazy evaluation is that by defining the
380 windowing function the way we have we can do the windowing and the
381 framing simultaneously. Thus, it turns out that we don't need the
382 function \verb~frames~. Let the first argument be the window width
383 and the second argument be the offset between frames.
386 > windows :: Int -> Int -> Signal Float -> [Frame Float]
389 > hw = hamming_window n
390 > apply_window = window hw
392 > check_for_end . map apply_window . iterate (drop m)
397 %--------------------------------------------------
398 \subsection {The Autocorrelation Sequence}
399 %--------------------------------------------------
401 The function \verb~autocorr~ computes the autocorrelation
402 sequence for the signal $x$ as a prelude to computing the coefficients
403 of the optimal $p$th-order linear prediction filter
404 (Section~\ref{sc:Durbin}). The $i$th sample of the autocorrelation
405 sequence for an $N$-point signal is defined by the equation
409 \sum^{N-1-i}_{n=0} x_n x_{n+i} & 0 \leq i < N \\
414 The summation operation is just the familiar dot product of
415 linear algebra. The dot product can be coded in Haskell as follows:
418 > x `dot` y = sum (zipWith (*) x y)
421 However, the operator \verb~`dot`~ is actually more general
422 than the familiar dot product because the two argument lists can have
423 different length; the longer list is essentially truncated to the
424 length of the shorter list. For derivation purposes, let `$\odot$'
425 denote this more general operation, and let `$\cdot$' denote the
426 standard dot product. In general,
428 x^n_0 \cdot y^n_0 = x_0^{n+k} \odot y^n_0 =
429 x^n_0 \odot y^{n+k}_0
431 for all $k \geq 0$, whereas $x^{n+k}_0 \cdot y^n_0$ would be
432 undefined. We can now formally derive the Haskell function for
433 computing the non-zero portion of the autocorrelation sequence as
436 [r_0,\ r_1,\ \ldots,\ r_{N-1} ]
437 & = & [ \sum^{N-1-i}_{n=0} x_n x_{n+i} \ |\ i \leftarrow [0..N-1]\:]\\
438 & = & [ x^{N-1-i}_0 \cdot x^{N-1}_i \ |\ i \leftarrow [0..N-1]\:]\\
439 & = & [ x^{N-1}_0 \odot x^{N-1}_i \ |\ i \leftarrow [0..N-1]\:]\\
440 & = & map\ \ (x^{N-1}_0\ \odot)\ \ [ x^{N-1}_i \ |\ i \leftarrow
442 & = & map\ \ (x^{N-1}_0\ \odot)\ \ (tails\ x^{N-1}_0)
444 where the function $tails$ is such that
446 tails\ x^N_0 = [\ x^N_0,\ x^N_1,\ x^N_2,\ \ldots,\ x^N_N\ ]
448 To get the complete autocorrelation sequence, we just append
449 an infinite list of zeros:
452 > autocorr x = map (x `dot`) (tails x) ++ repeat 0.0
455 The function \verb~tails~ is defined as follows:\footnote{This
456 function is also provided in the Chalmer's hbc library module {tt
460 > tails xs@(_:rxs) | null rxs = [xs]
461 > | otherwise = xs : tails rxs
466 In a conventional imperative language we would need to pass
467 {\em two\/} arguments to this function, one specifying how many
468 autocorrelation values are to be computed. The function would then
469 return an array containing precisely that many values. However, in a
470 lazy language the autocorrelation values will only be computed as
471 needed, so we do not need to specify such a value for this function.
474 The value $r_0$ is the {\em energy} of the signal. The
475 logarithm of the energy is often included as a parameter in feature
476 vectors used in recognizers.
479 %--------------------------------------------------
480 \subsection {The Durbin Algorithm}
482 %--------------------------------------------------
484 The next step is to solve for the optimum linear predictor
485 filter coefficients for a given order $p$. This can be done
486 efficiently using the {\em Durbin Algorithm}. The steps are listed in
487 Figure~\ref{fg:durbin} as they appear in~\cite[p.\ 411]{RabiScha78}
488 but with a few changes in notation. Basically, we successively
489 compute the optimum prediction filters for each order $i$ starting
490 with $i=1$. The optimum $i$th-order filter coefficients, $a^{(i)} =
491 [a^{(i)}_1,\ldots,a^{(i)}_i]$, are calculated from the optimum
492 $(i-1)$th-order filter coefficients, $a^{(i-1)} = [
493 a^{(i-1)}_1,\ldots, a^{(i-1)}_{i-1}]$, the mean squared error of that
494 filter, $e_{i-1}$, the autocorrelation value $r_i$, and the
495 autocorrelation values $r_1, r_2,\ldots, r_{i-1}$.
499 Initialization: & $e_0 = r_0$ \\[0.10in]
503 \shortstack{ Iteration: \\ $(i=1,2,\ldots,p)$} &
504 \begin{minipage}{2.850in}
507 k_i & = & \left( r_i - \sum^{i-1}_{j=1} a_j^{(i-1)} r_{i-j}
508 \right) \left/ \: e_{i-1} \right. \\
510 a_j^{(i)} & = & \left\{ \begin{array}{ll}
511 a_j^{(i-1)} - k_i a_{i-j}^{(i-1)}, &
512 j = 1, \ldots, i-1 \\
517 e_i & = & (1 - k_i^2) e_{i-1}
524 \caption[]{The Durbin Algorithm.}
529 This description can be straightforwardly realized in a
530 conventional language using arrays; we will implement the algorithm in
531 Haskell without using arrays.
534 > type LPA_Filter = (Float, [Float])
539 %%------------------------------------------------------------
540 \subsubsection {Updating the Predictor Polynomial}
541 %%------------------------------------------------------------
543 First, we derive a function \verb~new_as~ that takes the
544 optimal $(i-1)$th-order linear prediction filter coefficients and the
545 $i$th reflection coefficient and returns the optimal $i$th-order
546 linear prediction filter coefficients. From Figure~\ref{fg:durbin},
547 the optimal $i$th-order linear prediction filter is given by
549 a^{(i)} & \isDefined &
550 [ a^{(i)}_1,\ a^{(i)}_2,\ \ldots,\ a^{(i)}_i ] \\
551 & = & [ a^{(i-1)}_1 - k_i a^{(i-1)}_{i-1},\ \
552 a^{(i-1)}_2 - k_i a^{(i-1)}_{i-2},\ \
554 a^{(i-1)}_{i-1} - k_i a^{(i-1)}_1,\ \
557 Note that the subtraction operation involves the first and
558 last coefficients, the second and second-to-last coefficients, etc.
559 If we place two copies of $a^{(i-1)}$ side by side,
561 [ a^{(i-1)}_1,\ a^{(i-1)}_2,\ \ldots,\
562 a^{(i-1)}_{i-2},\ a^{(i-1)}_{i-1} ]\ \ \
563 [ a^{(i-1)}_1,\ a^{(i-1)}_2,\ \ldots,\
564 a^{(i-1)}_{i-2},\ a^{(i-1)}_{i-1} ]
566 it is easy to see that what is called for is some type
567 of \verb~foldr~ operation that consumes the list on the right as it
568 moves through the list on the left. Indeed, if we define the binary
569 operator $\oplus$ by (assuming that $k_i$ is in scope)
571 a \oplus (p,\ b:bs) = ((a-k_i*b):p,\ bs)
576 a^{(i)} = fst\ \ (foldr\ \ (\oplus)\ \ ([k_i], a^{(i-1)})\ \
580 Combining (\ref{eq:oplus}) and (\ref{eq:new-as}) into a
584 > new_as as k = fst (foldr op ([k],as) as)
585 > where a `op` (p,b:bs) = ((a-k*b):p,bs)
590 %%------------------------------------------------------------
591 \subsubsection {Computing the Reflection Coefficient}
592 %%------------------------------------------------------------
594 Next we consider the calculation of the $i$th reflection
595 coefficient. Ignoring the division by $e_{i-1}$ for the moment,
596 consider just the expression
598 r_i - \sum^{i-1}_{j=1} a^{(i-1)}_j r_{i-j}
599 \label{eq:k-numerator}
601 For the summation term we can use the same approach we used
602 for calculating the new prediction filter coefficients: a
603 \verb~foldr~ operation that consumes one list while moving through
604 another. Defining the binary operator $\otimes$ by
606 a \otimes (s,\ b:bs) = (s + a*b,\ bs)
609 the summation is the first component of the pair returned by
612 foldr\ \ (\otimes)\ \ (0,rs)\ \ as
613 \label{eq:k-summation}
615 where it is assumed that the length of $rs$ is at least as
616 great as that of $as$ (it is in practice). The second component of
617 the pair returned by (\ref{eq:k-summation}) is a list having $r_i$ as
618 its head, which, as we see from (\ref{eq:k-numerator}), is what we are
619 subtracting the summation from. Combining (\ref{eq:otimes}) and
620 (\ref{eq:k-summation}) with the division by $e_{i-1}$ yields the
621 following definition.
625 > let (summation,r:_) = foldr op (0,rs) as
626 > where a `op` (s,b:bs) = (s+a*b,bs)
627 > in (r - summation)/e
632 %%------------------------------------------------------------
633 \subsubsection {One Step in the Durbin Algorithm}
634 %%------------------------------------------------------------
636 We can now define the function that produces the optimal
637 $i$th-order linear prediction filter from the autocorrelation sequence
638 and the optimal $(i-1)$th-order filter.
641 > durbin_step rs (e,as) = let k = new_k rs (e,as)
642 > in ((1-k*k)*e, new_as as k)
647 %%------------------------------------------------------------
648 \subsubsection {The Durbin Algorithm}
649 %%------------------------------------------------------------
651 To get the optimal $p$th-order linear predictor given the
652 autocorrelation sequence, we just iteratively apply the function
653 \verb~durbin_step~ to the appropriate initial condition and select the
654 $p$th element of the resulting sequence. If $(r : rrs)$ is the
655 autocorrelation sequence for a frame, then the expression
657 iterate\ \ (durbin\_step\ rrs)\ \ (r,[\ ])
661 [\: (e_0,[\ ]),\ (e_1,a^{(1)}),\ (e_2,a^{(2)}),\ \ldots \:]
663 from which we want the element at position $p$.
666 > durbin :: Int -> Signal Float -> LPA_Filter
667 > durbin p (r:rrs) = (iterate (durbin_step rrs) (r,[]))!!p
672 %%------------------------------------------------------------
673 \subsection {Conversion to Cepstral Coefficients}
675 %%------------------------------------------------------------
677 Given a discrete-time linear system with the system function
679 \frac{G}{1 - \sum^p_{i=1} a_i z^{-i}}
681 the sequence of {\em cepstral coefficients\/} $c^\infty_0$ for
682 this system is defined by
686 \ln G & n=0 \\[0.10in]
687 a_n + \frac{1}{n} \sum^{n-1}_{k=1} a_k \cdot (n-k)
689 1 \leq n \leq p \\[0.10in]
690 \frac{1}{n} \sum^p_{k=1} a_k \cdot (n-k) \cdot c_{n-k} & n > p
695 Note that the summation terms of (\ref{eq:cepstra}) are just
696 convolution sums for the FIR filter with $a^{(p)}$ as its impulse
697 response and the scaled cepstral coefficient signal $[c_1, 2c_2, 3c_3,
698 4c_4, \ldots]$. Also, the second and third formulas are the same
699 except for the adding of the $a_i$'s to the scaled and filtered
700 cepstral coefficients. Hence, we can use the function \verb~delayAdd~
701 that we defined when developing the FIR filtering function \verb~fir~
702 (Section~\ref{sc:fir}). Also, the gain term is usually ignored in
703 forming the feature vector, so we just compute the sequence
707 > cepstra :: LPA_Filter -> Signal Float
708 > cepstra (_,as) = cs
710 > cs = as `delayAdd` xs
711 > xs = zipWith (/) (fir as (zipWith (*) [1..] cs)) [2..]
714 Because there is no terminating condition and because of the
715 recursion---the sequence \verb~cs~ appears in the definition of the
716 sequence \verb~xs~---this definition relies on lazy evaluation in a
720 \subsection {Putting it all together}
722 The function \verb~analyze~ transforms a windowed frame of
723 samples to a pair of values, a log gain term and a list of cepstral
724 coeficients. The first argument is the order of the linear prediction
725 analysis. The second argument is the number of cepstral coefficients.
728 > analyze :: Int -> Int -> Frame Float -> (Float, [Float])
730 > analyze p q wxs = let
732 > log_energy = log10 (head rs)
733 > cep = take q (cepstra (durbin p rs))
737 > log10 :: Float -> Float
738 > log10 x = log x / log 10
739 > --log10 x = let result = log x / log 10 in
740 > -- trace ("log10:"++(shows x ":")++(shows (log x) "/")++(shows (log 10) "=")++(show result)) result
746 \subsection {The Main Program}
748 Figure~\ref{fg:complete} shows the main Haskell program,
749 including the speech/feature I/O functions.
752 \caption[]{The linear predictive speech analysis main program}