A few notes on the SBRECOG speech recognition demo
With SBRECOG, I am presenting a speaker dependent
speech recognizer that works on DOS machines with
soundblaster compatible sound cards. The recognition
can be quite good if the conditions are optimal:
-- sufficiently distinct test sets that consist of
words of two or more syllables
-- good recording conditions
Sets that work fine with me are 4-6 element sets
consisting of Italian numbers or the aviation alphabet.
The program is based on the paper "Untersuchungen zur Verteilung von
Nulldurchgangsabstaenden in Sprachsignalen" (a study on the distribution
of zero crossing distances in speech signals) by Michael Kirstein, published
in IKP-Forschungsberichte II/62, Hamburg 1977.
Under the next two headings, I try to summarize the paper, only of course
as far as I have understood it and think it relevant for the program.
I. Related works
A couple of works presented since the 1950s give reason to assume that
zero crossings of a speech signal contain sufficient information to
allow the discrimination of phonemes or at least words:
-- Licklider & Pollack (1948) show that clipped speech remains under-
standable. In SBRECOG the amplitudes of the individual samples are
clipped at the value of |1|, i.e. the signal is reduced to 1 Bit
-- Chang, Pihl & Essigmann (1951) examine how the densities of zero
crossings and extrema (rho0 and rho0') are related to the first and
second formant in voiced sounds
-- Peterson (1951) shows that their values in the spectrum of vowels are
proportional to rho0, rho0'.
-- Chang, Pihl & Wiren (1952) introduce the "intervalgram", a graphical
representation of intervals between zero crossings
-- Kirstein (1971) talks about "Kumulanten" ("cumulants"), characteristic
concentrations of intervals (horizontal lines in the intervalgram)
Kirstein also quotes the rather pessimistic Burghard & Hess (1971) who
come to the result that zero crossing interval distributions did not
allow discrimination of vowels.
II. Problem and method
Windows with common sizes such as 10 or 20 ms are too narrow to give a
stable "view" on a speech signal; the distributions found are not
significant. That is why whole word utterances are chosen as the subject
of study.
-- The signal s(t) is clipped to a square signal _s_(t)=c*sgn(s(t))
-- the zero crossing intervals are collected
-- their distribution is examined, i.e. it is counted how many intervals
have the size i, how many the size i*2 and so on
Kirstein makes his PDP 15 micro examine the signal in real time; to reduce
the necessary computations he watches the positive part of the signal only.
Thus he reaches a sample frequency of 32 kHz. He admits that the speech
signal is "not at all symmetrical to the zero line", but thinks the
results are usable anyway.
The smallest interval that can be measured (at the resulting time resolution)
is 31.6 mu-s; the biggest that gets counted is 6.3 ms. Thus there are
200 possible intervals, stretching over a frequency range of 79..15,823 Hz.
These 200 intervals are classified into 16 classes; the idea is that
one class stretches over the bandwidth of about one formant.
Here is how Kirstein assigned intervals to the 16 classes:
class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
interv 1 2 3 4 5 6 7 8 9-10 11-12 13-15 16-19 20-25 26-38 39-78 79-200
mu-s 31 158 284 347 410 505 632 821 1232 2496
The signal durations turned out to be varying significantly between speakers
and even between different productions of one speaker. As the zero crossings
vary of course with the length of a signal, relative frequencies must be
calculated.
What we have at this point is a 16-dimensional vector representing each
word. Kirstein examines a number of statistical methods that compare
the similarity of two vectors. The one yielding the best results in his
study is a contingency matrix. The method is similar to that employed
by information theorists to calculate the "information transmission rate"
or "Transinformation" (Meyer-Eppler, 1969).
The formula combines input entropy + output entropy - overall entropy
to calculate a measure for the transmitted information
T = Sum_i=1..r ( Sum_j=1..c (p_ij * log_2 (p_ij/(p_i.*p_.j))),
where c, r are columns, rows of the matrix (c: dimension of the vectors,
r: number of vectors to be compared); p_ij are matrix cells, p_i. are
row sums, p_.j column sums.
Kirstein decides to smooth out the vectors (by averaging each element
with its weighted nearest neighbours). This turned out be desastrous
in my implementation, so I left out the smoothing.
III. About my implementation
My main interest was voice recognition in the telephone network, thus
I had to make do with a smaller bandwidth and a sampling rate of around
11 kHz. It is easy to see why the number of possible interval sizes is
reduced to 64 instead of Kirstein's 200 (see the related comments in
the code). Although their classification, that must eventually yield to
the 16-dimensional vector, is quite crucial for the performance of this
method, I must admit I did it quite ad hoc: I printed out a couple of
matrices and decided that they looked characteristical enough...
The performance of my program of course changes considerably with
different CPU speeds, as the sampling frequency is not constant on
different machines. If you do not achieve satisfactory results, try
changing the #define value of CPUSPEED to the tact rate of your machine,
or lower. I didn't test the program on machines other that 286s,
so given the quite different CPU designs it may be possible that you
have to set CPUSPEED to a value that doesn't match that of your computer
at all... The playback rate (that you observe during the training of
words) is no clue here, as of course the recognition depends only on
the recording speed. Just fiddle around with these things a bit.
The "user interface" of the program is so primitive that you will master
it without my explaining it here. Just note that there are basically
two ways of improving the recognition of a test set:
You can have multiple dictionary entries for different realisations of
one word. You may want to attach different ID strings to the dictionary
entries (like "bravo_1", "bravo_2", "bravo_fast", "bravo_slow"...), so
that you can see how often each of the entries is picked by the program.--
Or you can have the parameter vectors in the dictionary calculated as
the average of two or more (the program supports two only) realisations.
This is what the program means by asking "Would you like another test set
to be averaged with the set entered".
The sound blaster interface "direct.obj" was written by Joel Lucsy of
Vroom Diggy Diggy Software and is part of a Freeware package, "Blast".
I am including only the Blast files necessary to compile my demo. If
you want to use the package for your own programs I suggest you let archie
search your favourite ftp servers for it.
Why am I publishing this demo program? I would like to see people
starting further experiments inspired by the ideas presented here.
The material is free to use and share. I hope you may feel somewhat
obliged to make your enhancements and applications free software, too.
If you have any further questions or comments, you can contact me
by electronic mail at
kiehl@ldv01.uni-trier.de
or by conventional mail
until 06-31-1993 from 07-01-1993
Johannes Kiehl Johannes Kiehl
Postfach 2441 Postfach 2441
D - W 5500 Trier D - 54214 Trier