This is a spectrogram of the spoken syllables „ann-all-ack, repeated three times.
The input signal is fed into a bank of second-order bandpass filters with center frequencies ranging from 300 Hz to 3600 Hz on a logarithmic ( or mel ) scale. The Q-factor is chosen so high that their amplitude response peaks start to separate from each other. (I know that it’s definitely not state of the art, but then, nothing on these pages is.) The pixel values indicate, on a logarithmic scale, the energy that is stored in each bandpass. It is a quadratic form in the two state variables. Many artifacts in the „spectrogram“, e.g. the trumpet-shapes that open to the left are due to the ringing in the filters, they are not visible in a true spectrogram. A sine wave that is suddenly switched on excites a broad range of filters. Only filters with the correct center frequency will store significant amounts of energy. They correspond to bright horizontal lines.
All in all, there is an astonishing level of detail, both temporal and spectral. If we want to distinguish difficult consonants in spoken language, we will need them both. Some sounds differ only in small details, yet broad variations may be completely insignificant. Is this any different from our ability to recognize each and every A?
The bank of band-pass filters is a crude model of cochlear signal processing. The inner ear achieves its high frequency resolution by active filters. The wave that travels through the cochlea is amplified by tiny biomechanical amplifiers ( outer hair cells ) that compensate energy loss. The result is a large increase in sensitivity and frequency resolution.
The output of the filter bank is a time-varying vector in R^{64}, but pattern engines need a bunch of discretization maps into small, discrete address spaces. An often overlooked map of this kind is given by cmp: R^{2} ⟶ {0,1} , cmp(x,y) = 1 iff x > y. A comparator attached between the output of a filter and a time–delayed output of another filter gives a map from R^{64} x [0,∞[ to {0,1} x [0,∞[. As time passes, the spectrogram scrolls to the left and the output toggles between 0 and 1 in a complicated manner. The output space {0,1} is, of course, far too small to be useful. If we use 16 to 24 comparators, we get a map h_{1} into {0,1}^{16}. This is the space of small bitmaps again. Different delays and different filter outputs will give different maps h_{1} … h_{n}. No sane engineer will consider these wildly nonlinear maps as part of an audio processing system, but a pattern engine does not require much from its input functions. Each h_{i} will stay constant for short amounts of time, then it will change again. Because the spaces X_{i} are small, however, it is quite likely that a value of h_{i} will repeat. If this happens, we know that a certain pattern of spectral and temporal variation has occurred again. A pattern engine will combine the locally constant functions h_{i} into a new function that is constant on much larger domains.
/* See, for example, the book "Musical Applications of Microprocessors" by Hal Chamberlin, * published in 1985. It definitely shows its age, but I've enjoyed reading it. */ class Filter { double d1,d2,f,q ; public: Filter(void) { d1 = d2 = 0 ; f = q = 0 ; } ; void init( double center_frequency, double Q , double sample_frequency ) // set up center frequency and Q-factor { f = 2 * sin( M_PI * center_frequency / sample_frequency ) ; q = Q ; d1 = d2 = 0 ; } double lpf(double x ) // return a low-pass filtered sample { d2 += f * d1 ; d1 += f * ( x - d2 - d1 / q ) ; return d2 ; } ; double bpf(double x ) // return a band-pass filtered sample { d2 += f * d1 ; d1 += f * ( x - d2 - d1 / q ) ; return d1 ; } ; double hamilton(double x ) // return the stored energy, a quadratic form in the state variables { d2 += f * d1 ; d1 += f * ( x - d2 - d1 / q ) ; return d1*d1 + f*d1*d2 + d2*d2 ; } ; } ;