Final Course Project for
ECE 586YM: Generalized Principal Component Analysis:
Estimation and Segmentation of Hybrid Models
Spring 2006, Prof. Yi Ma
In this report, I present two methods of applying the methods found in [1] to the application domain of speaker recognition. Both methods I present utilize mel-spaced frequency cepstral coefficients as feature vectors, and use the coding length based segmentation scheme for purposes of vector quantization. The algorithm achieves nearly perfect performance in distinguishing between the voices of eight different people with similar voices.

The speech signal originates in the vocal cords, which periodically restrict and release the flow of air from the lungs. The sound at this point contains practically every audible frequency. Sound then passes through the larynx, and through the mouth and nasal passages where it is shaped by the lips, tongue, etc. Voiced tones (mostly vowels) account for most of the energy in a recorded speech signal. Fricatives, Plosives, etc… account for a relatively small portion of the energy in the signal.
The frequency content of voiced tones depends on the volume and shape of the larynx (voice box) of the individual speaker. The uniqueness of the frequency response of the voice box from speaker to speaker is the basis of most traditional speaker identification, and will be the basis for the algorithms presented herein.
Speech signals have a hybrid nature in time. The signal is roughly stationary for intervals of approximately 30ms, with fairly abrupt changes at the beginning and end of each syllable. This provides motivation for using a hybrid model, such as the Gaussian mixture model / k-means method used for vector quantization common in speaker recognition systems.
In speaker recognition, you have several sets of labeled training data, and you must use this information to label a new set of testing data. Both the training data and the testing data consist of recordings of users speaking a passphrase into a microphone. The spectrogram shown below will give an idea of the properties of the speech signals. It clearly shows an evolution that is hybrid in time. This spectrogram corresponds to one of the recordings used for training in the following experiment.

The algorithms consist of two stages: the training stage, which can be performed offline as soon as training data for all users is availabe, and the testing stage, which is performed when a new testing data set arrives, and is typically expected to complete with a matter of seconds. The same training stage is common to both of the algorithms I'm presenting:
Training:
The reader is reffered to [1] for a discussion of the coding length function used (referred to as L(W) in the paper). This results in each data set consisting of two hundered to five hundred samples in a 20 dimensional space. If two data sets are similar, then it would be reasonable to assume that they will code together more compactly than two data sets that are dissimilar. This observation is the basis for the first method I am presenting:
Testing (Method 1):
Here are the results of the above algorithm when applied to a simple example with three training recordings and three test recordings, recorded by three different people (Note the maximal values along the diagonal):
| train1 | train2 | train3 | |
| test1 | 648 | 677 | 765 |
| test2 | 766 | 675 | 773 |
| test3 | 761 | 797 | 684 |
One thing that stands out is that it takes more bits to code the data sets together than it does to code them separately! Although the presence of more groups in the combined segmentation reduces the bits required to represent each sample, coding the membership of the samples takes considerably more for each sample in the combined data set, and this effect dominates. Although this method was found to work well as long as the feature vectors are properly tuned, the time it takes to compute the combined coding length between the testing data set and each of the training data sets is prohibitive for problems larger than the 3 person experiment presented here. An alternative, significantly faster method of performing the testing has been developed. It relies on the assumption, similar to the first method, that samples in the testing set will code compactly with samples in the training set. However, this time each sample is considered individually and compared to each group in a given training. The affect on coding length is aggregated over all of the testing samples.
Testing (Method 2):
Although the results are reasonable before removal of the column means, the columns are in obvious need of re-scaling to remove a bias.
| train1 | train2 | train3 | train4 | train5 | train6 | train7 | train8 | |
| test1 | 14411 | 15182 | 15403 | 14642 | 15169 | 15246 | 15428 | 15148 |
| test2 | 15584 | 14217 | 15779 | 15873 | 15424 | 15784 | 15705 | 15292 |
| test3 | 15005 | 14796 | 14857 | 15335 | 14991 | 14990 | 14726 | 14610 |
| test4 | 14534 | 14706 | 14882 | 13171 | 14632 | 14773 | 14872 | 14625 |
| test5 | 15074 | 15044 | 15100 | 15205 | 13584 | 15089 | 15233 | 14892 |
| test6 | 14518 | 14413 | 14468 | 14809 | 14377 | 13485 | 14153 | 13792 |
| test7 | 15512 | 15199 | 15513 | 15676 | 15430 | 15449 | 15425 | 15079 |
| test8 | 15250 | 14897 | 15350 | 15333 | 15168 | 14580 | 15262 | 14078 |
After removing this bias, the results for the algorithm are improved, as can be seen in the following tables of results:
| train1 | train2 | train3 | train4 | train5 | train6 | train7 | train8 | |
| test1 | -574.9 | 375.5 | 233.7 | -363.9 | 321.7 | 321.6 | 327.2 | 458.8 |
| test2 | 597.7 | -589.4 | 609.6 | 867.6 | 577.6 | 859.2 | 604.5 | 602.2 |
| test3 | 19 | -10.8 | -312.1 | 329.7 | 143.8 | 65 | -374.2 | -79.5 |
| test4 | -451.7 | -100.9 | -286.5 | -1834.6 | -214.8 | -151.3 | -228.5 | -64.5 |
| test5 | 87.8 | 237 | -68.6 | 199.4 | -1263 | 164.6 | 132.3 | 202.2 |
| test6 | -468.2 | -393.8 | -701.1 | -196.2 | -469.6 | -1439.3 | -947.7 | -897.8 |
| test7 | 526.3 | 392 | 343.8 | 670.3 | 583.5 | 524.5 | 324.9 | 389.7 |
| test8 | 264 | 90.4 | 181.1 | 327.7 | 320.8 | -344.2 | 161.5 | -611.1 |
It should be noted that this is a rather difficult data set, where all of the sound samples are from women with similar tones of voice all saying the same word. A similar algorithm that uses k-means for vector quantization purposes gave the exact same results, with the third sample mis-classified in exactly the same way. Listening to the samples, it is apparrent that the speaker pronounced with a descending "e" as in "bed", while in the testing phase, the speaker pronounced "zero" as "zeeero". The algorithm ended up matching the sample to a training set with an audibly similar pronunciation.
Two methods have been presented of applying entropy based segmentation methods in a supervised manner to the speaker identification problem. Experimental results show that the algorithms are sound on moderately sized problems, and motivate further investigations into applications with larger data sets. It would also be interesting to try applying an algorithm using similar concepts to an unsupervised learning problem involving voice data, namely the automatic annotation of recorded conversations. It may be possible to automatically segment recorded conversation according to who which of several individuals is talking at a given time. There is also plenty of room for theoretical progress in the algorithm. Since we are using the difference between two upper bounds, the algorithm could fail if these bounds are not sufficiently tight.
[1] Ma, Y., Derksen H., Hong, W., and Wright, J. On the coding and segmentation of multivariate mixed data. UIUC Technical Report UILU-ENG-06-2203. April 2006.
[2] Some of the audio samples and parameters for feature vectors used in this project have been generously provided on the MATLAB file exchange by Amin Koohi.
[3] The illustration of the larynx was sourced from the wikipedia project.