Speaker Recognition Using a Coding-Length Based Segmentation Method for Vector Quantization

Andrew Wagner

Final Course Project for
ECE 586YM: Generalized Principal Component Analysis:
Estimation and Segmentation of Hybrid Models
Spring 2006, Prof. Yi Ma

Introduction:

In this report, I present two methods of applying the methods found in [1] to the application domain of speaker recognition.  Both methods I present utilize mel-spaced frequency cepstral coefficients as feature vectors, and use the coding length based segmentation scheme for purposes of vector quantization.  The algorithm achieves nearly perfect performance in distinguishing between the voices of eight different people with similar voices.

Background on Speech Production:Image of Larynx

 

The speech signal originates in the vocal cords, which periodically restrict and release the flow of air from the lungs. The sound at this point contains practically every audible frequency. Sound then passes through the larynx, and through the mouth and nasal passages where it is shaped by the lips, tongue, etc. Voiced tones (mostly vowels) account for most of the energy in a recorded speech signal. Fricatives, Plosives, etc… account for a relatively small portion of the energy in the signal.

The frequency content of voiced tones depends on the volume and shape of the larynx (voice box) of the individual speaker. The uniqueness of the frequency response of the voice box from speaker to speaker is the basis of most traditional speaker identification, and will be the basis for the algorithms presented herein.

Speech signals have a hybrid nature in time. The signal is roughly stationary for intervals of approximately 30ms, with fairly abrupt changes at the beginning and end of each syllable.  This provides motivation for using a hybrid model, such as the Gaussian mixture model / k-means method used for vector quantization common in speaker recognition systems.

In speaker recognition, you have several sets of labeled training data, and you must use this information to label a new set of testing data.  Both the training data and the testing data consist of recordings of users speaking a passphrase into a microphone. The spectrogram shown below will give an idea of the properties of the speech signals. It clearly shows an evolution that is hybrid in time. This spectrogram corresponds to one of the recordings used for training in the following experiment.

spectrogram

 

Algorithm Description and Experimental Results

The algorithms consist of two stages: the training stage, which can be performed offline as soon as training data for all users is availabe, and the testing stage, which is performed when a new testing data set arrives, and is typically expected to complete with a matter of seconds. The same training stage is common to both of the algorithms I'm presenting:

Training:

  1. Window blocks of the audio signal with a hamming window.
  2. Take the fourier transforms of the windowed portions of the signal.
  3. Project frequency domain signal down into a smaller number of mel-spaced frequency bands.
  4. Take the Discrete Cosine Transforms of the log of complex magnitudes of the vectors.
  5. Randomly sub-sample the feature vectors from each recording so that they all have the same number of vectors.  In this context, there are usually still plenty of  feature vectors.
  6. Perform the same re-scaling to the rows of each training set, to prevent bands with a lot of energy from over-powering the bands with less signal strength.
  7. Use a pairwise steepest descent algorithm to segment each of the training data sets, and determine the coding length for each set.

The reader is reffered to [1] for a discussion of the coding length function used (referred to as L(W) in the paper). This results in each data set consisting of two hundered to five hundred samples in a 20 dimensional space. If two data sets are similar, then it would be reasonable to assume that they will code together more compactly than two data sets that are dissimilar.  This observation is the basis for the first method I am presenting:

Testing (Method 1):

  1. Perform the same seven steps that were applied to the training data, this time to the testing data.
  2. Additionally, for each training/testing data set pair, compute the coding length of the union of the training/testing data set pair.
  3. Compute the difference between the coding length for the data sets coded together and the sums of the coding lengths of the data sets coded separately to determine how similar the data sets are.
  4. Assign each testing set to the training set that minimizes the quantity computed in the previous step.

Here are the results of the above algorithm when applied to a simple example with three training recordings and three test recordings, recorded by three different people (Note the maximal values along the diagonal):

Difference in coding length, in units of bits.
  train1 train2 train3
test1 648 677 765
test2 766 675 773
test3 761 797 684

One thing that stands out is that it takes more bits to code the data sets together than it does to code them separately!  Although the presence of more groups in the combined segmentation reduces the bits required to represent each sample, coding the membership of the samples takes considerably more for each sample in the combined data set, and this effect dominates. Although this method was found to work well as long as the feature vectors are properly tuned, the time it takes to compute the combined coding length between the testing data set and each of the training data sets is prohibitive for problems larger than the 3 person experiment presented here. An alternative, significantly faster method of performing the testing has been developed.  It relies on the assumption, similar to the first method, that samples in the testing set will code compactly with samples in the training set.  However, this time each sample is considered individually and compared to each group in a given training.  The affect on coding length is aggregated over all of the testing samples.

Testing (Method 2):

  1. For each test set:
  2. Minimize over the training sets:
  3. the sum of the changes in coding lengh when each sample is coded with the training group it codes with most compactly.
  4. Assign the testing data set to the training set that exhibits the greatest decrease in coding length when the testing vectors are grouped into training groups.
  5. It was found to be necessary to remove an unexplained bias along each training group by subtracting off the mean of each vector of values correpsponding to a given training sample.

Although the results are reasonable before removal of the column means, the columns are in obvious need of re-scaling to remove a bias.

  train1 train2 train3 train4 train5 train6 train7 train8
test1 14411 15182 15403 14642 15169 15246 15428 15148
test2 15584 14217 15779 15873 15424 15784 15705 15292
test3 15005 14796 14857 15335 14991 14990 14726 14610
test4 14534 14706 14882 13171 14632 14773 14872 14625
test5 15074 15044 15100 15205 13584 15089 15233 14892
test6 14518 14413 14468 14809 14377 13485 14153 13792
test7 15512 15199 15513 15676 15430 15449 15425 15079
test8 15250 14897 15350 15333 15168 14580 15262 14078

After removing this bias, the results for the algorithm are improved, as can be seen in the following tables of results:

  train1 train2 train3 train4 train5 train6 train7 train8
test1 -574.9 375.5 233.7 -363.9 321.7 321.6 327.2 458.8
test2 597.7 -589.4 609.6 867.6 577.6 859.2 604.5 602.2
test3 19 -10.8 -312.1 329.7 143.8 65 -374.2 -79.5
test4 -451.7 -100.9 -286.5 -1834.6 -214.8 -151.3 -228.5 -64.5
test5 87.8 237 -68.6 199.4 -1263 164.6 132.3 202.2
test6 -468.2 -393.8 -701.1 -196.2 -469.6 -1439.3 -947.7 -897.8
test7 526.3 392 343.8 670.3 583.5 524.5 324.9 389.7
test8 264 90.4 181.1 327.7 320.8 -344.2 161.5 -611.1

It should be noted that this is a rather difficult data set, where all of the sound samples are from women with similar tones of voice all saying the same word. A similar algorithm that uses k-means for vector quantization purposes gave the exact same results, with the third sample mis-classified in exactly the same way. Listening to the samples, it is apparrent that the speaker pronounced with a descending "e" as in "bed", while in the testing phase, the speaker pronounced "zero" as "zeeero". The algorithm ended up matching the sample to a training set with an audibly similar pronunciation.

Conclusion

Two methods have been presented of applying entropy based segmentation methods in a supervised manner to the speaker identification problem. Experimental results show that the algorithms are sound on moderately sized problems, and motivate further investigations into applications with larger data sets. It would also be interesting to try applying an algorithm using  similar concepts to an unsupervised learning problem involving voice data, namely the automatic annotation of recorded conversations. It may be possible to automatically segment recorded conversation according to who which of several individuals is talking at a given time. There is also plenty of room for theoretical progress in the algorithm. Since we are using the difference between two upper bounds, the algorithm could fail if these bounds are not sufficiently tight. 

References

[1] Ma, Y., Derksen H., Hong, W., and Wright, J.  On the coding and segmentation of multivariate mixed data. UIUC Technical Report UILU-ENG-06-2203. April 2006.

[2] Some of the audio samples and parameters for feature vectors used in this project have been generously provided on the MATLAB file exchange by Amin Koohi.

[3] The illustration of the larynx was sourced from the wikipedia project.