Building a speaker recognition model

abishek subramanian2021-07-17 | 17 min read

The ability of a system to recognize a person by their voice is a non-intrusive way to collect their biometric information. Unlike fingerprint detection, retinal scans or face recognition, speaker recognition just uses a microphone to record a person’s voice thereby circumventing the need for expensive hardware. Moreover, in pandemic like situations where infectious diseases may exist on surfaces, voice recognition can easily be deployed in place of other biometric systems that involve some form of contact. The existing applications of person authentication include:

credit card verification,
over the phone secure access in call centres,
suspect identification by voice,
national security measures for combating terrorism by using voice to locate and track terrorists
detection of a speaker's presence at a remote location
annotation and indexing of speakers in audio data
voice-based signatures and secure building access

Overview of a speaker verification system:

Speaker verification is a subfield within the broader Speaker recognition task. Speaker recognition is the process of recognizing a speaker by their voice. The recognizing of the speaker is typically required for one of the following use-cases:

Speaker Identification - Given an existing list of speakers, assign an unknown speaker’s voice to its closest match from the list. The system here will identify, via some meaningful sense, which existing speakers’ model does the utterance match. For example a list of log probabilities will be generated and the speaker (in the existing speakers list) with the highest value will be classified as the speaker who spoke the ‘unknown’ utterance.
- A possible application of this use-case could be to identify the time stamps where an individual member of an audio conference room spoke.
- If the unknown utterance is spoken by a speaker outside the list of existing speakers, the model will nonetheless map it to some speaker from that list. Hence, the assumption is that the incoming voice has to be from a speaker present in the existing list.
Speaker Verification - Given a speaker model, the system verifies whether the incoming speech is from the same speaker the model was trained on. It determines whether the individual is who they claim to be.
- A possible application of this use-case would be to use the speaker's voice as a biometric authentication token.

Important considerations:

Although an individual’s voice can be cloned, the purpose of a voice biometric is to serve as one of the ‘factors’ in a multi-factor authentication scheme.
While more advanced models for speaker verification exist, this blog will form a basis of speech signal processing. Further, deep learning methods are built on the foundation of signal processing. The input to the neural network should be features extracted from the speech signal (MFCCs, LPCCS, 40-log mel filter banks energies) rather than the entire speech signal.
In this project we focus on the second use-case. Our goal therefore is to verify a person by their voice, provided they have previously enrolled in the system.

The general process of speaker verification involves three stages: development, enrollment, and verification.

Development is the process of learning speaker independent parameters, which are needed for capturing specific speech characteristics. During development, the UBM (Universal Background Model) training happens where the learning of speaker independent parameters takes place through a large sample of speakers.
Enrollment is the process of learning the distinct characteristics of a speaker’s voice and is used to create a claimed model to represent the enrolled speaker during verification. Here the distinct characteristics of a speaker’s voice are used to create a speaker specific model. The speaker speaks for a duration (generally about 30s) where the system models his voice and saves it in a .npy file.
In verification, the distinct characteristics of a claimant’s voice are compared with the previously enrolled claimed speaker model to determine if the claimed identity is correct. Here, the incoming audio of a speaker is compared against a previously saved model of the speaker.

Speaker verification is usually conducted in two ways

TISV - (Text Independent Speaker Verification) - The lexicon and phonetic content of the speech has no constraints and the speaker during enrollment and verification can utter any phrase or sentence.
TDSV - (Text Dependent Speaker Verification) - The lexicon and phonetic content of the speech is constrained and the system can only return a match between enrollment and verification if the lexicon that was used for enrollment is used for verification.

Approach and algorithms used:

In case you are not familiar with the basics of digital signal processing, it is highly recommended that you read the Fundamentals of Signal Processing.
i-Vector is the embedding or vector that is extracted from a speech utterance during enrolment and verification. It is this embedding on which the cosine similarity is evaluated over to predict a match or a mismatch between the speaker in the enrolment and verification utterance.
In order to understand the i-Vector approach, a dive into GMMs (Gaussian Mixture Models) is necessary. Many probability distributions of random variables occurring in nature can be modeled as Gaussians or mixtures of Gaussians. They are well understood and the parameter estimation is done iteratively. The GMMs as a generative approach, effectively models the speaker for speaker verification systems where the probability that an incoming utterance was generated by a trained Gaussian is returned. The parameters for a GMM are the Gaussian means, the covariances and the weights of the mixtures.

A GMM of a weighted sum of M component Gaussian densities as given by the equation

$$\begin{equation}P(x|\lambda ) = \sum_{k=1}^{M} w_{k} \times g(x|\mu _{k},\Sigma_{k} )\end{equation}$$

where $x$ is a D-dimensional feature vector (in our case 39 dimensional MFCCs), $w_{k},\mu_{k}, \Sigma_{k} ; k = 1,2,...........M$, are the mixture weights, mean and covariance. $g(x | \mu_k,\Sigma_k), k = 1,2,3.........M$ are the component Gaussian densities.

Each component density is a D-variate Gaussian function of the form,

$$\begin{equation} g(\boldsymbol{x}| \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \left[ 1 \mathbin{/} \left( 2 \pi^{\mathcal{D} \mathbin{/} 2} |\boldsymbol{\Sigma}_k|^{1 \mathbin{/} 2} \right) \right] \text{exp} \{ -0.5 (\boldsymbol{x} - \boldsymbol{\mu}_k)^{T} \boldsymbol{\Sigma}_{k}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}_k) \}\end{equation}$$

where mean vector $\boldsymbol{\mu}_k$ and covariance matrix $\boldsymbol{\Sigma}_{k}$, the mixture weights satisfy the constraint $ \sum_{k=1}^M w_i = 1$. The complete GMM is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities, and these parameters are collectively represented by $\lambda = \{w_k,\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k} \}, k = 1, 2, 3.........M $.

Expectation Maximization

The Expectation Maximisation (EM) algorithm learns the GMM parameters, based on maximizing the expected log-likelihood of the training data.
The expectation of a random variable with respect to a distribution gives the arithmetic mean when summed (discrete distribution)or integrated (continuous distribution) to infinity.
The likelihood function measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters. In this case, the means, variances, and the mixture weights. It is formed from the joint probability distribution of the sample.
The motivation of the EM algorithm is to estimate a new and improved model from the current model using the training utterance features such that the probability of the training feature being generated by the new model is greater than or equal to the old model. This is an iterative technique where the new model becomes the current model for the following iteration.

$$\begin{equation} \prod_{n=1}^N P(\boldsymbol{x}_n | \lambda) \geq \prod_{n=1}^N P(\boldsymbol{x}_n | \lambda^{\text{old}}) \end{equation}$$

The maximum likelihood estimates the GMM parameters when the incoming extracted feature i.e MFCCs has latent variables . The EM algorithm finds model parameters by choosing either random values or initialization and uses those to guesstimate a second set of data.

The k-means algorithm is used for initialization of the GMM parameters iteratively in which the mixture of MFCCs feature vectors is performed by evaluation of the mixture means.

E-step

$$\begin{equation} P(k|\boldsymbol{x}) = \boldsymbol{w}_k g(\boldsymbol{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \mathbin{/} P(\boldsymbol{x} | \lambda^{\text{old}}) \end{equation}$$

The maximisation results in the GMM parameters being estimated

M-step

$$\begin{aligned} \boldsymbol{w}_k &= ( n_k \mathbin{/} T) \sum_{n=1}^N P(k|\boldsymbol{x}_n, \lambda^{\text{old}}) \\
\boldsymbol{\mu}_k &=(1 \mathbin{/} n_k) \sum_{n=1}^N P(k|\boldsymbol{x}_n, \lambda^{\text{old}}) \boldsymbol{x}_n \\
\boldsymbol{\Sigma}_k &= (1 \mathbin{/} n_k)\sum_{n=1}^N P(k|\boldsymbol{x}_n, \lambda^{\text{old}}) \boldsymbol{x}_n \boldsymbol{x}_n^T - \boldsymbol{\mu}_k \boldsymbol{\mu}_k^T
\end{aligned}$$

Thus we have evaluated the three parameters of the GMM i.e. the weights, means and the covariances.

UBM (Universal Background Model) Training

In typical speaker verification tasks, as there is limited amount of data available to train the speaker models, the speaker models can’t be directly estimated in a reliable fashion using Expectation Maximization. For this reason, maximum a posteriori (MAP) adaptation is often used to train the speaker models for speaker verification systems. This approach estimates the speaker model from the Universal Background Model. An UBM is a high-order GMM, trained on a large quantity of speech data, obtained from a wide sample of the speaker population of interest. It is designed to capture the general form of a speaker model and to represent the speaker-independent distribution of features. The UBM parameters are estimated using the EM algorithm.

The low level features on which the GMM is based are MFCCs. Further, in the GMM-UBM model, MFCC acoustic features are used as input features. The GMM mean super-vector is the concatenation of GMM mean vectors. The i-Vector method provides an intermediate speaker representation between the high dimensional GMM supervector and traditional low dimensional MFCC feature representations.
The enrollment procedure involves an i-Vector extracted on a 30 second long utterance from the speaker registration process. This i-Vector is saved in a database as a speaker model. Later during verification, an i-Vector extracted from a test utterance (about 15 second long) is compared against the enrolled i-Vector via a cosine similarity score. If this score exceeds a threshold value, it is declared as a match or a mismatch otherwise.

Dataset

Some of the popular datasets used for speech processing are LibriSpeech, LibriVox, VoxCeleb1, VoxCeleb2, TIMIT, etc.

The dataset used for the pre-trained UBM in this demo is VoxCeleb1. This dataset file distribution is as follows:

Verification split

VoxCeleb1	dev	test
# of speakers	1,211	40
# of videos	21,819	677
# of utterances	148,642	4,874

Based on the paper titled, A study on Universal Background Model training in Speaker Verification, the UBM can be trained on data i.e. ~1.5 hours long from around 60 speakers. The pre-trained model is based on this subset of speakers.

The above dataset has way more than the required number of speakers and has been used as a benchmark dataset for assessing speaker verification performance.

Since the UBM training even for the sub-set of speakers takes a while, training for 5 speakers has been shown in this blog. The same code can be used for more speakers.

Model Implementation

Extract MFCCs and train iVector extractor from subset VoxCeleb1

For the scope of this blog, 5 testers from dev and 5 testers from test are used from the VoxCeleb1 dataset. The UBM training, i-Vector training is performed on the dev set. The test set is used to evaluate model performance metrics. In practice the number of speakers used to train the UBM is much higher.
Silence detection - A .wav file containing human speech typically consists of speech and silence segments. The MFCCs extracted should be from speech frames where voice activity/speech activity is detected and not on silence segments. If the MFCCs are extracted from frames derived from silence segments, features that are not characteristic of the target speaker will be modeled in the GMM. A Hidden Markov Model (HMM) can be trained to learn the silence and speech segments in a speech file (refer pyAudioAnalysis) . To train the HMM, the timestamps of the speech segments and silence segments are fed to the model.
bob.kaldi

The MFCCs are extracted by the function [4]

bob.kaldi.mfcc(data, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)

Training the Universal Background Model. The UBM model is trained on the 5 speaker dev set of Voxceleb1 by the below functions [4]

For the global diagonal GMM model

bob.kaldi.ubm_train(feats, ubmname, num_threads=4, num_frames=500000, min_gaussian_weight=0.0001, num_gauss=2048, num_gauss_init=0, num_gselect=30, num_iters_init=20, num_iters=4, remove_low_count_gaussians=True)

Full covariance UBM model

bob.kaldi.ubm_full_train(feats, dubm, fubmfile, num_gselect=20, num_iters=4, min_gaussian_weight=0.0001)

Train the i-Vector (extractor) (dev) set - The i-Vector training is done by using the MFCC features of the dev set and UBM model. This results in a 600 dimensional array/embedding from a speech utterance. The function used to train the i-Vector is

bob.kaldi.ivector_train(feats, fubm, ivector_extractor, num_gselect=20, ivector_dim=600, use_weights=False, num_iters=5, min_post=0.025, num_samples_for_weights=3, posterior_scale=1.0)

Extract i-Vectors - Once the i-Vector training is complete, i-Vectors can be extracted by the below function on any speech .wav file [4]

bob.kaldi.ivector_extract(feats, fubm, ivector_extractor, num_gselect=20, min_post=0.025, posterior_scale=1.0)

Model Performance

The model performance is evaluated on the test set by calculating the True Positive Rate (TPR) and False Positive Rate (FPR). The TPR and FPR are determined in the following fashion:

Each speaker has a certain number of utterances corresponding to which there are i-Vectors. The i-Vectors are compared against each other via the cosine similarity score. If the score is above a certain threshold, the speaker is considered a match. A match is categorized as a positive and a mismatch is categorized as a negative. The CSS operates by comparing the angles between a test i-vector, $w_{\text{test}}$, and a target i-vector $w_{\text{target}}$

$$\begin{equation} \boldsymbol{S} ( \hat{\boldsymbol{w}}_{\text{target}} , \hat{\boldsymbol{w}}_{\text{test}} ) = \langle \hat{\boldsymbol{w}}_{\text{target}} , \hat{\boldsymbol{w}}_{\text{test}} \rangle \mathbin{/} \left( ||\hat{\boldsymbol{w}}_{\text{target}}|| \; || \hat{\boldsymbol{w}}_{\text{test}} || \right) \end{equation}$$

The true positive rate and false positive rate in this case are determined for each speaker and an overall average is calculated. This reflects the performance of the classifier.

References:

[1] https://www.allaboutcircuits.com/technical-articles/an-introduction-to-digital-signal-processing/

[2] https://www.researchgate.net/publication/268277174_Speaker_Verification_using_I-vector_Features

[3] https://ieeexplore.ieee.org/document/5713236

[4] https://www.idiap.ch/software/bob/docs/bob/bob.kaldi/master/py_api.html#module-bob.kaldi

[5] https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

[6] https://www.idiap.ch/software/bob/docs/bob/bob.kaldi/master/py_api.html#module-bob.kaldi

[7] https://www.researchgate.net/publication/268277174_Speaker_Verification_using_I-vector_Features

[8] https://engineering.purdue.edu/~ee538/DSP_Text_3rdEdition.pdf[9] Kamil, Oday. (2018). Frame Blocking and Windowing Speech Signal. 4. 87-94.

abishek subramanian

Summary