The enterprise platform to build, deliver, and govern AI
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
The ability of a system to recognize a person by their voice is a non-intrusive way to collect their biometric information. Unlike fingerprint detection, retinal scans or face recognition, speaker recognition just uses a microphone to record a person’s voice thereby circumventing the need for expensive hardware. Moreover, in pandemic like situations where infectious diseases may exist on surfaces, voice recognition can easily be deployed in place of other biometric systems that involve some form of contact. The existing applications of person authentication include:
Speaker verification is a subfield within the broader Speaker recognition task. Speaker recognition is the process of recognizing a speaker by their voice. The recognizing of the speaker is typically required for one of the following use-cases:
Speaker verification is usually conducted in two ways
A GMM of a weighted sum of M component Gaussian densities as given by the equation
$$\begin{equation}P(x|\lambda ) = \sum_{k=1}^{M} w_{k} \times g(x|\mu _{k},\Sigma_{k} )\end{equation}$$
where \(x\) is a D-dimensional feature vector (in our case 39 dimensional MFCCs), \(w_{k},\mu_{k}, \Sigma_{k} ; k = 1,2,...........M\), are the mixture weights, mean and covariance. \(g(x | \mu_k,\Sigma_k), k = 1,2,3.........M\) are the component Gaussian densities.
Each component density is a D-variate Gaussian function of the form,
$$\begin{equation} g(\boldsymbol{x}| \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \left[ 1 \mathbin{/} \left( 2 \pi^{\mathcal{D} \mathbin{/} 2} |\boldsymbol{\Sigma}_k|^{1 \mathbin{/} 2} \right) \right] \text{exp} \{ -0.5 (\boldsymbol{x} - \boldsymbol{\mu}_k)^{T} \boldsymbol{\Sigma}_{k}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}_k) \}\end{equation}$$
where mean vector \(\boldsymbol{\mu}_k\) and covariance matrix \(\boldsymbol{\Sigma}_{k}\), the mixture weights satisfy the constraint \( \sum_{k=1}^M w_i = 1\). The complete GMM is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities, and these parameters are collectively represented by \(\lambda = \{w_k,\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k} \}, k = 1, 2, 3.........M \).
$$\begin{equation} \prod_{n=1}^N P(\boldsymbol{x}_n | \lambda) \geq \prod_{n=1}^N P(\boldsymbol{x}_n | \lambda^{\text{old}}) \end{equation}$$
The maximum likelihood estimates the GMM parameters when the incoming extracted feature i.e MFCCs has latent variables . The EM algorithm finds model parameters by choosing either random values or initialization and uses those to guesstimate a second set of data.
The k-means algorithm is used for initialization of the GMM parameters iteratively in which the mixture of MFCCs feature vectors is performed by evaluation of the mixture means.
$$\begin{equation} P(k|\boldsymbol{x}) = \boldsymbol{w}_k g(\boldsymbol{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \mathbin{/} P(\boldsymbol{x} | \lambda^{\text{old}}) \end{equation}$$
The maximisation results in the GMM parameters being estimated
$$\begin{aligned} \boldsymbol{w}_k &= ( n_k \mathbin{/} T) \sum_{n=1}^N P(k|\boldsymbol{x}_n, \lambda^{\text{old}}) \\
\boldsymbol{\mu}_k &=(1 \mathbin{/} n_k) \sum_{n=1}^N P(k|\boldsymbol{x}_n, \lambda^{\text{old}}) \boldsymbol{x}_n \\
\boldsymbol{\Sigma}_k &= (1 \mathbin{/} n_k)\sum_{n=1}^N P(k|\boldsymbol{x}_n, \lambda^{\text{old}}) \boldsymbol{x}_n \boldsymbol{x}_n^T - \boldsymbol{\mu}_k \boldsymbol{\mu}_k^T
\end{aligned}$$
Thus we have evaluated the three parameters of the GMM i.e. the weights, means and the covariances.
In typical speaker verification tasks, as there is limited amount of data available to train the speaker models, the speaker models can’t be directly estimated in a reliable fashion using Expectation Maximization. For this reason, maximum a posteriori (MAP) adaptation is often used to train the speaker models for speaker verification systems. This approach estimates the speaker model from the Universal Background Model. An UBM is a high-order GMM, trained on a large quantity of speech data, obtained from a wide sample of the speaker population of interest. It is designed to capture the general form of a speaker model and to represent the speaker-independent distribution of features. The UBM parameters are estimated using the EM algorithm.
Some of the popular datasets used for speech processing are LibriSpeech, LibriVox, VoxCeleb1, VoxCeleb2, TIMIT, etc.
The dataset used for the pre-trained UBM in this demo is VoxCeleb1. This dataset file distribution is as follows:
VoxCeleb1 | dev | test |
|---|---|---|
# of speakers | 1,211 | 40 |
# of videos | 21,819 | 677 |
# of utterances | 148,642 | 4,874 |
Based on the paper titled, A study on Universal Background Model training in Speaker Verification, the UBM can be trained on data i.e. ~1.5 hours long from around 60 speakers. The pre-trained model is based on this subset of speakers.
The above dataset has way more than the required number of speakers and has been used as a benchmark dataset for assessing speaker verification performance.
Since the UBM training even for the sub-set of speakers takes a while, training for 5 speakers has been shown in this blog. The same code can be used for more speakers.
bob.kaldi.mfcc(data, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)bob.kaldi.ubm_train(feats, ubmname, num_threads=4, num_frames=500000, min_gaussian_weight=0.0001, num_gauss=2048, num_gauss_init=0, num_gselect=30, num_iters_init=20, num_iters=4, remove_low_count_gaussians=True)bob.kaldi.ubm_full_train(feats, dubm, fubmfile, num_gselect=20, num_iters=4, min_gaussian_weight=0.0001)bob.kaldi.ivector_train(feats, fubm, ivector_extractor, num_gselect=20, ivector_dim=600, use_weights=False, num_iters=5, min_post=0.025, num_samples_for_weights=3, posterior_scale=1.0)bob.kaldi.ivector_extract(feats, fubm, ivector_extractor, num_gselect=20, min_post=0.025, posterior_scale=1.0)The model performance is evaluated on the test set by calculating the True Positive Rate (TPR) and False Positive Rate (FPR). The TPR and FPR are determined in the following fashion:
Each speaker has a certain number of utterances corresponding to which there are i-Vectors. The i-Vectors are compared against each other via the cosine similarity score. If the score is above a certain threshold, the speaker is considered a match. A match is categorized as a positive and a mismatch is categorized as a negative. The CSS operates by comparing the angles between a test i-vector, \(w_{\text{test}}\), and a target i-vector \(w_{\text{target}}\)
$$\begin{equation} \boldsymbol{S} ( \hat{\boldsymbol{w}}_{\text{target}} , \hat{\boldsymbol{w}}_{\text{test}} ) = \langle \hat{\boldsymbol{w}}_{\text{target}} , \hat{\boldsymbol{w}}_{\text{test}} \rangle \mathbin{/} \left( ||\hat{\boldsymbol{w}}_{\text{target}}|| \; || \hat{\boldsymbol{w}}_{\text{test}} || \right) \end{equation}$$
The true positive rate and false positive rate in this case are determined for each speaker and an overall average is calculated. This reflects the performance of the classifier.
References:
[1] https://www.allaboutcircuits.com/technical-articles/an-introduction-to-digital-signal-processing/
[2] https://www.researchgate.net/publication/268277174_Speaker_Verification_using_I-vector_Features
[3] https://ieeexplore.ieee.org/document/5713236
[4] https://www.idiap.ch/software/bob/docs/bob/bob.kaldi/master/py_api.html#module-bob.kaldi
[5] https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
[6] https://www.idiap.ch/software/bob/docs/bob/bob.kaldi/master/py_api.html#module-bob.kaldi
[7] https://www.researchgate.net/publication/268277174_Speaker_Verification_using_I-vector_Features
[8] https://engineering.purdue.edu/~ee538/DSP_Text_3rdEdition.pdf[9] Kamil, Oday. (2018). Frame Blocking and Windowing Speech Signal. 4. 87-94.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.
Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.