In this paper a generative frontend based on both phonetic and prosodic features, and also a couple of approaches based on phonetic transcription- Aggregated Phone Recognizer followed by Language Models (APRLM) and Generalized Phone Recognizer followed by Language Models (GPRLM), are investigated. APRLM and GPRLM have few disadvantages since they need phonetic transcription of speech data, and also they use fewer level of information while the generative frontend built upon an ensemble of Gaussian densities uses prosodic and phonetic information altogether. Furthermore, no transcription of speech data is needed in Support Vector Machine (SVM)- based approaches, and they showed better performances in our experiments too. In addition, APRLM and GPRLM are more time consuming than SVM-based approaches. We used Mel-Frequency Cepstral Coefficients (MFCC) in APRLM and GPRLM, and Shifted Delta Cepstrum (SDC) and Pitch Contour Polynomial Approximation (PCPA) features in SVM-based methods. Probabilistic Sequence Kernel (PSK) and Generalized Linear Discriminant Sequence (GLDS) kernels are used in SVM experiments. SVM using GLDS and PSK kernels outperforms GMM in all our LID experiments conducted by applying PCPA features and LID performance improved about 2.1% and 5.9% respectively. The combination of Probabilistic Characteristic Vector using PCPA (PCV-PCPA) and Probabilistic Characteristic Vector using SDC (PCV-SDC) provides further improvements.
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |