Limits...
Heterophonic speech recognition using composite phones

View Article: PubMed Central - PubMed

ABSTRACT

Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the problem of heterophonic languages by developing the concept of a Composite Phoneme (CP) as a basic pronunciation unit for speech recognition. A CP is a set of alternative sequences of phonemes. CP’s are developed specifically in the context of Arabic by defining phonetic units that are consonant centric and absorb phonemically contrastive short vowels and gemination, not represented in the Arabic Modern Orthography (MO). CPs alleviate the need to diacritize MO into Classical Orthography (CO), to represent short vowels and stress, before generating pronunciation in terms of Simple Phonemes (SP). We develop algorithms to generate CP pronunciation from MO, and SP pronunciation from CO to map a word into a single pronunciation. We investigate the performance of CP, SP, UG (Undiacritized Grapheme), and DG (Diacritized Grapheme) ASRs. The experimental results suggest that UG and DG are inferior to SP and CP. For the A-SpeechDB corpus with MO vocabulary of 8000, the WER for bigram and context dependent phone are: 11.78, 12.64, and 13.59 % for CP, SP_M (SP from manual diacritized CO), and SP_A (SP from automated diacritized MO) respectively. For vocabulary of 24,000 MO words, the corresponding WER’s are 13.69, 15.08, and 16.86 %. For uniform statistical model, SP has a lower WER than CP. For context independent phone (CI), CP has lower WER than SP.

No MeSH data available.


WER of ASR system with Uniform LM. Top plot: Context Independent; Bottom plot: Context Dependent. Undiac Grapheme (UG) 5 emitting states (dot); Diac Grapheme (DG) 4 emitting states (short dash); Composite Phoneme (CP) 5 emitting states (solid); Simple Phoneme Manual Diacritization (SP_M) 4 emitting states (medium dash); Simple Phoneme Manual Diacritization with single state short vowels (SP_M1) 4 emitting states (medium dash dot); Simple Phoneme Automatic Diacritization (SP_A) 4 emitting states (long dash)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5121111&req=5

Fig1: WER of ASR system with Uniform LM. Top plot: Context Independent; Bottom plot: Context Dependent. Undiac Grapheme (UG) 5 emitting states (dot); Diac Grapheme (DG) 4 emitting states (short dash); Composite Phoneme (CP) 5 emitting states (solid); Simple Phoneme Manual Diacritization (SP_M) 4 emitting states (medium dash); Simple Phoneme Manual Diacritization with single state short vowels (SP_M1) 4 emitting states (medium dash dot); Simple Phoneme Automatic Diacritization (SP_A) 4 emitting states (long dash)

Mentions: Figures 1 and 2 graph WER of ASRs versus mixtures for the optimal number of HMM states. The left plot is for CI pronunciation units and the right plot is for CD pronunciation units. In all cases the word error rate calculation is based on the undiacritized forms (NIST style scoring) (Saon et al. 2010).Fig. 1


Heterophonic speech recognition using composite phones
WER of ASR system with Uniform LM. Top plot: Context Independent; Bottom plot: Context Dependent. Undiac Grapheme (UG) 5 emitting states (dot); Diac Grapheme (DG) 4 emitting states (short dash); Composite Phoneme (CP) 5 emitting states (solid); Simple Phoneme Manual Diacritization (SP_M) 4 emitting states (medium dash); Simple Phoneme Manual Diacritization with single state short vowels (SP_M1) 4 emitting states (medium dash dot); Simple Phoneme Automatic Diacritization (SP_A) 4 emitting states (long dash)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5121111&req=5

Fig1: WER of ASR system with Uniform LM. Top plot: Context Independent; Bottom plot: Context Dependent. Undiac Grapheme (UG) 5 emitting states (dot); Diac Grapheme (DG) 4 emitting states (short dash); Composite Phoneme (CP) 5 emitting states (solid); Simple Phoneme Manual Diacritization (SP_M) 4 emitting states (medium dash); Simple Phoneme Manual Diacritization with single state short vowels (SP_M1) 4 emitting states (medium dash dot); Simple Phoneme Automatic Diacritization (SP_A) 4 emitting states (long dash)
Mentions: Figures 1 and 2 graph WER of ASRs versus mixtures for the optimal number of HMM states. The left plot is for CI pronunciation units and the right plot is for CD pronunciation units. In all cases the word error rate calculation is based on the undiacritized forms (NIST style scoring) (Saon et al. 2010).Fig. 1

View Article: PubMed Central - PubMed

ABSTRACT

Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the problem of heterophonic languages by developing the concept of a Composite Phoneme (CP) as a basic pronunciation unit for speech recognition. A CP is a set of alternative sequences of phonemes. CP’s are developed specifically in the context of Arabic by defining phonetic units that are consonant centric and absorb phonemically contrastive short vowels and gemination, not represented in the Arabic Modern Orthography (MO). CPs alleviate the need to diacritize MO into Classical Orthography (CO), to represent short vowels and stress, before generating pronunciation in terms of Simple Phonemes (SP). We develop algorithms to generate CP pronunciation from MO, and SP pronunciation from CO to map a word into a single pronunciation. We investigate the performance of CP, SP, UG (Undiacritized Grapheme), and DG (Diacritized Grapheme) ASRs. The experimental results suggest that UG and DG are inferior to SP and CP. For the A-SpeechDB corpus with MO vocabulary of 8000, the WER for bigram and context dependent phone are: 11.78, 12.64, and 13.59 % for CP, SP_M (SP from manual diacritized CO), and SP_A (SP from automated diacritized MO) respectively. For vocabulary of 24,000 MO words, the corresponding WER’s are 13.69, 15.08, and 16.86 %. For uniform statistical model, SP has a lower WER than CP. For context independent phone (CI), CP has lower WER than SP.

No MeSH data available.