Limits...
Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.


Example of the NVFS segmentation scheme.Example of (a) Waveform of a/pa/utterance in Fig. 3. (b) Corresponding spectrogram of the signal. (c) Generated nested oscillatory reference. The waveform and its corresponding spectrogram are divided by the phase quadrant boundaries of the nested oscillation (c).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120313&req=5

f5: Example of the NVFS segmentation scheme.Example of (a) Waveform of a/pa/utterance in Fig. 3. (b) Corresponding spectrogram of the signal. (c) Generated nested oscillatory reference. The waveform and its corresponding spectrogram are divided by the phase quadrant boundaries of the nested oscillation (c).

Mentions: To qualitatively assess the effectiveness of the theta and low gamma combination, we segmented the speech signal in Fig. 3 with the theta-low gamma nested oscillatory reference. The results are presented in Fig. 5. Figure 5(a) and (b) show the waveform of the speech and the spectrogram of the speech, respectively. The speech waveform and its corresponding spectrogram are divided by the phase quadrant boundaries of the nested oscillatory reference in Fig. 5(c). Note that the consonant and transition regions are captured with short length frames, whereas the vowel regions are captured with relatively long length frames. For this example, a total of six of nine frames are assigned to the consonant and transition regions, whereas three frames capture the vowel.


Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference
Example of the NVFS segmentation scheme.Example of (a) Waveform of a/pa/utterance in Fig. 3. (b) Corresponding spectrogram of the signal. (c) Generated nested oscillatory reference. The waveform and its corresponding spectrogram are divided by the phase quadrant boundaries of the nested oscillation (c).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120313&req=5

f5: Example of the NVFS segmentation scheme.Example of (a) Waveform of a/pa/utterance in Fig. 3. (b) Corresponding spectrogram of the signal. (c) Generated nested oscillatory reference. The waveform and its corresponding spectrogram are divided by the phase quadrant boundaries of the nested oscillation (c).
Mentions: To qualitatively assess the effectiveness of the theta and low gamma combination, we segmented the speech signal in Fig. 3 with the theta-low gamma nested oscillatory reference. The results are presented in Fig. 5. Figure 5(a) and (b) show the waveform of the speech and the spectrogram of the speech, respectively. The speech waveform and its corresponding spectrogram are divided by the phase quadrant boundaries of the nested oscillatory reference in Fig. 5(c). Note that the consonant and transition regions are captured with short length frames, whereas the vowel regions are captured with relatively long length frames. For this example, a total of six of nine frames are assigned to the consonant and transition regions, whereas three frames capture the vowel.

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.