Limits...
Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.


Schematic of speech segmentations in automatic speech recognition (ASR) system and brain.Segmenting continuous speech into short frames is the first step in the speech recognition process. In the ASR system, the most widely used speech segmentation approach employs fixed-size external time bins as a reference (‘time-partitioned’). This approach is computationally simple but has a limitation with respect to reflecting a quasi-regular structure of speech. Alternatively, the brain, which does not have an external timing reference, uses an intrinsic slow (neuronal) oscillatory signal as a segmentation reference. This oscillatory signal is phase-locked with the speech envelope during comprehension, which enables the reflection of quasi-regular temporal dynamics of speech in segmentation. The phase of this oscillatory signal is separated into four phase quadrants (φi). The speech waveform and speech-induced spike trains are segmented and color-coded by the phase angle of the reference oscillatory signal (‘phased-partitioned’). This segmentation approach can potentially generate unequally sized time bins depending on the temporal dynamics of speech. In this paper, we investigated whether the speech envelope can serve as a potential temporal reference for segmenting speech.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120313&req=5

f1: Schematic of speech segmentations in automatic speech recognition (ASR) system and brain.Segmenting continuous speech into short frames is the first step in the speech recognition process. In the ASR system, the most widely used speech segmentation approach employs fixed-size external time bins as a reference (‘time-partitioned’). This approach is computationally simple but has a limitation with respect to reflecting a quasi-regular structure of speech. Alternatively, the brain, which does not have an external timing reference, uses an intrinsic slow (neuronal) oscillatory signal as a segmentation reference. This oscillatory signal is phase-locked with the speech envelope during comprehension, which enables the reflection of quasi-regular temporal dynamics of speech in segmentation. The phase of this oscillatory signal is separated into four phase quadrants (φi). The speech waveform and speech-induced spike trains are segmented and color-coded by the phase angle of the reference oscillatory signal (‘phased-partitioned’). This segmentation approach can potentially generate unequally sized time bins depending on the temporal dynamics of speech. In this paper, we investigated whether the speech envelope can serve as a potential temporal reference for segmenting speech.

Mentions: Segmenting continuous speech into short frames is the first step in the feature extraction process of an automatic speech recognition (ASR) system. Because additional feature extraction steps are based on each framed speech segment, adequate segmentation is necessary to capture the unique temporal dynamics within speech. The most commonly used speech segmentation technique in state-of-the-art ASR systems is the fixed frame size and rate (FFSR) technique, which segments input speech with a fixed frame size by shifting it in a typical time order (conventionally a 25 ms frame with a 10 ms shift) (Fig. 1, top)1. Although the FFSR provides excellent speech recognition performance with clean speech, recognition performance rapidly degrades when noise corrupts speech. Degradation of the recognition performance is primarily attributed to the notion that the FFSR is incapable of adapting to the quasi-regular structure of speech. The conventional frame size of 25 ms becomes insufficient because it can smear the dynamic properties of rapidly changing spectral characteristics within a speech signal, such as the peak of the stop consonant2 or the transition region between phonemes34. Furthermore, the conventional frame shift rate of 10 ms is too sparse to capture the short duration attributes of a sufficient number of frames. As a result, the peaks of the stop consonant or transition period are easily smeared by noise, which causes recognition failure. Conversely, for the periodic parts of speech, such as a vowel, the conventional frame size and shift rate cause unnecessary overlap, leading to the addition of redundant information and insertion errors in noisy environments5. To overcome these problems, various speech segmentation techniques have been proposed6. The variable frame rate (VFR) technique is the most widely employed scheme as a substitute for the FFSR scheme457. The VFR technique is done by extracting speech feature vectors with the FFSR scheme and determining which frame to retain. Such technique has been shown to improve performance in clean and noisy environments compared with the FFSR scheme. Yet, it needs to examine speech at much shorter intervals (e.g., 2.5 ms), which requires repetitive calculations of the predefined distance measures and frame selections between adjacent frames, producing high computational complexity47.


Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference
Schematic of speech segmentations in automatic speech recognition (ASR) system and brain.Segmenting continuous speech into short frames is the first step in the speech recognition process. In the ASR system, the most widely used speech segmentation approach employs fixed-size external time bins as a reference (‘time-partitioned’). This approach is computationally simple but has a limitation with respect to reflecting a quasi-regular structure of speech. Alternatively, the brain, which does not have an external timing reference, uses an intrinsic slow (neuronal) oscillatory signal as a segmentation reference. This oscillatory signal is phase-locked with the speech envelope during comprehension, which enables the reflection of quasi-regular temporal dynamics of speech in segmentation. The phase of this oscillatory signal is separated into four phase quadrants (φi). The speech waveform and speech-induced spike trains are segmented and color-coded by the phase angle of the reference oscillatory signal (‘phased-partitioned’). This segmentation approach can potentially generate unequally sized time bins depending on the temporal dynamics of speech. In this paper, we investigated whether the speech envelope can serve as a potential temporal reference for segmenting speech.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120313&req=5

f1: Schematic of speech segmentations in automatic speech recognition (ASR) system and brain.Segmenting continuous speech into short frames is the first step in the speech recognition process. In the ASR system, the most widely used speech segmentation approach employs fixed-size external time bins as a reference (‘time-partitioned’). This approach is computationally simple but has a limitation with respect to reflecting a quasi-regular structure of speech. Alternatively, the brain, which does not have an external timing reference, uses an intrinsic slow (neuronal) oscillatory signal as a segmentation reference. This oscillatory signal is phase-locked with the speech envelope during comprehension, which enables the reflection of quasi-regular temporal dynamics of speech in segmentation. The phase of this oscillatory signal is separated into four phase quadrants (φi). The speech waveform and speech-induced spike trains are segmented and color-coded by the phase angle of the reference oscillatory signal (‘phased-partitioned’). This segmentation approach can potentially generate unequally sized time bins depending on the temporal dynamics of speech. In this paper, we investigated whether the speech envelope can serve as a potential temporal reference for segmenting speech.
Mentions: Segmenting continuous speech into short frames is the first step in the feature extraction process of an automatic speech recognition (ASR) system. Because additional feature extraction steps are based on each framed speech segment, adequate segmentation is necessary to capture the unique temporal dynamics within speech. The most commonly used speech segmentation technique in state-of-the-art ASR systems is the fixed frame size and rate (FFSR) technique, which segments input speech with a fixed frame size by shifting it in a typical time order (conventionally a 25 ms frame with a 10 ms shift) (Fig. 1, top)1. Although the FFSR provides excellent speech recognition performance with clean speech, recognition performance rapidly degrades when noise corrupts speech. Degradation of the recognition performance is primarily attributed to the notion that the FFSR is incapable of adapting to the quasi-regular structure of speech. The conventional frame size of 25 ms becomes insufficient because it can smear the dynamic properties of rapidly changing spectral characteristics within a speech signal, such as the peak of the stop consonant2 or the transition region between phonemes34. Furthermore, the conventional frame shift rate of 10 ms is too sparse to capture the short duration attributes of a sufficient number of frames. As a result, the peaks of the stop consonant or transition period are easily smeared by noise, which causes recognition failure. Conversely, for the periodic parts of speech, such as a vowel, the conventional frame size and shift rate cause unnecessary overlap, leading to the addition of redundant information and insertion errors in noisy environments5. To overcome these problems, various speech segmentation techniques have been proposed6. The variable frame rate (VFR) technique is the most widely employed scheme as a substitute for the FFSR scheme457. The VFR technique is done by extracting speech feature vectors with the FFSR scheme and determining which frame to retain. Such technique has been shown to improve performance in clean and noisy environments compared with the FFSR scheme. Yet, it needs to examine speech at much shorter intervals (e.g., 2.5 ms), which requires repetitive calculations of the predefined distance measures and frame selections between adjacent frames, producing high computational complexity47.

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.