Limits...
Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.


Calculation of information gain by various speech segmentation scheme.(a) The CSEs of various speech segmentation schemes are compared using the speech signal in Fig. 3. The CSE of the NVFS scheme is compared with the CSE of the (i) reversed order, (ii) random segmentation over 1000 trial, and (iii) conventional FFSR scheme (25 ms frame and 10 ms shift). (b) CSE comparison of 1542 samples from the test set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120313&req=5

f6: Calculation of information gain by various speech segmentation scheme.(a) The CSEs of various speech segmentation schemes are compared using the speech signal in Fig. 3. The CSE of the NVFS scheme is compared with the CSE of the (i) reversed order, (ii) random segmentation over 1000 trial, and (iii) conventional FFSR scheme (25 ms frame and 10 ms shift). (b) CSE comparison of 1542 samples from the test set.

Mentions: To verify the significance of the NVFS scheme, we compared the potential information gain that was obtained by different segmentation schemes. First, we quantified the effectiveness of the NVFS scheme against other segmentation schemes by calculating the CSE of speech in Fig. 5. The CSE of the NVFS scheme was compared with the CSE of (i) the reversed frame order of the NVFS, in which the segmentation reference obtained by the NVFS is reversed back and forth, (ii) random segmentation in which the average CSE over 1000 repetitions of random segmentations, into nine frames (the same number of frames extracted by the NVFS for speech in Fig. 5), and (iii) the FFSR scheme (i.e., 25 ms frame with a 10 ms overlap), which is the conventional paradigm of the speech segmentation scheme in ASR. Figure 6(a) shows the results of the analysis. The theta-low gamma nested oscillation has the highest CSE, which suggests that this type of segmentation scheme can effectively extract information from the speech signal. Additional analysis was performed with the 1542 samples from the test set as in previous experiments. Each speech signal was segmented based on the NVFS scheme (theta-low gamma combination), reversed frame order of the NVFS, and the FFSR scheme (25 ms frame and 10 ms shift). The CSE for each segmentation scheme was averaged over all samples (Fig. 6(b)). The results indicate that the NVFS (theta-low gamma combination) scheme provides the highest information gain.


Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference
Calculation of information gain by various speech segmentation scheme.(a) The CSEs of various speech segmentation schemes are compared using the speech signal in Fig. 3. The CSE of the NVFS scheme is compared with the CSE of the (i) reversed order, (ii) random segmentation over 1000 trial, and (iii) conventional FFSR scheme (25 ms frame and 10 ms shift). (b) CSE comparison of 1542 samples from the test set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120313&req=5

f6: Calculation of information gain by various speech segmentation scheme.(a) The CSEs of various speech segmentation schemes are compared using the speech signal in Fig. 3. The CSE of the NVFS scheme is compared with the CSE of the (i) reversed order, (ii) random segmentation over 1000 trial, and (iii) conventional FFSR scheme (25 ms frame and 10 ms shift). (b) CSE comparison of 1542 samples from the test set.
Mentions: To verify the significance of the NVFS scheme, we compared the potential information gain that was obtained by different segmentation schemes. First, we quantified the effectiveness of the NVFS scheme against other segmentation schemes by calculating the CSE of speech in Fig. 5. The CSE of the NVFS scheme was compared with the CSE of (i) the reversed frame order of the NVFS, in which the segmentation reference obtained by the NVFS is reversed back and forth, (ii) random segmentation in which the average CSE over 1000 repetitions of random segmentations, into nine frames (the same number of frames extracted by the NVFS for speech in Fig. 5), and (iii) the FFSR scheme (i.e., 25 ms frame with a 10 ms overlap), which is the conventional paradigm of the speech segmentation scheme in ASR. Figure 6(a) shows the results of the analysis. The theta-low gamma nested oscillation has the highest CSE, which suggests that this type of segmentation scheme can effectively extract information from the speech signal. Additional analysis was performed with the 1542 samples from the test set as in previous experiments. Each speech signal was segmented based on the NVFS scheme (theta-low gamma combination), reversed frame order of the NVFS, and the FFSR scheme (25 ms frame and 10 ms shift). The CSE for each segmentation scheme was averaged over all samples (Fig. 6(b)). The results indicate that the NVFS (theta-low gamma combination) scheme provides the highest information gain.

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.