Limits...
Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.


Flow chart of proposed speech segmentation scheme.A flow chart that shows the computation of the speech segmentation boundaries for the NVFS scheme.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120313&req=5

f2: Flow chart of proposed speech segmentation scheme.A flow chart that shows the computation of the speech segmentation boundaries for the NVFS scheme.

Mentions: During speech comprehension in the brain, the presence of important events is indicated by the changes in the instantaneous phase of nested neuronal oscillations932. By following these observations in the brain, the nested oscillatory reference effect in the auditory system is modeled by a series of steps as follows: (i) extract primary and secondary frequency band oscillations from the speech envelope as speech segmental references; (ii) partition primary and secondary frequency band oscillations using their phase quadrant boundaries as the frame start and end points, and (iii) couple primary and secondary frequency band oscillations such that the property of the primary frequency band oscillation shapes the appearance of the secondary frequency band oscillation. If the energy of the framed speech segment created by the primary frequency band oscillation falls within the pre-determined threshold range, it substitutes the oscillatory reference of the corresponding region with the secondary frequency band oscillation (refer to Methods for details on creating a nested oscillatory reference). A flow chart that describes the computation of the NVFS scheme is shown in Fig. 2.


Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference
Flow chart of proposed speech segmentation scheme.A flow chart that shows the computation of the speech segmentation boundaries for the NVFS scheme.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120313&req=5

f2: Flow chart of proposed speech segmentation scheme.A flow chart that shows the computation of the speech segmentation boundaries for the NVFS scheme.
Mentions: During speech comprehension in the brain, the presence of important events is indicated by the changes in the instantaneous phase of nested neuronal oscillations932. By following these observations in the brain, the nested oscillatory reference effect in the auditory system is modeled by a series of steps as follows: (i) extract primary and secondary frequency band oscillations from the speech envelope as speech segmental references; (ii) partition primary and secondary frequency band oscillations using their phase quadrant boundaries as the frame start and end points, and (iii) couple primary and secondary frequency band oscillations such that the property of the primary frequency band oscillation shapes the appearance of the secondary frequency band oscillation. If the energy of the framed speech segment created by the primary frequency band oscillation falls within the pre-determined threshold range, it substitutes the oscillatory reference of the corresponding region with the secondary frequency band oscillation (refer to Methods for details on creating a nested oscillatory reference). A flow chart that describes the computation of the NVFS scheme is shown in Fig. 2.

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.