Limits...
Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.


Optimal combination of primary and secondary frequency band oscillation.Preference distribution analysis for determining the optimal combination of the primary and secondary frequency bands that extract as much information as possible from each speech sample. Theta (4~10 Hz) as the primary frequency band oscillation and low gamma (25~35 Hz) as the secondary frequency band oscillation were chosen as the optimal combination for the temporal reference.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120313&req=5

f4: Optimal combination of primary and secondary frequency band oscillation.Preference distribution analysis for determining the optimal combination of the primary and secondary frequency bands that extract as much information as possible from each speech sample. Theta (4~10 Hz) as the primary frequency band oscillation and low gamma (25~35 Hz) as the secondary frequency band oscillation were chosen as the optimal combination for the temporal reference.

Mentions: The speech envelope is composed of multiple frequency bands, which indicates that the envelope contains various timescales that deliver different speech features. Among the various temporal modulation rates of the speech envelope, only slow modulation rates (<50 Hz) are likely to be preferred by the auditory system in the brain222324. In this study, the six typical frequency bands (i.e., delta, 0.4~4 Hz; theta, 4~10 Hz; alpha, 11~16 Hz; beta, 16~25 Hz; low gamma, 25~35 Hz; and mid gamma, 35~50 Hz) of the speech envelope are examined to serve as the primary and secondary frequency band oscillations to organize the nested oscillatory reference. The effectiveness of the created nested oscillatory reference as a temporal guide to segment speech was measured by calculating the cochlea-scaled spectral entropy (CSE)33, which represents the potential information gain from segmentation (refer to Methods for details). Greater unpredictability in speech is reflected by an increase in the CSE, which can be interpreted as providing potential information. We searched for dominant frequency band combinations of primary and secondary frequency band oscillations that form as a nested oscillatory reference to provide the highest CSE. We subsequently compared the CSE value of the proposed NVFS scheme with other segmentation schemes to determine which segmentation scheme can extract more information from speech. A total of 1542 samples from a test set in Table 1 are employed to identify the dominant primary and secondary frequency band oscillation combinations. We plotted the analysis results of the preference distribution of the primary and secondary frequency band combinations that maximize the CSE (Fig. 4). For the majority of the samples, the theta (4~10 Hz) range and low gamma (25~35 Hz) range of the speech envelope participated as the primary frequency band oscillations and secondary frequency band oscillations, respectively, in speech segmentation to yield the highest CSE. We assumed the theta and low gamma band oscillations as the optimal combination of the primary and secondary frequency band oscillations and further investigated the effectiveness of the NVFS scheme.


Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference
Optimal combination of primary and secondary frequency band oscillation.Preference distribution analysis for determining the optimal combination of the primary and secondary frequency bands that extract as much information as possible from each speech sample. Theta (4~10 Hz) as the primary frequency band oscillation and low gamma (25~35 Hz) as the secondary frequency band oscillation were chosen as the optimal combination for the temporal reference.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120313&req=5

f4: Optimal combination of primary and secondary frequency band oscillation.Preference distribution analysis for determining the optimal combination of the primary and secondary frequency bands that extract as much information as possible from each speech sample. Theta (4~10 Hz) as the primary frequency band oscillation and low gamma (25~35 Hz) as the secondary frequency band oscillation were chosen as the optimal combination for the temporal reference.
Mentions: The speech envelope is composed of multiple frequency bands, which indicates that the envelope contains various timescales that deliver different speech features. Among the various temporal modulation rates of the speech envelope, only slow modulation rates (<50 Hz) are likely to be preferred by the auditory system in the brain222324. In this study, the six typical frequency bands (i.e., delta, 0.4~4 Hz; theta, 4~10 Hz; alpha, 11~16 Hz; beta, 16~25 Hz; low gamma, 25~35 Hz; and mid gamma, 35~50 Hz) of the speech envelope are examined to serve as the primary and secondary frequency band oscillations to organize the nested oscillatory reference. The effectiveness of the created nested oscillatory reference as a temporal guide to segment speech was measured by calculating the cochlea-scaled spectral entropy (CSE)33, which represents the potential information gain from segmentation (refer to Methods for details). Greater unpredictability in speech is reflected by an increase in the CSE, which can be interpreted as providing potential information. We searched for dominant frequency band combinations of primary and secondary frequency band oscillations that form as a nested oscillatory reference to provide the highest CSE. We subsequently compared the CSE value of the proposed NVFS scheme with other segmentation schemes to determine which segmentation scheme can extract more information from speech. A total of 1542 samples from a test set in Table 1 are employed to identify the dominant primary and secondary frequency band oscillation combinations. We plotted the analysis results of the preference distribution of the primary and secondary frequency band combinations that maximize the CSE (Fig. 4). For the majority of the samples, the theta (4~10 Hz) range and low gamma (25~35 Hz) range of the speech envelope participated as the primary frequency band oscillations and secondary frequency band oscillations, respectively, in speech segmentation to yield the highest CSE. We assumed the theta and low gamma band oscillations as the optimal combination of the primary and secondary frequency band oscillations and further investigated the effectiveness of the NVFS scheme.

View Article: PubMed Central - PubMed

ABSTRACT

Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

No MeSH data available.