Limits...
Modeling auditory coding: from sound to spikes.

Rudnicki M, Schoppe O, Isik M, Völk F, Hemmert W - Cell Tissue Res. (2015)

Bottom Line: On the other hand, discrepancies between model results and measurements reveal gaps in our current knowledge, which can in turn be targeted by matched experiments.Models of the auditory periphery have improved greatly during the last decades, and account for many phenomena observed in experiments.It also provides uniform evaluation and visualization scripts, which allow for direct comparisons between models.

View Article: PubMed Central - PubMed

Affiliation: Department of Electrical and Computer Engineering, Technische Universität München, München, Germany.

ABSTRACT
Models are valuable tools to assess how deeply we understand complex systems: only if we are able to replicate the output of a system based on the function of its subcomponents can we assume that we have probably grasped its principles of operation. On the other hand, discrepancies between model results and measurements reveal gaps in our current knowledge, which can in turn be targeted by matched experiments. Models of the auditory periphery have improved greatly during the last decades, and account for many phenomena observed in experiments. While the cochlea is only partly accessible in experiments, models can extrapolate its behavior without gap from base to apex and with arbitrary input signals. With models we can for example evaluate speech coding with large speech databases, which is not possible experimentally, and models have been tuned to replicate features of the human hearing organ, for which practically no invasive electrophysiological measurements are available. Auditory models have become instrumental in evaluating models of neuronal sound processing in the auditory brainstem and even at higher levels, where they are used to provide realistic input, and finally, models can be used to illustrate how such a complicated system as the inner ear works by visualizing its responses. The big advantage there is that intermediate steps in various domains (mechanical, electrical, and chemical) are available, such that a consistent picture of the evolvement of its output can be drawn. However, it must be kept in mind that no model is able to replicate all physiological characteristics (yet) and therefore it is critical to choose the most appropriate model-or models-for every research question. To facilitate this task, this paper not only reviews three recent auditory models, it also introduces a framework that allows researchers to easily switch between models. It also provides uniform evaluation and visualization scripts, which allow for direct comparisons between models.

No MeSH data available.


Results of an automatic speech recognition system evaluating rate-place code features of the noisy ISOLET database (which contains speech sounds from 0 dB SNR to clean) at different speech levels. Speech recognition scores of the same system with classical Mel-frequency cepstral features was 74.8% (dashed green line)
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4487355&req=5

Fig18: Results of an automatic speech recognition system evaluating rate-place code features of the noisy ISOLET database (which contains speech sounds from 0 dB SNR to clean) at different speech levels. Speech recognition scores of the same system with classical Mel-frequency cepstral features was 74.8% (dashed green line)

Mentions: Finally, we undertook a very high-level comparison between the models: we wanted to evaluate their discriminative ability to code speech sounds. For a fair comparison, we first equalized their rate thresholds. We decided to match the human hearing threshold, even if this might not be the most optimal setting. Because the auditory models are computationally expensive, we could only use a small speech database, the noisy ISOLET (Holmberg et al. 2007). Acoustic features were extracted from the rate-place coding by summing spikes from HSRs, MSRs and LSRs in overlapping Hanning windows (duration: 25 ms, advanced by 10 ms (Holmberg 2009; Holmberg et al. 2007)). They were preprocessed by a multi-layer perceptron and then fed to a Hidden-Markov speech recognizer. The recognition system was trained and tested for each level individually, because ASR systems are known for their weakness to adapt to previously unseen variations in the feature space. For a detailed description of the system compare Holmberg et al. (2007) and Holmberg (2009). Recognition scores were averaged over the conditions 0, 5, 10, 15, 20 dB SNR and clean speech (no noise added) and plotted for different speech levels in Fig. 18. The MAP model reached the highest recognition scores, despite its relatively broad tuning and limited dynamic range. Obviously, it was able to sustain a very good representation of the speech sounds in noise, which might be attributed to its efferent feedback mechanism. However, its performance was high only at low levels. Already at medium levels (above 40 dB(A)), its performance decayed rapidly, probably because of the saturation of its rate-level functions. The Holmberg model was designed to cover a very broad dynamic range, which is also reflected in the results: the roll-off of recognition scores to low and high levels was shallow. Still, the Zilany et al. (2014) model outperformed the Holmberg et al. (2007) model at all sound levels. This is probably due to the carefully tuned rate-level functions across the whole frequency range, but also offset-adaptation is known to improve speech coding (Wang et al. 2008). In summary, the ability of auditory models to code speech is already very elaborate, all three outperform classical Mel-frequency cepstral features (MFCC), the “gold standard” of automatic speech recognition, which reach a recognition score of 74.8 % (they are level-independent) in the same setting! From these results we can therefore conclude that although auditory models are certainly not perfect yet, they are already powerful tools to provide rather realistic auditory nerve responses.Fig. 18


Modeling auditory coding: from sound to spikes.

Rudnicki M, Schoppe O, Isik M, Völk F, Hemmert W - Cell Tissue Res. (2015)

Results of an automatic speech recognition system evaluating rate-place code features of the noisy ISOLET database (which contains speech sounds from 0 dB SNR to clean) at different speech levels. Speech recognition scores of the same system with classical Mel-frequency cepstral features was 74.8% (dashed green line)
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4487355&req=5

Fig18: Results of an automatic speech recognition system evaluating rate-place code features of the noisy ISOLET database (which contains speech sounds from 0 dB SNR to clean) at different speech levels. Speech recognition scores of the same system with classical Mel-frequency cepstral features was 74.8% (dashed green line)
Mentions: Finally, we undertook a very high-level comparison between the models: we wanted to evaluate their discriminative ability to code speech sounds. For a fair comparison, we first equalized their rate thresholds. We decided to match the human hearing threshold, even if this might not be the most optimal setting. Because the auditory models are computationally expensive, we could only use a small speech database, the noisy ISOLET (Holmberg et al. 2007). Acoustic features were extracted from the rate-place coding by summing spikes from HSRs, MSRs and LSRs in overlapping Hanning windows (duration: 25 ms, advanced by 10 ms (Holmberg 2009; Holmberg et al. 2007)). They were preprocessed by a multi-layer perceptron and then fed to a Hidden-Markov speech recognizer. The recognition system was trained and tested for each level individually, because ASR systems are known for their weakness to adapt to previously unseen variations in the feature space. For a detailed description of the system compare Holmberg et al. (2007) and Holmberg (2009). Recognition scores were averaged over the conditions 0, 5, 10, 15, 20 dB SNR and clean speech (no noise added) and plotted for different speech levels in Fig. 18. The MAP model reached the highest recognition scores, despite its relatively broad tuning and limited dynamic range. Obviously, it was able to sustain a very good representation of the speech sounds in noise, which might be attributed to its efferent feedback mechanism. However, its performance was high only at low levels. Already at medium levels (above 40 dB(A)), its performance decayed rapidly, probably because of the saturation of its rate-level functions. The Holmberg model was designed to cover a very broad dynamic range, which is also reflected in the results: the roll-off of recognition scores to low and high levels was shallow. Still, the Zilany et al. (2014) model outperformed the Holmberg et al. (2007) model at all sound levels. This is probably due to the carefully tuned rate-level functions across the whole frequency range, but also offset-adaptation is known to improve speech coding (Wang et al. 2008). In summary, the ability of auditory models to code speech is already very elaborate, all three outperform classical Mel-frequency cepstral features (MFCC), the “gold standard” of automatic speech recognition, which reach a recognition score of 74.8 % (they are level-independent) in the same setting! From these results we can therefore conclude that although auditory models are certainly not perfect yet, they are already powerful tools to provide rather realistic auditory nerve responses.Fig. 18

Bottom Line: On the other hand, discrepancies between model results and measurements reveal gaps in our current knowledge, which can in turn be targeted by matched experiments.Models of the auditory periphery have improved greatly during the last decades, and account for many phenomena observed in experiments.It also provides uniform evaluation and visualization scripts, which allow for direct comparisons between models.

View Article: PubMed Central - PubMed

Affiliation: Department of Electrical and Computer Engineering, Technische Universität München, München, Germany.

ABSTRACT
Models are valuable tools to assess how deeply we understand complex systems: only if we are able to replicate the output of a system based on the function of its subcomponents can we assume that we have probably grasped its principles of operation. On the other hand, discrepancies between model results and measurements reveal gaps in our current knowledge, which can in turn be targeted by matched experiments. Models of the auditory periphery have improved greatly during the last decades, and account for many phenomena observed in experiments. While the cochlea is only partly accessible in experiments, models can extrapolate its behavior without gap from base to apex and with arbitrary input signals. With models we can for example evaluate speech coding with large speech databases, which is not possible experimentally, and models have been tuned to replicate features of the human hearing organ, for which practically no invasive electrophysiological measurements are available. Auditory models have become instrumental in evaluating models of neuronal sound processing in the auditory brainstem and even at higher levels, where they are used to provide realistic input, and finally, models can be used to illustrate how such a complicated system as the inner ear works by visualizing its responses. The big advantage there is that intermediate steps in various domains (mechanical, electrical, and chemical) are available, such that a consistent picture of the evolvement of its output can be drawn. However, it must be kept in mind that no model is able to replicate all physiological characteristics (yet) and therefore it is critical to choose the most appropriate model-or models-for every research question. To facilitate this task, this paper not only reviews three recent auditory models, it also introduces a framework that allows researchers to easily switch between models. It also provides uniform evaluation and visualization scripts, which allow for direct comparisons between models.

No MeSH data available.