Limits...
Automatic sleep spindle detection: benchmarking with fine temporal resolution using open science tools.

O'Reilly C, Nielsen T - Front Hum Neurosci (2015)

Bottom Line: The tediousness of this manual task has motivated efforts for its automation.Results highlight several findings: (1) that expert scoring and polysomnographic databases are important confounders when comparing the performance of spindle detectors tested using different databases or scorings; (2) because spindles are sparse events, specificity estimates are potentially misleading for assessing automated detector performance; (3) reporting the performance of spindle detectors exclusively with sensitivity and specificity estimates, as is often seen in the literature, is insufficient; including sensitivity, precision and a more comprehensive statistic such as Matthew's correlation coefficient, F1-score, or Cohen's κ is necessary for adequate evaluation; (4) reporting statistics for some reasonable range of decision thresholds provides a much more complete and useful benchmarking; (5) performance differences between tested automated detectors were found to be similar to those between available expert scorings; (6) much more development is needed to effectively compare the performance of spindle detectors developed by different research teams.Finally, this work clarifies a long-standing but only seldomly posed question regarding whether expert scoring truly is a reliable gold standard for sleep spindle assessment.

View Article: PubMed Central - PubMed

Affiliation: MEG Laboratory, McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada ; Dream and Nightmare Laboratory, Center for Advanced Research in Sleep Medicine, Hôpital du Sacré-Coeur de Montréal Montreal, QC, Canada ; Département de Psychiatrie, Université de Montréal Montreal, QC, Canada.

ABSTRACT
Sleep spindle properties index cognitive faculties such as memory consolidation and diseases such as major depression. For this reason, scoring sleep spindle properties in polysomnographic recordings has become an important activity in both research and clinical settings. The tediousness of this manual task has motivated efforts for its automation. Although some progress has been made, increasing the temporal accuracy of spindle scoring and improving the performance assessment methodology are two aspects needing more attention. In this paper, four open-access automated spindle detectors with fine temporal resolution are proposed and tested against expert scoring of two proprietary and two open-access databases. Results highlight several findings: (1) that expert scoring and polysomnographic databases are important confounders when comparing the performance of spindle detectors tested using different databases or scorings; (2) because spindles are sparse events, specificity estimates are potentially misleading for assessing automated detector performance; (3) reporting the performance of spindle detectors exclusively with sensitivity and specificity estimates, as is often seen in the literature, is insufficient; including sensitivity, precision and a more comprehensive statistic such as Matthew's correlation coefficient, F1-score, or Cohen's κ is necessary for adequate evaluation; (4) reporting statistics for some reasonable range of decision thresholds provides a much more complete and useful benchmarking; (5) performance differences between tested automated detectors were found to be similar to those between available expert scorings; (6) much more development is needed to effectively compare the performance of spindle detectors developed by different research teams. Finally, this work clarifies a long-standing but only seldomly posed question regarding whether expert scoring truly is a reliable gold standard for sleep spindle assessment.

No MeSH data available.


Related in: MedlinePlus

The left panel shows six common situations [labeled as (a–f)] occurring when comparing the detection of a gold standard scorer (Gold) with another scorer (Test). The x-axis on these plots represents time. On the y-axis, a high (low) value indicates the presence (absence) of a spindle. For example, case (a) shows perfect agreement between the gold standard and the tested scorer. Resulting assessments (TN, TP, FP, and FN, in percent) for the proposed signal-sample-based approach and for the window-based method used in Devuyst et al. (2011) are given in rightward panel. Note: The length of the scored signal is taken as being 1 s, such that there is only one decision taken for the window-based method, whereas there are fs decisions for the signal-sample-based method.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4478395&req=5

Figure 1: The left panel shows six common situations [labeled as (a–f)] occurring when comparing the detection of a gold standard scorer (Gold) with another scorer (Test). The x-axis on these plots represents time. On the y-axis, a high (low) value indicates the presence (absence) of a spindle. For example, case (a) shows perfect agreement between the gold standard and the tested scorer. Resulting assessments (TN, TP, FP, and FN, in percent) for the proposed signal-sample-based approach and for the window-based method used in Devuyst et al. (2011) are given in rightward panel. Note: The length of the scored signal is taken as being 1 s, such that there is only one decision taken for the window-based method, whereas there are fs decisions for the signal-sample-based method.

Mentions: Although the assessment method proposed in Devuyst et al. (2011) might be adequate when spindles are detected for sleep stage scoring, they do not assess sleep spindles with a fine temporal resolution. From this paper, we can only infer that a 1-s scoring window was used for choosing between true positive (TP), false positive (FP), true negative (TN), and false negative (FN) cases as this was not explicitly stated in the methods. A high temporal resolution alternative to this approach would be to consider spindle scoring at a signal-sampling scale (i.e., for a fs = 256 Hz sampling rate, 256 TP, FP, TN, or FN outcomes are counted per second of recorded signal). As shown in Figure 1, this signal-sample-based approach (equivalent to the “by-sample” evaluation in Warby et al., 2014) allows for finer assessment and solves some ambiguities that occur when using a window-based approach (as in Devuyst et al., 2011). For example, it is not clear whether condition (e) in Figure 1 should be counted as TP, FP, or FN because the spindles detected by the two scorers are not synchronized. The degree of allowed asynchrony is directly related to the width of the decision window.


Automatic sleep spindle detection: benchmarking with fine temporal resolution using open science tools.

O'Reilly C, Nielsen T - Front Hum Neurosci (2015)

The left panel shows six common situations [labeled as (a–f)] occurring when comparing the detection of a gold standard scorer (Gold) with another scorer (Test). The x-axis on these plots represents time. On the y-axis, a high (low) value indicates the presence (absence) of a spindle. For example, case (a) shows perfect agreement between the gold standard and the tested scorer. Resulting assessments (TN, TP, FP, and FN, in percent) for the proposed signal-sample-based approach and for the window-based method used in Devuyst et al. (2011) are given in rightward panel. Note: The length of the scored signal is taken as being 1 s, such that there is only one decision taken for the window-based method, whereas there are fs decisions for the signal-sample-based method.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4478395&req=5

Figure 1: The left panel shows six common situations [labeled as (a–f)] occurring when comparing the detection of a gold standard scorer (Gold) with another scorer (Test). The x-axis on these plots represents time. On the y-axis, a high (low) value indicates the presence (absence) of a spindle. For example, case (a) shows perfect agreement between the gold standard and the tested scorer. Resulting assessments (TN, TP, FP, and FN, in percent) for the proposed signal-sample-based approach and for the window-based method used in Devuyst et al. (2011) are given in rightward panel. Note: The length of the scored signal is taken as being 1 s, such that there is only one decision taken for the window-based method, whereas there are fs decisions for the signal-sample-based method.
Mentions: Although the assessment method proposed in Devuyst et al. (2011) might be adequate when spindles are detected for sleep stage scoring, they do not assess sleep spindles with a fine temporal resolution. From this paper, we can only infer that a 1-s scoring window was used for choosing between true positive (TP), false positive (FP), true negative (TN), and false negative (FN) cases as this was not explicitly stated in the methods. A high temporal resolution alternative to this approach would be to consider spindle scoring at a signal-sampling scale (i.e., for a fs = 256 Hz sampling rate, 256 TP, FP, TN, or FN outcomes are counted per second of recorded signal). As shown in Figure 1, this signal-sample-based approach (equivalent to the “by-sample” evaluation in Warby et al., 2014) allows for finer assessment and solves some ambiguities that occur when using a window-based approach (as in Devuyst et al., 2011). For example, it is not clear whether condition (e) in Figure 1 should be counted as TP, FP, or FN because the spindles detected by the two scorers are not synchronized. The degree of allowed asynchrony is directly related to the width of the decision window.

Bottom Line: The tediousness of this manual task has motivated efforts for its automation.Results highlight several findings: (1) that expert scoring and polysomnographic databases are important confounders when comparing the performance of spindle detectors tested using different databases or scorings; (2) because spindles are sparse events, specificity estimates are potentially misleading for assessing automated detector performance; (3) reporting the performance of spindle detectors exclusively with sensitivity and specificity estimates, as is often seen in the literature, is insufficient; including sensitivity, precision and a more comprehensive statistic such as Matthew's correlation coefficient, F1-score, or Cohen's κ is necessary for adequate evaluation; (4) reporting statistics for some reasonable range of decision thresholds provides a much more complete and useful benchmarking; (5) performance differences between tested automated detectors were found to be similar to those between available expert scorings; (6) much more development is needed to effectively compare the performance of spindle detectors developed by different research teams.Finally, this work clarifies a long-standing but only seldomly posed question regarding whether expert scoring truly is a reliable gold standard for sleep spindle assessment.

View Article: PubMed Central - PubMed

Affiliation: MEG Laboratory, McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada ; Dream and Nightmare Laboratory, Center for Advanced Research in Sleep Medicine, Hôpital du Sacré-Coeur de Montréal Montreal, QC, Canada ; Département de Psychiatrie, Université de Montréal Montreal, QC, Canada.

ABSTRACT
Sleep spindle properties index cognitive faculties such as memory consolidation and diseases such as major depression. For this reason, scoring sleep spindle properties in polysomnographic recordings has become an important activity in both research and clinical settings. The tediousness of this manual task has motivated efforts for its automation. Although some progress has been made, increasing the temporal accuracy of spindle scoring and improving the performance assessment methodology are two aspects needing more attention. In this paper, four open-access automated spindle detectors with fine temporal resolution are proposed and tested against expert scoring of two proprietary and two open-access databases. Results highlight several findings: (1) that expert scoring and polysomnographic databases are important confounders when comparing the performance of spindle detectors tested using different databases or scorings; (2) because spindles are sparse events, specificity estimates are potentially misleading for assessing automated detector performance; (3) reporting the performance of spindle detectors exclusively with sensitivity and specificity estimates, as is often seen in the literature, is insufficient; including sensitivity, precision and a more comprehensive statistic such as Matthew's correlation coefficient, F1-score, or Cohen's κ is necessary for adequate evaluation; (4) reporting statistics for some reasonable range of decision thresholds provides a much more complete and useful benchmarking; (5) performance differences between tested automated detectors were found to be similar to those between available expert scorings; (6) much more development is needed to effectively compare the performance of spindle detectors developed by different research teams. Finally, this work clarifies a long-standing but only seldomly posed question regarding whether expert scoring truly is a reliable gold standard for sleep spindle assessment.

No MeSH data available.


Related in: MedlinePlus