Limits...
Towards an integrated food safety surveillance system: a simulation study to explore the potential of combining genomic and epidemiological metadata

View Article: PubMed Central - PubMed

ABSTRACT

Foodborne infection is a result of exposure to complex, dynamic food systems. The efficiency of foodborne infection is driven by ongoing shifts in genetic machinery. Next-generation sequencing technologies can provide high-fidelity data about the genetics of a pathogen. However, food safety surveillance systems do not currently provide similar high-fidelity epidemiological metadata to associate with genetic data. As a consequence, it is rarely possible to transform genetic data into actionable knowledge that can be used to genuinely inform risk assessment or prevent outbreaks. Big data approaches are touted as a revolution in decision support, and pose a potentially attractive method for closing the gap between the fidelity of genetic and epidemiological metadata for food safety surveillance. We therefore developed a simple food chain model to investigate the potential benefits of combining ‘big’ data sources, including both genetic and high-fidelity epidemiological metadata. Our results suggest that, as for any surveillance system, the collected data must be relevant and characterize the important dynamics of a system if we are to properly understand risk: this suggests the need to carefully consider data curation, rather than the more ambitious claims of big data proponents that unstructured and unrelated data sources can be combined to generate consistent insight. Of interest is that the biggest influencers of foodborne infection risk were contamination load and processing temperature, not genotype. This suggests that understanding food chain dynamics would probably more effectively generate insight into foodborne risk than prescribing the hazard in ever more detail in terms of genotype.

No MeSH data available.


Related in: MedlinePlus

Confusion matrix for the best performing model (max number of splits, 7; learning rate 0.01; number of trees: 200).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5383817&req=5

RSOS160721F13: Confusion matrix for the best performing model (max number of splits, 7; learning rate 0.01; number of trees: 200).

Mentions: One of the key aspects of machine learning is the division of the original dataset into training and evaluation datasets. The training dataset is used to develop the boosted decision tree solution, and the evaluation dataset is used to test the predictive power of the predictive algorithm. We therefore assessed the performance of the optimal predictive algorithm to predict whether a case is reported using the evaluation dataset set aside from the original dataset. This information is displayed using a confusion matrix (figure 13), which describes the number of false positives, false negatives, etc. From these figures, we can work out the sensitivity of the predictive model, that is Se=1102/(1102+315)=0.78, and the specificity, Sp=1 571 036/(1 571 036+82 378)=0.95. Therefore, our predictive model would be relatively good at detecting combinations of environmental and genomic characteristics that lead to cases. Hence, if such big data were available, the authorities could very well use such techniques to identify and monitor potentially high-risk scenarios. However, our results exhibit a classic example of the problem of looking for a rare event: we can detect roughly three out of four cases, but the total predicted cases are dominated by false positives owing to an over-abundance of negative results. Therefore, the predictive power of the model is poor: the positive predictive value is only 0.01.Figure 13.


Towards an integrated food safety surveillance system: a simulation study to explore the potential of combining genomic and epidemiological metadata
Confusion matrix for the best performing model (max number of splits, 7; learning rate 0.01; number of trees: 200).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5383817&req=5

RSOS160721F13: Confusion matrix for the best performing model (max number of splits, 7; learning rate 0.01; number of trees: 200).
Mentions: One of the key aspects of machine learning is the division of the original dataset into training and evaluation datasets. The training dataset is used to develop the boosted decision tree solution, and the evaluation dataset is used to test the predictive power of the predictive algorithm. We therefore assessed the performance of the optimal predictive algorithm to predict whether a case is reported using the evaluation dataset set aside from the original dataset. This information is displayed using a confusion matrix (figure 13), which describes the number of false positives, false negatives, etc. From these figures, we can work out the sensitivity of the predictive model, that is Se=1102/(1102+315)=0.78, and the specificity, Sp=1 571 036/(1 571 036+82 378)=0.95. Therefore, our predictive model would be relatively good at detecting combinations of environmental and genomic characteristics that lead to cases. Hence, if such big data were available, the authorities could very well use such techniques to identify and monitor potentially high-risk scenarios. However, our results exhibit a classic example of the problem of looking for a rare event: we can detect roughly three out of four cases, but the total predicted cases are dominated by false positives owing to an over-abundance of negative results. Therefore, the predictive power of the model is poor: the positive predictive value is only 0.01.Figure 13.

View Article: PubMed Central - PubMed

ABSTRACT

Foodborne infection is a result of exposure to complex, dynamic food systems. The efficiency of foodborne infection is driven by ongoing shifts in genetic machinery. Next-generation sequencing technologies can provide high-fidelity data about the genetics of a pathogen. However, food safety surveillance systems do not currently provide similar high-fidelity epidemiological metadata to associate with genetic data. As a consequence, it is rarely possible to transform genetic data into actionable knowledge that can be used to genuinely inform risk assessment or prevent outbreaks. Big data approaches are touted as a revolution in decision support, and pose a potentially attractive method for closing the gap between the fidelity of genetic and epidemiological metadata for food safety surveillance. We therefore developed a simple food chain model to investigate the potential benefits of combining ‘big’ data sources, including both genetic and high-fidelity epidemiological metadata. Our results suggest that, as for any surveillance system, the collected data must be relevant and characterize the important dynamics of a system if we are to properly understand risk: this suggests the need to carefully consider data curation, rather than the more ambitious claims of big data proponents that unstructured and unrelated data sources can be combined to generate consistent insight. Of interest is that the biggest influencers of foodborne infection risk were contamination load and processing temperature, not genotype. This suggests that understanding food chain dynamics would probably more effectively generate insight into foodborne risk than prescribing the hazard in ever more detail in terms of genotype.

No MeSH data available.


Related in: MedlinePlus