Limits...
Genome analysis of Legionella pneumophila strains using a mixed-genome microarray.

Euser SM, Nagelkerke NJ, Schuren F, Jansen R, Den Boer JW - PLoS ONE (2012)

Bottom Line: The distribution of Legionella genotypes within clinical strains is significantly different from that found in environmental strains.The Random Forest algorithm combined with a logistic regression model was used to select predictive markers and to construct a predictive model that could discriminate between strains from different origin: clinical or environmental.The Random Forest algorithm is well suited for the development of prediction models that use mixed-genome microarray data to discriminate between Legionella strains from different origin.

View Article: PubMed Central - PubMed

Affiliation: Regional Public Health Laboratory Kennemerland, Haarlem, The Netherlands. s.euser@streeklabhaarlem.nl

ABSTRACT

Background: Legionella, the causative agent for Legionnaires' disease, is ubiquitous in both natural and man-made aquatic environments. The distribution of Legionella genotypes within clinical strains is significantly different from that found in environmental strains. Developing novel genotypic methods that offer the ability to distinguish clinical from environmental strains could help to focus on more relevant (virulent) Legionella species in control efforts. Mixed-genome microarray data can be used to perform a comparative-genome analysis of strain collections, and advanced statistical approaches, such as the Random Forest algorithm are available to process these data.

Methods: Microarray analysis was performed on a collection of 222 Legionella pneumophila strains, which included patient-derived strains from notified cases in The Netherlands in the period 2002-2006 and the environmental strains that were collected during the source investigation for those patients within the Dutch National Legionella Outbreak Detection Programme. The Random Forest algorithm combined with a logistic regression model was used to select predictive markers and to construct a predictive model that could discriminate between strains from different origin: clinical or environmental.

Results: Four genetic markers were selected that correctly predicted 96% of the clinical strains and 66% of the environmental strains collected within the Dutch National Legionella Outbreak Detection Programme.

Conclusions: The Random Forest algorithm is well suited for the development of prediction models that use mixed-genome microarray data to discriminate between Legionella strains from different origin. The identification of these predictive genetic markers could offer the possibility to identify virulence factors within the Legionella genome, which in the future may be implemented in the daily practice of controlling Legionella in the public health environment.

Show MeSH

Related in: MedlinePlus

Receiver operating characteristic (ROC) curve for the eleventh test dataset (n = 111).This figure shows the performance of the logistic regression model with the four predictive markers (7B8, 15D6, 16E4, 33F8) for the eleventh test dataset. The proportion of the predicted clinical strains out of the truly clinical strains (true positive rate or sensitivity) is plotted against the proportion of the predicted clinical strains out of the truly environmental strains (false positive rate or 1-specificity).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3475688&req=5

pone-0047437-g001: Receiver operating characteristic (ROC) curve for the eleventh test dataset (n = 111).This figure shows the performance of the logistic regression model with the four predictive markers (7B8, 15D6, 16E4, 33F8) for the eleventh test dataset. The proportion of the predicted clinical strains out of the truly clinical strains (true positive rate or sensitivity) is plotted against the proportion of the predicted clinical strains out of the truly environmental strains (false positive rate or 1-specificity).

Mentions: Within the 250 markers with the highest rank of “importance” that were selected by the Random Forest analyses of the 10 randomly selected training datasets, 51 markers were present more than once. The 25 genetic markers that were present in the majority of the 10 training datasets were entered in the logistic regression model. These logistic regression analyses resulted in the selection of four DNA markers for the final model. In order to attempt to maximize the sensitivity of the model (predict as many of the clinical strains correctly), while minimizing the loss of specificity, we used an arbitrary cut-off value of 0.06 to translate the logistic regression prediction of the training dataset into a classification of the clinical and environmental strains. The 111 strains in the eleventh training dataset were predicted with a sensitivity of 96%, a specificity of 75%, a PPV of 51%, and a NPV of 98% (Table 1). The 111 strains in the eleventh test dataset that were not used for building the logistic regression model were additionally analyzed, and were predicted with a sensitivity of 96%, a specificity of 66%, a PPV of 45%, and a NPV of 98% (Table 2). Figure 1 shows the receiver operating curve of the logistic regression model predictions for the eleventh test dataset. The area under the curve (AUC) or c-statistic was 0.9037.


Genome analysis of Legionella pneumophila strains using a mixed-genome microarray.

Euser SM, Nagelkerke NJ, Schuren F, Jansen R, Den Boer JW - PLoS ONE (2012)

Receiver operating characteristic (ROC) curve for the eleventh test dataset (n = 111).This figure shows the performance of the logistic regression model with the four predictive markers (7B8, 15D6, 16E4, 33F8) for the eleventh test dataset. The proportion of the predicted clinical strains out of the truly clinical strains (true positive rate or sensitivity) is plotted against the proportion of the predicted clinical strains out of the truly environmental strains (false positive rate or 1-specificity).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3475688&req=5

pone-0047437-g001: Receiver operating characteristic (ROC) curve for the eleventh test dataset (n = 111).This figure shows the performance of the logistic regression model with the four predictive markers (7B8, 15D6, 16E4, 33F8) for the eleventh test dataset. The proportion of the predicted clinical strains out of the truly clinical strains (true positive rate or sensitivity) is plotted against the proportion of the predicted clinical strains out of the truly environmental strains (false positive rate or 1-specificity).
Mentions: Within the 250 markers with the highest rank of “importance” that were selected by the Random Forest analyses of the 10 randomly selected training datasets, 51 markers were present more than once. The 25 genetic markers that were present in the majority of the 10 training datasets were entered in the logistic regression model. These logistic regression analyses resulted in the selection of four DNA markers for the final model. In order to attempt to maximize the sensitivity of the model (predict as many of the clinical strains correctly), while minimizing the loss of specificity, we used an arbitrary cut-off value of 0.06 to translate the logistic regression prediction of the training dataset into a classification of the clinical and environmental strains. The 111 strains in the eleventh training dataset were predicted with a sensitivity of 96%, a specificity of 75%, a PPV of 51%, and a NPV of 98% (Table 1). The 111 strains in the eleventh test dataset that were not used for building the logistic regression model were additionally analyzed, and were predicted with a sensitivity of 96%, a specificity of 66%, a PPV of 45%, and a NPV of 98% (Table 2). Figure 1 shows the receiver operating curve of the logistic regression model predictions for the eleventh test dataset. The area under the curve (AUC) or c-statistic was 0.9037.

Bottom Line: The distribution of Legionella genotypes within clinical strains is significantly different from that found in environmental strains.The Random Forest algorithm combined with a logistic regression model was used to select predictive markers and to construct a predictive model that could discriminate between strains from different origin: clinical or environmental.The Random Forest algorithm is well suited for the development of prediction models that use mixed-genome microarray data to discriminate between Legionella strains from different origin.

View Article: PubMed Central - PubMed

Affiliation: Regional Public Health Laboratory Kennemerland, Haarlem, The Netherlands. s.euser@streeklabhaarlem.nl

ABSTRACT

Background: Legionella, the causative agent for Legionnaires' disease, is ubiquitous in both natural and man-made aquatic environments. The distribution of Legionella genotypes within clinical strains is significantly different from that found in environmental strains. Developing novel genotypic methods that offer the ability to distinguish clinical from environmental strains could help to focus on more relevant (virulent) Legionella species in control efforts. Mixed-genome microarray data can be used to perform a comparative-genome analysis of strain collections, and advanced statistical approaches, such as the Random Forest algorithm are available to process these data.

Methods: Microarray analysis was performed on a collection of 222 Legionella pneumophila strains, which included patient-derived strains from notified cases in The Netherlands in the period 2002-2006 and the environmental strains that were collected during the source investigation for those patients within the Dutch National Legionella Outbreak Detection Programme. The Random Forest algorithm combined with a logistic regression model was used to select predictive markers and to construct a predictive model that could discriminate between strains from different origin: clinical or environmental.

Results: Four genetic markers were selected that correctly predicted 96% of the clinical strains and 66% of the environmental strains collected within the Dutch National Legionella Outbreak Detection Programme.

Conclusions: The Random Forest algorithm is well suited for the development of prediction models that use mixed-genome microarray data to discriminate between Legionella strains from different origin. The identification of these predictive genetic markers could offer the possibility to identify virulence factors within the Legionella genome, which in the future may be implemented in the daily practice of controlling Legionella in the public health environment.

Show MeSH
Related in: MedlinePlus