Limits...
Estimating a Logistic Discrimination Functions When One of the Training Samples Is Subject to Misclassification: A Maximum Likelihood Approach.

Nagelkerke N, Fidler V - PLoS ONE (2015)

Bottom Line: The problem of discrimination and classification is central to much of epidemiology.These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls.Two examples are analyzed and discussed.

View Article: PubMed Central - PubMed

Affiliation: Malawi-Liverpool-Wellcome Trust, Chichiri, Blantyre 3, Malawi.

ABSTRACT
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.

No MeSH data available.


Related in: MedlinePlus

Frequency distribution of predicted probabilities of being a case in Example 1.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4608588&req=5

pone.0140718.g001: Frequency distribution of predicted probabilities of being a case in Example 1.

Mentions: As our first example, we consider the above described problem of identifying pathogenic legionella strain when the environmental control sample potentially contains several pathogenic strains. In short 49 pathogenic strains are to be compared to 173 environmental ones. We restricted ourselves to the four genetic markers previously identified as important by Euser et al [19]. Table 1 summarizes the data, while Table 2 presents results of fitting the DLR and logistic regression (LR) model. We reparameterized λ to μ = λ/(1-λ) because it yielded greater symmetry of the profile likelihood. The ML estimate of the expected number of environmental strains which are actually pathogenic, n1 λ/(1-λ), is 9 (95% CI: 0 to 19). There seems to be a substantial difference for the third marker, justifying further exploration of its role in pathogenesis. The likelihood ratio test (the difference in twice the log-likelihood between the defective logistic regression and usual logistic regression models) yields P = 0.078, the Wald test statistic for testing the hypothesis λ = 0 gives P = 0.047. Thus the DLR model fits slightly better than the LR model. Using the DLR estimates of β0 and β we calculated the predicted probabilities P(y = 1/x) of being a case, and also P(y = 1/x,z = 0). These probabilities can be used for classification choosing a suitable cut-off. (As β0 depends on the chosen proportions of pathogenic and environmental strains, these probabilities should be interpreted cautiously). As an example we chose a cut-off 0.285 for which 9 of 173 contaminated controls were classified as cases (9 is the estimate by the fitted DLR model), that is the DLR model estimated probability P(Y = 1/z = 0,x) exceeded 0.285. Then, also, the estimated P(y = 1/x) of 41 of 49 cases (i.e. z = 1) exceeded the cut-off. Histograms of estimated probabilities P(y = 1/x) are shown in Fig 1.


Estimating a Logistic Discrimination Functions When One of the Training Samples Is Subject to Misclassification: A Maximum Likelihood Approach.

Nagelkerke N, Fidler V - PLoS ONE (2015)

Frequency distribution of predicted probabilities of being a case in Example 1.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4608588&req=5

pone.0140718.g001: Frequency distribution of predicted probabilities of being a case in Example 1.
Mentions: As our first example, we consider the above described problem of identifying pathogenic legionella strain when the environmental control sample potentially contains several pathogenic strains. In short 49 pathogenic strains are to be compared to 173 environmental ones. We restricted ourselves to the four genetic markers previously identified as important by Euser et al [19]. Table 1 summarizes the data, while Table 2 presents results of fitting the DLR and logistic regression (LR) model. We reparameterized λ to μ = λ/(1-λ) because it yielded greater symmetry of the profile likelihood. The ML estimate of the expected number of environmental strains which are actually pathogenic, n1 λ/(1-λ), is 9 (95% CI: 0 to 19). There seems to be a substantial difference for the third marker, justifying further exploration of its role in pathogenesis. The likelihood ratio test (the difference in twice the log-likelihood between the defective logistic regression and usual logistic regression models) yields P = 0.078, the Wald test statistic for testing the hypothesis λ = 0 gives P = 0.047. Thus the DLR model fits slightly better than the LR model. Using the DLR estimates of β0 and β we calculated the predicted probabilities P(y = 1/x) of being a case, and also P(y = 1/x,z = 0). These probabilities can be used for classification choosing a suitable cut-off. (As β0 depends on the chosen proportions of pathogenic and environmental strains, these probabilities should be interpreted cautiously). As an example we chose a cut-off 0.285 for which 9 of 173 contaminated controls were classified as cases (9 is the estimate by the fitted DLR model), that is the DLR model estimated probability P(Y = 1/z = 0,x) exceeded 0.285. Then, also, the estimated P(y = 1/x) of 41 of 49 cases (i.e. z = 1) exceeded the cut-off. Histograms of estimated probabilities P(y = 1/x) are shown in Fig 1.

Bottom Line: The problem of discrimination and classification is central to much of epidemiology.These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls.Two examples are analyzed and discussed.

View Article: PubMed Central - PubMed

Affiliation: Malawi-Liverpool-Wellcome Trust, Chichiri, Blantyre 3, Malawi.

ABSTRACT
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.

No MeSH data available.


Related in: MedlinePlus