Limits...
Estimating a Logistic Discrimination Functions When One of the Training Samples Is Subject to Misclassification: A Maximum Likelihood Approach.

Nagelkerke N, Fidler V - PLoS ONE (2015)

Bottom Line: The problem of discrimination and classification is central to much of epidemiology.These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls.Two examples are analyzed and discussed.

View Article: PubMed Central - PubMed

Affiliation: Malawi-Liverpool-Wellcome Trust, Chichiri, Blantyre 3, Malawi.

ABSTRACT
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.

No MeSH data available.


Related in: MedlinePlus

ROC curve for artificially mislabeled control group data of Example 2.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4608588&req=5

pone.0140718.g003: ROC curve for artificially mislabeled control group data of Example 2.

Mentions: For our second example, to explore the DLR model in a situation in which the level of misclassification is (artificially) known, we used data from the Ille-et-Verlaine case-control study on esophageal cancer, with 200 cases and 778 controls (776 with complete data), by Tuyns et al [20] (data obtained from: http://faculty.washington.edu/norm/datasets.html, see S1 File). Four, highly significant (by standard logistic regression), covariables were of interest, age (rescaled), the square of age (age2), tobacco group (treated as a continuous variable) and daily alcohol consumption (in g/day, also rescaled). Applying DLR to these data correctly estimated λ = 0. We then intentionally randomly misclassified 67 cases as controls and used DLR to estimate the fraction misclassified. Table 3 and Table 4 summarize the data and the results of fitting the DLR and LR models. The ML estimate of the expected number of controls who are actually cases, n1 λ/(1-λ), is 116 (95% CI: 15–216). The real number 67 falls well within the CI. However, the assumed underlying logistic function may also not be entirely correct, and such violations of assumptions may bias estimates of λ/(1-λ). For example, it seems unlikely that the probability of esophageal cancer can really approximate 1, as all cases must have been non-cases prior to developing their disease, with the same covariable pattern (except perhaps for a slightly lower age). The P-value of likelihood ratio test comparing LR and DLR is 0.021, thus suggesting likely superiority of the DLR over the LR. Of course, as the hypothesis λ = 0 is on the boundary of the parameter space this P-value has to be taken with a grain of salt. Using the DLR estimates of β0 and β we calculated the predicted probabilities P(y = 1/x) of being a case. Histograms of estimated probabilities P(y = 1/x) are shown in Fig 2. As an example we chose a cut-off 0.35 for which 116 of 776 “z = 0” controls were classified as cases (116 is the estimate by the DLR model), that is the estimated probability P(y = 1/z = 0,x) exceeded the cut-off. Then 33 (49%) out of 67 misclassified cases and 694 (89%) out of 776 controls were classified correctly. The ROC curve is shown in Fig 3. To further explore the behavior of the proposed method we analyzed one hundred samples obtained by selecting randomly 67 misclassified cases. The median number of estimated misclassified controls was 103 (IQR: 69 to 142).


Estimating a Logistic Discrimination Functions When One of the Training Samples Is Subject to Misclassification: A Maximum Likelihood Approach.

Nagelkerke N, Fidler V - PLoS ONE (2015)

ROC curve for artificially mislabeled control group data of Example 2.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4608588&req=5

pone.0140718.g003: ROC curve for artificially mislabeled control group data of Example 2.
Mentions: For our second example, to explore the DLR model in a situation in which the level of misclassification is (artificially) known, we used data from the Ille-et-Verlaine case-control study on esophageal cancer, with 200 cases and 778 controls (776 with complete data), by Tuyns et al [20] (data obtained from: http://faculty.washington.edu/norm/datasets.html, see S1 File). Four, highly significant (by standard logistic regression), covariables were of interest, age (rescaled), the square of age (age2), tobacco group (treated as a continuous variable) and daily alcohol consumption (in g/day, also rescaled). Applying DLR to these data correctly estimated λ = 0. We then intentionally randomly misclassified 67 cases as controls and used DLR to estimate the fraction misclassified. Table 3 and Table 4 summarize the data and the results of fitting the DLR and LR models. The ML estimate of the expected number of controls who are actually cases, n1 λ/(1-λ), is 116 (95% CI: 15–216). The real number 67 falls well within the CI. However, the assumed underlying logistic function may also not be entirely correct, and such violations of assumptions may bias estimates of λ/(1-λ). For example, it seems unlikely that the probability of esophageal cancer can really approximate 1, as all cases must have been non-cases prior to developing their disease, with the same covariable pattern (except perhaps for a slightly lower age). The P-value of likelihood ratio test comparing LR and DLR is 0.021, thus suggesting likely superiority of the DLR over the LR. Of course, as the hypothesis λ = 0 is on the boundary of the parameter space this P-value has to be taken with a grain of salt. Using the DLR estimates of β0 and β we calculated the predicted probabilities P(y = 1/x) of being a case. Histograms of estimated probabilities P(y = 1/x) are shown in Fig 2. As an example we chose a cut-off 0.35 for which 116 of 776 “z = 0” controls were classified as cases (116 is the estimate by the DLR model), that is the estimated probability P(y = 1/z = 0,x) exceeded the cut-off. Then 33 (49%) out of 67 misclassified cases and 694 (89%) out of 776 controls were classified correctly. The ROC curve is shown in Fig 3. To further explore the behavior of the proposed method we analyzed one hundred samples obtained by selecting randomly 67 misclassified cases. The median number of estimated misclassified controls was 103 (IQR: 69 to 142).

Bottom Line: The problem of discrimination and classification is central to much of epidemiology.These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls.Two examples are analyzed and discussed.

View Article: PubMed Central - PubMed

Affiliation: Malawi-Liverpool-Wellcome Trust, Chichiri, Blantyre 3, Malawi.

ABSTRACT
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.

No MeSH data available.


Related in: MedlinePlus