Reconstruction of clonal trees and tumor composition from multi-sample sequencing data.
Bottom Line: We derive a combinatorial characterization of the solutions to this problem and show that the problem is NP-complete.We derive an integer linear programming solution to the VAF factorization problem in the case of error-free data and extend this solution to real data with a probabilistic model for errors.The resulting AncesTree algorithm is better able to identify ancestral relationships between individual mutations than existing approaches, particularly in ultra-deep sequencing data when high read counts for mutations yield high confidence VAFs.
Affiliation: Center for Computational Molecular Biology and Department of Computer Science, Brown University, Providence, RI 02912, USA.Show MeSH
Related in: MedlinePlus
Mentions: We created 90 synthetic tumor datasets. Each dataset contains 100 mutations grouped into 10 clones that accumulated following the infinite sites assumption. For each dataset, we simulated between four and six samples sequenced at a coverage of 50X, 100X or 1000X. Further details of the simulated data are contained in the Supplementary Appendix. We ran AncesTree, PhyloSub and CITUP on each dataset and compared the results using five measures: (i) the fraction of ancestral relationships between pairs of mutations that were correctly identified (Fig. 3A); (ii) the fraction of clustered relationships between pairs of mutations that were correctly identified (Fig. 3B); (iii) the fraction of incomparable relationships (i.e. neither ancestral nor clustered) between pairs of mutations that were correctly identified (Supplementary Fig. A2); (iv) the average error between the simulated and inferred frequency matrix F (Fig. 3C) and (v) the error between the simulated usage matrix and the inferred usage U using the same metric as Malikic et al. (2015) (Fig. 3D). We note that we compute these measures only on the set of mutations that are included in the output of all methods, which equates to the set of mutations output by AncesTree (median of 69 of the 100 total mutations) since CITUP and PhyloSub include all mutations. We find that AncesTree has higher accuracy in determining ancestral, clustered and incomparable relationships with median accuracy more than 0.05, 0.03 and 0.08, respectively, higher than the median accuracy of the other methods. Further, we find that AncesTree achieves a median error on F and U that is 0.01 and 0.03 lower than the median error of the other methods. See Supplementary Appendix for further details on all five metrics.Fig. 3.
Affiliation: Center for Computational Molecular Biology and Department of Computer Science, Brown University, Providence, RI 02912, USA.