Limits...
Vy-PER: eliminating false positive detection of virus integration events in next generation sequencing data.

Forster M, Szymczak S, Ellinghaus D, Hemmrich G, Rühlemann M, Kraemer L, Mucha S, Wienbrandt L, Staa M, UFO Sequencing Consortium within I-BFM Study GroupFranke A - Sci Rep (2015)

Bottom Line: We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure.However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses.Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform.

View Article: PubMed Central - PubMed

Affiliation: Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Schleswig-Holstein, D-24105 Kiel, Germany.

ABSTRACT
Several pathogenic viruses such as hepatitis B and human immunodeficiency viruses may integrate into the host genome. These virus/host integrations are detectable using paired-end next generation sequencing. However, the low number of expected true virus integrations may be difficult to distinguish from the noise of many false positive candidates. Here, we propose a novel filtering approach that increases specificity without compromising sensitivity for virus/host chimera detection. Our detection pipeline termed Vy-PER (Virus integration detection bY Paired End Reads) outperforms existing similar tools in speed and accuracy. We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure. This analysis was motivated by the recently reported virus integrations at genomic rearrangement sites and association with chromosomal instability in liver cancer. However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses. Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform. This high accuracy is useful for detecting low virus integration levels as well as non-integrated viruses.

No MeSH data available.


Related in: MedlinePlus

Vy-PER ideogram summary plot.Negative example without true virus integrations: Patient genome sequenced with 40× coverage and analysed with highest sensitivity, eliminating 6400 false positives and leaving three unsupported singletons (pseudomonas phage, caviid herpes, cercopithecine herpes) which we manually eliminated as alignment artefacts. PhiX and M13 originated from the 1% Illumina sequencing library control spike-in. The plot shows integration sites down to single chimeras.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4499804&req=5

f2: Vy-PER ideogram summary plot.Negative example without true virus integrations: Patient genome sequenced with 40× coverage and analysed with highest sensitivity, eliminating 6400 false positives and leaving three unsupported singletons (pseudomonas phage, caviid herpes, cercopithecine herpes) which we manually eliminated as alignment artefacts. PhiX and M13 originated from the 1% Illumina sequencing library control spike-in. The plot shows integration sites down to single chimeras.

Mentions: Using our Vy-PER method, we searched for virus integrations in 10 tumour samples that were sequenced to a minimal coverage of 80×, and in 10 matched normal samples from the same patients that were sequenced to a minimal coverage of 40×. Before final filtering (see Supplementary Methods online), we detected on average 1600 virus candidates per lane (12,800 per 80× genome). On manual inspection, most candidates appeared to be false positives. Table 1 shows a representative snapshot of false positive virus candidates with intriguingly plausible viruses, e.g. the common herpes viruses. However, their candidate sequences are non-specific STR-rich or homopolymer-rich sequences. It can be argued that some herpes types are known to integrate into human telomeres49, but on the other hand, some herpes virus genomes simply have homologies to the human telomeres50. When we aligned a number of full-length HiSeq reads from herpes candidates to the expected human genome sequence window using an interactive DIALIGN-based tool51, we found that these reads were in fact from the human genome. Our alignment exercise suggested that a more accurate and more sensitive alignment of the virus candidates to the host genome would be an effective filter to eliminate false positives. Therefore, we implemented the final filtering of candidates (Fig. 1). The STR-based filtering of virus candidate sequences and exact alignment of remaining sequences to the expected human genome window cut the number of false positive virus candidates to only 12 candidates per 80× genome, on average. The final exact alignment to the whole human genome eliminated most of these remaining few sequences. In the end, no true virus/human chimeras were detected (Fig. 2, Table 2) even at the lowest reporting threshold of 1 paired-end in 20 M bps. The only reported candidates were sparsely distributed phiX174 chimeras, a few singletons, and a cluster of three enterobacteria phage M13 chimeras in the remission sample of patient 8 but not in the tumour. The phiX chimeras are probably optical positioning errors of the HiSeq platform when the sequencing of the second read of a paired-end run is performed. PhiX libraries are technical controls for base calling on the Illumina platforms. They are spiked into the completed sample libraries as a fraction of about 1% of the total libraries. The phiX libraries should not normally ligate with sample libraries, as both libraries are blunt-ended. The fraction of phiX chimeras in the ALL samples comprised about 0.0001% of the total paired-ends, or about 160 chimeras per lane. Enterobacteria phage M13 is used for bacterial cloning, as in the commercial production of phiX. The M13 singletons and the small M13 cluster that we detected are a trace contaminant of the Illumina phiX174 libraries. This suggests that our STR-filtering and host-genome re-alignment strategy is very promising for the elimination of human sequence reads and the preservation of virus/host chimeras, even down to singletons.


Vy-PER: eliminating false positive detection of virus integration events in next generation sequencing data.

Forster M, Szymczak S, Ellinghaus D, Hemmrich G, Rühlemann M, Kraemer L, Mucha S, Wienbrandt L, Staa M, UFO Sequencing Consortium within I-BFM Study GroupFranke A - Sci Rep (2015)

Vy-PER ideogram summary plot.Negative example without true virus integrations: Patient genome sequenced with 40× coverage and analysed with highest sensitivity, eliminating 6400 false positives and leaving three unsupported singletons (pseudomonas phage, caviid herpes, cercopithecine herpes) which we manually eliminated as alignment artefacts. PhiX and M13 originated from the 1% Illumina sequencing library control spike-in. The plot shows integration sites down to single chimeras.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4499804&req=5

f2: Vy-PER ideogram summary plot.Negative example without true virus integrations: Patient genome sequenced with 40× coverage and analysed with highest sensitivity, eliminating 6400 false positives and leaving three unsupported singletons (pseudomonas phage, caviid herpes, cercopithecine herpes) which we manually eliminated as alignment artefacts. PhiX and M13 originated from the 1% Illumina sequencing library control spike-in. The plot shows integration sites down to single chimeras.
Mentions: Using our Vy-PER method, we searched for virus integrations in 10 tumour samples that were sequenced to a minimal coverage of 80×, and in 10 matched normal samples from the same patients that were sequenced to a minimal coverage of 40×. Before final filtering (see Supplementary Methods online), we detected on average 1600 virus candidates per lane (12,800 per 80× genome). On manual inspection, most candidates appeared to be false positives. Table 1 shows a representative snapshot of false positive virus candidates with intriguingly plausible viruses, e.g. the common herpes viruses. However, their candidate sequences are non-specific STR-rich or homopolymer-rich sequences. It can be argued that some herpes types are known to integrate into human telomeres49, but on the other hand, some herpes virus genomes simply have homologies to the human telomeres50. When we aligned a number of full-length HiSeq reads from herpes candidates to the expected human genome sequence window using an interactive DIALIGN-based tool51, we found that these reads were in fact from the human genome. Our alignment exercise suggested that a more accurate and more sensitive alignment of the virus candidates to the host genome would be an effective filter to eliminate false positives. Therefore, we implemented the final filtering of candidates (Fig. 1). The STR-based filtering of virus candidate sequences and exact alignment of remaining sequences to the expected human genome window cut the number of false positive virus candidates to only 12 candidates per 80× genome, on average. The final exact alignment to the whole human genome eliminated most of these remaining few sequences. In the end, no true virus/human chimeras were detected (Fig. 2, Table 2) even at the lowest reporting threshold of 1 paired-end in 20 M bps. The only reported candidates were sparsely distributed phiX174 chimeras, a few singletons, and a cluster of three enterobacteria phage M13 chimeras in the remission sample of patient 8 but not in the tumour. The phiX chimeras are probably optical positioning errors of the HiSeq platform when the sequencing of the second read of a paired-end run is performed. PhiX libraries are technical controls for base calling on the Illumina platforms. They are spiked into the completed sample libraries as a fraction of about 1% of the total libraries. The phiX libraries should not normally ligate with sample libraries, as both libraries are blunt-ended. The fraction of phiX chimeras in the ALL samples comprised about 0.0001% of the total paired-ends, or about 160 chimeras per lane. Enterobacteria phage M13 is used for bacterial cloning, as in the commercial production of phiX. The M13 singletons and the small M13 cluster that we detected are a trace contaminant of the Illumina phiX174 libraries. This suggests that our STR-filtering and host-genome re-alignment strategy is very promising for the elimination of human sequence reads and the preservation of virus/host chimeras, even down to singletons.

Bottom Line: We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure.However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses.Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform.

View Article: PubMed Central - PubMed

Affiliation: Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Schleswig-Holstein, D-24105 Kiel, Germany.

ABSTRACT
Several pathogenic viruses such as hepatitis B and human immunodeficiency viruses may integrate into the host genome. These virus/host integrations are detectable using paired-end next generation sequencing. However, the low number of expected true virus integrations may be difficult to distinguish from the noise of many false positive candidates. Here, we propose a novel filtering approach that increases specificity without compromising sensitivity for virus/host chimera detection. Our detection pipeline termed Vy-PER (Virus integration detection bY Paired End Reads) outperforms existing similar tools in speed and accuracy. We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure. This analysis was motivated by the recently reported virus integrations at genomic rearrangement sites and association with chromosomal instability in liver cancer. However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses. Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform. This high accuracy is useful for detecting low virus integration levels as well as non-integrated viruses.

No MeSH data available.


Related in: MedlinePlus