Limits...
Vy-PER: eliminating false positive detection of virus integration events in next generation sequencing data.

Forster M, Szymczak S, Ellinghaus D, Hemmrich G, Rühlemann M, Kraemer L, Mucha S, Wienbrandt L, Staa M, UFO Sequencing Consortium within I-BFM Study GroupFranke A - Sci Rep (2015)

Bottom Line: We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure.However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses.Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform.

View Article: PubMed Central - PubMed

Affiliation: Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Schleswig-Holstein, D-24105 Kiel, Germany.

ABSTRACT
Several pathogenic viruses such as hepatitis B and human immunodeficiency viruses may integrate into the host genome. These virus/host integrations are detectable using paired-end next generation sequencing. However, the low number of expected true virus integrations may be difficult to distinguish from the noise of many false positive candidates. Here, we propose a novel filtering approach that increases specificity without compromising sensitivity for virus/host chimera detection. Our detection pipeline termed Vy-PER (Virus integration detection bY Paired End Reads) outperforms existing similar tools in speed and accuracy. We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure. This analysis was motivated by the recently reported virus integrations at genomic rearrangement sites and association with chromosomal instability in liver cancer. However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses. Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform. This high accuracy is useful for detecting low virus integration levels as well as non-integrated viruses.

No MeSH data available.


Related in: MedlinePlus

Detection of virus integrations into the human genome using paired-end sequencing.After the computationally expensive classical alignment of all paired-end reads to the human genome, the pipeline splits into classical variant calling and virus integration detection.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4499804&req=5

f1: Detection of virus integrations into the human genome using paired-end sequencing.After the computationally expensive classical alignment of all paired-end reads to the human genome, the pipeline splits into classical variant calling and virus integration detection.

Mentions: Virus integrations can be detected using Illumina paired-end sequencing (Fig. 1), if one read maps to the human genome and its paired end to a virus genome2324. We analysed our childhood leukemia data accordingly using own scripts, as the data were too large for publicly available pipelines even on our high performance Linux cluster (928 cores with the hardware limit of 125 GB RAM per job). In addition, the pipelines that we tested were quite slow and did not detect virus integrations in benchmarking runs of our leukemia data. Our initial scripts identified a surprisingly large number of virus candidates. On closer inspection we noticed that many virus candidates were simply long stretches of unspecific short tandem repeats (STRs) or homopolymers (Table 1). These unspecific sequence stretches turned out to be identical in many virus genomes as well as the human genome. We therefore developed filters to eliminate false positives efficiently without compromising the sensitivity to detect true chimeric human/virus sequences. One solution to eliminate nearly all false positives would be to filter low-complexity reads with DUST32, as implemented for example in BLAST33 searches, or with PRINSEQ34. To avoid removing potentially informative reads (see Supplementary Results online), we prefer to filter reads by only considering the length of its longest STR. A second solution to eliminate false positives would be to cluster the virus/human chimera candidates within a genomic window as implemented in the VirusSeq pipeline26, but this method by definition loses the sensitivity to detect low-frequency events represented by singleton or low-frequency chimeras. A third method would be to obtain supporting information on potential viruses from unmapped read-pairs, as implemented in the VirusSeq and VirusFinder pipelines35, which is limited to the co-occurrence of a virus, not its integration into the host genome. Our aim for high sensitivity was motivated by the knowledge that for example HIV or HTLV has a tropism for T-lymphocytes with CD4 receptors3637, leaving most other cell types uninfected, and that viruses such as EBV or HTLV integrate at seemingly random sites and are difficult to detect unless a major clonal expansion takes place within the cell population14151637. With respect to the efficient use of computer resources, the computationally expensive alignment of all read-pairs to the human genome should ideally be performed just once, i.e. in the context of a classical BWA-based next generation sequencing (NGS) pipeline (Fig. 1). The resulting specific and efficient analysis pipeline would allow us to routinely scan all of our future genome, transcriptome, and targeted next generation sequencing data for integrations of known viruses before further wet lab tests or computationally more expensive analyses are performed on selected samples.


Vy-PER: eliminating false positive detection of virus integration events in next generation sequencing data.

Forster M, Szymczak S, Ellinghaus D, Hemmrich G, Rühlemann M, Kraemer L, Mucha S, Wienbrandt L, Staa M, UFO Sequencing Consortium within I-BFM Study GroupFranke A - Sci Rep (2015)

Detection of virus integrations into the human genome using paired-end sequencing.After the computationally expensive classical alignment of all paired-end reads to the human genome, the pipeline splits into classical variant calling and virus integration detection.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4499804&req=5

f1: Detection of virus integrations into the human genome using paired-end sequencing.After the computationally expensive classical alignment of all paired-end reads to the human genome, the pipeline splits into classical variant calling and virus integration detection.
Mentions: Virus integrations can be detected using Illumina paired-end sequencing (Fig. 1), if one read maps to the human genome and its paired end to a virus genome2324. We analysed our childhood leukemia data accordingly using own scripts, as the data were too large for publicly available pipelines even on our high performance Linux cluster (928 cores with the hardware limit of 125 GB RAM per job). In addition, the pipelines that we tested were quite slow and did not detect virus integrations in benchmarking runs of our leukemia data. Our initial scripts identified a surprisingly large number of virus candidates. On closer inspection we noticed that many virus candidates were simply long stretches of unspecific short tandem repeats (STRs) or homopolymers (Table 1). These unspecific sequence stretches turned out to be identical in many virus genomes as well as the human genome. We therefore developed filters to eliminate false positives efficiently without compromising the sensitivity to detect true chimeric human/virus sequences. One solution to eliminate nearly all false positives would be to filter low-complexity reads with DUST32, as implemented for example in BLAST33 searches, or with PRINSEQ34. To avoid removing potentially informative reads (see Supplementary Results online), we prefer to filter reads by only considering the length of its longest STR. A second solution to eliminate false positives would be to cluster the virus/human chimera candidates within a genomic window as implemented in the VirusSeq pipeline26, but this method by definition loses the sensitivity to detect low-frequency events represented by singleton or low-frequency chimeras. A third method would be to obtain supporting information on potential viruses from unmapped read-pairs, as implemented in the VirusSeq and VirusFinder pipelines35, which is limited to the co-occurrence of a virus, not its integration into the host genome. Our aim for high sensitivity was motivated by the knowledge that for example HIV or HTLV has a tropism for T-lymphocytes with CD4 receptors3637, leaving most other cell types uninfected, and that viruses such as EBV or HTLV integrate at seemingly random sites and are difficult to detect unless a major clonal expansion takes place within the cell population14151637. With respect to the efficient use of computer resources, the computationally expensive alignment of all read-pairs to the human genome should ideally be performed just once, i.e. in the context of a classical BWA-based next generation sequencing (NGS) pipeline (Fig. 1). The resulting specific and efficient analysis pipeline would allow us to routinely scan all of our future genome, transcriptome, and targeted next generation sequencing data for integrations of known viruses before further wet lab tests or computationally more expensive analyses are performed on selected samples.

Bottom Line: We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure.However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses.Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform.

View Article: PubMed Central - PubMed

Affiliation: Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Schleswig-Holstein, D-24105 Kiel, Germany.

ABSTRACT
Several pathogenic viruses such as hepatitis B and human immunodeficiency viruses may integrate into the host genome. These virus/host integrations are detectable using paired-end next generation sequencing. However, the low number of expected true virus integrations may be difficult to distinguish from the noise of many false positive candidates. Here, we propose a novel filtering approach that increases specificity without compromising sensitivity for virus/host chimera detection. Our detection pipeline termed Vy-PER (Virus integration detection bY Paired End Reads) outperforms existing similar tools in speed and accuracy. We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure. This analysis was motivated by the recently reported virus integrations at genomic rearrangement sites and association with chromosomal instability in liver cancer. However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses. Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform. This high accuracy is useful for detecting low virus integration levels as well as non-integrated viruses.

No MeSH data available.


Related in: MedlinePlus