Limits...
Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing.

Sevim V, Bashir A, Chin CS, Miga KH - Bioinformatics (2016)

Bottom Line: These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools.To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets.The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion.

View Article: PubMed Central - PubMed

Affiliation: Pacific Biosciences, Inc., Menlo Park, CA 94025, USA.

No MeSH data available.


Related in: MedlinePlus

Alpha-CENTAURI workflow and the HOR detection algorithm illustrated. (a) The workflow. Alpha-CENTAURI takes two input files: a FASTA file containing long reads and an HMM database built using known alpha-satellite monomers. The HMM database is used to infer monomeric sequences in each read. Then, HOR structure is predicted based on the start and end positions of each monomer on the read. The repeat structure on the read is classified under three categories regular, irregular (including inversion), or cases where no HOR is detected. (b) An illustration of a read consisting of an array of alpha-satellite monomers, which are identified from the HMM database. Each block arrow corresponds to a monomer. Similar colors indicate similar sequences. (c) Identified monomers are clustered-based sequence similarity. Here, each cluster is labeled by a different letter. (d) HOR structure is predicted based on the start positions, end positions and the distances between monomers
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4920115&req=5

btw101-F1: Alpha-CENTAURI workflow and the HOR detection algorithm illustrated. (a) The workflow. Alpha-CENTAURI takes two input files: a FASTA file containing long reads and an HMM database built using known alpha-satellite monomers. The HMM database is used to infer monomeric sequences in each read. Then, HOR structure is predicted based on the start and end positions of each monomer on the read. The repeat structure on the read is classified under three categories regular, irregular (including inversion), or cases where no HOR is detected. (b) An illustration of a read consisting of an array of alpha-satellite monomers, which are identified from the HMM database. Each block arrow corresponds to a monomer. Similar colors indicate similar sequences. (c) Identified monomers are clustered-based sequence similarity. Here, each cluster is labeled by a different letter. (d) HOR structure is predicted based on the start positions, end positions and the distances between monomers

Mentions: Alpha-CENTAURI’s workflow is designed to detect tandem repeats containing at least two ordered monomers (i.e. dimers), providing a minimal definition for HOR prediction (Willard and Waye, 1987b) (Fig. 1). As outlined in Figure 1a, the user is initially required to provide two input databases: (i) a file of quality-assessed long reads and (ii) a monomer training set of fasta sequences to define the basic repeat unit for a given satellite family. Initial monomer positions are determined using a hidden Markov method using a satellite model provided by the input consensus sequence (HMMER, Eddy, 1998). Minimum monomer lengths are defined by default to be 150 bases. Characterization of discrete, ordered monomer units provide an index of start positions for each read. Monomers are clustered into groups based on pair-wise sequence identity using an implementation of an O(ND) alignment algorithm (Myers, 1986) within the FALCON genome assembler (https://github.com/PacificBiosciences/FALCON/). Cluster similarity threshold is determined per read by evaluating a range of identity values (98% to 88% by 1% decrements) and selecting the monomer clusters assignments with the highest percent identity that permits inference of HOR organization. Consensus sequence fasta and tab-delimited description of HOR summary statistics are provided for each read.Fig. 1.


Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing.

Sevim V, Bashir A, Chin CS, Miga KH - Bioinformatics (2016)

Alpha-CENTAURI workflow and the HOR detection algorithm illustrated. (a) The workflow. Alpha-CENTAURI takes two input files: a FASTA file containing long reads and an HMM database built using known alpha-satellite monomers. The HMM database is used to infer monomeric sequences in each read. Then, HOR structure is predicted based on the start and end positions of each monomer on the read. The repeat structure on the read is classified under three categories regular, irregular (including inversion), or cases where no HOR is detected. (b) An illustration of a read consisting of an array of alpha-satellite monomers, which are identified from the HMM database. Each block arrow corresponds to a monomer. Similar colors indicate similar sequences. (c) Identified monomers are clustered-based sequence similarity. Here, each cluster is labeled by a different letter. (d) HOR structure is predicted based on the start positions, end positions and the distances between monomers
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4920115&req=5

btw101-F1: Alpha-CENTAURI workflow and the HOR detection algorithm illustrated. (a) The workflow. Alpha-CENTAURI takes two input files: a FASTA file containing long reads and an HMM database built using known alpha-satellite monomers. The HMM database is used to infer monomeric sequences in each read. Then, HOR structure is predicted based on the start and end positions of each monomer on the read. The repeat structure on the read is classified under three categories regular, irregular (including inversion), or cases where no HOR is detected. (b) An illustration of a read consisting of an array of alpha-satellite monomers, which are identified from the HMM database. Each block arrow corresponds to a monomer. Similar colors indicate similar sequences. (c) Identified monomers are clustered-based sequence similarity. Here, each cluster is labeled by a different letter. (d) HOR structure is predicted based on the start positions, end positions and the distances between monomers
Mentions: Alpha-CENTAURI’s workflow is designed to detect tandem repeats containing at least two ordered monomers (i.e. dimers), providing a minimal definition for HOR prediction (Willard and Waye, 1987b) (Fig. 1). As outlined in Figure 1a, the user is initially required to provide two input databases: (i) a file of quality-assessed long reads and (ii) a monomer training set of fasta sequences to define the basic repeat unit for a given satellite family. Initial monomer positions are determined using a hidden Markov method using a satellite model provided by the input consensus sequence (HMMER, Eddy, 1998). Minimum monomer lengths are defined by default to be 150 bases. Characterization of discrete, ordered monomer units provide an index of start positions for each read. Monomers are clustered into groups based on pair-wise sequence identity using an implementation of an O(ND) alignment algorithm (Myers, 1986) within the FALCON genome assembler (https://github.com/PacificBiosciences/FALCON/). Cluster similarity threshold is determined per read by evaluating a range of identity values (98% to 88% by 1% decrements) and selecting the monomer clusters assignments with the highest percent identity that permits inference of HOR organization. Consensus sequence fasta and tab-delimited description of HOR summary statistics are provided for each read.Fig. 1.

Bottom Line: These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools.To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets.The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion.

View Article: PubMed Central - PubMed

Affiliation: Pacific Biosciences, Inc., Menlo Park, CA 94025, USA.

No MeSH data available.


Related in: MedlinePlus