Limits...
Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals.

Reneker J, Shyu CR - BMC Bioinformatics (2005)

Bottom Line: For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants.Our experiments show that subsequently searching the annotation data can refine and focus the results for the user.Source code is available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, USA. jeff@diglib1.cecs.missouri.edu

ABSTRACT

Background: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches -- most of which are irrelevant to the researcher.

Results: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes -- without the main memory constraints present in other systems.

Conclusion: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.

Show MeSH

Related in: MedlinePlus

Database search pseudo code. The length of the query sequence Q determines which block of code will execute. Lines 3 – 18 execute for /Q/ <k (wordsize) while lines 19 – 29 execute for /Q/ = k. If k < /Q/ < 2k then Q is divided into two k length pieces for recursive calls to Search in lines 32 – 35 and the results from these calls are further tested in lines 38 – 41 to obtain the final answer in line 42. If /Q/ > = 2k, a similar block of code is executed in lines 44 – 57. However, a different comparison is made in line 54 as compared to line 40.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1131890&req=5

Figure 2: Database search pseudo code. The length of the query sequence Q determines which block of code will execute. Lines 3 – 18 execute for /Q/ <k (wordsize) while lines 19 – 29 execute for /Q/ = k. If k < /Q/ < 2k then Q is divided into two k length pieces for recursive calls to Search in lines 32 – 35 and the results from these calls are further tested in lines 38 – 41 to obtain the final answer in line 42. If /Q/ > = 2k, a similar block of code is executed in lines 44 – 57. However, a different comparison is made in line 54 as compared to line 40.

Mentions: where key is discussed above, offset is the starting position within of the first <location, PID> pair for this key, occurrences is the number of <location, PID> pairs for this key, bytes is the number of bytes written to for this key, i.e. the hash bin size, and numbersize is the size in bytes of each location and PID. Figure 1 shows how offset plus bytes equals offset for the next key. Each line of is extended with spaces on the right up to an appropriate number (we use 30) so that all lines have the same length. In this way, when using as an index into , we can easily determine which line to read based on the key. See lines 4–10 and 20–23 of the search algorithm presented in Figure 2. Using larger k-words during processing will increase the size of but will not change the size . It will, however, decrease the size of each hash bin within .


Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals.

Reneker J, Shyu CR - BMC Bioinformatics (2005)

Database search pseudo code. The length of the query sequence Q determines which block of code will execute. Lines 3 – 18 execute for /Q/ <k (wordsize) while lines 19 – 29 execute for /Q/ = k. If k < /Q/ < 2k then Q is divided into two k length pieces for recursive calls to Search in lines 32 – 35 and the results from these calls are further tested in lines 38 – 41 to obtain the final answer in line 42. If /Q/ > = 2k, a similar block of code is executed in lines 44 – 57. However, a different comparison is made in line 54 as compared to line 40.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1131890&req=5

Figure 2: Database search pseudo code. The length of the query sequence Q determines which block of code will execute. Lines 3 – 18 execute for /Q/ <k (wordsize) while lines 19 – 29 execute for /Q/ = k. If k < /Q/ < 2k then Q is divided into two k length pieces for recursive calls to Search in lines 32 – 35 and the results from these calls are further tested in lines 38 – 41 to obtain the final answer in line 42. If /Q/ > = 2k, a similar block of code is executed in lines 44 – 57. However, a different comparison is made in line 54 as compared to line 40.
Mentions: where key is discussed above, offset is the starting position within of the first <location, PID> pair for this key, occurrences is the number of <location, PID> pairs for this key, bytes is the number of bytes written to for this key, i.e. the hash bin size, and numbersize is the size in bytes of each location and PID. Figure 1 shows how offset plus bytes equals offset for the next key. Each line of is extended with spaces on the right up to an appropriate number (we use 30) so that all lines have the same length. In this way, when using as an index into , we can easily determine which line to read based on the key. See lines 4–10 and 20–23 of the search algorithm presented in Figure 2. Using larger k-words during processing will increase the size of but will not change the size . It will, however, decrease the size of each hash bin within .

Bottom Line: For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants.Our experiments show that subsequently searching the annotation data can refine and focus the results for the user.Source code is available upon request.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Missouri, Columbia, USA. jeff@diglib1.cecs.missouri.edu

ABSTRACT

Background: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches -- most of which are irrelevant to the researcher.

Results: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes -- without the main memory constraints present in other systems.

Conclusion: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.

Show MeSH
Related in: MedlinePlus