Limits...
A niched Pareto genetic algorithm for finding variable length regulatory motifs in DNA sequences

View Article: PubMed Central

ABSTRACT

The transcription factor binding sites also called as motifs are short, recurring patterns in DNA sequences that are presumed to have a biological function. Identification of the motifs from the promoter region of the genes is an important and unsolved problem specifically in the eukaryotic genomes. In this paper, we present a niched Pareto genetic algorithm to identify the regulatory motifs. This approach is based on the maximization of two objectives of the problem that is the motif length and the consensus similarity score. A long motif means it is less likely to be a false motif. The similarity score represents a motifs probability of conservation in a given set of sequences. Proposed method can find multiple, variable length motifs. In this method, we represented a candidate motif as a combination of length and starting position of the motif in each sequence of the co-regulated genes. This enables the algorithm to identify multiple motifs of variable length. We applied this approach on various data sets and the results show that it can find multiple motifs of variable length in co-regulated genes.

No MeSH data available.


Representation of a member: pi is the starting position of the subsequence mi of length w, in ith sequence
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3376862&req=5

Fig2: Representation of a member: pi is the starting position of the subsequence mi of length w, in ith sequence

Mentions: To represent an individual motif, we used the position-based representation approach as used in the algorithms GALF-P and GAME (Chan et al. 2008; Wei and Jensen 2006). Here, each individual motif is represented by a vector P = {w, p1, p2,…,pN} storing the length of motif and a set of possible starting positions for the motif instances in each sequence. Vector P is used to generate the vector of subsequences for a possible consensus solution set M, where each subsequence is of length w. The consensus solution set M = {m1, m2,…,mN}, where each pi is uniquely mapped to subsequence mi of length w, is used to generate the consensus motif. Figure 2 illustrates this approach. The initial population is generated using this multiple attribute representation. The representation of an individual motif in the algorithm is having two fields (1) the length of motif and (2) the starting positions in the promoter sequences. Hence, the population has all the members of same size but having different value of attribute length. This enables the algorithm to identify the motifs of variable length. The numeric encoding is used to represent the width and the starting position of a subsequence. The size of the population is taken from the user as an input. The algorithm randomly generates the initial population of the size specified by the user. The length of motif and starting positions of motif for each subsequence are randomly generated.Fig. 2


A niched Pareto genetic algorithm for finding variable length regulatory motifs in DNA sequences
Representation of a member: pi is the starting position of the subsequence mi of length w, in ith sequence
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3376862&req=5

Fig2: Representation of a member: pi is the starting position of the subsequence mi of length w, in ith sequence
Mentions: To represent an individual motif, we used the position-based representation approach as used in the algorithms GALF-P and GAME (Chan et al. 2008; Wei and Jensen 2006). Here, each individual motif is represented by a vector P = {w, p1, p2,…,pN} storing the length of motif and a set of possible starting positions for the motif instances in each sequence. Vector P is used to generate the vector of subsequences for a possible consensus solution set M, where each subsequence is of length w. The consensus solution set M = {m1, m2,…,mN}, where each pi is uniquely mapped to subsequence mi of length w, is used to generate the consensus motif. Figure 2 illustrates this approach. The initial population is generated using this multiple attribute representation. The representation of an individual motif in the algorithm is having two fields (1) the length of motif and (2) the starting positions in the promoter sequences. Hence, the population has all the members of same size but having different value of attribute length. This enables the algorithm to identify the motifs of variable length. The numeric encoding is used to represent the width and the starting position of a subsequence. The size of the population is taken from the user as an input. The algorithm randomly generates the initial population of the size specified by the user. The length of motif and starting positions of motif for each subsequence are randomly generated.Fig. 2

View Article: PubMed Central

ABSTRACT

The transcription factor binding sites also called as motifs are short, recurring patterns in DNA sequences that are presumed to have a biological function. Identification of the motifs from the promoter region of the genes is an important and unsolved problem specifically in the eukaryotic genomes. In this paper, we present a niched Pareto genetic algorithm to identify the regulatory motifs. This approach is based on the maximization of two objectives of the problem that is the motif length and the consensus similarity score. A long motif means it is less likely to be a false motif. The similarity score represents a motifs probability of conservation in a given set of sequences. Proposed method can find multiple, variable length motifs. In this method, we represented a candidate motif as a combination of length and starting position of the motif in each sequence of the co-regulated genes. This enables the algorithm to identify multiple motifs of variable length. We applied this approach on various data sets and the results show that it can find multiple motifs of variable length in co-regulated genes.

No MeSH data available.