Grammar-based compression approach to extraction of common rules among multiple trees of glycans and RNAs.
Bottom Line:
Many tree structures are found in nature and organisms.The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs.The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.
View Article:
PubMed Central - PubMed
Affiliation: Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Japan. tyoyo@kuicr.kyoto-u.ac.jp.
ABSTRACT
Show MeSH
Background: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees. Results: Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs. Conclusions: We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns. |
Related In:
Results -
Collection
License 1 - License 2 getmorefigures.php?uid=PMC4419412&req=5
Mentions: Simple elementary ordered tree grammar (SEOTG) is defined for the rooted ordered trees T(V,E), where V is a set of vertices, and E is a set of labeled edges. If T represents a glycan, a vertex corresponds to a monosaccharide, an edge corresponds to a bond between the monosaccharides, and the enzyme involved in the biosynthesis can be represented by the label of the edge, as shown in Figure 1. As well as CFGs, a grammar of SEOTG consists of 4-tuple (Σ,Γ,S,Δ), where Σ is a set of terminal symbols, Γ is a set of nonterminal symbols, S is a start symbol in Γ, and Δ is a set of production rules, that are classified into Horizontal Bisection (RHB), Vertical Bisection (RVB), and Name Change (RNC), as in Figure 2. It should be noted that production rules of SEOTG and SEUTG are different from construction rules of biomolecular trees. (RHB) includes three rules that an edge of nonterminal symbol A is replaced with a tree whose root is both roots of nonterminal symbols B and C. A is bisected at the root into B and C. We introduce a tag to represent the vertex connected with another tree. The first rule in (RHB) does not contain a tag, and the other rules contain a tag, respectively. (RVB) includes two rules that an edge of nonterminal symbol A is replaced with a tree in which the root is the root of nonterminal symbol B, and the root of nonterminal symbol C is attached to the tag of B. A is bisected at an internal vertex of A into B and C. (RNC) includes two rules that an edge of nonterminal symbol A is replaced with an edge of terminal symbol a. In addition, any nonterminal symbol does not appear in expansion of the symbol itself. Then, each nonterminal symbol corresponds to a subtree of a given tree T. Figure 3 shows an example of SEOTG with (Σ,Γ,S,Δ). Δ is the set of six production rules, R1,⋯,R6, where R1 is a vertical bisection rule, R2 and R3 are horizontal bisection rules, and R4,R5,R6 are name change rules. Figure 4 illustrates the derivation of a tree from the start symbol S by the SEOTG. The first replacement is done by R1, and U1 is replaced with the right-hand side of R2. Then, the lower endpoint of U1 connects with the root of U2, and one of leaves of the replaced tree of U1 connects with the root of U2. A tag indicates the vertex connected with another vertex. Hence, the lower endpoint labeled with a tag in A1 connects with the root of U2. By applying R3,⋯,R6, the right-most tree is generated. The trees surrounded by dotted curves are derived from nonterminal symbols U1 and U2, respectively. It is considered in the example of Figure 1 that a terminal symbol corresponds to a biosynthesis, and a nonterminal symbol corresponds to a sequence of biosyntheses.Figure 2 |
View Article: PubMed Central - PubMed
Affiliation: Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Japan. tyoyo@kuicr.kyoto-u.ac.jp.
Background: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees.
Results: Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs.
Conclusions: We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.