Characteristics of equipartition for RNA structure

Background With the continuous discovery of novel RNA molecules with key cellular functions and of novel pathways and interaction networks, the need for structural information of RNA is still increasing. In order to predict structure of long RNA and understand its natural folding mechanism, exploring the characteristic of RNA structure is an important issue. Methods The real RNA secondary structures of all 480 sequences from the database of RNA strand, validated by nuclear magnetic resonance or x-ray are selected. For one sequence with multiple domains, the length ratios of these domains to the sequence are computed. For one sequence with one domain and multiple sub-domains, the length ratios of these sub-domains to the domain are computed. Then the ratios are compared and analyzed to seek the partition characteristic of domains and subdomains. Results For most RNAs, the length ratios of multiple domains to its sequence are close to equal, and those of sub-domains to its domain are also nearly identical. Most RNAs with multiple domains have two domains, so the length ratios of the domains to its sequence are close to 0.5. For sequence with one domain and no sub-domain or one sub-domain, the centre of domain and sub-domain is close to that of the sequence. Conclusions A novel finding is given that RNA folding accords with the characteristic of equipartition based on statistical analysis. The characteristic reflects the folding rules of RNA from a new angle, which maybe more close to natural folding.


Background
RNAs are versatile molecules. To understand fully the various functions of RNAs, we need to first understand their structures [1]. Experimental test of RNA tertiary structure is too expensive and time consuming to meet practical need, so predicting RNA structure by computer becomes a basic method and issue in computational biology [2].
RNA is folded as the process of transcription into RNA from DNA. In order to predict RNA structure, a case may be made that the natural folding process of RNA and the simulated folding of RNA using an evolutionary algorithm, which includes intermediate folds, have much in common [3]. So exploring the characteristic of RNA structure is an important issue to understand its natural folding mechanism.
We compare the structures of the test set of all 480 sequences from RNA STRAND [4], validated by NMR or X-Ray, and give a novel finding that RNA folding accords with characteristic of equipartition based on statistical analysis on real RNA secondary structures.

Methods
Let sequence s=s 1 s 2 ...s n be a single-stranded RNA molecule, where each base s i ∈ {A, U, C, G}, 1 ≤ i ≤ n. The subsequence s i, j = s i s i+1 . . . s j is a segment of s, 1 ≤ i ≤ j ≤ n.
If s i and s j are complementary bases (A&U, C&G, U&G), then s i and s j may constitute a base pair (i, j). A secondary structure S on s is a set of base pairs S={(i, j)}, where i, j ∈ {1, 2, · · · , n}, that satisfies the following conditions.
(No sharp turns.) The ends of each pair in S are separated by at least four intervening bases; that is, if (i, j) ∈ S, then i < j-3. S is a matching: no base appears in more than one pair.
(The non-crossing condition.) If (i, j) and (k, l) are two pairs in S, then they are compatible, that is, they are juxtaposed (e.g. i < j < k < l) or nested (e.g. i < k < l < j).
If base pairs (i, j) and (k, l) are incompatible, they form a pseudoknot (e.g. i < k < j < l). More complex pseudoknots may occur if three or more base pairs cross each other.
In the past domains have been described as units of: compact three-dimensional structure, folding, function and evolution [5]. A domain is a conserved part of a given sequence and structure that exists independently of the rest of the chain, and often can be independently stable and folded. The majority of domains have less than 200 residues with an average of approximately 100 residues [6].
A domain D(i', j') consists of all (i', j') that satisfy, (i', j') ∈ D(i, j) then i < i' < j' < j. Each base pair and each helix is placed uniquely in one domain [7].
A domain is closed by a helix or pseudoknot, as Figure 1. One sub-domain is an independently stable part of one domain. If the closed helix or pseudoknot of one domain is deleted, its sub-domain will become domain.
By convention, single strands of RNA sequences are written in 5'-to-3' direction. RNA is folded as the process of transcription into RNA from DNA. The subsequence s i,j begins to transcribe from the 5'-end s i . It terminates transcription at the 3'-end s j , as Figure 1. The helix (i, i+m: j-m, j) is totally folded after transcription of s j .
For purpose of understanding the natural folding mechanism and pathway of RNA, we selected the real structures of all 480 sequences from RNA STRAND with secondary and pseudoknotted structures, validated by NMR or X-Ray, non-fragment and non-redundant sequences, and analyzed their domains and sub-domains.
If the structure of RNA has multiple domains, we computed R as the ratios of 3'-end of domains to the length of sequence. Let L is the length of sequence. The ratio of the 3'-end of the helix (i, i+m: j-m, j) and the domain D(i, j) to the length of sequence s 1,L is the ratio of j to L, that is R=j/L. We compute and analyze the value of R, and seek the partition characteristic of domains.
If the structure of RNA has only one domain, we computed SR as the length ratios of its sub-domains to the domain. If the domain D(i, j) is closed by a helix (i, i+m: j-m, j), then its internal length is j-i-2m-1, and SR is equal to (q-p+1)/(j-i-2m-1) for the sub-domain D(i+p, i+q) with j-m-i>q>p>m. We compute and analyze the value of SR, and seek the partition characteristic of sub-domains.

Results and discussion
Characteristic of equipartition for synthetic RNA rWe compare the structures of all 248 sequences of synthetic RNA. The results of statistical analysis on these structures are shown in Figure 1, Figure 2, Table 1  Table 2 and Table 3. Table 1 shows the distribution of multi-domains for synthetic RNA with more than two domains.
There are 12 sequences with more than two domains as Table 1 and Figure 1. Sequences PDB_00195, PDB_00262, PDB_00754 and PDB_01250, have three domains. Their domains are 0-0.33L, 0.33L-0.66L and 0.66L-L as Table 1 and Figure 1A, which completely fits the characteristic of equipartition. Sequence PDB_01060 has three domains, 0.25L-0.5L, 0.5L-0.75L and 0.75L-L, but we can divide the sequence into four domains, 0-0.25L, 0.25L-0.5L, 0.5L-0.75L and 0.75 L-L, then it also conforms to the characteristic of equipartition.
Only sequence PDB_01249 has six domains 0-0.17L, 0.17L-0.33L, 0.33L-0.5L, 0.5L-0.67L, 0.67L-0.83L and 0.83L-L, which is completely fits the characteristic of equipartition, as Table 1 and Figure 1D. Table 2 shows the distribution of domains for synthetic RNA with two domains. There are 48 sequences with two domains as Table 2 and Figure 2. The domains of 29 sequences are just 0-0.5L and 0.5L-L, and those of 13 sequences are close to 0-0.5L and 0.5L-L, which fits the characteristic of equipartition. The domain is formed by parallel helixes or pseudoknots as Figure 1E and 1F. But there are some exceptions, the domains of sequence PDB_00196 are 0-0.33L and 0.33L-L, those of sequence PDB_00868 are 0-0.39L and 0.39L-L, those of sequence PDB_00709 and PDB_00710 are 0-0.4L and 0.4L-L, those of sequence PDB_00971 are 0-0.57L and 0.57L-L, and those of sequence PDB_01138 are 0-0.58L and 0.58L-L. It can be thought as the combination of some domains, and they close to 0.33L, 0.4L and 0.6L. For example, we can regard sequence PDB_00196 as three domains 0-0.33L, 0.33L-0.66L and 0.66L-L, then 0.33L-0.66L and 0.66L-L combines into domain 0.33L-L.
The rest of 188 sequences have one domain or one pseudokont, and the centre of domain is basically same as that of its sequence. In common, their sub-domains can be divided into three classes, no sub-domain in 157 sequences as Figure 1G, one subdomain with the centre is close to that of its sequence as Figure 1H, and multiple and nearly equal sub-domains as Figure 1I.
For Sequence with one domain and no sub-domain, the centre of domain is close to that of the sequence.   have one pseudoknot with two helixes, and the centre of the pseudoknot is same as that of its sequence. PDB_00759 has two pseudoknots with three helixes, and the centre of helix (3,4:13,14) is basically same as that of its sequence.  L is the length of the sequence, H1-H2 are the helixes of the sequence, D1-D2 are the domains of the sequence, R1-R2 are the ratios of 3'-end of domains to the length of sequence. The helix data is the closed base pair in helix or the start and end bases in pseudokont. The domain is expressed as 5'-end and 3'-end.
As Table 3 shows the distribution of sub-domains for synthetic RNA with one domain and multiple subdomains. There are 7 sequences with one domain and multiple subdomains, as Table 5. The sub-domains of six sequences also conform to the characteristic of equipartition, with only one sequence exception.

Characteristic of equipartition for tRNA
We compare the structures of all 46 sequences of tRNA. The results of statistical analysis on these secondary structures are shown in Table 4 and Table 5. Table 4 shows the distribution of domains for tRNA with multiple domains. There are 17 sequences with two domains. The domains of 8 sequences are just 0-0.5L and 0.5L-L, and those of 7 sequences are close to 0-0.5L and 0.5L-L, which fits the characteristic of equipartition. But there are some exceptions, the domains of sequence PDB_00681 are 0-0.33L and 0.33L-L, those of sequence PDB_01162 are 0-0.61L and 0.61L-L. It can also be thought as the combination of two domains, and they close to 0.33L and 0.66L.
There are 3 sequences have three domains. Their domains are close to 0-0.33L, 0.33L -0.66L and 0.66L-L, which fits the characteristic of equipartition.
Sequence PDB_00998 has four domains 0-0.18L, 0.18L-0.34L, 0.34L-0.5L and 0.5L-L. They can be thought    The rest of 23 sequences have one domain or one pseudokont, and the centre of domain is basically same as that of its sequence. There are 12 sequences with one domain and multiple sub-domains as Table 5 and their sub-domains all close to 0 -0.33L, 0.33L-0.66L and 0.66L-L, which conforms to the characteristic of equipartition.

Characteristic of equipartition for other RNA
We compare the structures of all 49 sequences of Other RNA, 6 sequences of Ham Ribozyme and 9 sequences of Viral & Phag.
The results of statistical analysis on these secondary structures are shown in Table 6. For Other RNA, there are 7 sequences with two domains, and the domains are just 0-0.5L and 0.5L-L, which completely fits the characteristic of equipartition. There are 3 sequences have three domains. The domains of PDB_00626 and PDB_00739 are just 0-0.33L, 0.33L-0.67L and 0.67L-L, which fits the characteristic of equipartition. Sequence PDB_01261 has three domains. They can be divided into two groups, one is 0-0.5L, and the other is 0.5L-L. Then the domain 0.5L-L is divided into 0.5L-0.75L and 0.75L-L. Sequence PDB_001261 and PDB_00985 has four domains. Their domains are close to 0-0.25L, 0.25L-0.5L, 0.5-0.75L and 0.75L-L, which fits the characteristic of equipartition.
Sequences PDB_01061 and PDB_00370 have five domains. Their domains are 0-0.33L, 0.33L-0.66L and 0.66L-L, which completely fits the characteristic of equipartition.
For Viral & Phage, only one sequence PDB_00743 has two domains 0-0.52L, 0.52LL, which fits the characteristic of equipartition, as Table 6. The rest of 8 sequences have only one domain and no sub-domain, and three of them exist as two pseudoknotted helixes. The domains all fit the characteristic of equipartition.
For Ham Ribozyme, only one sequence PDB_00157 has two domains 0-0.5L, 0.5L-L, which completely fits the characteristic of equipartition, as Table 6. The rest of 5 sequences only have one domain with two sub-domains, and their sub-domains also conform to the characteristic of equipartition.

Characteristic of equipartition for other ribozyme
We compare the structures of all 18 sequences of Other Ribozyme. The results of statistical analysis on these secondary structures are shown in Table 7.
Sequence PDB_00805 have 8 domains, and they can be divided into four groups 0-0.25L, 0.25L-0.5L, 0.5-0.75L and 0.75L-L, which conforms to the characteristic of equipartition. Sequence PDB_00176 has four domains, and they meet the character of equipartition. Sequence PDB_01187 has four domains, and they can be divided into two groups 0-0.5L and 0.5L-L, which conforms to the characteristic of equipartition. There are 11 sequences with two domains. They fit the character of equipartition with three exceptions.
There are 4 sequences have only one domain. PDB_00088 has one domain with no sub-domain, PDB_00142 has one domain with one sub-domain, and they also fit the characteristic of equipartition. Sequence PDB_00078 has one pseudokontted domain with four subdomains 0-0.36L, 0.36L-0.48L, 0.48L-0.86L and 0.86L-L. They can be divided into two groups, one is 0-0.48L, and the other is 0.48L-L, which nearly fit the characteristic of equipartition. PDB_01185 has one pseudokontted domain with three sub-domains, and sub-domain 0-0.52L and 0.52L-L nearly fit the characteristic of equipartition. For 16S rRNA and 32S rRNA, they conform to other characteristics besides the characteristic of equipartition, it is a matter for further discussion.

Conclusions
In this paper, we give a novel finding that RNA folding accords with the characteristic equipartition based on statistical analysis on real RNA secondary structures of all 480 sequences from RNA STRAND, validated by NMR or X-Ray. For most RNA sequences, the length of multiple domains is close to equal. For the sub-domains of one domain, the length of them is also nearly identical. Most of multiple domains are two domains, so the length ratio of the first domain to its sequence is close to 0.5. The characteristic of equipartition reflects the folding rules of RNA from a new angle, which is more close to natural folding. Applying this characteristic, algorithm can be designed to dynamically predict long RNA structure, and the dynamic folding mechanism and the relation of function, mutation and RNA structure can be deeply understood from a new view.  L is the length of the sequence, N is the number of domain in the sequence, D1-D4 are the domains of the sequence, R1-R4 are the ratios of 3'-end of domains to the length of sequence. The domain is expressed as 5'-end and 3'-end.