Biological sequences as pictures – a generic two dimensional solution for iterated maps
- 6.3k Downloads
Representing symbolic sequences graphically using iterated maps has enjoyed an enduring popularity since it was first proposed in Jeffrey 1990 as chaos game representation (CGR). The usefulness of this representation goes beyond the convenience of a scale independent representation, it provides a variable memory length representation of transition. This includes the representation of succession with non-integer order, which comes with the promise of generalizing Markovian formalisms. The original proposal targeted genomic sequences only but since then several generalizations have been proposed, many specifically designed to handle protein data.
The challenge of a general solution is that of deriving a bijective transformation of symbolic sequences into bi-dimensional planes. More specifically, it requires the regular fractal nesting of polygons. A first attempt at a general solution was proposed by Fiser 1994 by using non-overlapping circles that contain the polygons. This was used as a starting point to identify a more efficient solution where the encapsulating circles can overlap without the same happening for the sequence maps which are circumscribed to fractal polygon domains.
We identified the optimal inscribed packing solution for iterated maps of any Biological sequence, indeed of any symbolic sequence. The new solution maintains the prized bijective mapping property and includes the Sierpinski triangle and the CGR square as particular solutions of the more encompassing formulation.
KeywordsBiological Sequence Symbolic Sequence Circle Packing Aminoacid Sequence Markov Chain Order
The use of iterative functions to represent nucleotide sequence was originally proposed in 1990 with the designation of Chaos Game Representation, CGR . Since then, the CGR technique evolved from being a graphic representation technique into becoming a platform for pattern recognition [2, 3, 4, 5, 6, 7, 8, 9], screening entropic properties [2, 10, 11, 12], and finally into a generalization of Markov Transition tables [13, 14]. That conclusion was further generalized for any symbolic sequence  where a dissimilarity metric was also proposed and routes for efficient implementation were established . The emerging interest in alignment-free sequence analysis techniques , in particular for application to proteomic sequences, raises the prospect of a wider use of CGR and CGR-related techniques [18, 19] capable of exploratory sequence analysis of whole genomes [20, 21, 22]. We have subsequently examined the advantage of using non-genomic word-statistics (integer order) for Biological sequence analysis  with an application to the SCOP protein database . The strengths of that approach suggest that even more interesting results would be achieved by order-free analysis. However, the potential of hyper-dimensional generalization of CGR such as the use of Universal Sequence Maps, USM , are hard to convey and realize in the absence of a 2D projection that retains the bijective mapping property of this technique. This conclusion is particularly clear in a recent report exploring CGR application to aminoacid sequences . In that work, in order to benefit from the scale independency of Chaos Game applied to proteins, the reverse encoding of the aminoacid sequences back to the 4 unit alphabet nucleotide sequences is the solution of choice. Furthermore, as detailed in the derivation of sequence similarity metrics and density kernel functions [13, 14, 15], CGR is most useful as a bijective mapping of a sequence of symbols into a numeric vector of the same length, rather than as a technique to compress sequences into individual points. Accordingly, the work reported here is driven by the need to advance the identification of a 2D projection of the vector of CGR positions that is still applicable to longer, often non-genomic, alphabets.
A possible solution is to keep the quadrangular representation for each unit and arrange them in a tabular format . However, this solution sacrifices the regularity of the CGR solution where changing the position of the elements of the alphabet (ACGT) corresponds to a simple translation of the coordinates of the map. In order to preserve equivalence between units of the alphabet, the iterated dividing ratios would have to have the same value for all elements of the larger protein alphabet.
To satisfy the desirable equivalence between all elements of the alphabet (all 20 aminoacids in the case of proteins) one could simply keep the 1/2 ratio of the original CGR  independently from the length of the alphabet. This solution has in fact been explored  by morphing the unit square into a polygon with as many vertices as units in the alphabet. However, this second option sacrifices the bijective mapping property for alphabets with more than 4 units. Without this property, the graphical projection becomes just a visualization technique from which it is no longer possible to completely recover the sequence composition.
Finally, the third possibility, chronologically the first extension to be proposed, is to adjust the iterative step such that the prized bijective property is preserved , which enables recovery of sequence from map position. However, in that work, the iterative step was adjusted to inscribe the circles where the polygons are inscribed, not the polygons themselves. As a consequence, neither the CGR nor the iconic three alphabet unit map, the Spierpinsky triangle, appear as particular solutions of this formulation.
The study reported here takes the third of these approaches  as the starting point and then seeks to identify a new rule that maximizes the packing density of the projection to the point where the polygons are inscribed directly within each other. In addition to providing a more efficient visualization of the sequence, the specific goal of this study if to obtain a solution that also generalizes the 3 alphabet unit solution by the Spierpinsky triangle and the 4 alphabet unit solution of genomic CGR.
The main goal of the work described here is achieved with the identification of the dividing ratio in Equation 2 for which maximum packing of nexted, non-ovelapping polygons is obtained. The identification of this equation was achieved empirically with the assistance of MathCad's symbolic processing engine. Specifically, different n-polygons were inscribed in a circle of unitary radius and, using basic trigonometry: the dividing ratio was obtained as the fraction of the distance between any two edges that would identify the perimeter of the target polygon. The observation that the dividing ratio between any two edges is a constant value for each n-polygon was itself an empirical observation. The collection of solutions obtained was symbolically processed using MathCad's equation simplification tool to generate the solution in Equation 2.
The starting point for the attempt to find a generic formulation of CGR that is applicable to all alphabets is twofold. Firstly and foremost there is the original CGR formulation for nucleotide sequences , using a dividing ratio of 1/2 to fill a unit square domain homogeneously. Secondly, there is the proposition that there is a function of the number of edges in a regular polygon, n, that will yield a maximum dividing ratio, s, where the bijective property is still preserved. Fiser et. al  proposed Equation 1 as a suitable description of that dependency.s = (1 + sin (π/n))-1
It is important to note (see Methods for description of computational derivation) that similarly to the reference circle packing solution by Fiser et. al (Equation 1), the polygon packing achieved by Equation 2 is also an heuristic solution. This is not a rare situation when it comes to fractal geometry  where the numerical solutions for a novel graphical configuration are often the starting point, rather than the conclusion, of the deductive process . It nevertheless suggests that further analysis of this solution is needed to uncover simpler and more meaningful patterns.
As overviewed in the Background section, the enduring popularity of iterated maps to represent Biological sequences has gradually expanded from a convenient graphical illustration into an efficient computational framework for computation. Two routes appear now possible for furthering the CGR transformation. One possibility if to relinquish the requirement to project sequences in two dimensions and instead use unitary hypercube maps . The other possibility assisted by the formulation reported here is to revisit CGR Kernel functions  to enable feature detection through entropic profiling  for non-genomics sequences. Interestingly, the 4 unit alphabet of genomic code is the exact point where both routes are equivalent, an observation that has been noted, and used, by other authors analyzing non-genomic sequences .
A generic formulation of Chaos Game Representation dividing ratios for 2-dimensional displaying was identified. The new formulation determined ratios for the iterative function that produce the reference representations for alphabets length 2 to 4. Specifically, it produces Sierpinsky's triangle for n = 3 and the original, homogeneously covered Chaos Game representation for n = 4. Given the fact that the density of CGR correspond to a order-free transition matrix – each consecutively nested polygon corresponds to an additional Markov chain order – the value of consistent graphical representation techniques is, potentially, enormous. Furthermore, the growing interest on alignment-free sequence dissimilarity metrics suggests a new role for Chaos Game iterative functions as a scale-free approach to word-statistics. The formulation identified to optimize the dividing ratio is, as is often the case with fractal processes, an empirical result that should be object of further analysis to uncover simpler and more meaningful patterns. Nevertheless, and regardless of the actual CGR computation being performed in two or more dimensions for sequence with longer alphabets, such as proteins, a generic graphical visualization technique is now at hand.
The m-code (Matlab, Mathworks Inc) used to generate the figures as well as the application depicted in Figure one are made available with open source at http://genechaos.org.
This work was supported by the NHLBI Proteomics Initiative through contract N01-HV-28181.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.