Sequence Bundles is a novel method for collation, visual representation, exploration and analysis of multiple sequence alignment (MSA) data [1]. Since its development, this method has been used to visualise and expose a number of sequence motifs and data features in protein alignments. The Sequence Bundles method was presented at the IEEEVis 2013 conference in Atlanta, Georgia, where it was awarded the ex aequo honourable mention in the BioVis 2013 data redesign contest.
Motivation
With the continuous development of ever more powerful methods for data collection and generation, we are faced with the challenge of not only making sense of this abundance of information, but also making good use of it. Modern computational methods for structuring data, finding patterns and querying databases address many of these challenges already. However, in many processes, the abilities intrinsic to human perception are still not matched by computers. Such processes include: rapidly recognising complex and non-obvious patterns; instant inferring, deducing and ad hoc hypothesis-forming; following sound and scientifically informed intuition. We aimed at capitalising on these human abilities and tried to bring sequence data analysis closer to human experience.
Our motivation in creating, developing, and putting Sequence Bundles to practical use was to allow for the discovery of hidden sequence motifs and other data features in a visualised dataset by direct manipulation and visual analysis of that data visualisation itself. Sequence Bundles is a visualisation method aimed at aiding scientific discovery by enabling the process of direct exploration where visualisation can be used as a sandbox for rapid testing of hypothesis, suppositions and even speculations about MSAs.
We also aimed at designing a visualisation method that would demonstrate potential for being relatively accessible to domain non-specific readers (e.g. prospective collaborators). By revealing more--more intuitively than existing MSA visualisation methods--the Sequence Bundles method is designed with the intent to be equally approachable and attractive to both practitioner and non-practitioner audience groups.
Related work
With the current growth in the amount of biological data, its scale, variety and complexity, new strategies and tools for exploring this wealth of knowledge are required [2, 3]. Moreover, in order for this knowledge to be understandable and usable for both expert and interdisciplinary audiences, it needs to be presented in accessible, transparent and intuitive ways.
In bioinformatics, a convention of the Sequence Logo has been developed [4] in order to enable the display of a range of MSA features in a single graphic: the consensus sequence, relative frequencies of residues at every position, the amount of information present at every position measured in bits, as well as significant locations in the input alignment. Further developments which build on the Sequence Logo method include inter alia: HMMLogo (giving visual representation to both emission and transition probabilities of Profile Hidden Markov Models--pHMMs) [5]; Seq2Logo (including other important information in the visual output, e.g. about the low number of observations) [6]; CodonLogo (a tool that allows for visual discrimination between patterns of codon and nucleotide conservation) [7]; and pLogo (visualising residue heights scaled relative to their statistical significance) [8]. All of these developments are in essence variations on the original Sequence Logo visualisation method by Schneider and Stephens [4] and even though they enhance the Logo visualisation by the addition of novel features, they also retain the Logo's inherent limitations.
Some kinds of information buried in MSAs cannot be easily exposed by either the Sequence Logo method, or any of its variations. When addressing those MSA features designers of visualisation tools need to rely on combining other methods [9] or--as in case of the Sequence Bundles--creating new ones.
Objectives
In a series of interviews and workshops with bioinformaticians from the United Kingdom, United States and Poland (see the 'Acknowledgements' section), we identified a number of requirements that a successful MSA visualisation should support, as well as a number of limitations and redundant features of the existing Sequence Logo method that should be addressed. This led our design efforts towards the following objectives:
-
1
-- Although Sequence Logos are very effective in exposing the general consensus sequence, as well as amino acid distribution on each position, they also obscure patterns in the relationships between sites within the sequences. This results in very important information about residue correlation and non-obvious sequence affinity being removed completely from the visualisation. Our general goal was, therefore, to reintroduce this relational information to the visualisation in order to facilitate and assist visual exposure of sequence motifs.
-
2
-- Our scientific interviewees saw little benefit in showing the amount of information on each position, measured in Sequence Logos against the Y-axis and expressed in bits. In fact, some scientists were surprised to learn about that during the interview, as they had never used this measure before. Displaying the amount of information seemed to be addressed to a far more specialised user. Therefore, our aim was to remove this data from the Y-axis and repurpose the axis for the benefit of a larger and more interdisciplinary audience.
-
3
-- Some visualisation tools are well suited for showing details, while others favour a more global inspection. Residue statistical detail and localised sequence properties can be easily identified and described by using Sequence Logos (or even by inspecting parts of a MSA itself). However, the Logo method is of limited value when applied to datasets with longer sequences, because of its site-specific focus. Thus, our objective was to favour global inspection of datasets by designing a visualisation encoding which is capable of exposing macroscopic patterns and generating findings of sequence-wide significance.
-
4
-- A Sequence Logo hides important information about the total number of analysed sequences (this information exists in the length of a MSA itself) and their relative affinity (relative distance from each other on the phylogenetic tree). Consequently, our aim was to provide an indication of the sample size (number of sequences in a visualised MSA).
-
5
-- The Sequence Logo visualisation method is equally well equipped to display either DNA or protein MSAs. In fact, the Logo visualisation principles should be easily applied to any sequential dataset which can be formatted as a MSA. Our goal was to retain this universal scope of application.
In line with our motivation, and in order to address Sequence Logo limitations and other visualisation challenges identified during our research, we decided to abandon the convention of Sequence Logo and develop a completely new method for visualising MSA data, which we explain below. First in the 'Methods' section we outline iterative design methodologies employed in the project, followed by an explanation of the Sequence Bundles visual encoding and a summary of key departures from the Sequence Logo. Later, in the 'Results' section, we describe the extent to which Sequence Bundles has been developed and list a number of interesting data features exposed in the competition dataset by using our visualisation method. Finally, we conclude with a discussion around the interactive potential of the Sequence Bundles method, which can complement existing visualisation tools to expose what otherwise could remain hidden.