Design (in computational biology) matters

Unlike the usual attitude among bioinformaticians (weep at the horrible state of the software in the field, the proliferation of standards, you name it), I want to write a post to celebrate the amazing leaps forward in the field of bacterial comparative genomics, and what can be learned from it.

A basic recap
To everyone not familiar with bacterial genetics, it’s useful to explain a bit the very peculiar genetics of microorganisms. To date, the best metaphor that I’m aware of has beenĀ  coined by Prof. Peter Young, by comparing bacterial genomes to smartphones. A species would be represented by a model (i.e. an iPhone), which has then a very similar OS between strains (representing the conserved genes, or core), but it differs in the number and types of apps installed (representing the so-called dispensable or accessory genes). The ensemble of OS and all the different apps installed in each “phone” is termed “pangenome“. It’s of course a little bit more complicated than that, but this metaphor is useful to explain what microbiologist think of the different gene content inside the same species: different “apps” could potentially mean adaptations to specific niches. Several theoretical and experimental works have proven that this postulate is tendentially true. Pinning down which genes might confer a competitive advantage in a certain environment is a difficult task though; sometimes up to thousands of different “apps” are “installed”
between different strains of the same species, usually with little or no annotation (think of an app in a foreign language with no clear user interface!).

These are actually different species, but you get the idea!

These are actually different species if we follow the metaphor, but you get the idea!

Another, maybe easier, use of the pangenome concept concerns phylogenesis; genetic relationship between strains can be constructed by aligning conserved genes (slightly different OS subversions let’s say) or by comparing which accessory genes are shared (phones with overlapping “apps” could potentially operate in a similar way).

These two approaches represent the vast majority of the analysis that are being carried on bacterial pangenomes; so much that they are now becoming the routine. As any other bioinformatics task, anything that is done more than once will inevitably become a “pipeline”, with different levels of flexibility and quality. How good is the established “pangenome pipeline”, if there’s any?

The good ‘ol days
Back in 2011 I was a PhD student in Florence and had to make sense of a large (for the time) set of Sinorhizobium meliloti strains. The things we wanted to study about thisĀ  (interesting!) symbiotic species required to annotate the newly sequenced genomes, define the pangenome (what genes are part of the OS, and which genes are apps) and finally get a phylogenetic tree of the core and accessory genome.

At the time most of the steps required to achieve these relatively simple goals were run “manually”: the genomes had to be assembled/annotated by knitting together a plethora of different tools, such as prodigal to get gene models, rnammer to predict rRNA clusters, blast2go, interproscan and KAAS for functional annotations. Sure, at the time automatic web-based pipelines were available (such as Rast), but with limited availability, flexibility and long waiting times; and if you imagine having hundreds of strains instead of tens, you immediately see how this approach cannot scale up.

A blast2go run on a single genome took a lot of time to complete

A blast2go run on a single genome took a lot of time to complete.

Then to get the pangenome, the appropriate orthology algorithm has to be chosen; accuracy is not really a problem, as strains are usually so close to each other that conserved genes can be very easily detected (so easy that you can write your own implementation). The problem is mostly computational: the algorithm has to be fast and not resources intensive, especially when the number of genomes scales up. As a way of example, using InParanoid/Multiparanoid becomes quickly problematic for the exponential growth in the pairwise InParanoid jobs and the amount of memory required by Multiparanoid. Similar limitations apply to other routinely used orthology algorithm (say OMA), designed and used on inter-species analysis.

The last step in the pipeline required running an alignment of each conserved gene, concatenate them and then derive a phylogenetic tree. This step was somewhat less stressful, but still required some scripting and could become complicated when scaling up.

To summarize, a few years ago almost all the necessary tools were present, but requiring a substantial amount of time to get to the desired results, not to mention the risk of propagating any trivial mistake in any of the analysis step down to the final conclusions of the analysis. You might already imagine what is needed to overcome these problems…

The glorious present and future

Fast forward to 2015, where to get the same results that we got 4 years ago I can use the following tools:

  • The Spades assembler to not worry too much on choosing the optimal value of k for the de Bruijn graph
  • Prokka to transform a list of contigs in useful (and tidy-ready-to-submit-to-genbank) annotations in standard formats
  • Roary to quickly and scalably get the pangenome (and its alignment!) sorted out

It is worth noting that now the whole pipeline can be basically run with three command line calls, and that it shouldn’t take longer than a couple of hours on my 2011 dataset (depending on how many cores you can lay your hands on). This leaves plenty of time to focus on the actual biology of these bugs and to design and carry on more complex and interesting computational analysis.

The Quokka, the marsupial from which Prokka takes its name from

The Quokka, the marsupial from which Prokka takes its name from

But how exactly each one of this three tools has succeeded in retrospectively making my life so easier? The secret is not that they provided any particular strong breakthrough, but instead they all three have focused on design, usability and scalability. In particular:

  • Spades is amazingly robust, very well documented and maintained, runs relatively faster and with few errors, thus making it trustworthy
  • Prokka encapsulates pretty much every tool you might need to have a decent annotation and it’s again amazingly configurable and well documented/maintained
  • Roary takes a series of clever shortcuts on clustering gene families, taking advantage of the high similarity between strains, thus dramatically reducing computation time and requirements. Plus it works best when fed with Prokka’s outputs, thus encouraging people in the field to adapt to this new standard pipeline

It’s also worth noting that all three of this tools make extensive reuse of existing tools: Spades uses BWA for the error correction step, Prokka is basically (a very good) wrapper around all the tools needed to annotate a genome, and Roary as well. Not so much about re-inventing the wheel then, but more on making it better. I submit that as biology shifts towards larger and larger datasets only the implementations that heavily focus on good design, scalability and robustness will prevail.
We’re all advised…