Some snaps taken while attending the FEMS 2017 conference.
I recently switched to a new camera (the Fujifilm x100t); I’m still adapting to it, but so far I really like its portability and quality. I’m also a big fan of fixed lens cameras, as they force you to (try to) follow a consistent style.
A few months ago the IUPAC (International Union of Pure and Applied Chemistry) had to take a decision on the only apparently very simple task of giving a name to the latest discovered chemical elements. Even a basic understanding of chemistry and the way the periodic table is made will make apparent how such decision is both of little significance and at the same time irrepetible. Of little significance, for those elements are only artificially made and have a short and turbulent life before becoming a lighter and more stable element; almost irrepetible, because the number of new element “discovered” is probably reaching the limit of what is humanly possible.
Some people argued that one of those elements should have been called Levium, in honour to Primo Levi’s and the short stories of “The periodic table” (“Il sistema periodico” in Italian), unfortunately with little success.
Primo Levi, who is mostly known for his recollection of his deportation to the Auschwitz concentration camp, possesses two only apparent antithetical qualities as a writer: he is both a distant observer and a relentless moral agent. The former quality appears somewhat wrong when used to describe the hopeless struggle that characterized the inmates of the annihilation camps, but such distance acquires sense when used to derive fundamental truths about human nature. This is most evident in the “I sommersi e i salvati” (“The drowned and the saved”) chapter of “Se questo e’ un uomo” (“If this is a man”).
A similar approach is taken in the main theme of “The periodic table”: the somewhat technical recollection of the experiences of the author as a chemist (mostly in the varnish industry) are used to again convey some very sharp truths about human nature, but most of all what science is.
In many occasions throughout the book (each chapter revolving around a particular element), Levi demonstrate to know all too well the high of intellectual discovery, and chooses to stay far away from collective discoveries and scientific enterprises, but to focus on the solitary work of single chemists, which is very similar in spirit to the work of the first alchemists. This is especially true, Levi claims, when considering the daily struggle of a chemist working in an industry, whose day-to-day battle against matter itself eventually only leads to very occasional small victories and surely to burn-out with age; “Chromium”, “Nickel” and “Silver” are very good examples of chapters where this concept is exposed. Other chapters are instead more intimate and moving, were chemistry is put aside to give space to Levi’s moral compass, forged by a life of terrible experiences and constant observation of human nature. Some of those chapters, like “Argon” and “Tin”, are exhilarating accounts of the family and friends of the author. Others instead recall more dramatic moments in the author’s life, and are probably the best parts of the book. In “Iron”, a fellow chemistry student with a passion for climbing is used to show what is like (and what’s the cost) of being free; the opposite concept is presented in “Gold”, when the author has been captured by fascists and fears that he would soon die. In “Cerium” an episode from the Nazi lager is recalled to remind that some basic (and maybe irrational) form of human will can persist even in the face of hopelessness. Finally in “Vanadium”, the incredible and fortuitous exchange between Levi and one of the German civilians with whom he interacted in Auschwitz shows how morality is a complex matter even when reason is indisputably on one side of the argument.
I could not recommend this book enough, especially to anyone accustomed to rational thought. Even though it would be impossible to replicate this formula, I cannot but wonder how a similar book would look like for disciplines like physics (a chapter for each particle?) or even “messier” sciences like biology (a chapter for each species?). It could actually be an interesting exercise for a crowd-sourced book, where stories on a single “unit” of each discipline are contributed by different people. I would surely like to write a chapter about one or two bacterial species.
Everyone who does computational biology and has wrote at least one Python script probably knows about the BioPython library. I personally remember going “Oh!” some years ago when I gave up writing my own (horrible, terrible, clunky) GenBank file parser and discovered it. Since then it has been a central part of almost all small scripts I needed to write. Recent versions have become even more useful, with the inclusion of a very cool KEGG API wrapper, which has the side-effect of putting together two well-designed bioinformatics software together!
It is then with great pleasure that I’m announcing the addition of the Bio.phenotype module to BioPython, starting from version 1.67. The module allows to parse and write the outputs of Phenotype Microarray experiments, as well as to run some simple analysis on the raw data. Even though I have published another software in the past to run the same analysis (plus some more), I thought that a simpler library would prove useful for many, and that having it as part of BioPython would make it more easily accessible. Moreover, from a software development perspective it is worth noting that BioPython is following very strict practices, which ensure that code is properly written, tested and maintained. This is all possible thanks to the great work of the BioPython community in general and of Peter Cock in particular.
So, I really hope that this small library will prove useful to anyone with Phenotype Microarray CSV files collecting dust in their filesystem. To make it even easier I’ve posted a small tutorial here, which also includes some downstream analysis and plots that are not covered in the BioPython manual (the tutorial is also embedded at the end of this post).
Unlike the usual attitude among bioinformaticians (weep at the horrible state of the software in the field, the proliferation of standards, you name it), I want to write a post to celebrate the amazing leaps forward in the field of bacterial comparative genomics, and what can be learned from it.
A basic recap
To everyone not familiar with bacterial genetics, it’s useful to explain a bit the very peculiar genetics of microorganisms. To date, the best metaphor that I’m aware of has been coined by Prof. Peter Young, by comparing bacterial genomes to smartphones. A species would be represented by a model (i.e. an iPhone), which has then a very similar OS between strains (representing the conserved genes, or core), but it differs in the number and types of apps installed (representing the so-called dispensable or accessory genes). The ensemble of OS and all the different apps installed in each “phone” is termed “pangenome“. It’s of course a little bit more complicated than that, but this metaphor is useful to explain what microbiologist think of the different gene content inside the same species: different “apps” could potentially mean adaptations to specific niches. Several theoretical and experimental works have proven that this postulate is tendentially true. Pinning down which genes might confer a competitive advantage in a certain environment is a difficult task though; sometimes up to thousands of different “apps” are “installed”
between different strains of the same species, usually with little or no annotation (think of an app in a foreign language with no clear user interface!).
Another, maybe easier, use of the pangenome concept concerns phylogenesis; genetic relationship between strains can be constructed by aligning conserved genes (slightly different OS subversions let’s say) or by comparing which accessory genes are shared (phones with overlapping “apps” could potentially operate in a similar way).
These two approaches represent the vast majority of the analysis that are being carried on bacterial pangenomes; so much that they are now becoming the routine. As any other bioinformatics task, anything that is done more than once will inevitably become a “pipeline”, with different levels of flexibility and quality. How good is the established “pangenome pipeline”, if there’s any?
The good ‘ol days
Back in 2011 I was a PhD student in Florence and had to make sense of a large (for the time) set of Sinorhizobium meliloti strains. The things we wanted to study about this (interesting!) symbiotic species required to annotate the newly sequenced genomes, define the pangenome (what genes are part of the OS, and which genes are apps) and finally get a phylogenetic tree of the core and accessory genome.
At the time most of the steps required to achieve these relatively simple goals were run “manually”: the genomes had to be assembled/annotated by knitting together a plethora of different tools, such as prodigal to get gene models, rnammer to predict rRNA clusters, blast2go, interproscan and KAAS for functional annotations. Sure, at the time automatic web-based pipelines were available (such as Rast), but with limited availability, flexibility and long waiting times; and if you imagine having hundreds of strains instead of tens, you immediately see how this approach cannot scale up.
Then to get the pangenome, the appropriate orthology algorithm has to be chosen; accuracy is not really a problem, as strains are usually so close to each other that conserved genes can be very easily detected (so easy that you can write your own implementation). The problem is mostly computational: the algorithm has to be fast and not resources intensive, especially when the number of genomes scales up. As a way of example, using InParanoid/Multiparanoid becomes quickly problematic for the exponential growth in the pairwise InParanoid jobs and the amount of memory required by Multiparanoid. Similar limitations apply to other routinely used orthology algorithm (say OMA), designed and used on inter-species analysis.
The last step in the pipeline required running an alignment of each conserved gene, concatenate them and then derive a phylogenetic tree. This step was somewhat less stressful, but still required some scripting and could become complicated when scaling up.
To summarize, a few years ago almost all the necessary tools were present, but requiring a substantial amount of time to get to the desired results, not to mention the risk of propagating any trivial mistake in any of the analysis step down to the final conclusions of the analysis. You might already imagine what is needed to overcome these problems…
The glorious present and future
Fast forward to 2015, where to get the same results that we got 4 years ago I can use the following tools:
- The Spades assembler to not worry too much on choosing the optimal value of k for the de Bruijn graph
- Prokka to transform a list of contigs in useful (and tidy-ready-to-submit-to-genbank) annotations in standard formats
- Roary to quickly and scalably get the pangenome (and its alignment!) sorted out
It is worth noting that now the whole pipeline can be basically run with three command line calls, and that it shouldn’t take longer than a couple of hours on my 2011 dataset (depending on how many cores you can lay your hands on). This leaves plenty of time to focus on the actual biology of these bugs and to design and carry on more complex and interesting computational analysis.
But how exactly each one of this three tools has succeeded in retrospectively making my life so easier? The secret is not that they provided any particular strong breakthrough, but instead they all three have focused on design, usability and scalability. In particular:
- Spades is amazingly robust, very well documented and maintained, runs relatively faster and with few errors, thus making it trustworthy
- Prokka encapsulates pretty much every tool you might need to have a decent annotation and it’s again amazingly configurable and well documented/maintained
- Roary takes a series of clever shortcuts on clustering gene families, taking advantage of the high similarity between strains, thus dramatically reducing computation time and requirements. Plus it works best when fed with Prokka’s outputs, thus encouraging people in the field to adapt to this new standard pipeline
It’s also worth noting that all three of this tools make extensive reuse of existing tools: Spades uses BWA for the error correction step, Prokka is basically (a very good) wrapper around all the tools needed to annotate a genome, and Roary as well. Not so much about re-inventing the wheel then, but more on making it better. I submit that as biology shifts towards larger and larger datasets only the implementations that heavily focus on good design, scalability and robustness will prevail.
We’re all advised…