Everyone who does computational biology and has wrote at least one Python script probably knows about the BioPython library. I personally remember going “Oh!” some years ago when I gave up writing my own (horrible, terrible, clunky) GenBank file parser and discovered it. Since then it has been a central part of almost all small scripts I needed to write. Recent versions have become even more useful, with the inclusion of a very cool KEGG API wrapper, which has the side-effect of putting together two well-designed bioinformatics software together!
It is then with great pleasure that I’m announcing the addition of the Bio.phenotype module to BioPython, starting from version 1.67. The module allows to parse and write the outputs of Phenotype Microarray experiments, as well as to run some simple analysis on the raw data. Even though I have published another software in the past to run the same analysis (plus some more), I thought that a simpler library would prove useful for many, and that having it as part of BioPython would make it more easily accessible. Moreover, from a software development perspective it is worth noting that BioPython is following very strict practices, which ensure that code is properly written, tested and maintained. This is all possible thanks to the great work of the BioPython community in general and of Peter Cock in particular.
So, I really hope that this small library will prove useful to anyone with Phenotype Microarray CSV files collecting dust in their filesystem. To make it even easier I’ve posted a small tutorial here, which also includes some downstream analysis and plots that are not covered in the BioPython manual (the tutorial is also embedded at the end of this post).
Unlike the usual attitude among bioinformaticians (weep at the horrible state of the software in the field, the proliferation of standards, you name it), I want to write a post to celebrate the amazing leaps forward in the field of bacterial comparative genomics, and what can be learned from it.
A basic recap
To everyone not familiar with bacterial genetics, it’s useful to explain a bit the very peculiar genetics of microorganisms. To date, the best metaphor that I’m aware of has been coined by Prof. Peter Young, by comparing bacterial genomes to smartphones. A species would be represented by a model (i.e. an iPhone), which has then a very similar OS between strains (representing the conserved genes, or core), but it differs in the number and types of apps installed (representing the so-called dispensable or accessory genes). The ensemble of OS and all the different apps installed in each “phone” is termed “pangenome“. It’s of course a little bit more complicated than that, but this metaphor is useful to explain what microbiologist think of the different gene content inside the same species: different “apps” could potentially mean adaptations to specific niches. Several theoretical and experimental works have proven that this postulate is tendentially true. Pinning down which genes might confer a competitive advantage in a certain environment is a difficult task though; sometimes up to thousands of different “apps” are “installed”
between different strains of the same species, usually with little or no annotation (think of an app in a foreign language with no clear user interface!).
Another, maybe easier, use of the pangenome concept concerns phylogenesis; genetic relationship between strains can be constructed by aligning conserved genes (slightly different OS subversions let’s say) or by comparing which accessory genes are shared (phones with overlapping “apps” could potentially operate in a similar way).
These two approaches represent the vast majority of the analysis that are being carried on bacterial pangenomes; so much that they are now becoming the routine. As any other bioinformatics task, anything that is done more than once will inevitably become a “pipeline”, with different levels of flexibility and quality. How good is the established “pangenome pipeline”, if there’s any?
The good ‘ol days
Back in 2011 I was a PhD student in Florence and had to make sense of a large (for the time) set of Sinorhizobium meliloti strains. The things we wanted to study about this (interesting!) symbiotic species required to annotate the newly sequenced genomes, define the pangenome (what genes are part of the OS, and which genes are apps) and finally get a phylogenetic tree of the core and accessory genome.
At the time most of the steps required to achieve these relatively simple goals were run “manually”: the genomes had to be assembled/annotated by knitting together a plethora of different tools, such as prodigal to get gene models, rnammer to predict rRNA clusters, blast2go, interproscan and KAAS for functional annotations. Sure, at the time automatic web-based pipelines were available (such as Rast), but with limited availability, flexibility and long waiting times; and if you imagine having hundreds of strains instead of tens, you immediately see how this approach cannot scale up.
Then to get the pangenome, the appropriate orthology algorithm has to be chosen; accuracy is not really a problem, as strains are usually so close to each other that conserved genes can be very easily detected (so easy that you can write your own implementation). The problem is mostly computational: the algorithm has to be fast and not resources intensive, especially when the number of genomes scales up. As a way of example, using InParanoid/Multiparanoid becomes quickly problematic for the exponential growth in the pairwise InParanoid jobs and the amount of memory required by Multiparanoid. Similar limitations apply to other routinely used orthology algorithm (say OMA), designed and used on inter-species analysis.
The last step in the pipeline required running an alignment of each conserved gene, concatenate them and then derive a phylogenetic tree. This step was somewhat less stressful, but still required some scripting and could become complicated when scaling up.
To summarize, a few years ago almost all the necessary tools were present, but requiring a substantial amount of time to get to the desired results, not to mention the risk of propagating any trivial mistake in any of the analysis step down to the final conclusions of the analysis. You might already imagine what is needed to overcome these problems…
The glorious present and future
Fast forward to 2015, where to get the same results that we got 4 years ago I can use the following tools:
- The Spades assembler to not worry too much on choosing the optimal value of k for the de Bruijn graph
- Prokka to transform a list of contigs in useful (and tidy-ready-to-submit-to-genbank) annotations in standard formats
- Roary to quickly and scalably get the pangenome (and its alignment!) sorted out
It is worth noting that now the whole pipeline can be basically run with three command line calls, and that it shouldn’t take longer than a couple of hours on my 2011 dataset (depending on how many cores you can lay your hands on). This leaves plenty of time to focus on the actual biology of these bugs and to design and carry on more complex and interesting computational analysis.
But how exactly each one of this three tools has succeeded in retrospectively making my life so easier? The secret is not that they provided any particular strong breakthrough, but instead they all three have focused on design, usability and scalability. In particular:
- Spades is amazingly robust, very well documented and maintained, runs relatively faster and with few errors, thus making it trustworthy
- Prokka encapsulates pretty much every tool you might need to have a decent annotation and it’s again amazingly configurable and well documented/maintained
- Roary takes a series of clever shortcuts on clustering gene families, taking advantage of the high similarity between strains, thus dramatically reducing computation time and requirements. Plus it works best when fed with Prokka’s outputs, thus encouraging people in the field to adapt to this new standard pipeline
It’s also worth noting that all three of this tools make extensive reuse of existing tools: Spades uses BWA for the error correction step, Prokka is basically (a very good) wrapper around all the tools needed to annotate a genome, and Roary as well. Not so much about re-inventing the wheel then, but more on making it better. I submit that as biology shifts towards larger and larger datasets only the implementations that heavily focus on good design, scalability and robustness will prevail.
We’re all advised…
Selfgway = Segway + selfie stick (also called, the “Wand of Narcissus”)
The most recent work of Banksy is a rich art exhibition under the form of a “bemusement park”. The park grounds and galleries are full of interesting and provocative installations (the most disturbing being a game where you can drive a series of refugees boats); the general theme of social and political critic feels very refreshing and somewhat reassuring. Another positive aspect is the ability to kindle interest in crowds that are usually not interested by art and social themes.
Despite all these positive aspects, being part of any mass crowd always brings some sort of social critic and irony. This is most probably not coincidental, given the previous works of Banksy, the most notable example being the “Exit through the gift shop” documentary, probably the greatest pun of all times towards the arts industry and arts-consuming crowds. With this in mind, here a few pictures that document those aspects of being part of any modern crowd.
A particularly exciting part of scientific writing involves broad views on earth and human history; basically coming to very insightful conclusions starting from sparse and specific experiments. This is especially true when using archaeological data, which is very sparse and incomplete by its very nature. A prominent example of such broad insight on human history is “Guns, Germs and Steel” by prof. Jared Diamond, which represents a successfully attempt on removing racial prejudices regarding the reasons why western countries “succeeded” in conquering other civilizations. The most notable example is of course the south-american empires, but many other examples within African and Oceanian civilizations are presented. The bottom line of the book is relatively simple: the presence of domesticable plants and animals, together with a geographical configuration enabling exchanges boost technological advances. The relative western immunity to some infective agents derived from cattle domestication is also indicated as a decisive factor in the western expansion in America.
The book also features some pretty interesting bits, such as the accurate description of how a handful of Spanish soldiers kidnapped the Inca emperor Atahualpa (surrounded by 7000 soldiers!), and more interestingly how the Chinese empire (whose technological level has always been on par, if not more advanced than western civilizations) decided to abruptly stop naval explorations, therefore possibly changing the world’s history with such single minor decision.
Many critics to this book focus on the lack of clear and specific experiments that could confirm some of the most challenging conclusions of the author: Tom Tomlinson summarizes such critics by saying “it is inevitable that Professor Diamond uses very broad brush-strokes to fill in his argument”. It doesn’t necessarily need to be that way though: in another book by Walter Alvarez, every single experiment that led to the conclusion that the cretaceous mass-extinction was caused by a giant rock falling from the sky is clearly outlined and explained, including other quite convincing alternate theories. (As a side note, the book has the best name ever invented: T-rex and the crater of doom).
Given the use of this “broad brush-strokes”, it is inevitable that new experiments will eventually pop-out and provide details that could change the interpretation given by professor Diamond, maybe not completely, but at least by posing significant challenges, or the need to update parts of the book.
By chance I stumbled upon some very interesting studies that would very well fit as appendices to the book. The first study involves tuberculosis, which is one of the illnesses that badly affected south-american populations after contact with the conquistadores. The first deviation from Diamond’s theory is the fact that Mycobacterium tuberculosis (the bacteria causing the disease) was probably transmitted from humans to domesticated animals and not the contrary; the other more challenging discovery is the presence in Peru of human bones showing clear signs of tuberculosis some 300 years before Colombo set foot in the Americas. Luckily enough the bacterial DNA in those bones was still readable and allowed the authors of this study to conclude that tuberculosis was brought in Peru by… sea lions. This study itself is also another example of a huge insight coming from sparse and incomplete data: much of the conclusion relies on 5 non-synonymous SNPs shared between the Peruvian strain and those that infects modern-day sea lions; there are of course a number of other experiments in the article, making the conclusion pretty robust. It is entirely possible that the distance with the “regular” western tuberculosis was such that native populations had less resistance, but it also shows how new emerging experiments can threaten parts of a convincing theory. The devil is indeed in the details.
The other three studies all came in this week, as I attended a very interesting talk by Eske Willerslev on the analysis of ancient genomes (the oldest so far being a 700’000 years old horse!). The first study somewhat challenges Diamond’s idea that most of the large mammals have gone extinct by human intervention (thus reducing the availability of domesticating cattle for some civilizations), but could instead have been caused by a drastic reduction in the presence of some protein-rich plants after the end of the latest ice age. The other quite amazing couple of studies suggest that there have been contacts (and you know, admixture) between south-americans and Polynesians: this conclusion is based on two specular experiments: genotyping of natives from Rapanui and of two individuals belonging to the Brazilian Botocudos population (which actually seem to show no signs of native american origin). According to the genetic data this encounter has happened no later than about 1400. This exciting conclusion promises to revise the way human migrations in the new continent are currently taught, and it is also very likely that new surprises will pop out sooner or later.
P.s. I’m pretty sure that there will be many other studies out there challenging Diamond’s book details that I completely ignore 🙂
This article is going to mix three things that apparently have little to do with each other: art, war, and microbiology.
As in the previous post, it all started with an art exhibition at the Strozzina museum in Florence: as part of the exhibition called “Unstable territories“, a room was all covered in black and featured several screens in the middle. A wild and alien landscape was being screened: it featured hills filled with trees and grass fields, all red. Some of the screens started showing soldiers marching through refugee camps, while tanks and guns were firing in the distance; all the people were looking at the camera in complete silence. A few dead bodies on the side of the roads were shown too, surrounded by fluorescent red grass. The piece is called “The enclave” by Richard Mosse, shoot in Congo using an infrared film used by the army to spot disguised weapons. The objective of the piece is quite straightforward: by showing an environment full of what resembles blood, the impact of the war on the population is exposed for everyone to see. The beautiful and quiet landscapes are transformed into a nightmare.
A few months later, an article suggested that, instead of increasing the awareness about the too-often forgotten war in Congo, the art piece caught the interest of many people only because of the peculiar effect obtained through the use of infrared film. I first thought that it was wrong, but I soon realized that I knew close to nothing about that subject; kindly enough, the author pointed out a truly complete book on that war, called “Dancing in the glory of monsters“. The author does a great job in telling the long and intricate story of this war, going from the responsibility of the international community and state corruption down to the dreadful stories of the single civilians.
After reading that book, a few considerations come to mind: first of all, it is true that the work of Richard Mosse cannot be fully appreciated without a minimal knowledge on this horrible conflict. Among other things, the fact that the victims of the many mass killings have been buried in a hurry results in countless nameless graves that have been eaten by the jungle; the symbolism of the red landscape acquires then additional depth, pointing out the long trail of death that the war has brought. But something else quite unusual comes to mind after reading this story, which relates to conflicts in other parts of the world (a prominent example is the conflict in middle-east). We tend to pick a side in many of the traditional conflicts, for cultural reason or even just family tradition; not on this one (at least for me). Being such an unknown story, and given the horrors that have been perpetrated by all sides, the only possible side to pick is the civilian population. In fact, picking sides in the “traditional” conflicts after reading this book feels just wrong.
Now’s the turn of microbiology: in a visit to the Typas lab at EMBL, I’ve joined some experiments measuring biofilm formation on some bacterial strains. In this essay, a red dye is added to the agar plate, which binds to biofilm components: if a bacterial colony makes it, it turns red. Surprisingly, the name of the dye is Congo red, named by the German company Bayer in the heat of the colonization in Africa (perhaps it’s time to use another name in publications?). Looking at the agar plate doesn’t help but think about the work of Richard Mosse, and how strange this connections are.