The Supertree Toolkit 2

Our recent paper in Biodiversity Journal describes the latest version of the Supertree Toolkit. This is a major departure from the first version and in this post I'm going to describe, in more detail than was required for the paper, the reasoning behind the differences.

The previous version of the STK started life as a bunch of Perl scripts that Katie and I wrote to process her supertree data. After writing a few separate scripts we realised that there was a need to package these scripts up and have a set of library functions to drive them. One of the key things that Katie realised back in 2003 when she started her PhD is that meta-data (data about data) was key to automated processing. We therefore designed an XML file to sit alongside the tree file to contain this meta-data. A rudimentary GUI was created to make creating the XML files easy, but the whole database was dependant on the directory structure and placement of files. Handing it to someone else is not straightforward.

After Katie and Matt got the funding from the Systematics Association, Katie and I sat down to redesign the STK from the ground up - learning from the past mistakes. At Imperial College I worked on Fluidity, a computational fluids dynamics package, which used a system called Spud to set options. Part of Spud was the GUI called Diamond. What was special about Diamond was that it uses something called a schema (each model comes with a different schema) that contains all the options for that model, including information on which ones were required and the nesting of options. Katie and I realised that we could use the same structure to build entire datasets for supertree analyses.

We therefore forked the Diamond code (written in Python) and rewrote the whole STK library in Python. We took a slightly different structure too. All functions are in the library and then the command line interface (CLI) and GUI could then just call the functions. This also would allow us to build other front-ends easily (e.g. a web-based version), and allow users to write their own processing pipelines in Python. The entire process from forking to publishing took about a year. In that time we had a number of undergraduates testing the code during their projects - a vital part of making software stable.

The new STK is now a stepping stone to integrating more features, including taxonomic knowledge, novel supertree methods and more automated processing. The more it is used, the better it will get. Our ultimate aim is to enable processing of data in an almost automated fashion (we still need a user to make sure the processing was sensible!).