
This collection of somewhat unfocussed and prosaic text has been slowly evolving and thus far it has been mainly about me in some form of ego-massaging form. Not what a larger bioinformatics interested audience really needs. I intend to up the ante, and to start describing a little more about what I feel are pretty neat concepts within bioinformatics.
Bioinformatics is more than just a job, it is a way of life, and before the standardization of the discipline with tools from Ingenuity, SAS, Rosetta, GeneData, Mnemosyne etc it is pretty much a free-for-all development fight. The bioinformatics community hasn’t really standardized on a single programming language (Perl is dominant, but python, Java, R, C etc) all get a pretty good look in, and there each language has advantages and disadvantages. As scientitists and bioinformaticians publish tools, as tools are adopted across the community and as workflows and approaches to understanding are integrated the needs of our bioIT systems become rather more convoluted and complex.
In a post last week I questioned as to how you, the wider audience might keep your R repositories updated and synchronized. On a laptop and workstation I imagine that the task is rather trivial. You update code as required in an ad hoc fashion. What happens though when you need to share your code base with other scientists in another country (timezone, city whatever). It is not always reasonable to ask a collaborator or customer to please install packages X,Y,Z – recompile application A and you’re ready to go.
When working at the Turku Centre for Biotechnology I was lucky enough to be able to recruit a SysAdmin called Aschwin. He had some pretty strong ideas as to how the lab infrastructure should be implemented and erred perhaps on the side of the “bleeding edge”. Our cluster was quickly restructured into something pretty workable, that satisfied my user needs (I wrote my first technical needs brief for Aschwin prior to the industrialisation of my career) and got the job done. Aschwin used Ubuntu, FAI and a collection of scripts and new .deb packages to encapsulate the bulk of the required software. This worked well, but always required Aschwin to adapt, modify and correct the environment. Not a problem whilst you work with your SysAdmin…
I had requested a cluster implementation such as Rocks or something a little more pre-digested, but at least the cluster was deployed as specified and worked! Rocks is something that I have used at home in a 5 node cluster during earlier EST analysis, but it suffered due to it’s need for RPM files and the general pain in the &%$ that is required to package some software. My continued struggle to identify the optimal customer roll-out platform continued (and still continues).
rPath came to my attention in 2006 and I saw a small company building on ideas from redhat (my first (1995) and still favourite linux). The rPath company has a free rBuilder application that can be used to create a virtual linux distribution. You cut and paste software packages from their servers and when these are insufficient for your needs, so roll your own and upload them to the server for the benefit of the wider community. I have created packages for typical applications such as ClustalW, Phylip, NCBI Blast and InterPro. Keeping them up-to-date is a question of time and resources and it seems that the package evolution is somehow punctuated with frenetic energy twice a year.
The rPath approach is really cool. I can create packages, compile packages and roll-my-own distribution that can be made available for download as live CDs, installation media or images for running within the typical virtual machines. This is the real reason that I am interested. To quickly roll a linux distribution with an up-to-date set of software, the latest (and validated R packages) and with the required Mnemosyne integration software and I am a happy bunny. I can run the software on my MacBook in VMware Fusion and can run the wierd packages with bizarre dependency problems with the standard IT infrastructure used in the office.
Please have a look and get in contact. We together can make a really useful linux distribution that has a superior feature set than e.g. DNAlinux, BioBrew or BioLinux. In later parts of this story I will discuss the process of creating a distribution, packaging your software and distributing your new distribution across a large computer cluster.