Archive for the ‘tutorials’ Category

tracking bugs, managing plans and coping with feature creep

Wednesday, April 8th, 2009

nbss.png

I am a bioinformatician. My background is as a traditional geneticist, my PhD was in the fields of molecular biology (and a little phylogenetics and domain analysis). I only entered the domain of bioinformatics during my Post doctoral years when I worked as a genome annotator for the first green eukaryotic genome project. During this time I learned a lot of PERL, moved into Python and integrated a load of stuff to link my needs with a relational database and distributed jobs across a large cluster to run the typical InterPro / UniProt / nonred type tasks. GUI was never considered (or attempted) and code remained ad hoc until broken.

Now 10 years on from these heady days, I am now writing code in C, Java, R and a little Python. During my few years as an adjunct professor at a Finnish research centre I rewrote the whole software pipeline that I imagined in Java as a rather monolithic beast and have reimplemented the whole stack as something a little more abstracted and perhaps useful over the last few years as a distraction on the daily commute to the capital city. The whole software environment is now several hundred thousand lines of code, we have a rich GUI delivered over HTML and through Java WebStart and things are finally beginning to look how they should have when I first starting planning the project back in 2004.

Critical at this point is how is a self-taught informatician supposed to handle this code? I work alone, there is no code audit and no one works with me to validate, correct or comment on my code! I maintain a single code tree and this is at least within a code versioning system (subversion for bioinformatics is great …), so I am hopefully not completely inept at doing my work.

My question is how should I really manage the long list of non-specific issues, bugs and problems that I routinely encounter. A campaign to resolve an issue that has been introduced through feature creep can take hours if not days, and during this time I undoubtedly discover many more bugs and issues…

At the moment bugs are documented within a text file of problems, issues and events. I have a todo list, and this seems rather inefficient and trivial. I know that something like Bugzilla could work, but this seems rather more complex than is absolutely needed. I also work on a train, and therefore don’t have web access for much of the commute – a client side project that can be synced through SVN would be ideal. I also work on Linux, OSX and Windows, so ideally something that is cross-platform would be great…

This seems like a tall order, and something that there is no simple answer for. What bug tracking software do other bioinformaticians use?

A general modular framework for gene set enrichment analysis

Tuesday, February 10th, 2009

Gene Set Enrichment Analysis or GSEA is one of those tasty methods that has been out there in the public domain for a number of years now. I guess that when most people see GSEA they immediately think of the original Gene Set Enrichment Analysis publication that was written by scientists from the Broad Institute. Earlier whilst investigating the contents of the BioinformaticsBlogLogs, it appears that GSEA is one of the technologies that still piques a cetain amount of interest. Gene Set Enrichment is one of the two most frequently searched terms (and only slighly ahead of “bioinformatics future 2009-”). While, perhaps to kill two birds with one stone, I should state that GSEA and related techniques are one of the futures of bioinformatics. GSEA is already a stand-alone tool, and enrichment algorithms are widely used in informatics solutions from the like of Ingenuity Systems etc.

It is therefore wonderful to find a well written article that compares and contrasts different enrichment methods, and proposes a framework for the further benchmarking of the available methods. As an applied bioinformatician it is all too easy to deploy a method without considering whether the statistic adopted really is best of breed.

Anyhow,

BMC Bioinformatics. 2009 Feb 3;10(1):47. Click here to read

Ackermann M, Strimmer K.

PMID: 19192285

 This article is well worth a read. The application of GSEA technologies within the field of expression profiling is discussed and the issue of multiple methods achieving the same task, and need for standardisation on methods and evaluation of the standardised methods is a clear point. The authors perform a meta analysis of the existing GSEA methods, analyse these methods within various simulations and evaluate the results. The overall finding is perhaps that GSEA itself may be an inferior method to a more simple univariate procedure, and that workflows relying on enrichment analysis may be simplified.

R/bioconductor and methods implemented in C

Tuesday, February 3rd, 2009

rcint.png

This is not the easiest thing for a C novice to do, and I am discovering a lot as I go, but I feel that we have made really excellent progress over the last 24 hours and the method is looking almost feasible. We can compile code from within the R package, we can associate the library at package loading time and can call the method properly. I seem to be failing at a point that may be pretty close to the final frontier …

I am dead pleased – an academic itch has been scratched and something useful has come of it. The next challenge is to get this final problem solved and then to document the complete workflow within a tutorial. Let’s hope that we can manage some or all of this on the train tomorrow morning!

Adding numbers in R – the hard (-est possible) way

Tuesday, February 3rd, 2009

electronic_calculator.jpg

I have had a pretty good think about my NCBI taxonomy in R issue, and there is only one (creative) way to go. I have had a pretty good read around to see what exists in the way of tutorial for R / C integration. Unfortunately at the moment the  answer is not a lot apart from some .pdfs and packages from Dirk Eddelbuettel and within the R documents for writing R extensions.

To get the ball rolling and to start a neat tutorial to inspire both myself and some of my more loyal readers, I am going to create a new function in R, through a package specifically dedicated to the task of adding two numbers. Kind of easy in R, but packaging a C method within R is a little more complicated, but should help me learn enough to look at the problem in a new and more creative way.

#include <stdio.h>
adder(int a, int b) {
int c;
c = a + b;
return c;
}

void main()
{
int a,b,r;
a = 10;
b = 5;
r = adder(a,b);
printf(”\nHello World\n%d\n”, r);
}

My first C code for the project, my first C code (other than earlier code fixes of other people’s software) and hopefully a beautiful challenge ahead.  I should probably be looking forwards to the commute to the office tomorrow with added enthusiasm? I don’t think that anything here needs documenting (yet), but now I think that we need to get this code compiled within an R package, and even integrated into the package …

linux distributions, custom images and respinning

Tuesday, January 27th, 2009

ssv_tux_stick_plain.jpg

I have been discussing rPath over the last few days and have put an not inconsiderable amount of effort into getting a bioinformatics linux distribution off the ground. rPath is perhaps not the easiest way to go; fedora offers its own way using the Revisor tool. I’ve been a fan of RedHat since my first excursion into linux 15 years ago, and have a Fedora workstation at home, but getting software packaged into .rpm is something that still makes me more than just a little crazy (Staden .rpm production caused much hair loss – I am certain of it!)

It is cool to see that on Slashdot this morning there is a link for a new Suse project going into alpha for exactly this purpose as well. SUSE Studio looks like an interesting tool and something that I should invest some time in exploring.

Bioinformatics is great; I am paid (a modest fee) to do something I enjoy and I would love to see more people using responsibly a wider range of tools. It is sad that so many so called “bioinformaticians” don’t know how to compile C code; have little understanding of the tools available and do not explore religiously the new publications in bioinformatics, nucleic acids research etc for new applications that will push the boundaries as to what they can do. I vehemently oppose the concept that meaningful bioinformatics can be achieved using a windows workstation alone!

I guess I should ask you bioinformaticians out there – how would you roll a meaningful bioinformatics distribution for use on either a workstation or for deployment across a cluster of tens or hundreds of servers?

What should the focus of the blog be for the next four weeks?

Friday, January 23rd, 2009

question-mark.jpg

The bioinformaticsblog is being read, and looking at the logs shows me what is popular and what is not. Loads of hits are coming from Google and it is pretty interesting what you all seem to be hunting for! At the moment the mail search themes surround my comments on NetOffice (I still love it), iPhone (still thinking about feasibility) and public datasets (especially the GSK cell line data).

I have put a huge effort into the bioram-linux plan and have a pretty neat system running inside a virtual machine. It has the latest R with a load of bioconductor and cran packages. It has software such as trace2dbest, phred, cross_match (not publicly available from rPath.org yet) for an EST project running through the lab at the moment and I am working of certain “biomarker” packages, that will demonstrate my control of destiny rather than the other way round! This is where I wish to concentrate for the time being.

My blog focus until the end of Feb 2009 will be as follows

  • BioRAM-linux, rPath, rBuilder and conary; towards a functional customized bioinformatics cluster OS
  • bioinformatics and iPhone – towards XML integration of iPhone and Mnemosyne LabManager server
  • tutorials in comparative genomics using ‘R’ and ‘bioconductor’
  • An XML framework for Agilent array analysis using ‘R/bioconductor’
  • Papers-of-the-week

This is my plan, but I really need to have your input here (mail me, or preferably leave a comment)

tutorial – building a bioinformatics centric, distributed, cluster centric linux (bioram part I)

Tuesday, January 20th, 2009

sample_cd.jpg

Yesterday I mentioned rPath linux. I am again in love with this application. rPath linux and their free rBuilder application stack is something that the bioinformatics community should consider jumping at. Regardless as to whether you are a grad-student interested in high-performance bioinformatics computing, a postDoc with a big problem or a technologist in a core facility, there is something that you could do with rPath.

I have decided to put together a tutorial series on rPath and my own (very-much-in-progress) version of bioinformatics-oriented linux. In this tutorial we are going to start out with a linux running inside a virtual machine. This will be configured with up-to-date and interesting packages that are of relavance to tasks within genomics, transcriptomics and metabolomics. The virtual machine will be re-iteratively improved and we will prepare our own version of “bioram-linux”.  This will be made available as a standalone linux distribution and will be integrated in an environment to deploy across a computer cluster dedicated to bioinformatics. We will use some of my ideas, some of Aschwin’s ideas and hopefully, you, the wider community will start to comment and update us with your feelings, opinion and ideas.

The final version will be made available at rPath rBuilder. The packages and modules can be copied, modified and altered to suit your needs within your own projects – this is just a push to hopefully increase the amount of content and knowledge within the bioinformatics community in relation to rPath.

As a first note (self-important mumbling to self even), please acknowledge that I am no expert in linux administration (I am a comfortable super user) and will undoubtedly introduce faux-pas within the pipeline and workflow. Please let me know when I am going wrong and hopefully together we can do something rather clever? The project is called “bioram-linux”. BioRAM was the name of a company that I planned to start during 2002 / 2003; the domain was registered and ideas and business plans collected. This idea was on the back-burner and seemed like a good idea when I registered for rPath a couple of years ago. Please accept the name, I like it :-)

In today’s tutorial we are going to head over to rPath, find the bioram-linux / Mnemosyne BioSciences LabManager development image and will download it. We will then install the appropriate linux within a VMWare based virtual machine running on our host computer. For the time being all images are x86, and will stay that way until much later tutorials when we will start to cross-compile packages for the x86_64 platforms.

Let’s get cracking

  • Head over to the rBuilder pages at rPath. In the search box at the top of the page, click on “product”, and search for the term “bioram”. Press “search” and the Mnemosyne BioSciences product for bioram-linux should be returned as one of the top matches, click on this link. The pages for bioram can also be found directly by using the URL http://www.rpath.org/project/bioram-linux/

rbuilder_1.png

  • On the bioram pages, use the right-hand panel to select for the option to “View Releases”. A page similar to the one shown below should be shown. This provides a link to the BioRAM linux download page. If, of course, you are looking at this page more than a few days after the first posting of this blog article then things may have changed! Select the link for the “Mnemosyne Biosciences – BioRAM linux for bioinformatics”, and proceed to download the VMWare image. Please note both the size of the file (it is not a very small file ..) and ideally the md5sum checksum. You can then validate that what you download is the same as what was present on the server.

rbuilder_2.png

  • When the file has downloaded, ensure that the file checksums match and extract the .tar.gz archive. If you don’t already have the VMware player, VMware Fusion or some other VMware environment you should arrange this now. I can recommend the VMware player for linux, this is great. I use Fusion on my MacBook Pro and this is even better. Load the downloaded VMware image into your VM software of choice and start the VM up for the first time …

rbuilder_3.png

  • Provide the path to the downloaded BioRAM installation page

rbuilder_4.png

  • View and review system settings and make any changes that you feel would benefit your system!

rbuilder_5.png

  • Everything should now be set! Click the various boxes and boot your newly defined Virtual Machine for the first time. Good luck! There will be a minute or so of linux loading and the standard process messages will be written to the screen – there should be no need for panic!

rbuilder_6.png

  • Once the BioRAM linux has loaded, it will complain about an improperly configured X. Let the system know that you are in control, make the error message go away by clicking “YES” I know what I am doing and have used linux before boxes. X will start and you will be prompted to configure X to your specifications. I would recommend something along the lines of what is shown below. Make sure that the VMware graphics adapter is configured and everything should be fine.

rbuilder_8.png

 

rbuilder_7.png

  • You are now able to log on with the “root” account – there is no password, so the very first things that you should do are to ensure that at least networking works and that you have a secure root password implemented!

rbuilder_9.png

rbuilder_10.png

You have done really well getting to the end of this first BioRAM linux tutorial. I am aware that I am not the best tutorial writer, but the only way to learn in by doing it, and receiving comments and criticisms from the readers. I hope that you have gotten this far, in the coming few weeks we are going to do a whole lot more with this.

In the next session (Thursday) we will configure our new virtual machine for package building, maintenance and compiling. We will define a new version of the “R” package and will create version 2.8.1 of R for our virtual machine. This will be installed and we will have a little think about how we can get “Bioconductor” installed and in use.