Archive for the ‘R/bioconductor’ Category

BioconductorBuntu – A Linux Distribution that Implements a Web-Based DNA Microarray Analysis Server

Wednesday, April 1st, 2009

bioconductorbuntu.png

Paul Geeleher et al., Bioinformatics Advance Access Publication March 23rd, 2009.

Fresh in the latest version of Bioinformatics Advance Access is a rather wonder short correspondence on BioconductorBuntu. The authors of this brief article have highlighted a rather important divide within the bioinformatics community; those who can use R and those who can’t.

To solve the issue of “hot” microarray data analysis for those fearful of scripting, the authors have implemented a whole Ubuntu distribution containing the requisite packages, software and servers for rapid deployment of a data analysis server. In addition to just providing R and some bioconductor packages the authors have also implemented a basic framework of authentication and ownership, and some core GUIs to streamline the process of uploading, analysis and reporting the content of DNA microarray studies. In contrast to earlier efforts such as AMDA (Genopolis, Italy) the authors have provided mechanisms for the handling of Affymetrix data, single and dual colour arrays.

The workflow appears to contain all core elements of data validation, QC and differential expression analysis and also provides a little content for both GSEA and KEGG type analyses.

This is in my humble opinion a wonderful piece of work. Certainly this is not a complete solution (what about Illumina or Agilent data in their more native file structures?) and the reporting is lacking outside of the most basic content – but it does deliver an elegant and functional system for the dirty and unwashed masses. The wrapping of the stack onto an Ubuntu “spin” is great – if as promised I can download an iso, burn a disk, boot, install and rock-and-roll then this really could stand to be a really useful tool sitting in the corner of many small labs.

I have some vague suspicions though that this approach is doomed to failure. The biologists who cannot use R and Bioconductor are the same people who will be terribly afraid of booting a linux workstation and installing something by themselves. These are the same people who will be least well prepared to diagnose the problems on the server, and who will need the most training and babysitting to get them to the stage where the software can be applied in a meaningful way! Not a detraction from the paper, while BioconductorBuntu is a very elegant solution, and promises to solve some of the problems, a bioinformatician, IT guy or statistician is really needed to get the biologist up-and-running. Thank goodness – our jobs are still safe for the time being ;-)

This is certainly a well-earned-paper-of-the-week. Congratulations Paul et al.,

EMAAS: An extensible grid-based Rich Internet Application for microarray data analysis and management

Monday, March 23rd, 2009

G Barton et al., BMC Bioinformatics 2008, 9:493doi:10.1186/1471-2105-9-493

emaas.jpgEMAAS is another environment for handling and analysis of gene expression data. The authors have set about the development of a distributed e-support system for the management and analysis of microarray data; to provide access to complex methods and to apply (from a biologist’s POV) non-trivial technologies to handle large multi-variate datasets.

Whilst other solutions have missed the point and taken an easy approach to solving the problem, the EMAAS approach is rather more complicated and relies instead on integration of internet accessible tools, standard statistical packages (R/Bioconductor) and web-resources (CELSIUS, GEO). The decision to aim for a modular and flexible framework is excellent and makes this in my opinion a very much more interesting project. The completeness with which tools and environments has been included is breathtaking; the depth of IT and analytical platforms required is rather daunting.

In contrast to the manuscript reviewed in the last post, this resource’s source is available under a suitable GPL license, and some of the demo server also works. I have some problems with the resource (Flash for a start), but this is one smooth implementation and is packaged in such a way that I could take it for a spin if I so wished!

This manuscript is heavy to read, but a damned fine resource is described underneath the technical fluff. This is a great resource and this earns a great recommendation from the bioinformaticsblog.

RGG: A general GUI Framework for R scripts

Wednesday, March 4th, 2009

rgg.png

Ilhami Visne et al., BMC Bioinformatics 2009, 10:74.

Here is another great idea for all of you bioinformaticians working in a scientific support capacity. We are all aware of the issues of writing powerful methods in R for a customer who then wishes to run the analysis again-and-again with a different set of parameters! R is great, Bioconductor is amazing and it facilitates our ability to understand high-dimensional data, but many biologists are incapable to getting a grip on bioinformatics and computers or demonstrate a reluctance to accept data analysis as a core activity. Many biologists are just muppets when it comes to interactions with bioinformaticians (and in many cases the bioinformaticians are just as muppet-like and do not help the cause!)

manamana.jpg

 There are many times when it would be great to have a robust method for packaging R methods in such a way that “Excel biologists” can reuse the code and adapt workflows for their own purposes. I have certainly implemented my own solution using a mixture of Java webstart, R, RServe and Tomcat to provide such tools, but the proposed RGG solution is certainly very much more appealing within a formal support environment.

The R GUI Generator is based on the concept of specifying experimental attributes and context within an XML file. As within the Sweave framework, structural content is mixed with functional R code yielding formal and coherent documents. The authors describe well their concept, and the application of the generalised workflow is illustrated with some stunning screenshots of the software in action.

This manuscript is awarded a “publication of the week” since it provides an excellent solution to a real problem within support bioinformatics. While the project is undoubtedly rather academic at the moment in demonstrating only a proof-of-principal the concept is clear and the outlook is bright. Whilst the RGG is currently bound to client side life with the JGR (Java GUI for R) and their own RGGRunner software they do not preclude future integration of RGG with other frameworks such as RServe and explicity state of the power of integration of Sweave and RGG – a pretty good outlook from my desk at least.

While the authors certainly propose the collection and establishment of an RGG repository, this requires a significant amount of input from the community. While power-R-users are the people who can contribute methods, the users are those who will undoubtedly have the most issues in understanding what should be done, how, when and why. This creates just a little cause for concern, but it is great idea nevertheless, and is certainly something that should be added to the list of things to evaluate!

Bioinformatics tutorials and ‘R/Bioconductor’

Thursday, February 12th, 2009

hpgraphic.png

A check of the server logs shows that yesterday was a good day for traffic and users – I am hoping that some of the more trivial aspects of life are appealing, whilst perhaps more serious reviews and the objective slaughter of manusripts that miss-the-mark is more valuable (for all of us).

The logs show that many people are stumbling across the blog by searching for “Bioinformatics R tutorials”. Welcome, but my humblest apologies, I am not sure that this is something that I am willing to provide! If you could comment as to the tutorials that you would like to see that are not already packaged within a book, then please leave a comment, and perhaps we can start a panel of tutorials for new R users?

If you are looking for inspiration, assistance and to learn a whole lot more about R in bioinformatics then I would highly recommend that you attend the BioC 2009 meeting – I have been there in 2007/2008 and hope to be there this year too.

aroma.affymetrix

Wednesday, February 11th, 2009

aromaaffy.png

Another trend is becoming clear within the logs of the BioinformaticsBlog. It seems that aroma.affymetrix is again becoming a rather common term that had led people to the blog! Yesterday I mentione, in relation to the criticism of the EMMA 2 resource, that proper examples of handling datasets should be included. I think that a deep tutorial describing how to use aroma.affymetrix (beyond the examples provided within the Google group) would be in order!

I can imagine the issues that many users are facing with getting aroma off the ground. I have already established my own R package (creatively called MnemosynePackage) that performs much of the interaction with aroma and provides a reasonable abstraction layer between the lazy user and aroma.affymetrix.

As a challenge to you, the wider community – are you interested? If this post receives any comments firmly requesting a detained tutorial on the application of aroma.affymetrix for the analysis of large datasets then I will write and document the tutorial! Is this a deal?

A general modular framework for gene set enrichment analysis

Tuesday, February 10th, 2009

Gene Set Enrichment Analysis or GSEA is one of those tasty methods that has been out there in the public domain for a number of years now. I guess that when most people see GSEA they immediately think of the original Gene Set Enrichment Analysis publication that was written by scientists from the Broad Institute. Earlier whilst investigating the contents of the BioinformaticsBlogLogs, it appears that GSEA is one of the technologies that still piques a cetain amount of interest. Gene Set Enrichment is one of the two most frequently searched terms (and only slighly ahead of “bioinformatics future 2009-”). While, perhaps to kill two birds with one stone, I should state that GSEA and related techniques are one of the futures of bioinformatics. GSEA is already a stand-alone tool, and enrichment algorithms are widely used in informatics solutions from the like of Ingenuity Systems etc.

It is therefore wonderful to find a well written article that compares and contrasts different enrichment methods, and proposes a framework for the further benchmarking of the available methods. As an applied bioinformatician it is all too easy to deploy a method without considering whether the statistic adopted really is best of breed.

Anyhow,

BMC Bioinformatics. 2009 Feb 3;10(1):47. Click here to read

Ackermann M, Strimmer K.

PMID: 19192285

 This article is well worth a read. The application of GSEA technologies within the field of expression profiling is discussed and the issue of multiple methods achieving the same task, and need for standardisation on methods and evaluation of the standardised methods is a clear point. The authors perform a meta analysis of the existing GSEA methods, analyse these methods within various simulations and evaluate the results. The overall finding is perhaps that GSEA itself may be an inferior method to a more simple univariate procedure, and that workflows relying on enrichment analysis may be simplified.

EMMA2 – A MAGE-compliant system for the collaborative analysis and integration of microarray data

Tuesday, February 10th, 2009

In BMC Bioinformatics there is a pre-print of the EMMA-2 manuscript, a superficially interesting sounding database developed in Computational Genomics at Bielefeld, one of the bioinformatics places in Germany. I am greatly saddened by the experience and am left feeling whelmed, as if I have pointlessly lost a few minutes of my life.

The BMC Bioinformatics manuscript is well written, easy to read, and paints a convincing reason as to why we should evaluate their database system (and even consider licensing it?). The screenshots in the publication are attractive and glossy, but an evaluation of the resource using a standard work environment is not such a great experience!

clippynot.png

The inclusion of the mutant spawn of clippy is at best an error of judgement. I have my suspicions that non-power users will not be making MIAME submissions, and should perhaps stick with GeneSpring! Power users who have used R and can programme are easily offended and this leaves me a little livid!

Trying to access a project that contains some hopefully useful data (for evaluating competence and utility meets with something rather unexpected … a colourful error message! The name of the dataset was BMC and this was hopefully a dataset to accompany the publication; it was accessible through the demo account so something is a little FUBAR!

emmaerror.png

Instead of pushing further I have looked at the standard example datasets available in the demo account. All I can say that this is bullsh!%, a study containing only 1-6 arrays is of very limited utility. I guess that convince me that this solution is worthy of inclusion within the BioRAM linux framework, I would like to see some of the good Novartis SymAtlas scale data analysed. For a beautiful illustration of capability, utility and scalability, a database should really show that a dataset such as the GSK cell line data or expO clinical cancer samples can be processed

emmaprojects.png

If we dive into the MTGeneChipDemo_2 (which has a whole 6 arrays included),  we can browse individual arrays for affy ids, we can try to observe a heat map (and throw another exception)…

While the manuscript is OK and even prompted me to evaluate the database, the database is just awful, horrid and rotten. Sure, I may not have the appropriate dependencies to evaluate the resource properly, but a demo with dead and useless data points towards a software development that is lacking real users and vision. I have failed to see any added value in the EMMA2 pipeline. For MIAME-MAGE compliance other free databases acheive better functionality and utility (I support BASE2 here), for data analysis and visualisation even a chimpanzee may be better off using R/bioconductor than EMMA-2.

Dondrup et al., fine manuscript but terrible database delivery experience.

R/bioconductor and methods implemented in C

Tuesday, February 3rd, 2009

rcint.png

This is not the easiest thing for a C novice to do, and I am discovering a lot as I go, but I feel that we have made really excellent progress over the last 24 hours and the method is looking almost feasible. We can compile code from within the R package, we can associate the library at package loading time and can call the method properly. I seem to be failing at a point that may be pretty close to the final frontier …

I am dead pleased – an academic itch has been scratched and something useful has come of it. The next challenge is to get this final problem solved and then to document the complete workflow within a tutorial. Let’s hope that we can manage some or all of this on the train tomorrow morning!

Adding numbers in R – the hard (-est possible) way

Tuesday, February 3rd, 2009

electronic_calculator.jpg

I have had a pretty good think about my NCBI taxonomy in R issue, and there is only one (creative) way to go. I have had a pretty good read around to see what exists in the way of tutorial for R / C integration. Unfortunately at the moment the  answer is not a lot apart from some .pdfs and packages from Dirk Eddelbuettel and within the R documents for writing R extensions.

To get the ball rolling and to start a neat tutorial to inspire both myself and some of my more loyal readers, I am going to create a new function in R, through a package specifically dedicated to the task of adding two numbers. Kind of easy in R, but packaging a C method within R is a little more complicated, but should help me learn enough to look at the problem in a new and more creative way.

#include <stdio.h>
adder(int a, int b) {
int c;
c = a + b;
return c;
}

void main()
{
int a,b,r;
a = 10;
b = 5;
r = adder(a,b);
printf(”\nHello World\n%d\n”, r);
}

My first C code for the project, my first C code (other than earlier code fixes of other people’s software) and hopefully a beautiful challenge ahead.  I should probably be looking forwards to the commute to the office tomorrow with added enthusiasm? I don’t think that anything here needs documenting (yet), but now I think that we need to get this code compiled within an R package, and even integrated into the package …

Data structure limitations in R/Bioconductor – time for Kernigan and Richie

Tuesday, February 3rd, 2009

51tgeprtdnl_ss500_.jpg

R/Bioconductor is perhaps the most important tool within my existing bioinformatics software arsenal. It provides quick and easy robust methods for transforming complex data. Gene expression studies with thousands of arrays and tens of thousands of probesets pose no challenge for the software.

I have a work in progress with my PhD students – I would like them to implement a generic tool for comparative genomics in R/Bioconductor and would like a package implemented for the assignment of taxonomic context to a given sequence. I spent last night fighting with NCBI Taxonomy data parsers in R. Getting the data in is no problem at all; getting the data out is no problem, but parsing a large graph in R is a real PITA. I would imagine that a normal user would be willing to wait a few minutes for a method to run in R. Using my best abilities and only native code I am in a position where I can bring a complete restructure of the data down into about 4 hours of wall time – this is very much better than my attempts over the weekend that were running at over 15 hours.

I considered the problem a little and have checked a solution using simple Java code – the run time is about 90 seconds using a first pass code without optimisation or significant refactoring … this means that native R code sucks for certain tasks! While I have been writing R packages for several years now, I am finally at the stage where I feel that I can no longer ignore the need of embedding C in my R packages.

I have never written a line of C in my life … Perl, Python, Java, R, Pascal, Logo, and others yes, but not yet C. Is this a turning point? Should I just federate the task to my students, or should I aim to deliver something special and a little more functional than the basics. I am tempted – this is surely a right of passage for an applied bioinformatician!

Now could also be a good time to question as to how large heterogeneous data objects in R can be properly utilised – is a list of lists the most appropriate data structure for object data … Yikes – taxonomic tree and phylogenetic distributions appear to be more of a can-of-worms than was originally expected!