Archive for the ‘data aggregation’ Category

EMAAS: An extensible grid-based Rich Internet Application for microarray data analysis and management

Monday, March 23rd, 2009

G Barton et al., BMC Bioinformatics 2008, 9:493doi:10.1186/1471-2105-9-493

emaas.jpgEMAAS is another environment for handling and analysis of gene expression data. The authors have set about the development of a distributed e-support system for the management and analysis of microarray data; to provide access to complex methods and to apply (from a biologist’s POV) non-trivial technologies to handle large multi-variate datasets.

Whilst other solutions have missed the point and taken an easy approach to solving the problem, the EMAAS approach is rather more complicated and relies instead on integration of internet accessible tools, standard statistical packages (R/Bioconductor) and web-resources (CELSIUS, GEO). The decision to aim for a modular and flexible framework is excellent and makes this in my opinion a very much more interesting project. The completeness with which tools and environments has been included is breathtaking; the depth of IT and analytical platforms required is rather daunting.

In contrast to the manuscript reviewed in the last post, this resource’s source is available under a suitable GPL license, and some of the demo server also works. I have some problems with the resource (Flash for a start), but this is one smooth implementation and is packaged in such a way that I could take it for a spin if I so wished!

This manuscript is heavy to read, but a damned fine resource is described underneath the technical fluff. This is a great resource and this earns a great recommendation from the bioinformaticsblog.

SiPaGene: A new repository for instant online retrieval, sharing and meta-analyses of GeneChip® expression data

Monday, March 23rd, 2009

Adriane Menßen et al., BMC Genomics 2009, 10:98 doi:10.1186/1471-2164-10-98

sipagene.png

This manuscript describes a new database, data warehouse and analytical platform for the handling of Affymetrix based gene expression data. The authors identify the need for a database that is convenient, facilitates online analysis and provides user-specific sharing options, and further qualifies their understanding of an unmet database need with the statement that “… existing tools do not use the whole range of statistical power provided by the MAS5.0/GCOS algorithms”.

I agree with the authors that there is such a gap within the database arena for a MIAME compliant database that provides both data warehousing and data analytical capabilities; the addition of user-specific access rights is great, but the MAS5 and GCOS methods undoubtedly have their place, but their usage alone is perhaps naive?

The authors fill a number of quite heavy pages with their description of a refreshingly heavyweight database infrastructure (Java, ancient Oracle) that is currently biased towards their local research environments interest in immunology, inflammation, regeneration and cancer. Such alengthily described database is then populated with only 1000 arrays.

This manuscript is of interest, the approach is nice; a combined warehouse and analysis environment. I have some problems with the database though. “Non-academic commercial use is restricted” is a waste; I would never consider paying for this resource when fantastic solutions from SAS JMP Genomics / GeneData / … with full support, testing and scalability are available with a lower TCO. To see what has been done, how well it performs and to play with a resource is nice.

I suspect that this is another fail – the online demo will not even work

sipagene_miss.png

So, nice try, but no cigar. The manuscript is nice, convincingly written and more professional than some solutions out there. The web presentation looks fugly, and is also broken. The politics of code availability is plainly stupid – those who can pay will not because the implementation is not sufficiently good – Charite, please make the code a little more available!

A general modular framework for gene set enrichment analysis

Tuesday, February 10th, 2009

Gene Set Enrichment Analysis or GSEA is one of those tasty methods that has been out there in the public domain for a number of years now. I guess that when most people see GSEA they immediately think of the original Gene Set Enrichment Analysis publication that was written by scientists from the Broad Institute. Earlier whilst investigating the contents of the BioinformaticsBlogLogs, it appears that GSEA is one of the technologies that still piques a cetain amount of interest. Gene Set Enrichment is one of the two most frequently searched terms (and only slighly ahead of “bioinformatics future 2009-”). While, perhaps to kill two birds with one stone, I should state that GSEA and related techniques are one of the futures of bioinformatics. GSEA is already a stand-alone tool, and enrichment algorithms are widely used in informatics solutions from the like of Ingenuity Systems etc.

It is therefore wonderful to find a well written article that compares and contrasts different enrichment methods, and proposes a framework for the further benchmarking of the available methods. As an applied bioinformatician it is all too easy to deploy a method without considering whether the statistic adopted really is best of breed.

Anyhow,

BMC Bioinformatics. 2009 Feb 3;10(1):47. Click here to read

Ackermann M, Strimmer K.

PMID: 19192285

 This article is well worth a read. The application of GSEA technologies within the field of expression profiling is discussed and the issue of multiple methods achieving the same task, and need for standardisation on methods and evaluation of the standardised methods is a clear point. The authors perform a meta analysis of the existing GSEA methods, analyse these methods within various simulations and evaluate the results. The overall finding is perhaps that GSEA itself may be an inferior method to a more simple univariate procedure, and that workflows relying on enrichment analysis may be simplified.

EMMA2 – A MAGE-compliant system for the collaborative analysis and integration of microarray data

Tuesday, February 10th, 2009

In BMC Bioinformatics there is a pre-print of the EMMA-2 manuscript, a superficially interesting sounding database developed in Computational Genomics at Bielefeld, one of the bioinformatics places in Germany. I am greatly saddened by the experience and am left feeling whelmed, as if I have pointlessly lost a few minutes of my life.

The BMC Bioinformatics manuscript is well written, easy to read, and paints a convincing reason as to why we should evaluate their database system (and even consider licensing it?). The screenshots in the publication are attractive and glossy, but an evaluation of the resource using a standard work environment is not such a great experience!

clippynot.png

The inclusion of the mutant spawn of clippy is at best an error of judgement. I have my suspicions that non-power users will not be making MIAME submissions, and should perhaps stick with GeneSpring! Power users who have used R and can programme are easily offended and this leaves me a little livid!

Trying to access a project that contains some hopefully useful data (for evaluating competence and utility meets with something rather unexpected … a colourful error message! The name of the dataset was BMC and this was hopefully a dataset to accompany the publication; it was accessible through the demo account so something is a little FUBAR!

emmaerror.png

Instead of pushing further I have looked at the standard example datasets available in the demo account. All I can say that this is bullsh!%, a study containing only 1-6 arrays is of very limited utility. I guess that convince me that this solution is worthy of inclusion within the BioRAM linux framework, I would like to see some of the good Novartis SymAtlas scale data analysed. For a beautiful illustration of capability, utility and scalability, a database should really show that a dataset such as the GSK cell line data or expO clinical cancer samples can be processed

emmaprojects.png

If we dive into the MTGeneChipDemo_2 (which has a whole 6 arrays included),  we can browse individual arrays for affy ids, we can try to observe a heat map (and throw another exception)…

While the manuscript is OK and even prompted me to evaluate the database, the database is just awful, horrid and rotten. Sure, I may not have the appropriate dependencies to evaluate the resource properly, but a demo with dead and useless data points towards a software development that is lacking real users and vision. I have failed to see any added value in the EMMA2 pipeline. For MIAME-MAGE compliance other free databases acheive better functionality and utility (I support BASE2 here), for data analysis and visualisation even a chimpanzee may be better off using R/bioconductor than EMMA-2.

Dondrup et al., fine manuscript but terrible database delivery experience.

Data structure limitations in R/Bioconductor – time for Kernigan and Richie

Tuesday, February 3rd, 2009

51tgeprtdnl_ss500_.jpg

R/Bioconductor is perhaps the most important tool within my existing bioinformatics software arsenal. It provides quick and easy robust methods for transforming complex data. Gene expression studies with thousands of arrays and tens of thousands of probesets pose no challenge for the software.

I have a work in progress with my PhD students – I would like them to implement a generic tool for comparative genomics in R/Bioconductor and would like a package implemented for the assignment of taxonomic context to a given sequence. I spent last night fighting with NCBI Taxonomy data parsers in R. Getting the data in is no problem at all; getting the data out is no problem, but parsing a large graph in R is a real PITA. I would imagine that a normal user would be willing to wait a few minutes for a method to run in R. Using my best abilities and only native code I am in a position where I can bring a complete restructure of the data down into about 4 hours of wall time – this is very much better than my attempts over the weekend that were running at over 15 hours.

I considered the problem a little and have checked a solution using simple Java code – the run time is about 90 seconds using a first pass code without optimisation or significant refactoring … this means that native R code sucks for certain tasks! While I have been writing R packages for several years now, I am finally at the stage where I feel that I can no longer ignore the need of embedding C in my R packages.

I have never written a line of C in my life … Perl, Python, Java, R, Pascal, Logo, and others yes, but not yet C. Is this a turning point? Should I just federate the task to my students, or should I aim to deliver something special and a little more functional than the basics. I am tempted – this is surely a right of passage for an applied bioinformatician!

Now could also be a good time to question as to how large heterogeneous data objects in R can be properly utilised – is a list of lists the most appropriate data structure for object data … Yikes – taxonomic tree and phylogenetic distributions appear to be more of a can-of-worms than was originally expected!

Target discovery from data mining approaches

Monday, February 2nd, 2009

90-1.jpg

Yongliang Yang et al., Drug Discovery Today 2009, Vol 14, p147-154.

Target discovery is a key area within drug development: you can’t develop a drug without a target (anymore), and I am sure that many bioinformaticians working within pharma and biotech spend a not inconsiderable amount of their time compiling portfolios of information relating to a development drug’s target protein.

It is therefore great to see a review article in Drug Discovery Today that highlights, describes and outlines the informatics workflows that may be used to discover a meaningful target. This is not a review article for a seasoned corporate bioinformatician, but is rather a good illustration of much of bioinformatics on the commercial side of the academic/commercial divide. It is rather obvious however that the authors are academic, and that the focus of the article is more towards the academic characterisation of a target rather than the approach that the more enterprise oriented bioinformatician would take! The article is not weakened from this since there the authors place considerable stress on the fact that there are huge volumes of data out there, and that there is considerable benefit to be reaped by making sense of, and integrating, these data.

The article is well written and provides a meaningful review of integrative data mining. For bioinformaticians considering a career in corporate bioinformatics, this provides a robust view of what we spend much of our time doing, although the tools summarised may not be the optimal tools within a corporate setting.

This is a good manuscript, not a great one. It highlights the issues within target characterisation and understanding and is a worthy read for all.

Gene expression, experimental metadata and trawling the public domain

Monday, January 5th, 2009

There are a variety of resources out there, in the wild, that contain potentially valuable scientific observations. The Gene Expression Omnibus (GEO) developped by the NCBI is perhaps the largest resource.

GeneExpressionOmnibus

There are some pretty good third party APIs for extracting the data from GEO – I am routinely using the GEOmetadb (Bioinformatics 2008 24(23):2798-2800) and GEOquery packages. The XML prepared by the NCBI is also great for extracting the core content for specific studies or platforms (although NCBI have created rather non-standard XML through inclusion of non UTF8 standard characters).

The Celsius project is something that I explored when it was first published in GenomeBiology. While the performance is rather poor, the rich annotation of experimental data looks rather appealing. Celsius is implemented as an R package and it easily integrates with the Mnemosyne LabManager R workflows. There is pretty good annotation for many arrays within the framework. This is a project that is of significant benefit, though from the server performance it seems that many other people are benefiting from the hard work of the authors! The server is certainly faster in the early morning in Finland.

As an illustration as to what can be achieved using the R Celsius API, I’ll document the process for extracting the annotations for the the GSE2109 expO dataset of human clinical cancer samples. This is powerful stuff and could perhaps be used to build a more powerful alternative to some of the existing curated resources such as Oncomine or GeneSapiens.