Archive for the ‘paper-of-the-week’ Category

BioconductorBuntu – A Linux Distribution that Implements a Web-Based DNA Microarray Analysis Server

Wednesday, April 1st, 2009

bioconductorbuntu.png

Paul Geeleher et al., Bioinformatics Advance Access Publication March 23rd, 2009.

Fresh in the latest version of Bioinformatics Advance Access is a rather wonder short correspondence on BioconductorBuntu. The authors of this brief article have highlighted a rather important divide within the bioinformatics community; those who can use R and those who can’t.

To solve the issue of “hot” microarray data analysis for those fearful of scripting, the authors have implemented a whole Ubuntu distribution containing the requisite packages, software and servers for rapid deployment of a data analysis server. In addition to just providing R and some bioconductor packages the authors have also implemented a basic framework of authentication and ownership, and some core GUIs to streamline the process of uploading, analysis and reporting the content of DNA microarray studies. In contrast to earlier efforts such as AMDA (Genopolis, Italy) the authors have provided mechanisms for the handling of Affymetrix data, single and dual colour arrays.

The workflow appears to contain all core elements of data validation, QC and differential expression analysis and also provides a little content for both GSEA and KEGG type analyses.

This is in my humble opinion a wonderful piece of work. Certainly this is not a complete solution (what about Illumina or Agilent data in their more native file structures?) and the reporting is lacking outside of the most basic content – but it does deliver an elegant and functional system for the dirty and unwashed masses. The wrapping of the stack onto an Ubuntu “spin” is great – if as promised I can download an iso, burn a disk, boot, install and rock-and-roll then this really could stand to be a really useful tool sitting in the corner of many small labs.

I have some vague suspicions though that this approach is doomed to failure. The biologists who cannot use R and Bioconductor are the same people who will be terribly afraid of booting a linux workstation and installing something by themselves. These are the same people who will be least well prepared to diagnose the problems on the server, and who will need the most training and babysitting to get them to the stage where the software can be applied in a meaningful way! Not a detraction from the paper, while BioconductorBuntu is a very elegant solution, and promises to solve some of the problems, a bioinformatician, IT guy or statistician is really needed to get the biologist up-and-running. Thank goodness – our jobs are still safe for the time being ;-)

This is certainly a well-earned-paper-of-the-week. Congratulations Paul et al.,

EMAAS: An extensible grid-based Rich Internet Application for microarray data analysis and management

Monday, March 23rd, 2009

G Barton et al., BMC Bioinformatics 2008, 9:493doi:10.1186/1471-2105-9-493

emaas.jpgEMAAS is another environment for handling and analysis of gene expression data. The authors have set about the development of a distributed e-support system for the management and analysis of microarray data; to provide access to complex methods and to apply (from a biologist’s POV) non-trivial technologies to handle large multi-variate datasets.

Whilst other solutions have missed the point and taken an easy approach to solving the problem, the EMAAS approach is rather more complicated and relies instead on integration of internet accessible tools, standard statistical packages (R/Bioconductor) and web-resources (CELSIUS, GEO). The decision to aim for a modular and flexible framework is excellent and makes this in my opinion a very much more interesting project. The completeness with which tools and environments has been included is breathtaking; the depth of IT and analytical platforms required is rather daunting.

In contrast to the manuscript reviewed in the last post, this resource’s source is available under a suitable GPL license, and some of the demo server also works. I have some problems with the resource (Flash for a start), but this is one smooth implementation and is packaged in such a way that I could take it for a spin if I so wished!

This manuscript is heavy to read, but a damned fine resource is described underneath the technical fluff. This is a great resource and this earns a great recommendation from the bioinformaticsblog.

SiPaGene: A new repository for instant online retrieval, sharing and meta-analyses of GeneChip® expression data

Monday, March 23rd, 2009

Adriane Menßen et al., BMC Genomics 2009, 10:98 doi:10.1186/1471-2164-10-98

sipagene.png

This manuscript describes a new database, data warehouse and analytical platform for the handling of Affymetrix based gene expression data. The authors identify the need for a database that is convenient, facilitates online analysis and provides user-specific sharing options, and further qualifies their understanding of an unmet database need with the statement that “… existing tools do not use the whole range of statistical power provided by the MAS5.0/GCOS algorithms”.

I agree with the authors that there is such a gap within the database arena for a MIAME compliant database that provides both data warehousing and data analytical capabilities; the addition of user-specific access rights is great, but the MAS5 and GCOS methods undoubtedly have their place, but their usage alone is perhaps naive?

The authors fill a number of quite heavy pages with their description of a refreshingly heavyweight database infrastructure (Java, ancient Oracle) that is currently biased towards their local research environments interest in immunology, inflammation, regeneration and cancer. Such alengthily described database is then populated with only 1000 arrays.

This manuscript is of interest, the approach is nice; a combined warehouse and analysis environment. I have some problems with the database though. “Non-academic commercial use is restricted” is a waste; I would never consider paying for this resource when fantastic solutions from SAS JMP Genomics / GeneData / … with full support, testing and scalability are available with a lower TCO. To see what has been done, how well it performs and to play with a resource is nice.

I suspect that this is another fail – the online demo will not even work

sipagene_miss.png

So, nice try, but no cigar. The manuscript is nice, convincingly written and more professional than some solutions out there. The web presentation looks fugly, and is also broken. The politics of code availability is plainly stupid – those who can pay will not because the implementation is not sufficiently good – Charite, please make the code a little more available!

Does the Biomarker Search Paradigm Need Re-Booting?

Wednesday, March 4th, 2009

plantmarkers.png(a nice logo from a deprecated database, and absolutely nothing to do with this review …)

Robest Hurst,

BMC Urology 2009, 9:1

Published in BMC Urology is a wonderful, well written and provoking commentary on the development of biomarkers. The author describes the state-of-the-nation in biomarker development for the characterisation and classification of bladder cancer, and argues that enough-if-enough and now is the time for the biomarker development field to wake up and start developing useful biomarkers. While the article has absolutely nothing to do with bioinformatics (apart from a little reference towards algorithms in the final sentences), I know that many bioinformaticians are working in the biomarker development field.

Bladder cancer is currently monitored most effectively using cystoscopy – an invasive method, but one which is suggested to have a 95% sensitivity. One issue with bladder cancer is that there is an insiduous recurrence; and treated patients of often superficial cancers develop aggressive invasive disease, and this kills 50% of patients… The need for a biomarker is clear, with >95% sensitivity from a non-invasively sampled biosample, and patients would likely be more compliant with post-treatment follow-ups. The issue is reiterated several times that sensitivity of prognostic markers of disease progression is key.

Stick-based protein assays have been developed for analysis of urine samples, but suffer from <70% sensitivity – the author describes “betting lives on a test with worse sensitivity than the gold standard“, and further questions the value of the tests based on the fatal consequence of false-negatives and the cost of follow-up on the false-positives.

I am really happy to read the author’s dissection of microarray and proteomic-based biomarker discovery. The author acknowledges the naive nature of magically robust, sensitive and specific biomarkers from the results, and states the unpredictable nature of the homeostatic ripples that move outwards from a peturbation within interconnectded network of cooperating proteins. The promise of biomarkers is therefore dismissed with the statement that “the probability of finding a single biomarker with the requisite sensitivity and specificity is vanishingly small“. Does this mean that we can pack our bags instead and go home?

Fortunately not! Hurst instead argues that the combination of biomarkers from existing studies into practical panels is the way ahead instead of yet more studies searching for the elusive individual biomarker. With the acknowledgement that all cancers are largely unique, and that thousands of samples would be required to obtain robust samples, the emphasis should be placed on the selection of biomarker panels from small numbers of assays that are largely independent, but which are relective of the overall phenotype, and the historical approach of modelling causality within the system should be abandonned; the leads of the re-boot within the title! Most encouragingly the author also states that “the search for candidate biomarkers needs to be divorced from the validation in clinical populations” and advocates the development of biomarker panels in surrogate model systems with cancer patient specimens as a validative tool rather than a discovery tool.

This stuff is common sense, obvious and clear to bioinformaticians, but not always to the scientists and clinicians closer to the patient. This is a well written article and should be distributed widely; the final sentence really summarises it well “the intelligent development of biomarkers truly is a problem in systems biology.”

RGG: A general GUI Framework for R scripts

Wednesday, March 4th, 2009

rgg.png

Ilhami Visne et al., BMC Bioinformatics 2009, 10:74.

Here is another great idea for all of you bioinformaticians working in a scientific support capacity. We are all aware of the issues of writing powerful methods in R for a customer who then wishes to run the analysis again-and-again with a different set of parameters! R is great, Bioconductor is amazing and it facilitates our ability to understand high-dimensional data, but many biologists are incapable to getting a grip on bioinformatics and computers or demonstrate a reluctance to accept data analysis as a core activity. Many biologists are just muppets when it comes to interactions with bioinformaticians (and in many cases the bioinformaticians are just as muppet-like and do not help the cause!)

manamana.jpg

 There are many times when it would be great to have a robust method for packaging R methods in such a way that “Excel biologists” can reuse the code and adapt workflows for their own purposes. I have certainly implemented my own solution using a mixture of Java webstart, R, RServe and Tomcat to provide such tools, but the proposed RGG solution is certainly very much more appealing within a formal support environment.

The R GUI Generator is based on the concept of specifying experimental attributes and context within an XML file. As within the Sweave framework, structural content is mixed with functional R code yielding formal and coherent documents. The authors describe well their concept, and the application of the generalised workflow is illustrated with some stunning screenshots of the software in action.

This manuscript is awarded a “publication of the week” since it provides an excellent solution to a real problem within support bioinformatics. While the project is undoubtedly rather academic at the moment in demonstrating only a proof-of-principal the concept is clear and the outlook is bright. Whilst the RGG is currently bound to client side life with the JGR (Java GUI for R) and their own RGGRunner software they do not preclude future integration of RGG with other frameworks such as RServe and explicity state of the power of integration of Sweave and RGG – a pretty good outlook from my desk at least.

While the authors certainly propose the collection and establishment of an RGG repository, this requires a significant amount of input from the community. While power-R-users are the people who can contribute methods, the users are those who will undoubtedly have the most issues in understanding what should be done, how, when and why. This creates just a little cause for concern, but it is great idea nevertheless, and is certainly something that should be added to the list of things to evaluate!

Genomic resources for a commerical flatfish, the Senegalese sole (Solea senegalensis): EST sequencing, oligo microarray design, and development of the Soleamold bioinformatic platform

Tuesday, March 3rd, 2009

torturedsole.jpg

BMC Genomics. 2008; 9: 508.

Joan Cerdà et al.,

As a former expert of EST technology and analysis, I still really enjoy reading the state-of-the-nation in EST papers, and like to see how the technological envelope is being opened further. While I suspect that rather many research groups are under-selling their EST collections, and are completely failing to fully exploit their own data, this manuscript manages to add something new to the EST genre.

The Senegalese sole is a flatfish of economic relevance within Europe and North Africa. The fish is within an aquaculture development programme, but physiological aspects of growth and development including at least disease resistance and larval growth remain uncontrolled leaving room for substantial improvements. This manuscript concentrates on the development of genomic resources for the study of gonad development in the fish, and within a systems-scale analysis the authors include sequence data from cDNA libraries, and a substantial amount of in situ data and this is wrapped into what appears to be a very attractive data presentation environment.

10 high titre cDNA libraries were constructed from different developmental stages, tissues and organs, and 3′ sequencing was used to obtain a total of 5,200 EST sequences. The sequences were processed with a rather primitive bioinformatics analysis pipeline, but a meaningful unigene set was assembled and cohorts of meaningful tentative consensus sequences were identified. The core analyses were based solely (pun intended ;-) ) on metrics such as number of ESTs represented within unigene and GO mapping of unigenes on basis of BLAST results. This certainly yields the standard but appealing eye-candy and demonstrates a grasp of the data (but value beyond the aesthetic is questionable).

solefig.jpg

The sequences were used to create an Agilent custom expression array, and this has been demonstrated to work, although lists of differentially expressed gene expression within meaningful comparisons are likely to follow in subsequent manuscripts.

The core value of this manuscript is however their Soleamold bioinformatics application – needs to be installed on Windows (why couldn’t they have packaged a Java Webstart application instead) – but shows a great set of screen shots of morphology and ISH data whereby the genomic, transcriptomic and ISH data are integrated into a single coherent application.

Overall, this is a great manuscript demonstrating what can be done with just a few 1000s of EST sequences, flexible technologies such as the Agilent custom arrays and a load of IST. The bioinformatics of data analysis is poor and incomplete, but the integrative imagination and implementation looks five-star. Great read, good concept and nice implementation. This is undoubtedly worth a read-of-the-week!

A general modular framework for gene set enrichment analysis

Tuesday, February 10th, 2009

Gene Set Enrichment Analysis or GSEA is one of those tasty methods that has been out there in the public domain for a number of years now. I guess that when most people see GSEA they immediately think of the original Gene Set Enrichment Analysis publication that was written by scientists from the Broad Institute. Earlier whilst investigating the contents of the BioinformaticsBlogLogs, it appears that GSEA is one of the technologies that still piques a cetain amount of interest. Gene Set Enrichment is one of the two most frequently searched terms (and only slighly ahead of “bioinformatics future 2009-”). While, perhaps to kill two birds with one stone, I should state that GSEA and related techniques are one of the futures of bioinformatics. GSEA is already a stand-alone tool, and enrichment algorithms are widely used in informatics solutions from the like of Ingenuity Systems etc.

It is therefore wonderful to find a well written article that compares and contrasts different enrichment methods, and proposes a framework for the further benchmarking of the available methods. As an applied bioinformatician it is all too easy to deploy a method without considering whether the statistic adopted really is best of breed.

Anyhow,

BMC Bioinformatics. 2009 Feb 3;10(1):47. Click here to read

Ackermann M, Strimmer K.

PMID: 19192285

 This article is well worth a read. The application of GSEA technologies within the field of expression profiling is discussed and the issue of multiple methods achieving the same task, and need for standardisation on methods and evaluation of the standardised methods is a clear point. The authors perform a meta analysis of the existing GSEA methods, analyse these methods within various simulations and evaluate the results. The overall finding is perhaps that GSEA itself may be an inferior method to a more simple univariate procedure, and that workflows relying on enrichment analysis may be simplified.

EMMA2 – A MAGE-compliant system for the collaborative analysis and integration of microarray data

Tuesday, February 10th, 2009

In BMC Bioinformatics there is a pre-print of the EMMA-2 manuscript, a superficially interesting sounding database developed in Computational Genomics at Bielefeld, one of the bioinformatics places in Germany. I am greatly saddened by the experience and am left feeling whelmed, as if I have pointlessly lost a few minutes of my life.

The BMC Bioinformatics manuscript is well written, easy to read, and paints a convincing reason as to why we should evaluate their database system (and even consider licensing it?). The screenshots in the publication are attractive and glossy, but an evaluation of the resource using a standard work environment is not such a great experience!

clippynot.png

The inclusion of the mutant spawn of clippy is at best an error of judgement. I have my suspicions that non-power users will not be making MIAME submissions, and should perhaps stick with GeneSpring! Power users who have used R and can programme are easily offended and this leaves me a little livid!

Trying to access a project that contains some hopefully useful data (for evaluating competence and utility meets with something rather unexpected … a colourful error message! The name of the dataset was BMC and this was hopefully a dataset to accompany the publication; it was accessible through the demo account so something is a little FUBAR!

emmaerror.png

Instead of pushing further I have looked at the standard example datasets available in the demo account. All I can say that this is bullsh!%, a study containing only 1-6 arrays is of very limited utility. I guess that convince me that this solution is worthy of inclusion within the BioRAM linux framework, I would like to see some of the good Novartis SymAtlas scale data analysed. For a beautiful illustration of capability, utility and scalability, a database should really show that a dataset such as the GSK cell line data or expO clinical cancer samples can be processed

emmaprojects.png

If we dive into the MTGeneChipDemo_2 (which has a whole 6 arrays included),  we can browse individual arrays for affy ids, we can try to observe a heat map (and throw another exception)…

While the manuscript is OK and even prompted me to evaluate the database, the database is just awful, horrid and rotten. Sure, I may not have the appropriate dependencies to evaluate the resource properly, but a demo with dead and useless data points towards a software development that is lacking real users and vision. I have failed to see any added value in the EMMA2 pipeline. For MIAME-MAGE compliance other free databases acheive better functionality and utility (I support BASE2 here), for data analysis and visualisation even a chimpanzee may be better off using R/bioconductor than EMMA-2.

Dondrup et al., fine manuscript but terrible database delivery experience.

ArrayPlex: distributed, interactive and programmatic access to genome sequence, annotation, ontology, and analytical toolsets.

Tuesday, February 10th, 2009

arrayplex.png

Genome Biol. 2008;9(11):R159. Epub 2008 Click here to read 

 

Killion PJ, Iyer VR.

PMID: 19014503

 
Another quick manuscript review for something that I hope that most bioinformaticians (working in or around core facilities have already read). ArrayPlex is an orgy of my favourite bioinformatics themes; distributed data, tomcat, PostgreSQL, expression data, OSX – you name it, its probably already in this paper.
 
This manuscript describes a system that aims to meet an unmet need within the field of applied bioinformatics, an integrated and centralised system for the storage and maintenance of microarray data. The resource is aimed at balancing the primitive raw data (gene expression content) with the associated annotative context (relating to gene names, gene identifiers and functional annotations). The system is designed for sensible operating systems (will not run on Windows ;-) ) and is deployed as a Tomcat service. ArrayPlex looks after itself (or so the authors suggest) and builds an operating environment using data trawled from the public domain, and appears extensible through the provision of API. 
 
While I haven’t yet installed or deployed ArrayPlex for a formal evaluation of functionality, much of the functionality it provides is already available from other resources. The authors stress that it isn’t intended as a substitute for e.g. the BASE database, but rather suggest that it may be an alternative for some would-be Bioconductor users… I am not sure what comment to make here, but I really don’t see too much competition in ArrayPlex! The screenshots provided within the manuscript are beautiful and make the system look like an extremely attractive tool – if it has data aggregation or integration capabilities as promised then this will be a must-have tool in the future; especially if there is any scope for R/bioconductor integration.
 
My feeling – this paper is a  Smörgåsbord of great bioinformatics themes, and is something that really should be investigated further! It leaves me a little concerned however; the authors suggest that Bioconductor is difficult to use because of it’s lack of GUI and need for shell. Quite how an inexperienced user will cope with the dependencies of installing Tomcat, postgresql and other applications on a UNIX or OSX box (without shell) is quite beyond me. The descriptions of the pipelines are attractive, and I am at least convinced that the system is worth a look, and should perhaps be earmarked for inclusion within the BioRAM linux distribution.
 
I should also note that several paradigms and intents are shared between ArrayPlex and my very own Mnemosyne LabManager application. The LabManager though, aims to provide an abstraction layer to the underlying R/Bioconductor, and provides mechanisms for an R proficient user to benefit from the server side and encompassing APIs at the same time… Time will tell?

Bladder Cancer-Associated Gene Expression Signatures Identified by Profiling of Exfoliated Urothelia

Tuesday, February 10th, 2009

heatmap1.png

I am having a push do understand how we could be more effectively selecting candidate biomarkers from both our own proprietary datasets and from the wealth of public data that has been deposited within e.g. ArrayExpress or GEO. I have identified the following manuscript as being worthy of review and hope that you might agree with some of my feelings.

Cancer Epidemiol Biomarkers Prev. 2009 Feb 3. [Epub ahead of print]Click here to read

Rosser CJ, Liu L, Sun Y, Villicana P, McCullers M, Porvasnik S, Young PR, Parker AS, Goodison S.

PMID: 19190164 [PubMed - as supplied by publisher]

 The authors argue that bladder cancer, when detected early is largely treatable with a 5-year survival rate of ~94%; this is of course greatly hindered by the papillary tumours that invade the surrounding muscle tissues. Following surgery to remove the tumour, regular checkups are required to ensure that there is no tumour recurrence. These checkups are performed using cystoscopy – an invasive and rather unpleasant sounding procedure. There is thus substantial opportunity for the improvement in quality of life through the early detection of tumour recurrence using less invasive methods, with urine representing an ideal biosource type.

Urine cytology can be used for the diagnosis of new malignancy, albeit with rather low sensitivity and imperfect specificity. Existing protein based biomarkers have high false-positive rates and other development methods suffer from insufficient predictive power. In this manuscript, the authors rise to the challenge of developing a gene-expression based biomarker panel for the diagnosis of bladder cancer through the profiling of urothelial cells from bladder washes. From a panel of confirmed bladder cancer patients and apparently healthy (with respect to bladder cancer) controls profiling of amplified RNA was performed using the U133 plus 2 platform.

The resulting expression data was analysed within a framework of feature selection algorithms; the authors have previously described a machine learning approach that has been successfully applied within both breast and prostate cancers. The structure of the experimental data looks good, only one cancer patient profile clustered with the control patients within hierarchical cluster analysis. Pathway analysis software was used to map observations onto relevant biological context, and a 14-gene model was refined to build a classifier. The resulting classifier yielded 76% accuracy in cancer class prediction; apparently a reasonable feat considering the cytologic classification was only 35%.

The manuscript is certainly not presenting anything amazingly new, but is showing the application of existing technologies to demonstrate a proof-of-principal in meeting an as yet unmet medicinal need. The logical workflow within the manuscript is clear, the arguments are well presented and bioinformatics methodology is certainly acceptable. The paper, in my opinion, is noteworthy because it is not an open-and-shut case. There is plenty of room for improvement (18/20 cancer patients identified), so the derived panel suffers from both false negatives (a low rate) and false positives. The starting panel is heterogeneous with patients of mixed ages, sexes, ethnic backgroinds and  clinical characteristics. I see much more to be done here, and this has been real food-for-thought!