Archive for the ‘public data’ Category

BioconductorBuntu – A Linux Distribution that Implements a Web-Based DNA Microarray Analysis Server

Wednesday, April 1st, 2009

bioconductorbuntu.png

Paul Geeleher et al., Bioinformatics Advance Access Publication March 23rd, 2009.

Fresh in the latest version of Bioinformatics Advance Access is a rather wonder short correspondence on BioconductorBuntu. The authors of this brief article have highlighted a rather important divide within the bioinformatics community; those who can use R and those who can’t.

To solve the issue of “hot” microarray data analysis for those fearful of scripting, the authors have implemented a whole Ubuntu distribution containing the requisite packages, software and servers for rapid deployment of a data analysis server. In addition to just providing R and some bioconductor packages the authors have also implemented a basic framework of authentication and ownership, and some core GUIs to streamline the process of uploading, analysis and reporting the content of DNA microarray studies. In contrast to earlier efforts such as AMDA (Genopolis, Italy) the authors have provided mechanisms for the handling of Affymetrix data, single and dual colour arrays.

The workflow appears to contain all core elements of data validation, QC and differential expression analysis and also provides a little content for both GSEA and KEGG type analyses.

This is in my humble opinion a wonderful piece of work. Certainly this is not a complete solution (what about Illumina or Agilent data in their more native file structures?) and the reporting is lacking outside of the most basic content – but it does deliver an elegant and functional system for the dirty and unwashed masses. The wrapping of the stack onto an Ubuntu “spin” is great – if as promised I can download an iso, burn a disk, boot, install and rock-and-roll then this really could stand to be a really useful tool sitting in the corner of many small labs.

I have some vague suspicions though that this approach is doomed to failure. The biologists who cannot use R and Bioconductor are the same people who will be terribly afraid of booting a linux workstation and installing something by themselves. These are the same people who will be least well prepared to diagnose the problems on the server, and who will need the most training and babysitting to get them to the stage where the software can be applied in a meaningful way! Not a detraction from the paper, while BioconductorBuntu is a very elegant solution, and promises to solve some of the problems, a bioinformatician, IT guy or statistician is really needed to get the biologist up-and-running. Thank goodness – our jobs are still safe for the time being ;-)

This is certainly a well-earned-paper-of-the-week. Congratulations Paul et al.,

EMAAS: An extensible grid-based Rich Internet Application for microarray data analysis and management

Monday, March 23rd, 2009

G Barton et al., BMC Bioinformatics 2008, 9:493doi:10.1186/1471-2105-9-493

emaas.jpgEMAAS is another environment for handling and analysis of gene expression data. The authors have set about the development of a distributed e-support system for the management and analysis of microarray data; to provide access to complex methods and to apply (from a biologist’s POV) non-trivial technologies to handle large multi-variate datasets.

Whilst other solutions have missed the point and taken an easy approach to solving the problem, the EMAAS approach is rather more complicated and relies instead on integration of internet accessible tools, standard statistical packages (R/Bioconductor) and web-resources (CELSIUS, GEO). The decision to aim for a modular and flexible framework is excellent and makes this in my opinion a very much more interesting project. The completeness with which tools and environments has been included is breathtaking; the depth of IT and analytical platforms required is rather daunting.

In contrast to the manuscript reviewed in the last post, this resource’s source is available under a suitable GPL license, and some of the demo server also works. I have some problems with the resource (Flash for a start), but this is one smooth implementation and is packaged in such a way that I could take it for a spin if I so wished!

This manuscript is heavy to read, but a damned fine resource is described underneath the technical fluff. This is a great resource and this earns a great recommendation from the bioinformaticsblog.

SiPaGene: A new repository for instant online retrieval, sharing and meta-analyses of GeneChip® expression data

Monday, March 23rd, 2009

Adriane Menßen et al., BMC Genomics 2009, 10:98 doi:10.1186/1471-2164-10-98

sipagene.png

This manuscript describes a new database, data warehouse and analytical platform for the handling of Affymetrix based gene expression data. The authors identify the need for a database that is convenient, facilitates online analysis and provides user-specific sharing options, and further qualifies their understanding of an unmet database need with the statement that “… existing tools do not use the whole range of statistical power provided by the MAS5.0/GCOS algorithms”.

I agree with the authors that there is such a gap within the database arena for a MIAME compliant database that provides both data warehousing and data analytical capabilities; the addition of user-specific access rights is great, but the MAS5 and GCOS methods undoubtedly have their place, but their usage alone is perhaps naive?

The authors fill a number of quite heavy pages with their description of a refreshingly heavyweight database infrastructure (Java, ancient Oracle) that is currently biased towards their local research environments interest in immunology, inflammation, regeneration and cancer. Such alengthily described database is then populated with only 1000 arrays.

This manuscript is of interest, the approach is nice; a combined warehouse and analysis environment. I have some problems with the database though. “Non-academic commercial use is restricted” is a waste; I would never consider paying for this resource when fantastic solutions from SAS JMP Genomics / GeneData / … with full support, testing and scalability are available with a lower TCO. To see what has been done, how well it performs and to play with a resource is nice.

I suspect that this is another fail – the online demo will not even work

sipagene_miss.png

So, nice try, but no cigar. The manuscript is nice, convincingly written and more professional than some solutions out there. The web presentation looks fugly, and is also broken. The politics of code availability is plainly stupid – those who can pay will not because the implementation is not sufficiently good – Charite, please make the code a little more available!

Genomic resources for a commerical flatfish, the Senegalese sole (Solea senegalensis): EST sequencing, oligo microarray design, and development of the Soleamold bioinformatic platform

Tuesday, March 3rd, 2009

torturedsole.jpg

BMC Genomics. 2008; 9: 508.

Joan Cerdà et al.,

As a former expert of EST technology and analysis, I still really enjoy reading the state-of-the-nation in EST papers, and like to see how the technological envelope is being opened further. While I suspect that rather many research groups are under-selling their EST collections, and are completely failing to fully exploit their own data, this manuscript manages to add something new to the EST genre.

The Senegalese sole is a flatfish of economic relevance within Europe and North Africa. The fish is within an aquaculture development programme, but physiological aspects of growth and development including at least disease resistance and larval growth remain uncontrolled leaving room for substantial improvements. This manuscript concentrates on the development of genomic resources for the study of gonad development in the fish, and within a systems-scale analysis the authors include sequence data from cDNA libraries, and a substantial amount of in situ data and this is wrapped into what appears to be a very attractive data presentation environment.

10 high titre cDNA libraries were constructed from different developmental stages, tissues and organs, and 3′ sequencing was used to obtain a total of 5,200 EST sequences. The sequences were processed with a rather primitive bioinformatics analysis pipeline, but a meaningful unigene set was assembled and cohorts of meaningful tentative consensus sequences were identified. The core analyses were based solely (pun intended ;-) ) on metrics such as number of ESTs represented within unigene and GO mapping of unigenes on basis of BLAST results. This certainly yields the standard but appealing eye-candy and demonstrates a grasp of the data (but value beyond the aesthetic is questionable).

solefig.jpg

The sequences were used to create an Agilent custom expression array, and this has been demonstrated to work, although lists of differentially expressed gene expression within meaningful comparisons are likely to follow in subsequent manuscripts.

The core value of this manuscript is however their Soleamold bioinformatics application – needs to be installed on Windows (why couldn’t they have packaged a Java Webstart application instead) – but shows a great set of screen shots of morphology and ISH data whereby the genomic, transcriptomic and ISH data are integrated into a single coherent application.

Overall, this is a great manuscript demonstrating what can be done with just a few 1000s of EST sequences, flexible technologies such as the Agilent custom arrays and a load of IST. The bioinformatics of data analysis is poor and incomplete, but the integrative imagination and implementation looks five-star. Great read, good concept and nice implementation. This is undoubtedly worth a read-of-the-week!

ArrayPlex: distributed, interactive and programmatic access to genome sequence, annotation, ontology, and analytical toolsets.

Tuesday, February 10th, 2009

arrayplex.png

Genome Biol. 2008;9(11):R159. Epub 2008 Click here to read 

 

Killion PJ, Iyer VR.

PMID: 19014503

 
Another quick manuscript review for something that I hope that most bioinformaticians (working in or around core facilities have already read). ArrayPlex is an orgy of my favourite bioinformatics themes; distributed data, tomcat, PostgreSQL, expression data, OSX – you name it, its probably already in this paper.
 
This manuscript describes a system that aims to meet an unmet need within the field of applied bioinformatics, an integrated and centralised system for the storage and maintenance of microarray data. The resource is aimed at balancing the primitive raw data (gene expression content) with the associated annotative context (relating to gene names, gene identifiers and functional annotations). The system is designed for sensible operating systems (will not run on Windows ;-) ) and is deployed as a Tomcat service. ArrayPlex looks after itself (or so the authors suggest) and builds an operating environment using data trawled from the public domain, and appears extensible through the provision of API. 
 
While I haven’t yet installed or deployed ArrayPlex for a formal evaluation of functionality, much of the functionality it provides is already available from other resources. The authors stress that it isn’t intended as a substitute for e.g. the BASE database, but rather suggest that it may be an alternative for some would-be Bioconductor users… I am not sure what comment to make here, but I really don’t see too much competition in ArrayPlex! The screenshots provided within the manuscript are beautiful and make the system look like an extremely attractive tool – if it has data aggregation or integration capabilities as promised then this will be a must-have tool in the future; especially if there is any scope for R/bioconductor integration.
 
My feeling – this paper is a  Smörgåsbord of great bioinformatics themes, and is something that really should be investigated further! It leaves me a little concerned however; the authors suggest that Bioconductor is difficult to use because of it’s lack of GUI and need for shell. Quite how an inexperienced user will cope with the dependencies of installing Tomcat, postgresql and other applications on a UNIX or OSX box (without shell) is quite beyond me. The descriptions of the pipelines are attractive, and I am at least convinced that the system is worth a look, and should perhaps be earmarked for inclusion within the BioRAM linux distribution.
 
I should also note that several paradigms and intents are shared between ArrayPlex and my very own Mnemosyne LabManager application. The LabManager though, aims to provide an abstraction layer to the underlying R/Bioconductor, and provides mechanisms for an R proficient user to benefit from the server side and encompassing APIs at the same time… Time will tell?

Target discovery from data mining approaches

Monday, February 2nd, 2009

90-1.jpg

Yongliang Yang et al., Drug Discovery Today 2009, Vol 14, p147-154.

Target discovery is a key area within drug development: you can’t develop a drug without a target (anymore), and I am sure that many bioinformaticians working within pharma and biotech spend a not inconsiderable amount of their time compiling portfolios of information relating to a development drug’s target protein.

It is therefore great to see a review article in Drug Discovery Today that highlights, describes and outlines the informatics workflows that may be used to discover a meaningful target. This is not a review article for a seasoned corporate bioinformatician, but is rather a good illustration of much of bioinformatics on the commercial side of the academic/commercial divide. It is rather obvious however that the authors are academic, and that the focus of the article is more towards the academic characterisation of a target rather than the approach that the more enterprise oriented bioinformatician would take! The article is not weakened from this since there the authors place considerable stress on the fact that there are huge volumes of data out there, and that there is considerable benefit to be reaped by making sense of, and integrating, these data.

The article is well written and provides a meaningful review of integrative data mining. For bioinformaticians considering a career in corporate bioinformatics, this provides a robust view of what we spend much of our time doing, although the tools summarised may not be the optimal tools within a corporate setting.

This is a good manuscript, not a great one. It highlights the issues within target characterisation and understanding and is a worthy read for all.

Next generation tools for the annotation of human SNPs

Monday, February 2nd, 2009

90-1.jpg

Rachel Karchin, Briefings in Bioinformatics, 2009, Vol 10, 35-52

This is another timely review article published in Briefings in Bioinformatics, and as my own job duties become a little more translational, something that is of immediate relevance and interest.

This is a typical review article and I feel is noteworthy for the depth and exploration of human SNP based data-resources. The author collates a very comprehensive resource of web-based databases (21 in total) and evaluated their potential for both usability (from a bioinformatician’s perspective) and for utility (as in the resource might be of real use). The results from this near exhaustive analysis is provided as a meaningful set of supplementary data. This evaluation is further supplemented by meaningful and largely typical case studies that might be encountered within a typical drug-discovery or translational-medicine campaign.

Characterisation of intronic SNPs characterised within Schizophrenia, novel amyotropic lateral sclerosis SNPs and mixed esophageal cancer SNPs truly highlight the potential roles of the different resources in SNP characterisation.

The manuscript is certainly of benefit for researchers and bioinformaticians aiming to fast-forward their knowledge of web-based polymorphism databases and resources. There are also very welcome and relevant statements as to non-synchronous nature of data between reference (and de facto up-to-date) resources such as dbSNP and the plethora of derived databases whose data is largely anchored to older (and in some cases ancient) releases of dbSNP.

The outlook and summary statements are of great relevance to the authors of the next generation of web-enabled derived data containing bioinformatics resources. The author is not naive and does comment on the limitations of the development, support and maintenance needs of academic SNP webservers. Bioinformatics in academia is driven by the need for publications and maintenance of existing resources is largely not very publishable. It should be noted here (in bold even) that Nucleic Acids Research does publish the annual Database and Webserver issue, perhaps the only journal of impact that supports the systematic republication of, in many cases, rather old resources.

Overall, much of the content within this review is old-hat, the evaluation of the resources and the comparison of added-value within typical SNP characterisation workflows is however elegant and of real value to many bioinformatics, pharmacogenetics and translational systems biology researchers. I am not sure that I flag this manuscript as a must read, but it is certainly worth packing into your bag if you’re travelling on a train, plane or have a few minutes to spare at home tonight.

What should the focus of the blog be for the next four weeks?

Friday, January 23rd, 2009

question-mark.jpg

The bioinformaticsblog is being read, and looking at the logs shows me what is popular and what is not. Loads of hits are coming from Google and it is pretty interesting what you all seem to be hunting for! At the moment the mail search themes surround my comments on NetOffice (I still love it), iPhone (still thinking about feasibility) and public datasets (especially the GSK cell line data).

I have put a huge effort into the bioram-linux plan and have a pretty neat system running inside a virtual machine. It has the latest R with a load of bioconductor and cran packages. It has software such as trace2dbest, phred, cross_match (not publicly available from rPath.org yet) for an EST project running through the lab at the moment and I am working of certain “biomarker” packages, that will demonstrate my control of destiny rather than the other way round! This is where I wish to concentrate for the time being.

My blog focus until the end of Feb 2009 will be as follows

  • BioRAM-linux, rPath, rBuilder and conary; towards a functional customized bioinformatics cluster OS
  • bioinformatics and iPhone – towards XML integration of iPhone and Mnemosyne LabManager server
  • tutorials in comparative genomics using ‘R’ and ‘bioconductor’
  • An XML framework for Agilent array analysis using ‘R/bioconductor’
  • Papers-of-the-week

This is my plan, but I really need to have your input here (mail me, or preferably leave a comment)

iPhone application development – room for bioinformatics?

Friday, January 23rd, 2009

apple-iphone-in-hand-thumb.jpg

A great core tutorial has been written on writing your first iPhone application in 14 days.

I am certain that there is room for iPhone bioinformatics, and I am confident that an integration with a meta-aggregation warehouse such as the Mnemosyne LabManager could provide a scope for accessing bioinformatics based data on the move?

A good question would be why?

  1. Because I can.
  2. It would be useful in many situations.
  3. Imagine checking the expression profile of a target gene against your “favourite” dataset during a conference.
  4. Having access to experimental workflows on the train would promote innovation

I do not propose that we implement interpro for iPhone, but a sequence analysis method would be useful ;-)

Just a thought and perhaps a plan for the future!

biomedexperts – scientific social networking

Thursday, January 15th, 2009

europeisall.png

Googling for yourself is always an interesting way to waste some time! I was curious to see if the bioinformatics blog would be identified by a name search (not appears to be the answer). Amongst the surprising many people who share my name, are a few pages dedicated to just me within the sphere of social networking. Beyond the typical linkedin-type pages (and a remarkably untouched facebook page, was a resource that I hadn’t seen before – biomedexperts “your scientific match point”.

Pretty interesting stuff, a collation as to whom I have worked with, when and the scientific articles that have been published. It is very clear that the centre of my scientific network is based in Bavaria! Enough years in the wastelands of the Finnish periphery, and the network remains and is still stronger than anything that went before.

Have a look, and I guess that if you have a few articles published within a typical biomed indexed journal, your global scientist::scientist interactions will be intuitively plotted.

As a side note (or even grumble) I stumbled across the web pages for the national UK census of 1911. I was interested to check out the gossip of my great-great-grandparents who would I guess have been just a few years younger than I am now! I have found a few candidate names, registered (for free) to access the data, and am then asked to supply credit card details for more info …. Having wasted sufficient time in filling the abomination that is the census document, I feel that if in another 90 years time my great-great-grandchildren have to fork out cash to see just how unseriously I take such things, I’ll come back and haunt some tightassed civil servant!

OK another post that is full of waffle and guff, no bioinformatics (but cool networks that Ingenuity can’t yet plot ..) I’m still thinking about how and when to start making this site attractive (but waiting to get a threshold number of visitors per day first)