There are occasions when as bioinformaticians we question what we are really doing; when we need to sit back and consider whether we should even pursue a given project! There are even times when the chosen technologies are enough to turn our stomachs and to give us the Screaming Heebie-jeebies.
I am presently eager to secure a collaboration with a remarkable clinical scientist who is doing some very elegant research. He has some beautiful study designs and a fabulous patient cohort. He also has a huge stack of data – but it is all SOLiD. Nothing quite dampens a genome biologist’s mood quite like the mention of SOLiD data. This is strange – we all know of the 2-base encoded challenges of ColorSpace data – these challenges must be apocryphal since who has actually recently processed any of the data?
When at MGRC we had a good look at making SOLiD a native part of the “One-Click” pipeline. The mapping statistics were pretty unconvincing at the time and the project died a quick and painless death – the provider of the test data did not get the rich characterisation of personal variants back and I remained on my Illumina high horse – “deep base space is all that we need for meaningful variant analysis”. I too am guilty of the propagation of the “SOLiD sucks” mindset.

So, a couple of years since we last poked at SOLiD data and I have an external disk crammed with lanes of XSQ formatted LifeTech ColourSpace data. An an additional element of competition – the team from LifeTech are also trying to compete with us for some of the action in this remarkable project. This means that application of the LifeTech LifeScope software is off-bounds as an “unvalidated” black box solution – I need to craft a solution myself.
Former colleagues at Novocraft certainly have mapping software (NovoAlignCS) should be capable of doing something with the data – but NovoAlign expects fasta sequences and accompanying quality files. Reading of posts on BioStar and SeqAnswers reveals more mystery and misinformation on ColorSpace data handling but a gem emerges from the rough in the form of NGS_plumbing. Written by Laurent Gautier (whom I’ve met at a couple of meetings over the last few years) this software contains some rather simple methods to convert XSQ files into something more tractable for current mapping software. While written in Python and creating a few bizarre dependencies this is easily installed on Linux (but not on OSX) and even in a VirtualBox bioinformatics virtual machine the NGS_plumbing method chews through the XSQ files at a prodigious rate to create files that may be of some utility within scientific progress.
I don’t like SOLiD data – I don’t like that the XSQ file is binary and requires HDF5-devel to even peek inside it. I really don’t like that the resulting fastq file is a load of .0123 – it’s just not what I am used to seeing. The challenge though is not quite as horrendous as the naysayers would have led me to believe. So far so good. Of course, the proof-of-the-pudding will depend on some of the mapping statistics and whether we can actually pull some meaningful genetic variants out of the collection!






