I have been in contact with Preston, an amazing guy from California, for a few years. I have never met him but he has a nasty undiagnosed disease. Preston is a pioneer in consumer genomics. He was one of the first non-domain experts to sequence his genome looking for an answer to a non-life threatening disease. In a previous career I (unsuccessfully) explored his genome looking for clues as to the nature of the disease.
Preston remains desperate to understand a health problem and we are trying to resurrect his genome project as a consumer exercise in DNA analytics – what can a man on the street actually do with a whole genome sequence? As a genome biologist with access to HPC and state-of-the-art software this seems like a trivial project. This is something that we’re working on outside of normal work-hours – no contributions from our employers. This genome is something that we’re going to explore using iOS, Android and home PCs. We’ll beg, borrow and scrounge resources to interpret all possible angles. We want to see what can we really do and how can we do it? Hopefully we can benchmark the genomics reality from the man-in-the-street perspective rather than the glossy vision pitched by the personal genomics companies.
Logistics is a first (and somewhat funny) starting point. I’m in Brisbane, Australia with a domestic capped internet connection. Preston is in ?Sacramento?, California. Preston has an encrypted USB drive with the Illumina prepared BAM file describing the sequence reads and their correspondence to the reference genome. Preston however, doesn’t have a computer – he only has an Android phone. His cousin, Steve, has a computer and has been able to mount the disk. Our first challenge is to somehow share the data. This is currently in the format of a 140 gigabyte file. My first discussions with Steve reveal the true disconnect – “is it even possible to share such a large file?” We have no idea. We’re currently trying to upload the data to DNANexus – is cloud computing the answer? It sounds good but our first experiences appear rather shaky and we now have a partial (and unusable) file in our shared space.
I suspect that we’re going to have a bit of fun and a lot of frustration over the coming weeks. We’d all love to hear your advice for consumer genome exploration without the budget or the support that we usually rely upon during our daily research and support.
There are occasions when as bioinformaticians we question what we are really doing; when we need to sit back and consider whether we should even pursue a given project! There are even times when the chosen technologies are enough to turn our stomachs and to give us the Screaming Heebie-jeebies.
I am presently eager to secure a collaboration with a remarkable clinical scientist who is doing some very elegant research. He has some beautiful study designs and a fabulous patient cohort. He also has a huge stack of data – but it is all SOLiD. Nothing quite dampens a genome biologist’s mood quite like the mention of SOLiD data. This is strange – we all know of the 2-base encoded challenges of ColorSpace data – these challenges must be apocryphal since who has actually recently processed any of the data?
When at MGRC we had a good look at making SOLiD a native part of the “One-Click” pipeline. The mapping statistics were pretty unconvincing at the time and the project died a quick and painless death – the provider of the test data did not get the rich characterisation of personal variants back and I remained on my Illumina high horse – “deep base space is all that we need for meaningful variant analysis”. I too am guilty of the propagation of the “SOLiD sucks” mindset.
So, a couple of years since we last poked at SOLiD data and I have an external disk crammed with lanes of XSQ formatted LifeTech ColourSpace data. An an additional element of competition – the team from LifeTech are also trying to compete with us for some of the action in this remarkable project. This means that application of the LifeTech LifeScope software is off-bounds as an “unvalidated” black box solution – I need to craft a solution myself.
Former colleagues at Novocraft certainly have mapping software (NovoAlignCS) should be capable of doing something with the data – but NovoAlign expects fasta sequences and accompanying quality files. Reading of posts on BioStar and SeqAnswers reveals more mystery and misinformation on ColorSpace data handling but a gem emerges from the rough in the form of NGS_plumbing. Written by Laurent Gautier (whom I’ve met at a couple of meetings over the last few years) this software contains some rather simple methods to convert XSQ files into something more tractable for current mapping software. While written in Python and creating a few bizarre dependencies this is easily installed on Linux (but not on OSX) and even in a VirtualBox bioinformatics virtual machine the NGS_plumbing method chews through the XSQ files at a prodigious rate to create files that may be of some utility within scientific progress.
I don’t like SOLiD data – I don’t like that the XSQ file is binary and requires HDF5-devel to even peek inside it. I really don’t like that the resulting fastq file is a load of .0123 – it’s just not what I am used to seeing. The challenge though is not quite as horrendous as the naysayers would have led me to believe. So far so good. Of course, the proof-of-the-pudding will depend on some of the mapping statistics and whether we can actually pull some meaningful genetic variants out of the collection!
I have been a Mac user since my first Powerbook was purchased back in 2004. I have loved that crazy combination of Powerpoint, Terminal and UNIXness that just works! There are times when the rest of the world just doesn’t appreciate the power of Apple computers and key pieces of software are written for Windows computers only!
I am still a British citizen and although expatriate for the last 14 years I have a duty to yearly file my Tax return. This is an exercise in futility – I have no savings in the UK and the shares that I foolishly purchased back in the last ’90s generate sufficiently tiny amounts of income (measured in pence) that the process is not really worth it. Rules are rules and although I check each year HM’s Taxman would like my paperwork prior to the end of January or I will be obliged to pay a whopping GBP 100 fine! Crazy! Luckily there is some great software out there in the form of TaxCalc 2012 that will collate my lack of income and foreign resident status and prepare the convincing argument that I am not liable for either fine or big taxation bill. This piece of software however still runs only on Windows. A glaring oversight that thousands of Apple users must be aware of.
To deal with this challenge I have dragged around an old Windows VAIO computer that I bought back in 2002 prior to my realisation that Apple was the way to go. It used to dual boot either Windows 2000 or an old version of Redhat. A lovely computer. Last night when I started out on the tax return challenge I plugged in computer – the battery charging light came on. But then it did not start. A slight whiff of fried electronics and no beep. No lights. No action! BUGGER!
I am in a hurry to file this tax return and it is a money losing exercise at the best of times. I don’t wish to buy a replacement Windows computer (Alexander has a PC but it is in a possibly flooded shipping container somewhere near the Brisbane rive and waiting to clear customs). How can I deal with this challenge?
- Buy a PC anyhow and be done with it
- CodeWeavers CrossOver – has been used by a number of people with earlier versions of the TaxCalc software
- VirtualBox and an install of a Windows software image
These all seem like the simplest and cheapest solutions. There are evaluation versions of CrossOver available and it seems like a plausible way to head. To abbreviate a long story it all works perfectly apart from the required Network Connection – this means the upload to the HMRC cannot work = long and slow painful fail.
The next attempt was with VirtualBox – I discovered this link that provided some helpful suggestions as to how free Microsoft images could be used. It all sounded absolutely fabulous but again I could not coerce the Networking to function properly and could establish no external link beyond the virtual machine.
Finally I tried the evaluation version of Parallels and a not so completely licensed .iso image of Windows 2000. I will purchase Parallels in 12 days time – this is the most awesome piece of software that I have used in a while. The networking worked fine, the functionality was complete and the install only took a few minutes. IE5 does not gel with the slightly richer presentation tools from TaxCalc so with a minor upgrade to IE6 (I cannot believe that I am actually writing this) and the software was installed and my Tax return (prepared last night using network impaired CodeWeaver CrossOver functionality) was uploaded with almost 2 days of time to spare! = Hoorah….
I dislike filing tax returns, I dislike Windows and I dislike software that is bound to a single platform. CodeWeaver and Parallels have some amazing software that aims to reduce some of this pain and I am really impressed with both. The next step is to get the Illumina GenomeStudio software running on this OSX based laptop… <shudder>
Life at the moment is completely different to anything that has gone before. In a previous life, working for Pharma in the South of Finland, I had a crazy long commute to get to work. This is actually why I am here now; the Orion job was awesome but the travel was killing me and destroying the family! During the 90 minutes each way on the train I wrote loads of bioinformatics scripts and implemented the functional foundations of Mnemosyne BioSciences BioIT infrastructure. After the logistical simplicity of Malaysia, I again have a commute here in Brisbane. I travel 45 minutes up the river from the local ferry terminal (Mowbray Park) to the University of Queensland. I am still amazed that this is a plausible way to travel to work. Sitting in the sun on a seat without a table is however not conducive to working. I have had plenty of time for silent reflection. This neural freedom has me thinking of all of the what nexts.
How can I resurrect the Mnemosyne LabManager software? And what was the software anyhow?
How should I start collating the tutorials and workflows that I would like to blog? We need whitepapers and case-studies.
What are the new tools, frameworks and schemas that I should really learn?
The transition from MGRC to QFAB brings some new challenges and new opportunities. The MGRC way was very much have the required software written; in Australia salaries are quite different and licensing the appropriate third party solutions is more effective. The world of bioinformatics solutions has evolved much since my departure from Finland…
Once the shipping container with our possessions has cleared customs and once we have a home with sufficient space to swing the proverbial cat then I have a plan. I have a list of tutorials to write as I rediscover the contemporary in genome bioinformatics. I have some great datasets that were produced earlier in my career that I’ll work through. I’m all set! We’re just waiting on the bank and customs and excise and then we can get cracking!
The year started with an ambitious plan to be a little more rigorous in my blogging duties. This was a plan destined to fail from the outset! You, my dear readers, were deceived since I had yet another international relocation on the cards. At the end of office hours on January 11th I returned my MGRC keys and access cards and hot-footed it to the budget airline terminal in Kuala Lumpur and took an AirAsiaX night flight to Brisbane.
Our new home is Brisbane, heart of Australia’s Sunshine State. I am now working for QFAB a bioinformatics core / services company based in the Institute for Molecular Bioscience at the University of Queensland. A very different location to the hustle and bustle of an emerging Asian city. We have thus been on the ground for 10 days; we have a temporary home, pay-as-you-go internet and basic telecommunications. We have a car and ambitions for a house. I would argue that we are successfully moved. The kids have been enrolled in schools and some of the uniforms, hats and long socks have been purchased = so far so good!
I have my bioinformatics ambitions and hopes in a shipping container and our goods should clear customs in a couple of weeks. Then we can start in earnest on actually putting some meaningful bioinformatics tutorials onto the interwebs. We are sad to have left Malaysia and our friends and colleagues but are excited for the new opportunities that life in a rather more developed society will bring to family.
I have a Raspberry Pi – it was intended as a project to enjoy with Son #1. Unfortunately the pull of Minecraft (there is rant coming on this at a later stage!) has precluded much progress on this. Alexander’s interest in bioinformatics and even informatics appears rather limited. He is an avid consumer of Youtube and loves his music but getting into writing computer code is something that hasn’t yet tickled his fancy.
I am therefore delighted to see the release of the Raspberry Pi Educational Manual. This manual is a work in development but has some fabulous self-help sections on writing code in Python (no BioPython – but a start at least) and has a section on the Unix command line. I am not sure how successful this document will go down at home? I think that the Scratch visual programming section will be better appreciated by the girls. Once I have finished the assembly of the Gertboard – this will allow for a little more human machine interaction. I remember fondly my youth and coercing an expansion board on a BBC micro to run a sequence of lights my way! I think that this is an amazing step – well done Raspberry Pi!
The second of January already! The first day of salaried work in what is going to be an exciting and challenging year. As I spin up my RSS feeds after a couple of days of downtime we have just a plum genome paper from the boys at BGI and a collection of archael genomes published in PNAS – a quiet morning on the scientific digestion front.
Today also marked the delivery of the Christmas cards – thanks Biomax, MPOB, Felda and Eric for your thoughts during this festive season.
I had an enlightening discussion with a colleague this morning – she was voicing her concerns about the balance of career progression and personal interests and ambitions. What should one do when you wake up in the morning and work and career seem like an unpleasant hurdle? How can a passionate scientist trapped in a science-lite position acquire and mold the responsibilities that will provide the motivation and impetus to ensure satisfaction? This is a theme that I’ll try to work on over the coming weeks.
I would love to bring the Genome_S back to life immediately – I have a slight challenge in that the raw data and the server that the data resides on are currently in transit. This will remain an objective for the year and one that should be rolled into my own New Year’s Resolutions.
What about a full list of sensible New Year’s Resolutions for this genome biologist?
- Think fit = walk more and break out of sedentary habits. I am stoked by the whole Fitbit concept and eFitness is an ambition, goal and requirement.
- Family = endeavour to ensure an optimal balance between work and family-life. Let’s see how this can be aligned with Think fit? Kids on bikes and a walk in the evening?
- Bioinformatics skill development = a management position doesn’t necessarily keep key skills alive. This requires some proactivity; let’s try to write some tutorials on R and genome informatics.
- Genome_S – this is a personal genome project and really requires some personal input to making it something that has clarity, precision or even meaning. It’ll be a herculean effort, but can be combined with skill development above!
- Scientific writing – there are some manuscripts from the past that have died. The dead are perhaps only undead and with some delicious brains perhaps a couple of manuscripts can be revived? It is a long range goal but one that I must at least try!
I am completely rubbish at blogging! We started with a great fanfare in what would be the Northern hemisphere’s spring and have since crashed with very little to really show for it! As we roll into the final hours of 2012 I would like to take this time to evaluate whether this endeavour was such a spectacular failure and whether I should try to be a little more competent in 2013?
Yup – it was a failure on all fronts.
I can only talk about a little of what actually keeps me amused and busy during the corporate working day but the Prestome project shone brightly at the start of the year before stalling – personal human genome analysis is a complicated business. While the process of mapping reads to a reference genome is a computational exercise and the process of characterising CNV, LOH and genomic inversions is pretty routine it is the process of curating mutations and looking for the evidence of pathogenicity that is complicated and really without end! The Prestome case remains fascinating and it seems that a single genome sequence is not sufficient to make a meaningful call. It would be awesome to sequence an extended family, to perform RNA-Seq on muscle biopsy and to look at additional clinical chemistry parameters. This remains a challenge for the future and the man behind the Prestome remains a contact and someone who I am eager to meet in real life!
My other ambition for the blog was to openly explore my own genome – human_s. I purchased a shiny workstation, wrote some lovely code to present the data QC and then became distracted by other more pressing commercial engagements. I am sure that my own analysis would mirror what was discovered for the Prestome – a single genome can only tell me so much about myself and 2 lanes of data from 1.5 generation Illumina is probably not enough anyhow.
I am not one for dwelling on what has gone on in the past – let’s quickly move on into 2013 and have a spectacular year and try to improve on this blog’s rather lacklustre delivery. I know that there are some surprises in store for the new year – these will involve a number of changes, complications and challenges – but interesting times is where I wish to live. Watch this space and perhaps we can have a more coherent delivery of this bioinformatician’s view of what is hot and what is not?
Happy New Year to all – may your dreams and ambitions for 2013 all come true!
It is pretty tricky striking a balance between bringing some slightly rusty bioinformatics skills to life and writing about the progress and challenges. Combine this with the daily hurdles of managing a genome informatics team in an emerging country and it seems that things are forgotten. This is not the case – I am just easily distracted.
- Genome S is my very own genome. It’s only a couple of lanes of Illumina HiSeq data, but something that I wish to process myself my way. I started pre-processing the data using NGSQC Toolkit. This works and produces most of the information that I really wanted to include. It is written in PERL and is stinky slow. The graphics are informative but fugly – I can do better myself!
- I have coerced by old java development environment back to life – I have tried existing fastq parsers and again they work, but not with the efficiency that I wanted. I now have a streamlined fastq parser and quality clipper that works the way that I want – it doesn’t produce the figures yet – I’ll start working on ggplot2 visualisations over the weekend.
- Processing a few tens-of-millions of Illumina short reads reminded me very much of my time in academia and the OpenSputnik project that was tuned for the handling of Sanger EST sequences. One of the displays and filters that I used in sputnik was a linguistic complexity filter for removing the “simple” and linguistically trivial sequences. Naturally I want to try this on NGS data – something that I haven’t seen reported elsewhere. The EMBOSS package has a good method for calculating such linguistic complexity. I have spent a couple of rather frustrating evenings trying to convert the EMBOSS complex method into java – their method is written in C. This is a hurdle too far; I have implemented my interpretation of their code in Java and it runs, but the result isn’t the same and I am not entirely sure if I have interpreted their code correctly.
- Finally the borders between home and work have most definitely crossed and the #Prestome has spent a couple of evenings on the sofa with the family. We most definitely have traction here – the top down approach has failed to identify causal variants, the bottom-up approach has failed to identify causal variants but a more targeted approach to looking at all synonymous and non-synonymous mutations in all genes associated with muscle development, diseases of the musculature and myopathies has yielded what looks like a good candidate. This has really lifted the spirits – this “customer” really is beginning to feel more like family now!
- It is the #Prestome project that is giving me most joy and inspiration at the moment. The process of hacking through variants looking for causal information requires the bringing together of different databases – nothing that we have at MGRC or I have at home allows me to work on the data in the way that I would really like – there’s an opportunity here for a little software development. I have scraped together some of the key data collections (Hapmap, 1000 Genomes, COSMIC, dbSNP, ENSEMBL) and am building these into relational data structures in PostgreSQL … I’m easily amused; but it gives me some satisfaction even if I have to fight data into MySQL (or heaven forbid Oracle) first. This will be a comment later in the week…
Lots is happening – I’m perhaps distracted and not sufficiently focused on any single task. I’m having fun though. I have some lovely ideas of challenges that I’d love to take in due course; I’d love to play with Affymetrix data, see what can be done with some of the more recent Illumina chip data and explore my variations on a number of other platforms.
Finally the home genomics plan is coming together. When the shiny new Dell server was delivered, it didn’t quite reach my own threshold for perfection. I think that we are now at a level that I am happy to call perfect. I added an cheap graphics card and now have an ideal screen resolution on the monitor that I use – the embedded graphics have been disabled and will not be used again. The drive sleds from http://www.discountechnology.com arrived (on a Saturday no less = impressed) and the server is now populated with a few terabytes of usable disk space. The air-conditioner in the home office has even been refilled with refrigerant gasses so the room temperature can be better maintained when computing. The home network is now gigabit and while this doesn’t quite solve the USB 3.0 issues the ability to get data on and off the box is a little more streamlined.
Server complains at ambient temperature