It is pretty tricky striking a balance between bringing some slightly rusty bioinformatics skills to life and writing about the progress and challenges. Combine this with the daily hurdles of managing a genome informatics team in an emerging country and it seems that things are forgotten. This is not the case – I am just easily distracted.
- Genome S is my very own genome. It’s only a couple of lanes of Illumina HiSeq data, but something that I wish to process myself my way. I started pre-processing the data using NGSQC Toolkit. This works and produces most of the information that I really wanted to include. It is written in PERL and is stinky slow. The graphics are informative but fugly – I can do better myself!

- I have coerced by old java development environment back to life – I have tried existing fastq parsers and again they work, but not with the efficiency that I wanted. I now have a streamlined fastq parser and quality clipper that works the way that I want – it doesn’t produce the figures yet – I’ll start working on ggplot2 visualisations over the weekend.
- Processing a few tens-of-millions of Illumina short reads reminded me very much of my time in academia and the OpenSputnik project that was tuned for the handling of Sanger EST sequences. One of the displays and filters that I used in sputnik was a linguistic complexity filter for removing the “simple” and linguistically trivial sequences. Naturally I want to try this on NGS data – something that I haven’t seen reported elsewhere. The EMBOSS package has a good method for calculating such linguistic complexity. I have spent a couple of rather frustrating evenings trying to convert the EMBOSS complex method into java – their method is written in C. This is a hurdle too far; I have implemented my interpretation of their code in Java and it runs, but the result isn’t the same and I am not entirely sure if I have interpreted their code correctly.

- Finally the borders between home and work have most definitely crossed and the #Prestome has spent a couple of evenings on the sofa with the family. We most definitely have traction here – the top down approach has failed to identify causal variants, the bottom-up approach has failed to identify causal variants but a more targeted approach to looking at all synonymous and non-synonymous mutations in all genes associated with muscle development, diseases of the musculature and myopathies has yielded what looks like a good candidate. This has really lifted the spirits – this “customer” really is beginning to feel more like family now!
- It is the #Prestome project that is giving me most joy and inspiration at the moment. The process of hacking through variants looking for causal information requires the bringing together of different databases – nothing that we have at MGRC or I have at home allows me to work on the data in the way that I would really like – there’s an opportunity here for a little software development. I have scraped together some of the key data collections (Hapmap, 1000 Genomes, COSMIC, dbSNP, ENSEMBL) and am building these into relational data structures in PostgreSQL … I’m easily amused; but it gives me some satisfaction even if I have to fight data into MySQL (or heaven forbid Oracle) first. This will be a comment later in the week…
Lots is happening – I’m perhaps distracted and not sufficiently focused on any single task. I’m having fun though. I have some lovely ideas of challenges that I’d love to take in due course; I’d love to play with Affymetrix data, see what can be done with some of the more recent Illumina chip data and explore my variations on a number of other platforms.




