Genomics @home

It is pretty tricky striking a balance between bringing some slightly rusty bioinformatics skills to life and writing about the progress and challenges. Combine this with the daily hurdles of managing a genome informatics team in an emerging country and it seems that things are forgotten. This is not the case – I am just easily distracted.

  • Genome S is my very own genome. It’s only a couple of lanes of Illumina HiSeq data, but something that I wish to process myself my way. I started pre-processing the data using NGSQC Toolkit. This works and produces most of the information that I really wanted to include. It is written in PERL and is stinky slow. The graphics are informative but fugly – I can do better myself!
  • I have coerced by old java development environment back to life – I have tried existing fastq parsers and again they work, but not with the efficiency that I wanted. I now have a streamlined fastq parser and quality clipper that works the way that I want – it doesn’t produce the figures yet – I’ll start working on ggplot2 visualisations over the weekend.

    http://maojf.blogspot.com/2011_04_01_archive.html

  • Processing a few tens-of-millions of Illumina short reads reminded me very much of my time in academia and the OpenSputnik project that was tuned for the handling of Sanger EST sequences. One of the displays and filters that I used in sputnik was a linguistic complexity filter for removing the “simple” and linguistically trivial sequences. Naturally I want to try this on NGS data – something that I haven’t seen reported elsewhere. The EMBOSS package has a good method for calculating such linguistic complexity. I have spent a couple of rather frustrating evenings trying to convert the EMBOSS complex method into java – their method is written in C. This is a hurdle too far; I have implemented my interpretation of their code in Java and it runs, but the result isn’t the same and I am not entirely sure if I have interpreted their code correctly. 
  • Finally the borders between home and work have most definitely crossed and the #Prestome has spent a couple of evenings on the sofa with the family. We most definitely have traction here – the top down approach has failed to identify causal variants, the bottom-up approach has failed to identify causal variants but a more targeted approach to looking at all synonymous and non-synonymous mutations in all genes associated with muscle development, diseases of the musculature and myopathies has yielded what looks like a good candidate. This has really lifted the spirits – this “customer” really is beginning to feel more like family now!
  • It is the #Prestome project that is giving me most joy and inspiration at the moment. The process of hacking through variants looking for causal information requires the bringing together of different databases – nothing that we have at MGRC or I have at home allows me to work on the data in the way that I would really like – there’s an opportunity here for a little software development. I have scraped together some of the key data collections (Hapmap, 1000 Genomes, COSMIC, dbSNP, ENSEMBL) and am building these into relational data structures in PostgreSQL … I’m easily amused; but it gives me some satisfaction even if I have to fight data into MySQL (or heaven forbid Oracle) first. This will be a comment later in the week…

Lots is happening – I’m perhaps distracted and not sufficiently focused on any single task. I’m having fun though. I have some lovely ideas of challenges that I’d love to take in due course; I’d love to play with  Affymetrix data, see what can be done with some of the more recent Illumina chip data and explore my variations on a number of other platforms.

Genomics @home?

Finally the home genomics plan is coming together. When the shiny new Dell server was delivered, it didn’t quite reach my own threshold for perfection. I think that we are now at a level that I am happy to call perfect. I added an cheap graphics card and now have an ideal screen resolution on the monitor that I use – the embedded graphics have been disabled and will not be used again. The drive sleds from http://www.discountechnology.com arrived (on a Saturday no less = impressed) and the server is now populated with a few terabytes of usable disk space. The air-conditioner in the home office has even been refilled with refrigerant gasses so the room temperature can be better maintained when computing. The home network is now gigabit and while this doesn’t quite solve the USB 3.0 issues the ability to get data on and off the box is a little more streamlined.

how hot?

Server complains at ambient temperature

software, third party plugins and downtime

Something happened to bioinformaticsblog.org last night. Something in the presentation layer went bad and could not be fixed. I dropped the whole of the blogging software and reinstalled – the SQL dump was useless; the problem persisted. I recovered the content, reinstalled and voila, we’re back in business. One of the plugins was useless and broke the whole shebang. The error logs clearly points to the amung.us crap … I was suspicious before! I’m now on the hunt for something more friendly.

During the error investigation I discovered how much bandwidth WordPress uses – a huge amount. In the admin console there appears to be constant traffic – gigabytes of the stuff. Close windows please; they’re chomping through our quota!

The Mnemosyne subversion repository is back!

An image that spent four years on the daily Pendolino to Helsinki; the splash screen for the LabManager software by Mnemosyne BioSciences

I am pleased to have my subversion server back online after approximately 18 months of tropical heat, humidity and distraction! Subversion is a pretty awesome code repository for tracking complex software documents. It used to be rather beastly to set up SVN behind an https server – things went pretty well today and it was only a couple of hours of fiddling to get it just right. The last commit (svn – Revision 3519) was in July 2010 in the week before I moved to Kuala Lumpur. It is such an amazing blast-from-the-past to discover the massiveness of the code that was written – and now how foreign it seems.

Dell PowerEdge T410 – suitable for home genome informatics?

The old server has been spun-down after almost five years of service. In its place is a shiny new Dell PowerEdge T410. What are my first impressions? The box is well built, well equipped and my almost obsessive commitment to all things Apple has been broken. The system as it stands is not quite perfect – I’ll document some of the niggles here – things that I’ll be solving if possible.

  1. The server has a hot-swap backplane (not absolutely necessary I know) but it ships with a single drive sled containing the single drive. This drive is awesome; a SAS 15K drive but with just 300GB of storage. It would have been polite for Dell to include the rest of the drive sleds; I don’t agree with their expectation that I’ll buy ultra-high-performance and costly disks through their channels. For this home genomics challenge that will require terabytes of self-funded disk space I’d rather go with something a little more consumer level. A quick trip to www.discountechnology.com and hopefully I’ll receive some appropriate drive slides next week.
  2. The server is LOUD. I know that I am not a representative user for servers – a home office in a condo in the tropics is on the border of standard operating environment; the room is typically 32C all the time – increasing in the afternoon when the sun shines in! The fans are working hard to cool the system – and exercise in frustration; I wish the temperature thresholds were slightly more configurable.
  3. The graphics is in-built and is a Matrox G200ew; fine for a server. I would like to use the server with a monitor that supports a resolution of 1680×1050 – the graphics card in Linux (Ubuntu / Centos) is rather recalcitrant to my desires and it seems that it has been so for others as well (http://ubuntuforums.org/showthread.php?t=1505532). There are some examples this resolution has been achieved using Windows Server – it can be done… I’ll need to fight with this challenge a little.
  4. USB 3.0 – so many external disks and dongles are now USB 3.0 – it would have been nice to use them at a proper performance … Apple doesn’t include this standard, so I’m not too concerned; it would have been nice. This could be solved with an expansion card at a later date.

Last night I migrated the final files from the old disk drives using this new server and a 2-bay Toaster – ideal for extracting meaning from a two drive LVM setup – god knows why I chose this file format when the old server was originally setup? I guess that it seemed aesthetically optimal when setting up; not ideal for final rundown! Toaster was great for extracting the final files – I used a Centos live CD to mount the LVM and migrate content – I’m getting about 30Mb/s transfer with the Toaster which is almost acceptable – I’d prefer more and USB3.0 would be ideal for this!

I’ve settled on the concept of using the Precise Pangolin for the operating system for the next few weeks anyhow. It has been a pleasure to use compared to the older Fedora 7. I know that this is an unfair comparison, but the ability to install eclipse, GridEngine, tomcat, etc with a single command is exciting and empowering compared to the state-of-the-art 5 years ago.

Tomorrow I will reinstall the subversion server, get some of the core services up and running and hopefully get the first of the Genome Informatics software that I wish to test and evaluate up-and-running.

Which version of linux is best for this bioinformatician?

As I spin-down one server, I need to think about the next server. I have been a fan of Linux since 1996. During my PhD I was trying to run clustalw to align some viral RNA-dependent RNA polymerase sequences and the shitty shared UNIX computer would kill processes that had been running for more than a couple of hours. I purchased Redhat 4.0 direct from the US and had it shipped to Norwich. I installed this on the best PC I could afford at the time and had a vast amount of fun learning configuration, compiling stuff and getting to learn how networks work. This was perhaps the start of my time as a bioinformatician. I have been running linux ever since, although I do perhaps spend a little more time on OSX Darwin at the moment. The server being retired runs Fedora release 7 (Moonshine). I am therefore a little behind the most recent updates. In the past I have used RPath Linux (awesome but fussy packaging), Mandrake (for dummies), SuSE (unconvinced) and various flavours of UNIX (Tru64, IRIX, SunOS). Colleagues (Hi Grisha) once tried to convince me that Debian was the only way and Slackware has been tried, tested and discarded!

With Number One Son we had a look at some of the recent LiveCDs from the candidate distributions that I could be using. We have tried Centos 6, Ubuntu 12.04 (prerelease – waiting for Precise Pangolin final and Thursday) and Fedora 16. While I have been using Redhat / Fedora since 1996 I am feeling rather tempted with the Pangolin. Centos is Redhat and it seems boring and too straight. Fedora seemed a little lacklustre. A sensible decision would be Centos, it was clean and worked as I expected. During my years in Pharma and I was using a collection of RedhatES boxes to get data processed. Centos is familiar. I think that I’m going to ignore Fedora (I have installed it on a USB drive and will evaluate it for a few more days.) I was impressed with the Pangolin almost-final. It is a very different beast though and there will be a learning curve. I haven’t installed or done very much with a linux box for a few years now – I need to see how things have really evolved!

The next question will then be what to install! I’ve been reading the information on Biolinux-cloud – far too much installed (and installing Novocraft software, even on a home computer, is a hanging offence with current employer (hyperbole perhaps, but still not a good idea!) I’ll be covering the beginning of the installation of the server next weekend!

Retiring an old server

Decommissioning and old server

Decommissioning an old server - the floor fan is trying to ward off further thermal disasters.

Today has been cathartic for home bio-computing. My server has been a trusty companion for a few years. It was built from a collection of mail order parts in Finland. It had (then) a massive amount of disk space and is still completely awesome. While it provided a subversion server and provided MySQL and PostgreSQL for the Mnemosyne LabManager software it hasn’t taken kindly to life in the tropics – it’s over 40 degrees in the home office and that’s with a failing aircon trying to manage the temperature challenge. More than a couple of hours of gentle computing and it will thermal shutdown. With a floorfan and much gentle patience all of it’s contents have been backed-up. The catharsis has been in the deletion of massive reams of GEO datasets that I’ll never look at again. I have several hundred gigabytes of EST sequences from the original OpenSputnik.org projects – again – I don’t imagine that they’ll be doing anything ever again. They have been parked in /dev/null for a rainy day! A few movies that haven’t been watched for a couple of years have been moved onto an external USB disk – an end of an era…


[root@localhost ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc1[0] sdf1[3] sde1[2] sdd1[1]
1465151808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
[=>...................]  resync =  6.8% (33411308/488383936) finish=236.9min speed=32003K/sec

At least a poke at its RAID array status explains some of the crappy performance issues that we’ve been fighting with. I face a thermal shutdown every 3 hours or so … the poor array is trying to sort itself out and never quite gets there.


[root@localhost ~]# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sun Oct 14 15:21:55 2007
Raid Level : raid5
Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Apr 22 14:17:27 2012
State : active, resyncing
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Rebuild Status : 8% complete

UUID : abcb499a:bdbc3a1a:36654873:16ae2233
Events : 0.5179

It’s awesome that the array was built (along with the rest of the system) 4.5 years ago. Pretty good for a server to satisfy expectations and needs for so long! Now that we can shutdown and reassign old disks and parts we can start the process of getting the new genomics big rig into action. I have a shipping date now – later this week!

Genome “S”

I had my genome sequenced to low coverage by the lab team when they were optimising some protocols. It has lived on a hard disk floating around the company for the last 9 months or so and I haven’t done anything with it. A colleague rose to the challenge of packing the data into a SAM file – I loaded it into Tablet for a quick look and had a quick play with samtools on a Mac but that was it.

We’re now into Q2 2012 and with this new reinvigorated BioinformaticsBlog I’m ready for a challenge. I wish to extract sufficient data out of my own genome that I can at least run a Promethease report and look for that rather obnoxious deletion that has been responsible for a few too many cases of familial cancer in the family.

So, what is the starting point? I have an 80 gig SAM file. That’s all. I have a Mac Mini (2009 model) and a little bit of patience. I have a few late nights coming up and the interest to see what I might learn. Once I have had my play with the data I’ll ask the team at work to run it through the MGRC myBench software and see how their expertise compares with my own. I’ll stick to open and downloadable tools and will try to document the process. I wonder what we’ll all learn?

 

First week in …

The reinvigorated BioinformaticsBlog is now a week old. I was curious to see if anyone had actually had a look at the site. I was therefore delighted to see that a handful of visitors from North America, Europe, Asia and Australia have checked it out. I’m impressed!

One of the goals of the site is to start playing with the hottest and most powerful genomics applications that are released. I mentioned in a post from last weekend that I intended to build a genomics “big-rig” for playing with CUDA GPU software and seeing what really can be done with the latest and greatest. I have abandoned the plan to build it myself and I have configured and ordered something more sustainable from Dell. I have a PowerEdge T410 with 24GB RAM and 8 cores that will hopefully facilitate most of the activities that I am interested in! Updates on this will be posted once the order has been confirmed and box has been delivered. Awesome!

Blogging in business?

I found this link over at Techcrunch during this morning’s reading (http://techcrunch.com/2012/04/14/ceo-bloggers-to-blog-or-not-to-blog/). I very much like the argument that the post makes – acting as a CEO or executive director should involve delegation and not too much micromanagement; a high value blogging existence is a very attractive way to proactively advertise and develop a business. I’m not a CEO and I do not have this as an (immediate) ambition. I do like writing, I love reading and I am trying to build a personal and likeable brand. The bioinformaticsblog marketing department does not exist, there is no corporate oversight (I’m not a CEO and my writing is out of company time) – I am in this for me. I don’t know where the blog is heading and similarly I do not know where I am going – let’s pitch for some honest fun, develop some ideas and hopefully generate some authoritative contributions to the bioinformatics and genomics community.