Archive for the ‘hardware’ Category

linux distributions, custom images and respinning

Tuesday, January 27th, 2009

ssv_tux_stick_plain.jpg

I have been discussing rPath over the last few days and have put an not inconsiderable amount of effort into getting a bioinformatics linux distribution off the ground. rPath is perhaps not the easiest way to go; fedora offers its own way using the Revisor tool. I’ve been a fan of RedHat since my first excursion into linux 15 years ago, and have a Fedora workstation at home, but getting software packaged into .rpm is something that still makes me more than just a little crazy (Staden .rpm production caused much hair loss – I am certain of it!)

It is cool to see that on Slashdot this morning there is a link for a new Suse project going into alpha for exactly this purpose as well. SUSE Studio looks like an interesting tool and something that I should invest some time in exploring.

Bioinformatics is great; I am paid (a modest fee) to do something I enjoy and I would love to see more people using responsibly a wider range of tools. It is sad that so many so called “bioinformaticians” don’t know how to compile C code; have little understanding of the tools available and do not explore religiously the new publications in bioinformatics, nucleic acids research etc for new applications that will push the boundaries as to what they can do. I vehemently oppose the concept that meaningful bioinformatics can be achieved using a windows workstation alone!

I guess I should ask you bioinformaticians out there – how would you roll a meaningful bioinformatics distribution for use on either a workstation or for deployment across a cluster of tens or hundreds of servers?

tutorial – building a bioinformatics centric, distributed, cluster centric linux (bioram part I)

Tuesday, January 20th, 2009

sample_cd.jpg

Yesterday I mentioned rPath linux. I am again in love with this application. rPath linux and their free rBuilder application stack is something that the bioinformatics community should consider jumping at. Regardless as to whether you are a grad-student interested in high-performance bioinformatics computing, a postDoc with a big problem or a technologist in a core facility, there is something that you could do with rPath.

I have decided to put together a tutorial series on rPath and my own (very-much-in-progress) version of bioinformatics-oriented linux. In this tutorial we are going to start out with a linux running inside a virtual machine. This will be configured with up-to-date and interesting packages that are of relavance to tasks within genomics, transcriptomics and metabolomics. The virtual machine will be re-iteratively improved and we will prepare our own version of “bioram-linux”.  This will be made available as a standalone linux distribution and will be integrated in an environment to deploy across a computer cluster dedicated to bioinformatics. We will use some of my ideas, some of Aschwin’s ideas and hopefully, you, the wider community will start to comment and update us with your feelings, opinion and ideas.

The final version will be made available at rPath rBuilder. The packages and modules can be copied, modified and altered to suit your needs within your own projects – this is just a push to hopefully increase the amount of content and knowledge within the bioinformatics community in relation to rPath.

As a first note (self-important mumbling to self even), please acknowledge that I am no expert in linux administration (I am a comfortable super user) and will undoubtedly introduce faux-pas within the pipeline and workflow. Please let me know when I am going wrong and hopefully together we can do something rather clever? The project is called “bioram-linux”. BioRAM was the name of a company that I planned to start during 2002 / 2003; the domain was registered and ideas and business plans collected. This idea was on the back-burner and seemed like a good idea when I registered for rPath a couple of years ago. Please accept the name, I like it :-)

In today’s tutorial we are going to head over to rPath, find the bioram-linux / Mnemosyne BioSciences LabManager development image and will download it. We will then install the appropriate linux within a VMWare based virtual machine running on our host computer. For the time being all images are x86, and will stay that way until much later tutorials when we will start to cross-compile packages for the x86_64 platforms.

Let’s get cracking

  • Head over to the rBuilder pages at rPath. In the search box at the top of the page, click on “product”, and search for the term “bioram”. Press “search” and the Mnemosyne BioSciences product for bioram-linux should be returned as one of the top matches, click on this link. The pages for bioram can also be found directly by using the URL http://www.rpath.org/project/bioram-linux/

rbuilder_1.png

  • On the bioram pages, use the right-hand panel to select for the option to “View Releases”. A page similar to the one shown below should be shown. This provides a link to the BioRAM linux download page. If, of course, you are looking at this page more than a few days after the first posting of this blog article then things may have changed! Select the link for the “Mnemosyne Biosciences – BioRAM linux for bioinformatics”, and proceed to download the VMWare image. Please note both the size of the file (it is not a very small file ..) and ideally the md5sum checksum. You can then validate that what you download is the same as what was present on the server.

rbuilder_2.png

  • When the file has downloaded, ensure that the file checksums match and extract the .tar.gz archive. If you don’t already have the VMware player, VMware Fusion or some other VMware environment you should arrange this now. I can recommend the VMware player for linux, this is great. I use Fusion on my MacBook Pro and this is even better. Load the downloaded VMware image into your VM software of choice and start the VM up for the first time …

rbuilder_3.png

  • Provide the path to the downloaded BioRAM installation page

rbuilder_4.png

  • View and review system settings and make any changes that you feel would benefit your system!

rbuilder_5.png

  • Everything should now be set! Click the various boxes and boot your newly defined Virtual Machine for the first time. Good luck! There will be a minute or so of linux loading and the standard process messages will be written to the screen – there should be no need for panic!

rbuilder_6.png

  • Once the BioRAM linux has loaded, it will complain about an improperly configured X. Let the system know that you are in control, make the error message go away by clicking “YES” I know what I am doing and have used linux before boxes. X will start and you will be prompted to configure X to your specifications. I would recommend something along the lines of what is shown below. Make sure that the VMware graphics adapter is configured and everything should be fine.

rbuilder_8.png

 

rbuilder_7.png

  • You are now able to log on with the “root” account – there is no password, so the very first things that you should do are to ensure that at least networking works and that you have a secure root password implemented!

rbuilder_9.png

rbuilder_10.png

You have done really well getting to the end of this first BioRAM linux tutorial. I am aware that I am not the best tutorial writer, but the only way to learn in by doing it, and receiving comments and criticisms from the readers. I hope that you have gotten this far, in the coming few weeks we are going to do a whole lot more with this.

In the next session (Thursday) we will configure our new virtual machine for package building, maintenance and compiling. We will define a new version of the “R” package and will create version 2.8.1 of R for our virtual machine. This will be installed and we will have a little think about how we can get “Bioconductor” installed and in use.

Bioinformatics, academia, open-source and proxy servers

Friday, January 9th, 2009

proxyServer

I have a pet peeve! Proxy servers and filtering gateway servers! As an academic in a OK university on the periphery of the civilised world, proxy and filtering gateway servers are something that exist, but are largely invisible to the user. Network connections just work! Bittorrent, ssh, ftp, telnet whatever. There may be throttling on some ports, and blocks of addresses may be banned, but the process is rather hidden. Writing bioinformatics software is clean and simple. Project X requires access to some data in XML format. Create a quick java script, dump into Tomcat and volia, a working XML dumper running on an obscure port of a production server!

In industry things are a little different. Companies maintain a largely healthy lockdown on the network; you are allowed to access what is identified as necessary. In my corporate environment this includes the non-blocked http pages on ports 80 and 8080. https connections on 443 work fine, and external ssh connections pose no problem. ftp is usually fine, but that is it! The reason is simple, companies need to protect IP, and the network is locked down to prevent the installation of non-approved software, and to prevent any “trojan” applications from leaking anything to the largely untrusted outside world.

Bioinformatics developers please note that connections outside of these ports is not possible. ENSEMBL is a fantastic resource and the ENSEMBL MySQL databases are a trove of mission valuable data. A database may appear on ports 3306, 3316, 5306 or 5316 depending on which version of ENSEMBL is required and whether it is required within the BioMart framework. BioMart itself has a registry, which again requires 3306, and in turn describes a collection of other servers with typically blocked ports.

In an industrial workplace to open a port is possible, but it requires paperwork. A justification as to who needs access to the resource, which ports on which server, the reason for the resource and different levels of approval. The job is actioned by an outsourced service provider and things work, but it all takes time!

If you are writing a really cool java WebStart application, deploying a BioMart, Taverna, MySQL, … resource, please consider the needs of the small (but potentially fee-paying) community who would love to evaluate your resource, but may skip the evaluation if the communication port is blocked. I could provide a long list of resources that really cannot be accessed (ArrayTrack, Chipster, BioMart, MySQL, KEGG …) but this is perhaps not fair…

A more valuable comment would be a suggestion as to how this can be avoided – subdomains are always very simple to implement on the server side, and you could consider mapping key resouces that use non-standard ports to a more standard port. If a resource looks good, then we also have our tricks that can work – the use of external servers that translate requests works, but is a hassle that no-one really needs!

self-learning in bioinformatics …

Tuesday, January 6th, 2009

Today is the 6th of January – and the 12th Night after Christmas. Superstition argues that it is today at the latest when the decorations from Christmas should be taken down and put away to avoid bad luck. With a corporate restructuring due to complete tomorrow, I need all of the good luck that I can get!

Whilst putting away the decorations I came across a pile of former playthings. The tools on which I learned very much of the bioinformatics tricks and techniques that I rely upon today! Amongst these toys I have an eclectic array of old SGI boxes (indy, indigo2 impact), Sun workstations (Sparc 10 pizza boxes and Sparc IPX lunch boxes galore) along with other slightly more bizarre hardware such as Sun Javastations and ancient RAID arrays.

Why do I have this junk? Back as a PostDoc I decided that there was more to bioinformatics and computing that the obligatory DEC alpha and the piece-of-$%&* Windows workstations available at the time. I spent a lot of time and money (ebay.de was my friend) buying end-of-life Unix hardware and running cool software (beyond NCBItools) such as openPBS to get complex genome analysi jobs running on machines that were rather fantastically expensive when they were new!

Those were pretty good times. Bioinformatics hardware is now kind of sucky. At home I have a couple of self-build linux servers and a Mac laptop. At work I have some pretty OK HP Xeon workstations, and at the University the group still runs some now rather dated Dell rack servers as a modest Beowulf. There isn’t really anything cool out there anymore is there.

My next goal for this year (sponsorship, career and readership allowing) is to start the process of buying some of the slightly more off-the-wall hardware to start seeing what the new boundaries of bioinformatics hardware could be? Are you interested?

Please post examples of great hardware that you are running bioinformatics on! Which ancient hardware do you still run?