Archive for the ‘rPath’ Category

Caos NSA and Perceus: All-in-one Cluster Software Stack

Friday, February 6th, 2009

caos-nsa.png

As part of my best working practice, and as a devoted follower of linux technologies, paradigms and server-side computing I am a committed reader of the Linux Magazine. Yesterday they had a pretty good article on a linux distrubution that I have hitherto been unaware of (and should have known something about) and a rather basic introduction as to how to get the software running.

I am not sure what this means – at the moment BioRAM linux is slowly evolving, but is not really planned as a complete cluster software stack, but more as a bioinformatics cluster “central”. Am I re-inventing the wheel?

I have suspicions that BioRAM linux may be dead in the water (already) although it has a current, up-do-date and functioning application stack. I guess that the rPath linux made available through rBuilder may be a little off the bleeding edge curve, and I should perhaps again focus my energies elsewhere… This is rather negative thought, and anyone who works with me will know that negativity is bad and a positive spin must be added to everything ….

My feeling, Caos NSA looks like a very viable software distribution for a formal evaluation and review. To review such a tool we need a set of objectives, and some deliverables that will allow for an objective assessment as to whether the software is fit for purpose. Naturally, it is also rather important that we can package and install a set of software applications that will be needed across the cluster.

I’ll get back to this thought later, but I’m excited about the opportunity to take Caos NSA for a run over the weekend!

linux distributions, custom images and respinning

Tuesday, January 27th, 2009

ssv_tux_stick_plain.jpg

I have been discussing rPath over the last few days and have put an not inconsiderable amount of effort into getting a bioinformatics linux distribution off the ground. rPath is perhaps not the easiest way to go; fedora offers its own way using the Revisor tool. I’ve been a fan of RedHat since my first excursion into linux 15 years ago, and have a Fedora workstation at home, but getting software packaged into .rpm is something that still makes me more than just a little crazy (Staden .rpm production caused much hair loss – I am certain of it!)

It is cool to see that on Slashdot this morning there is a link for a new Suse project going into alpha for exactly this purpose as well. SUSE Studio looks like an interesting tool and something that I should invest some time in exploring.

Bioinformatics is great; I am paid (a modest fee) to do something I enjoy and I would love to see more people using responsibly a wider range of tools. It is sad that so many so called “bioinformaticians” don’t know how to compile C code; have little understanding of the tools available and do not explore religiously the new publications in bioinformatics, nucleic acids research etc for new applications that will push the boundaries as to what they can do. I vehemently oppose the concept that meaningful bioinformatics can be achieved using a windows workstation alone!

I guess I should ask you bioinformaticians out there – how would you roll a meaningful bioinformatics distribution for use on either a workstation or for deployment across a cluster of tens or hundreds of servers?

tutorial – building a bioinformatics centric, distributed, cluster centric linux (bioram part I)

Tuesday, January 20th, 2009

sample_cd.jpg

Yesterday I mentioned rPath linux. I am again in love with this application. rPath linux and their free rBuilder application stack is something that the bioinformatics community should consider jumping at. Regardless as to whether you are a grad-student interested in high-performance bioinformatics computing, a postDoc with a big problem or a technologist in a core facility, there is something that you could do with rPath.

I have decided to put together a tutorial series on rPath and my own (very-much-in-progress) version of bioinformatics-oriented linux. In this tutorial we are going to start out with a linux running inside a virtual machine. This will be configured with up-to-date and interesting packages that are of relavance to tasks within genomics, transcriptomics and metabolomics. The virtual machine will be re-iteratively improved and we will prepare our own version of “bioram-linux”.  This will be made available as a standalone linux distribution and will be integrated in an environment to deploy across a computer cluster dedicated to bioinformatics. We will use some of my ideas, some of Aschwin’s ideas and hopefully, you, the wider community will start to comment and update us with your feelings, opinion and ideas.

The final version will be made available at rPath rBuilder. The packages and modules can be copied, modified and altered to suit your needs within your own projects – this is just a push to hopefully increase the amount of content and knowledge within the bioinformatics community in relation to rPath.

As a first note (self-important mumbling to self even), please acknowledge that I am no expert in linux administration (I am a comfortable super user) and will undoubtedly introduce faux-pas within the pipeline and workflow. Please let me know when I am going wrong and hopefully together we can do something rather clever? The project is called “bioram-linux”. BioRAM was the name of a company that I planned to start during 2002 / 2003; the domain was registered and ideas and business plans collected. This idea was on the back-burner and seemed like a good idea when I registered for rPath a couple of years ago. Please accept the name, I like it :-)

In today’s tutorial we are going to head over to rPath, find the bioram-linux / Mnemosyne BioSciences LabManager development image and will download it. We will then install the appropriate linux within a VMWare based virtual machine running on our host computer. For the time being all images are x86, and will stay that way until much later tutorials when we will start to cross-compile packages for the x86_64 platforms.

Let’s get cracking

  • Head over to the rBuilder pages at rPath. In the search box at the top of the page, click on “product”, and search for the term “bioram”. Press “search” and the Mnemosyne BioSciences product for bioram-linux should be returned as one of the top matches, click on this link. The pages for bioram can also be found directly by using the URL http://www.rpath.org/project/bioram-linux/

rbuilder_1.png

  • On the bioram pages, use the right-hand panel to select for the option to “View Releases”. A page similar to the one shown below should be shown. This provides a link to the BioRAM linux download page. If, of course, you are looking at this page more than a few days after the first posting of this blog article then things may have changed! Select the link for the “Mnemosyne Biosciences – BioRAM linux for bioinformatics”, and proceed to download the VMWare image. Please note both the size of the file (it is not a very small file ..) and ideally the md5sum checksum. You can then validate that what you download is the same as what was present on the server.

rbuilder_2.png

  • When the file has downloaded, ensure that the file checksums match and extract the .tar.gz archive. If you don’t already have the VMware player, VMware Fusion or some other VMware environment you should arrange this now. I can recommend the VMware player for linux, this is great. I use Fusion on my MacBook Pro and this is even better. Load the downloaded VMware image into your VM software of choice and start the VM up for the first time …

rbuilder_3.png

  • Provide the path to the downloaded BioRAM installation page

rbuilder_4.png

  • View and review system settings and make any changes that you feel would benefit your system!

rbuilder_5.png

  • Everything should now be set! Click the various boxes and boot your newly defined Virtual Machine for the first time. Good luck! There will be a minute or so of linux loading and the standard process messages will be written to the screen – there should be no need for panic!

rbuilder_6.png

  • Once the BioRAM linux has loaded, it will complain about an improperly configured X. Let the system know that you are in control, make the error message go away by clicking “YES” I know what I am doing and have used linux before boxes. X will start and you will be prompted to configure X to your specifications. I would recommend something along the lines of what is shown below. Make sure that the VMware graphics adapter is configured and everything should be fine.

rbuilder_8.png

 

rbuilder_7.png

  • You are now able to log on with the “root” account – there is no password, so the very first things that you should do are to ensure that at least networking works and that you have a secure root password implemented!

rbuilder_9.png

rbuilder_10.png

You have done really well getting to the end of this first BioRAM linux tutorial. I am aware that I am not the best tutorial writer, but the only way to learn in by doing it, and receiving comments and criticisms from the readers. I hope that you have gotten this far, in the coming few weeks we are going to do a whole lot more with this.

In the next session (Thursday) we will configure our new virtual machine for package building, maintenance and compiling. We will define a new version of the “R” package and will create version 2.8.1 of R for our virtual machine. This will be installed and we will have a little think about how we can get “Bioconductor” installed and in use.

rPath linux and bioinformatics – part I (why?)

Monday, January 19th, 2009

rbo.png

This collection of somewhat unfocussed and prosaic text has been slowly evolving and thus far it has been mainly about me in some form of ego-massaging form. Not what a larger bioinformatics interested audience really needs. I intend to up the ante, and to start describing a little more about what I feel are pretty neat concepts within bioinformatics.

Bioinformatics is more than just a job, it is a way of life, and before the standardization of the discipline with tools from Ingenuity, SAS, Rosetta, GeneData, Mnemosyne etc it is pretty much a free-for-all development fight. The bioinformatics community hasn’t really standardized on a single programming language (Perl is dominant, but python, Java, R, C etc) all get a pretty good look in, and there each language has advantages and disadvantages. As scientitists and bioinformaticians publish tools, as tools are adopted across the community and as workflows and approaches to understanding are integrated the needs of our bioIT systems become rather more convoluted and complex.

In a post last week I questioned as to how you, the wider audience might keep your R repositories updated and synchronized. On a laptop and workstation I imagine that the task is rather trivial. You update code as required in an ad hoc fashion. What happens though when you need to share your code base with other scientists in another country (timezone, city whatever). It is not always reasonable to ask a collaborator or customer to please install packages X,Y,Z – recompile application A and you’re ready to go.

When working at the Turku Centre for Biotechnology I was lucky enough to be able to recruit a SysAdmin called Aschwin. He had some pretty strong ideas as to how the lab infrastructure should be implemented and erred perhaps on the side of the “bleeding edge”. Our cluster was quickly restructured into something pretty workable, that satisfied my user needs (I wrote my first technical needs brief for Aschwin prior to the industrialisation of my career) and got the job done. Aschwin used Ubuntu, FAI and a collection of scripts and new .deb packages to encapsulate the bulk of the required software. This worked well, but always required Aschwin to adapt, modify and correct the environment. Not a problem whilst you work with your SysAdmin…

I had requested a cluster implementation such as Rocks or something a little more pre-digested, but at least the cluster was deployed as specified and worked! Rocks is something that I have used at home in a 5 node cluster during earlier EST analysis, but it suffered due to it’s need for RPM files and the general pain in the &%$ that is required to package some software. My continued struggle to identify the optimal customer roll-out platform continued (and still continues).

rPath came to my attention in 2006 and I saw a small company building on ideas from redhat (my first (1995) and still favourite linux). The rPath company has a free rBuilder application that can be used to create a virtual linux distribution. You cut and paste software packages from their servers and when these are insufficient for your needs, so roll your own and upload them to the server for the benefit of the wider community. I have created packages for typical applications such as ClustalW, Phylip, NCBI Blast and InterPro. Keeping them up-to-date is a question of time and resources and it seems that the package evolution is somehow punctuated with frenetic energy twice a year.

The rPath approach is really cool. I can create packages, compile packages and roll-my-own distribution that can be made available for download as live CDs, installation media or images for running within the typical virtual machines. This is the real reason that I am interested. To quickly roll a linux distribution with an up-to-date set of software, the latest (and validated R packages) and with the required Mnemosyne integration software and I am a happy bunny. I can run the software on my MacBook in VMware Fusion and can run the wierd packages with bizarre dependency problems with the standard IT infrastructure used in the office.

Please have a look and get in contact. We together can make a really useful linux distribution that has a superior feature set than e.g. DNAlinux, BioBrew or BioLinux. In later parts of this story I will discuss the process of creating a distribution, packaging your software and distributing your new distribution across a large computer cluster.