Case study: A small compute cluster for omics analysis
Several years ago we advised one of our clients, a research group in the field of genomics, on the acquisition of a compute server for their omics analyses. After helping them to select the most
suitable machine configuration we were also offered to do the system administration and hardware maintenance of the machine, which we gladly accepted.
After less than two years the increased needs for computation power and storage space led to the question whether we could help them with upgrading their compute infrastructure.
In cooperation with the client and several suppliers we decided to extend the existing server with an additional disk storage module and two compute nodes. We also suggested a server for backups of important data.
The existing machine had two 6-core processors, 144 GB RAM (i.e. an average of 12 GB per core), and 16 disks of 2 TB each. An external disk enclosure with 12× 2 TB disks was connected to the machine as well. Since the machine was performing well and was still under warranty, we decided to keep using it.
To satisfy the need for more compute power, two compute nodes were added, turning what used to be a single server in to a small compute cluster. The compute nodes consist of two 16 core processors each and have a total of 256 GB of RAM, i.e. 16 GB of RAM per core. We designed the cluster in such a way that all storage is centralised on the ‘old’ compute/storage server. Consequently, the nodes are configured with very limited local storage.
The need for additional disk storage was solved by replacing the existing 12-slot external drive enclosure with a new one with space for 45 drives. Initially half of these were filled with 4TB disks and two SSDs for read and write caching. Internally, the enclosure was connected to a new RAID card so that the original 16 disk array and its controller would get a bit more headroom and performance would remain up to par.
To make sure that the network connection between the compute/storage server and the compute nodes was not going to be a bottleneck, each compute node was connected with the storage server via a QDR InfiniBand interconnect. In order to simplify maintenance we used the IP over InfiniBand (IPoIB) protocol. This allowed us to use a normal network configuration with NFS mount points, regular internet traffic, and the Sun GridEngine batch queue system while maintaining excellent performance.
In order to safeguard important datasets a separate server for backups was required. To that end, a server without much compute power and RAM was purchased. The machine has 12 internal slots for hard drives and the external enclosure originally connected to the main compute/storage server was moved to this machine. As this machine only contained backups of data and wasn’t used for ‘live’ computation, we chose to use so-called cold storage disks which are slower, but as a result are also more cost-effective.
A backup schema was devised that automatically rotated daily, weekly and monthly snapshots of important data on the compute/storage server. Our backup plan kept 12 monthly snapshots, 4 weekly snapshots and 8 daily ones. This allows the client to go back to a certain point in time in case of e.g. accidental deletion of files. The backup plan was implemented using the open source rsnapshot tool in combination with LVM (Logical Volume Manager) which provides the actual snapshots of the data on the compute/storage server.
Together with our client and hardware suppliers we devised a way to turn their existing compute/storage server into a small compute cluster that satisfies the needs for their omics analyses for now and several years to come.
Their original setup was expanded from a single server to a small cluster consisting of a compute/storage server, two compute nodes and a server for backups of important data. The following table summarises this project in numbers:
|RAM per core (GB)||12||15|
|Effective disk space (excluding backups) (TB)||42||120|