Linux Cluster

Large-Scale Computing on Bioinfo Cluster

Josh's User Manual

Introduction

A cluster is a group of parallel connected computers to accelerate distributable calculations.
Major applications in bio/chem/stat-informatics

Sequence-, model-, structure-based database searches: no of queries ↑ + database size ↑
Predictive calculations and quantitative data analyses: no of inputs ↑ + size of data sets ↑
Modeling and simulations: advancement of algorithms ↑ + no of inputs ↑
Load balancing of high-traffic databases: no of simultaneous queries ↑ + query speed ↑

Specifications of Bioinfo Cluster

Hardware

Head node: HP Tetrad 1.9GHz Xeon, 8GB RAM, 1TB HD space
Cluster nodes: 32 nodes, each with dual 2.8GHz AMD (Barton), 1GB RAM, 40GB HD and dual Gigabit network card
High-end floor cooling system for 200 CPUs

Operating system

Debian Linux

Cluster software

Distributed inter-process communication (DIPC) software ('parallel computing software', e.g. MPI, PVM)
Queuing systems (not enabled yet, options: DQS, OpenPBS, SGE)
Load balancing software (OpenMosix, SGE)
Resource monitoring systems (e.g. jmon, mosmon, freenodes)

Account infrastructure

Home directory on user account disk: limit 1.5GB
Directory /data/working/user_account on HD RAID array: user limit 10GB
User directories/files mounted (available) on nodes

User policies

Greediness: when available, all nodes available for everyone
Honesty: schedule lengthy jobs with administrator (≥5 nodes for 6 or more days)
Modesty: chose node quantity based on user activity

Financial management

Recharging per account, power users by CPU usage
Grants for hardware upgrades/expansions and administrator salary

Discussion

Administration

Halftime administrator

Hardware

Cluster size
CPU: 32 vs 64 bit, RAM, etc.

Software

Cluster software
User applications

Recharging

Criteria: flat fee, CPU usage
How much?

Workshop using the Bioinfo Cluster

Instructor: Josh Lauricha