Large-Scale Computing on Bioinfo Cluster
- Introduction
- A cluster is a group of parallel connected computers to accelerate distributable calculations.
- Major applications in bio/chem/stat-informatics
- Sequence-, model-, structure-based database searches: no of queries ↑ + database size ↑
- Predictive calculations and quantitative data analyses: no of inputs ↑ + size of data sets ↑
- Modeling and simulations: advancement of algorithms ↑ + no of inputs ↑
- Load balancing of high-traffic databases: no of simultaneous queries ↑ + query speed ↑
- Specifications of Bioinfo Cluster
- Hardware
- Head node: HP Tetrad 1.9GHz Xeon, 8GB RAM, 1TB HD space
- Cluster nodes: 32 nodes, each with dual 2.8GHz AMD (Barton), 1GB RAM, 40GB HD and dual Gigabit network card
- High-end floor cooling system for 200 CPUs
- Operating system
- Cluster software
- Distributed inter-process communication (DIPC) software ('parallel computing software', e.g. MPI, PVM)
- Queuing systems (not enabled yet, options: DQS, OpenPBS, SGE)
- Load balancing software (OpenMosix, SGE)
- Resource monitoring systems (e.g. jmon, mosmon, freenodes)
- Account infrastructure
- Home directory on user account disk: limit 1.5GB
- Directory /data/working/user_account on HD RAID array: user limit 10GB
- User directories/files mounted (available) on nodes
- User policies
- Greediness: when available, all nodes available for everyone
- Honesty: schedule lengthy jobs with administrator (≥5 nodes for 6 or more days)
- Modesty: chose node quantity based on user activity
- Financial management
- Recharging per account, power users by CPU usage
- Grants for hardware upgrades/expansions and administrator salary
- Discussion
- Administration
- Hardware
- Cluster size
- CPU: 32 vs 64 bit, RAM, etc.
- Software
- Cluster software
- User applications
- Recharging
- Criteria: flat fee, CPU usage
- How much?
- Workshop using the Bioinfo Cluster