PC Cluster Specifications



The cluster is formed by four physically separated bunches of Intel-based PCs. The master node is a 1.0GHz PIII dual-processor PC named Orpheus. It was formerly a 1.0GHz PIII single-processor PC, but we assigned the master node rol to former node 08, provided that duals are a bit more reliable. This computer has four net cards. The first one is used to connect the whole cluster to the internet, and the other three to three of the four different net switches (the fourth switch is connected to one of the first three ones). The PBS queue system is organized in such a manner that different queues run in physically distinct computers, so each node is devoted to a certain queue, and only that queue.

It is important to note that our setup is not an optimized one, and even if it were, different users might have different needs. The queue/CPU biunivocal assignment, for example, may have unwanted features for some users (some queues can be empty while others are overloaded), but, as everything else in our setup, it does work for us. The fact that two switches are connected in cascade lowers the performance of the data transfer (vital in many parallel calculations), so some users might desire a better tuned connectivity for the nodes. Anyway, our measurements of net traffic load during parallel computing seems to point out that the speed limiting factor is definitely not the communication bandwidth, but the computation time in each node. For faster CPUs, or more communication-demanding calculations, a faster net could be needed, but for us it's a waste of money.

The cluster has 82 nodes, 18 of which are dual-processor ones, giving a total of 100 CPUs. A detailed breakdown is shown in the following table:

Node number CPU type Speed Queue Added
01-07Dual PIII1.0 GHzMedium12-Dec-2001
09-14PIII1.0 GHzMedium12-Dec-2001
15-20Dual PIII600 MHzShort-
21-28PIII866 MHzMini-
29-44P41.7 GHzVShort10-May-2002
45-60P41.7 GHzLarge10-May-2002
61-78P42.4 GHzVLarge18-Dec-2002
100-104Dual PII and PIII450-500 MHzSlow10-Feb-2003



Queue Specifications



As mentioned before, the different queues run in different CPUs. Each queue is designed to cover some needs of the group. It is worth noting that the queueing system has some mayor advantages over "interactive" job sending, specially for groups of users, and large number of available CPUs. With a well set queue, the user doesn't have to take into account which nodes are working or free. Just send the job to the queue, and it will enter the cluster when the appropriate nodes are free. With the time limits the queues have (all but VLarge and Mini), the user can expect some CPUs to get freed in a reasonable amount of time, while being encouraged to send the jobs to queues with settings appropriate to its nature (a big freq calc should go to VLarge, but a tiny NBO calculation could as well be done in Slow). Recently (Jun-03) we upgraded Mini to no time limit (from a former 24h limit), under the pressure of more really long queues needed.

We have currently 7 queues running, with the following specs:


Queue name Number of CPUs Speed of each CPU Peak GHz CPUs per job GHz per job Time limit TFlop per job
Slow10450-500 MHz4.78104.7860h806.6
Mini8866 GHz6.9143.46-299.3/day
VShort161.7 GHz27.1246.7836h878.7
Short12600 MHz7.16127.1672h1855.9
Medium201.0 GHz19.9944.0096h1382.4
Large161.7 GHz27.1246.78184h4491.1
VLarge182.4 GHz42.5949.55-825.1/day



[Group Home Page]

Valid HTML 4.01! Valid CSS!