Consolidating Condor into Slurm

As of today, over 229,300 jobs have successfully run in the Slurm scheduler. Given the stability and flexibility of the new scheduler, we are consolidating the Condor system into the Slurm. This means that jobs you previously submitted to Condor, you will now submit to a HPC partition named Condor.

In July, we commissioned a new scheduler for the HPC, Slurm. Since then, we have been tuning the system to improve stability and performance. As of today, over 229,300 jobs have successfully run in the Slurm cluster.

Given the manageability and flexibility of the Slurm Scheduler, we have decided to consolidate the Condor system into Slurm. What this means is that jobs that you would have submitted to Condor in the past, you will now submit to a partition named Condor using the sbatch command.

For details on Slurm job submission, refer to our guide.

Consolidating schedulers will allow us to free up important, but limited resources for managing systems at the RCC. In addition, users who have been using both systems will no longer have to manage submit scripts for multiple schedulers.

New 'Condor' Partition

The new partition has the following characteristics:

  • All compute nodes that were previously in the Condor system will now be in the Condor partition
  • All RCC users have access to this partition; this remains a free, permanent resource;
  • Maximum job execution time is 90 days;
  • MPI and OpenMP are available, but only for single-node jobs; these nodes are not be connected by IB fabric;
  • Both Lustre and Panasas file systems are mounted on compute nodes in this partition;
  • This partition is configured 'fair-share', similar to Condor. The more jobs you submit in a given time period, the lower your job priority becomes.
  • This partition is already available to all RCC users. Run rcctool my:partitions to see details. Currently, there are only five nodes in this partition, but we will continuously migrate compute nodes as existing Condor jobs finish. When fully migrated, this partition will have access to approximately 950 cores.

Time Line

The time line for this project will begin on Monday, November 2, 2015, and run for the next several months:

  • Monday, November 2, 2015 - We will disallow any new job submissions to the existing Condor cluster. Any jobs queued before this date will continue to run. As jobs finish and compute nodes are freed up, we will migrate them to the new scheduler.
  • Monday, January 4, 2016 - Any running Condor jobs that have not completed will be cancelled. During this time, we will reconfigure Condor nodes to run using the Slurm scheduler, and move them into the new HTC Partition as Condor jobs drain off of them.

If you have any questions or comments, please let us know: support@rcc.fsu.edu