HPC Status Update

We've been tuning, tweaking, and fixing the HPC since we upgraded the system in July, and we have lots of updates to report on.

It has been a month since we have upgraded to the Slurm scheduler on the HPC, and since then, the staff here at the RCC has been tweaking the configuration, fixing problems, and tuning it to meet the needs of researchers who use the system.

This update will provide an overview of where we are at, and what we are continuing to work on.

Summary

  • We have done a lot of work to ensure owner-based jobs start in a timely manner. If you submit to an owner-based partition and feel like your jobs are not starting fast enough, please let us know.
  • Job prioritization in Slurm is weighted more heavily toward larger jobs (i.e. jobs that use more cores). This means that smaller jobs submitted will take longer to start than they used to in MOAB. We are working on a way to mitigate this and will keep you updated.
  • We are working on an issue where compute nodes periodically run out of memory and crash, causing all jobs running on that node to stop running If your job mysteriously stops without any notification or meaningful error message, this is probably what happened.
  • Jobs submitted to the backfil2 partition are preemptable, meaning owner-based jobs may preempt backfill2 jobs at any time. This was configured in Moab, but never fully worked.
  • We have enabled a new experimental partition, quicktest, with a maximum run time of 10 minutes and 8 cores maximum. You can use this partition to test jobs without having to run them on the login nodes.
  • We've enabled the sinfo command, which you can use to check node status.
  • The HPC is running RedHat v7.1, but Condor and Spear are still running RedHat 6.5. This is causing some software inconsistencies for users, particularly for 'R'. We are planning to upgrade both systems in the near future, and will have more details soon.

Job Preemption in backfill2

There are two backfill partitions: backfill and backfill2. Both of these partitions have a maximum job run time of four hours. Jobs running in backfill2 can be preempted by owner jobs. This means that, while a backfill2 job may start sooner, it may be killed if an owner-based job needs to start.

Preemption is not enabled for the backfill partition, so jobs submitted here cannot be killed by owner jobs. Likewise, jobs submitted to the genacc_q partition (which can run for up to 90 days) cannot be preempted by owner jobs.

Job Priority and Job Start Times

We have been working a tuning job priority and wait times in the Slurm Scheduler. For much of July, owner based jobs were not starting as fast as they should have. We have improved this quite a bit in the past few weeks, and are now seeing owner-based jobs start much faster. If you submit a job to an owner-based partition and feel it is not starting quickly enough, please let us know.

Also, Slurm places higher priority on jobs that use multiple nodes. This is appropriate, given that the primary use-case for the HPC is to handle larger jobs. However, it has created a practical issue where smaller jobs (that use one node or less) wait in queue for a long time; much longer than they used to in MOAB.

We are testing a few ideas to decrease the wait time for these smaller jobs and will post another update soon when we have a solution.

The quicktest partition

We have noticed a number of users running MPI jobs on the Login Nodes. This is a violation of our policy, and it uses up resources on the system preventing other users from being able to use the login nodes.

To mitigate this issue and provide you a better way to test your jobs, we have created a new partition called quicktest with a maximum job time of 10 minutes. If you want to test your code, you can submit it to quicktest, where it will start within minutes:

$ sbatch -p quicktest my_job.sh

The sinfo command

Many users have requested that we provide a mechanism for exposing the status of compute nodes (online/offline/error-state). As of today, we have enabled the Slurm sinfo command for all users. If you run...

$ sinfo -p [partition_name]

...you will see a list of nodes and their states. Refer to the Slurm Documentation for full details about how to use sinfo.

Out-of-Memory Issues

We have seen an occasional issue where certain jobs fill up the memory on a compute node causing that node to crash. If other jobs are running on this node, they too will crash. The Slurm scheduler may or may not send an error to users about this when it occurs. Often times, jobs will just fail silently.

If your job failed without any meaningful error message, this is probably what happened. We are working on mitigating this issue, and will post a notice when we find a solution.

Condor and Spear

We upgraded the HPC to Red Hat Enterprise Linux v7.1 (RHEL7) as part of the system maintenence in July. We also upgraded many software packages.
However, we did not upgrade our Spear and Condor clusters to RHEL7 during the maintenance.

This means that we are currently supporting multiple versions of many software programs, such as "R". Obviously, we want to standardize everything as soon as possible.

In that regard, we have already started upgrading Condor (more details to come), and will upgrade Spear later this year.


Thanks for being patient while we work through all of the issues related to the upgrade, and we appreciate the helpful feedback you have provided. Please keep sending us tickets when you see issues or have questions.