HPC Status Update
We've been tuning, tweaking, and fixing the HPC since we upgraded the system in July, and we have lots of updates to report on.
It has been a month since we have upgraded to the Slurm scheduler on the HPC, and since then, the staff here at the RCC has been tweaking the configuration, fixing problems, and tuning it to meet the needs of researchers who use the system.
This update will provide an overview of where we are at, and what we are continuing to work on.
Summary
- We have done a lot of work to ensure owner-based jobs start in a timely manner. If you submit to an owner-based partition and feel like your jobs are not starting fast enough, please let us know.
- Job prioritization in Slurm is weighted more heavily toward larger jobs (i.e. jobs that use more cores). This means that smaller jobs submitted will take longer to start than they used to in MOAB. We are working on a way to mitigate this and will keep you updated.
- We are working on an issue where compute nodes periodically run out of memory and crash, causing all jobs running on that node to stop running If your job mysteriously stops without any notification or meaningful error message, this is probably what happened.
- Jobs submitted to the
backfil2
partition are preemptable, meaning owner-based jobs may preemptbackfill2
jobs at any time. This was configured in Moab, but never fully worked. - We have enabled a new experimental partition,
quicktest
, with a maximum run time of 10 minutes and 8 cores maximum. You can use this partition to test jobs without having to run them on the login nodes. - We've enabled the
sinfo
command, which you can use to check node status. - The HPC is running RedHat v7.1, but Condor and Spear are still running RedHat 6.5. This is causing some software inconsistencies for users, particularly for 'R'. We are planning to upgrade both systems in the near future, and will have more details soon.
Job Preemption in backfill2
There are two backfill partitions: backfill
and backfill2
. Both of these
partitions have a maximum job run time of four hours. Jobs running in
backfill2
can be preempted by owner jobs. This means that, while a backfill2
job may start sooner, it may be killed if an owner-based job needs to start.
Preemption is not enabled for the backfill
partition, so jobs submitted here
cannot be killed by owner jobs. Likewise, jobs submitted to the
genacc_q
partition (which can run for up to 90 days) cannot be preempted by owner jobs.
Job Priority and Job Start Times
We have been working a tuning job priority and wait times in the Slurm Scheduler. For much of July, owner based jobs were not starting as fast as they should have. We have improved this quite a bit in the past few weeks, and are now seeing owner-based jobs start much faster. If you submit a job to an owner-based partition and feel it is not starting quickly enough, please let us know.
Also, Slurm places higher priority on jobs that use multiple nodes. This is appropriate, given that the primary use-case for the HPC is to handle larger jobs. However, it has created a practical issue where smaller jobs (that use one node or less) wait in queue for a long time; much longer than they used to in MOAB.
We are testing a few ideas to decrease the wait time for these smaller jobs and will post another update soon when we have a solution.
The quicktest
partition
We have noticed a number of users running MPI jobs on the Login Nodes. This is a violation of our policy, and it uses up resources on the system preventing other users from being able to use the login nodes.
To mitigate this issue and provide you a better way to test your jobs, we have
created a new partition called quicktest
with a maximum job time of 10
minutes. If you want to test your code, you can submit it to quicktest
, where
it will start within minutes:
$ sbatch -p quicktest my_job.sh
The sinfo
command
Many users have requested that we provide a mechanism for exposing the status
of compute nodes (online/offline/error-state). As of today, we have enabled the
Slurm sinfo
command for all users. If you run...
$ sinfo -p [partition_name]
...you will see a list of nodes and their states. Refer to the Slurm Documentation
for full details about how to use sinfo
.
Out-of-Memory Issues
We have seen an occasional issue where certain jobs fill up the memory on a compute node causing that node to crash. If other jobs are running on this node, they too will crash. The Slurm scheduler may or may not send an error to users about this when it occurs. Often times, jobs will just fail silently.
If your job failed without any meaningful error message, this is probably what happened. We are working on mitigating this issue, and will post a notice when we find a solution.
Condor and Spear
We upgraded the HPC to Red Hat Enterprise Linux v7.1 (RHEL7) as part of the system
maintenence in July. We also upgraded many software packages.
However, we did not upgrade our Spear and Condor clusters to RHEL7 during the maintenance.
This means that we are currently supporting multiple versions of many software programs, such as "R". Obviously, we want to standardize everything as soon as possible.
In that regard, we have already started upgrading Condor (more details to come), and will upgrade Spear later this year.
Thanks for being patient while we work through all of the issues related to the upgrade, and we appreciate the helpful feedback you have provided. Please keep sending us tickets when you see issues or have questions.