Status Report on the HPC

Here is a few updates on the HPC, including the state of accounts, job preemption, and other items.

Hi RCC Partners,

We'll continue to send status reports on the HPC for another week or two while we are ironing out the residual issues from the upgrade to Slurm. Here is a summary of the issues that we know about and are working on:

  1. The Slurm Controller has crashed several times this week. This has caused all Slurm commands (sbatch, squeue, etc) to stop responding. Already-running jobs were not affected. We have determined that the issue occurs when a job is submitted with the -w option (for specifying specific nodes to run a job on). A faulty configuration file may have caused this, and we have corrected the issue today. However, we will have to continue monitoring it to see if this fix permanently corrects the issue.
  2. We've enabled the Slurm Health Check feature on the cluster, which takes nodes offline when there are issues. Currently, there are 18 nodes offline, and we will bring them up as we fix them.
  3. We are working on tuning the parameters for job submission to ensure jobs start in timely manner. Our Systems Team plans on focusing on this task heavily next week. Specifically:
    1. We've noticed that a small number of jobs are not starting as early as they should be in certain partitions.
    2. We have not yet turned on job pre-empting, but intend to do so. This means that any backfill2 job running on an owner-based node, will be killed if the node owner submits a job. We haven't turned this functionality on yet, but we will soon.
  4. If you've been granted access to a partition, and you see an error similar to Invalid account or account/partition combination specified, this generally means your user account doesn't have permission to submit to that partition. If you believe this to be an error, please let us know by submitting a ticket. We are making improvements to our account synchronization script, so it may take up to a day for your partition access to propagate to the scheduler. As soon as we are confident that the script is running smoothly, we'll increase the frequency of account setting propagation.

That's it for now. If you have any questions, see other issues, or need help with anything, please let us know: support@rcc.fsu.edu.