Troubleshooting HPC Jobs

This page lists common questions and isseus that users experience when using the HPC Cluster.

Network Limitations

Note that HPC compute nodes are unable to access data from outside of the private network.  Once your job begins, it can access only data stored on our GPFS or Archival systems.  This means that if you need to transfer data from an outside server, it needs to be done before your job is submitted to Slurm.

HPC Login nodes and the export server can access data from anywhere on the Internet.

Common reasons that jobs don't start quickly

When you submit a  job to the HPC cluster, the Slurm scheduler assigns it a job priority number that determines how soon the system will attempt to start the job.  Many factors affect the job priority, including resources requested, how those resources are distributed on the cluster, and how many jobs you have submitted recently.

There are several reasons your job may not run as soon as you expect.  Typically, you can solve these issues by tuning your submission script parameters.  Common reasons include:

Asking for too many cores on a single node

This usually happens when you specify both the number of tasks (-n) and the number of nodes (-N) in your submit script.

Most nodes in the HPC cluster contain between 16 and 64 cores.  If you request more cores than a node in partition has, the job will not fail immediately, but instead remain in pending status indefinitely until cancelled.

Similarly, if you request a large number of cores on a single node, even if the job is able to run, it may take a very long time to allocate resources for it.

Asking for a single core on too many nodes

The more nodes your jobs requires, the harder time the Slurm scheduler will have finding resources to run it.  It is far more efficient to run your jobs on an arbitrary number of cores per node depending on where it can be scheduled.

Not configuring memory parameters optimally

Typically, each processor bus in a node has RAM in multiples of 4gb.  However, a small part of this RAM is used for overhead processes, so a single job can never occupy all of it without crashing the server.  The Slurm scheduler has been tuned to take this into account.

If you wish to specify how much RAM your job needs (using the --mem or --mem-per-cpu parameters), it is best to a multiple of 3.9gb.  This ensures that your job doesn't request more RAM than the node is able to allocate.

Explicitly specifying number of nodes using the -N (--nodes) parameter

It is far more efficient to let the Slurm scheduler allocate cores for you across an arbitrary number of nodes than to wait for the specific number of nodes you specify to become available.  Some jobs have no choice but to tune the number of nodes based on software or algorithmic performance requirements, but for all other jobs it is best to omit this parameter in your submit scripts.

Jobs from owner partitions may be occupying a node

Each compute node in the HPC is shared between free, general access partitions (genacc, condor, backfill, etc) and one or more owner-based partitions.  Owner-based partitions belong to research groups that have purchased HPC resources.  These goups get priority access to the nodes that they have purchased, and may cause jobs in generall access partitions to be delayed.

Additionally, jobs submitted to the backfill2 partition may occassionally be cancelled due to pre-emption.  The reason for this is that jobs running in backfill2 run in free time-slots on owner nodes.  When an owner job cannot start due to backfill2 jobs, those jobs will be cancelled.  If you want to avoid pre-emption, you can submit to the backfill (not backfill2) partition.  Jobs in this partition may take slightly longer to start, but they are not subject to pre-emption.

Running too many small jobs

Generally speaking, the HPC cluster is tuned to optimize start times for larger jobs, rather than large numbers of smaller jobs.  This varies depending on which partition you submit your job(s) to, and on a number of other factors.

In addition, we utilize a fair share algorithm for determining job priority.  The more jobs you submit in a given time period, the lower your fair share score becomes, which affects your job priority.  This ensures that a single HPC user running 1,000's of jobs doesn't crowd out users that submit fewer jobs.