Job failures on hpc-[d30/d31] nodes

Over the past few days, we've noticed a large number of jobs failing on the HPC when they are assigned to a certain set of nodes.

The affected nodes are all nodes in the D30 and D31 racks, and possibly the nodes in the D32 rack.  (hpc-d30..., hpc-d31..., hpc-d32...).

Errors typically look like the following:

[HPC] Your job 1234567 is in state NODE_FAIL
 
Your HPC Job 1234567 entered the following state: NODE_FAIL

You may also see errors like this:

srun: error: Node failure on hpc-d32-1-3
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 123457 ON hpc-d32-1-1 CANCELLED AT 2018-06-18T09:35:12 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

We are working on identifying the problem and preparing a solution to this problem, and we will post further updates as soon as we have more information about what is causing the issue.