Resolved: HPC Issues

Starting last Friday, we experienced a problem with our authentication system which caused a number of the cluster nodes to fail.  As a result, some jobs wouldn't run, and other odd things happened.  In some cases, you may have seen a message stating "srun: error: slurm_receive_msgs: Socket timed out on send/recv operation".  In other cases, your job may have finished with no output files.

If you experienced any of these issues, we apologize.  The issue is now resolved; we have fixed all of the affected nodes, we believe everything is normal again.

This issue also affected GPFS migrations.  We have re-started all affected migrations, and those will finish today.

If you have any further issues with Slurm jobs, please let us know: