Resolved: HPC Issues
Starting last Friday, we experienced a problem with our authentication system which caused a number of the cluster nodes to fail. As a result, some jobs wouldn't run, and other odd things happened. In some cases, you may have seen a message stating "srun: error: slurm_receive_msgs: Socket timed out on send/recv operation". In other cases, your job may have finished with no output files.
If you experienced any of these issues, we apologize. The issue is now resolved; we have fixed all of the affected nodes, we believe everything is normal again.
This issue also affected GPFS migrations. We have re-started all affected migrations, and those will finish today.
If you have any further issues with Slurm jobs, please let us know: support@rcc.fsu.edu.