HPC and Spear software upgrade will occur in August

This August, we will perform periodic maintenance on our HPC and Spear clusters.  There will be some brief downtime.  This maintenance will allow us to upgrade all of the software on our cluster to newer versions and more.

The maintenance will begin Monday, August 13, and we expect all systems to be back online no later than Friday, August 17.  Some systems will be unavailable at scheduled points during this time (see tentative schedule below).

What we are doing

This year's software upgrade will allow us to accomplish the following:

  • Upgrade CentOS from from v7.3 to v7.4 (release notes)
  • Install new versions of most software on the cluster.  We will publish a manifest listing all of the packages that are being upgraded, along with new version numbers, as soon as we have it.
  • Upgrade the HPC Slurm Scheduler from v17.02 to v17.11 (release notes)
  • Implement improvements to some of the physical infrastructure (power, networking) in the datacenter
  • Deprecate native support for Lustre and Panasas in favor of our new GPFS storage system.

Tentative Schedule

This schedule is based on the current project planning and may change as we get closer to the maintenance window.

  • Friday, Aug 10 - 9am: We will begin draining jobs off HPC compute nodes.
  • Sunday, Aug 12 - 5pm: We will disable HPC job submissions in Slurm.  The HPC cluster will stop accepting jobs at this time.  Already submitted jobs will continue to run.
  • Monday, Aug 13 - 6am: We will turn off Spear nodes and older general access HPC nodes and begin rebuilding them.  Any jobs running on these nodes will be cancelled.
  • Tuesday, Aug 14 - 6am: We will turn off and rebuild HPC Login nodes and all remaining HPC nodes.  This includes owner nodes.  Any jobs running on the HPC will be cancelled.
  • Friday, Aug 17 - 9am: We will reset the GPFS storage system.  Spear, HPC, and export nodes will be unavailable during the reboot.  We expect the reboot to last no more than an hour (but probably much less).
  • Friday, Aug 17 - 5pm: All nodes will be back online.  We will likely be able to bring services online earlier than this day and time, depending on how things go.

Summary

We will publish updates and any changes to the schedule as we get closer to the maintenance window.  In the meantime, if you have any questions, issues, or requests, please let us know: support@rcc.fsu.edu.