System Maintenance (Residual issues: MATLAB, LAMMPS, engineering, nwchem)

    UPDATE - Tuesday, August 21 - 4:45pm - MATLAB is now working correctly on all nodes.  We are still working on the following known issues:

    1. Omnipath networking - this affects users in the engineering partitions on the HPC. Some jobs in these partitions may fail if they use cores distributed among multiple nodes. We expect to have the issue resolved no later than Wednesday.
    2. LAMMPS - Our software team is working on compilation issues with LAMMPS. Users can still compile LAMMPS from source in their your directories.
    3. nwchem libraries - Our software team is working on compiliation issues with this package.
    4. Disappearing Jobs - Some users' jobs are dissapearing after submission.  This is related to software issues, and we are working on them.
    5. GPU nodes - We are working on an issue related to drivers.  This should be resolved shortly.

    Thanks for your patience.  If you see any other issues, please let us know: support@rcc.fsu.edu.


          UPDATE - Monday, August 20 - 5pm -  HPC, Spear, and all other services are now back online, and available for general use.

          There are still four items that we are working on, and we expect to have back online shortly:

          1. Omnipath networking - this affects users in the engineering partitions on the HPC. Some jobs in these partitions may fail if they use cores distributed among multiple nodes. We expect to have the issue resolved no later than Wednesday.
          2. LAMMPS - Our software team is working on compilation issues with LAMMPS. Users can still compile LAMMPS from source in their your directories.
          3. nwchem libraries - Our software team is working on compiliation issues with this package.
          4. MATLAB is not working yet on the login and Spear nodes. It is working on the HPC compute nodes, so MATLAB jobs will run. We expect to have this issue resolved no later than tomorrow.

            UPDATE - Monday, August 20 - 9am - We are continuing to work on bringing the storage system and the cluster back online. Our Systems Team uncovered a number of issues with the GPFS system over the weekend.  They are rebuilding the export nodes.  More updates shortly.


            UPDATE - Friday, August 17 - 4pm - We have completed most of the maintenance that was scheduled. However, there were a few issues that require additional attention. So we are extending our maintenance period through 5pm, Monday, August 20.

            During this time, we will finish cleaning up residual issues. The GPFS filesystem will also be during the weekend.


            UPDATE - Friday, August 17 - 11am - The GPFS filesystem is offline as of 9am this morning while our systems team resets some components.  We are working on routing issues now, and the system should be back online shortly.


            UPDATE - Thursday, Aug 16 - 4:40pm - Progress is continuing on the upgrade.  Currently, we are approximately one day behind schedule, and we anticipate that maintanence will extend into the weekend.  We will post an update here and email all RCC users tomorrow with the updated schedule.

            Also, we are rebooting the GPFS filesystem tomorrow at 9am.  We expect the reboot to be brief, and we will post an update here as soon as it completes.


            UPDATE - Wendesday, Aug 15 - 4:35pm - We have successfully rebuilt a large portion of the cluster, and the nodes are now configuring themselves using our configuration management framework.  As of this afternoon, we have drained all owner nodes, and are proceeding to rebuild the remaining devices in the HPC cluster.  No services are currently fully online, but we are making progress towards completion of the maintenance period.


            UPDATE - Tuesday, Aug 14 - 9:48am - Spear and some older HPC compute nodes are offline.  We ran into some issues rebuilding Spear nodes yesterday.  Those issues were resolved yesterday, and we are now completing the rebuild of Spear nodes.


              UPDATE - Monday, Aug 13 - 8:30am - We have begun maintenance as of 6am this morning. 

              • Spear nodes are offline; we are rebuilding them now.
              • We have drained all HPC compute nodes.  If you submit any Slurm jobs via the HPC Login Nodes, they will remain in 'pending' status.

              We are performing software maintenance starting Monday, August 13 through Friday, August 17.  We will post updates here as the project progresses.  The schedule is as follows:

              • Friday, Aug 10 - 9am: We will begin draining jobs off HPC compute nodes.
              • Sunday, Aug 12 - 5pm: We will disable HPC job submissions in Slurm.  The HPC cluster will stop accepting jobs at this time.  Already submitted jobs will continue to run.
              • Monday, Aug 13 - 6am: We will turn off Spear nodes and older general access HPC nodes and begin rebuilding them.  Any jobs running on these nodes will be cancelled.
              • Tuesday, Aug 14 - 6am: We will turn off and rebuild HPC Login nodes and all remaining HPC nodes.  This includes owner nodes.  Any jobs running on the HPC will be cancelled.
              • Friday, Aug 17 - 9am: We will reset the GPFS storage system.  Spear, HPC, and export nodes will be unavailable during the reboot.  We expect the reboot to last no more than an hour (but probably much less).
              • Friday, Aug 17 - 5pm: All nodes will be back online.  We will likely be able to bring services online earlier than this day and time, depending on how things go.

              For more details about what we're doing, please see our new announcement.