Hurricane Hermine Recovery - Spear Online; Lustre recovery proceeding
UPDATE - Thurs, Sept 15, 3:20pm - Lustre at 36% recovered
At this time, only three services remain affected by Hermine:
- Lustre data - We have recovered 36% of the data on Lustre that was affected by the loss of our OST. This process is moving slower than expected, and will likely be finished sometime early next week.
- 2015/2016 HPC Nodes - This, too will be a slow recovery process. We are working with the vendor to replace damaged hardware, and we will provide periodic updates as we fix or replace those nodes that are damaged
- NoleStor / WOS - No hardware damage is evident on this system, but we have not yet fully tested and restored service
We will send emails only periodically from this point forward, but we will continue to update this page and our Twitter account.
UPDATE - Mon, Sept 12, 3:20pm - Spear Online, Lustre recovery proceeding
We are happy to announce that our Spear system is available again, and you can now login and use it. We appreciate all of your patience during the recovery process.
We have also made Lustre available, but not all data has fully recovered yet. We anticipate this will take another three to four days.
Some files were damaged by a failed device (see details below), and we are working to restore those from a recent backup. You can recognize broken files as follows:
- If you try to list a file that you know exists, you get a "No such file or directory" error.
- If you try to read a file with vi, you see a "READ ERRORS" message in the vi satus bar.
- If you perform a directory list ("ls -l"), you see broken file metadata listed with question marks instead of file information; e.g.:
-????????? ? ? ? ? ? .bash_history
The cause of all of this is that one of the 12 Object Storage Target (OST) devices in Lustre experienced a problem. This most likely started shortly after the storage system was handed over to us by Dell a month ago. However, we discovered it only when we tried to restore the Lustre service after Hermine.
When we rebooted Lustre after Hermine, the system had already internally marked the OST as "invalid". Thus, all data on that OST was considered "invalid" and a rollback occurred, erasing between 10 and 20 TB of data.
Fortunately, we still have a backup of the system from the Lustre maintenance we performed this summer. We were able to extract a list of files on that faulty OST, and we are now restoring data. This process will take another 3 to 4 days.
Please let us know if would like a list of all files that are being restored or if you have any questions or concerns (support@rcc.fsu.edu). We will send out another notice when the restore is complete, or if it is delayed for any reason.
UPDATE - Fri, Sept 9, 9pm
The Lustre recovery will take several more days. We will be able to post a more accurate time estimate on Monday.
Regardless, we are going to make Spear and Lustre available on Monday morning, since most of the data is in-tact. The remaining data is being copied from our backup.
If you use Lustre on Monday and encounter any files that produce I/O errors when attempting to read meta-data or file contents, this means you have found a damaged file. Once the restoration fully completes, these files should work correctly. We estimate this will finish sometime next week.
Regarding HPC, we are working with the hardware vendor (Dell) to repair our failed 2015/2016 HPC nodes and will provide updates as soon as that process moves further along. In the meantime, we will reprovision as many alternative nodes as possible for affected owner-based partitions.
UPDATE - Thurs, Sept 8, 4pm
We are still working on recovering the Lustre filesystem, and we estimate at least two more days of downtime.
Our recovery efforts showed that we lost one Object Storage Target (OST) device out of 12 in the system during the storm. This OST contained about 10TB of data spread more or less evenly across the filesystem.
The good news is that we have a backup of all of this data on a separate filesystem, and we know exactly which files were affected. So, we started this afternoon to copy those files back onto Lustre. Unfortuantely, it will take us at least one full day to copy this data, and then some additional time to ensure the system is stable. Thank you for your patience during this process.
We are still working on the HPC nodes mentioned in yesterday's update. Two network switches were destroyed as a result of the storm damage. We received replacement switches this afternoon, and are installing them today. We are also still evaluating the HPC nodes for the partitions listed in yesterday's update.
If you have questions or concerns, please let us know: support@rcc.fsu.edu.
UPDATE - Wed, Sept 7, 4:25pm
The HPC is fully online with the exception of the 2016 compute nodes. Many of those nodes suffered hardware damage, so we are temporarily disabling them for repair. The following partitions are affected:
- amecfd_q
- amesolids_q
- beerli_q
- chemistry_q
- coaps14_q
- eoas_q
- hongli_q
- hu_q
- lin_q
- zhou_q
We are going to reconfigure the scheduler to schedule jobs for these partitions on working nodes while we perform repairs on broken nodes. We will provide updates on that process tomorrow.
The Lustre System is powered on, but there are volume errors. We are working with the vendor to evaluate and correct these errors. This may take at least several days. In a worst-case scenario, we will have to re-copy all data from our Gluster backup onto the Lustre system, which will be a time intensive process.
Spear and other Lustre-dependent servers are unavailable. This includes the Lustre mounts on the HPC login nodes.
UPDATE - Wed Sept 7, 11:45am
We are working on two several things today. Here is an overview of our current situation:
- Lustre Filesystem
- We are working on recovering this, but rebuilding the Lustre system may take more than a day.
- Currently, we see no hardware failures.
- Data is still backed-up on our temporary GlusterFS (used during the software upgrade) in case there are any major issues with the system.
- 2016 Compute Nodes
- These nodes were in an area of the data center with potential water damage.
- We will begin powering these up today to see if there are any hardware failures.
- One network switch that serves these nodes was damaged during the storm. We will replace it as soon as possible.
- OpenStack VMs
- We will attempt to power this system on today and report on it before the end of the day.
- This includes MorphBanks, I2B2, and other iDigInfo systems
- Various other systems
- We are waiting on Lustre to restore Genomaize and Masoct VMs. The VMs were brought online this morning and are working fine, but both rely on external storage that is still down before we can say that they are fully recovered.
- The DDN WOS (NoleStor) is online, but a switch that serves it was damaged. We will need to replace it before it is available again.
- We are testing various IMB systems today, but Lustre will need to be available before those will be fully restored.
On Thursday evening, Hurricane Hermine knocked out the FSU chilled water cooling system, which serves one of our two data centers in Dirac Science Library. This caused the temperature in the facility to soar, and triggered the sprinkler system.
On Friday evening, our staff were able to enter the facility and immediately turn off all systems that were still running. Since then, we've been working to recover our systems.
As of 4pm on September 6, the following services are online:
- HPC login nodes and some compute nodes
- Panasas file system (use HPC login nodes above to gain access)
- Virtual Machine cluster and hosted websites
- Most HPC login and compute nodes
- RCC Website, ticketing system, and web services
The following services remain OFFLINE, but we are working on them:
- Spear Systems
- Lustre filesystem
- New (2016) HPC compute nodes
You can get frequent updates via our Twitter account: https://twitter.com/fsurcc.