RESOLVED: Research Archival System Issue (Globus affected)
UPDATE (3pm) -- All archival volumes have been brought back online and the globus fsurcc#archival endpoint has been reactivated.
The issue was that a failed drive was being replaced, which is a pretty standard operation for a raid configuration and usually does not impact operations. However, because of some very heavy IO on the system, this reconstruction was interrupted all the time.
We are looking at a way to prevent this type of perfect storm of events. Please let us know if you still experience issues with the archival storage.
UPDATE (12:15pm) - We have pinpointed the issue with our archival system to some unusual I/O patterns, and we are trying to determine the cause of this.
All ZFS volumes are currently unmounted and we will bring them back online in the coming hours.
We are currently experiencing an issue with our Research Archival Storage System.
In order to stabilize the system, we are going to un-mount the system from the export nodes and disable the endpoint in Globus. GPFS endpoints are not affected.
We will update this notice in a few hours, or as soon as this issue is resolved.