1. Home
  2. Network Status
  3. US vSAN Network Status

US vSAN Network Status

Reason for Outage (RFO) report

October 16, 2023

Rochen has released the following Reason For Outage (RFO) report.

Introduction

On October 1st, 2023, Rochen experienced a major outage due to a series of failures in our VMware vSphere environment in Dallas, TX, United States. Rochen utilizes elements of the vSphere product group to provide redundant infrastructure for our services in this location. Two components were involved in this outage: VMware ESXi and VMware vSAN. VMware ESXi is a bare metal hypervisor that hosts Virtual Machines (VMs). VMware vSAN is a software-defined storage solution that allows for local or distributed direct-attached storage.

VMs running within the vSphere environment are used to host the underlying infrastructure for various Rochen services, including our Solo, Growth, Pro and Reseller hosting plans, Managed Cloud Servers (MCS), Premium Web Hosting (PWH) plans and other services such as Managed Load Balancing. The situation did not result in a failure of all VMs running within the affected vSphere environment. Many VMs remained online. However, it resulted in a major outage for a significant portion of our customers in this location.

Outage and Recovery

Throughout this RFO, we will refer to the two ESXi hypervisors that resulted in this outage as hypervisors “A” and “B.” Please note for clarity, however, that there were more than two ESXi hypervisors as part of the affected vSphere environment. VMware vSAN can distribute small amounts of data for each VM across many ESXi hypervisors.

At approximately 0300 UTC on October 1st, 2023, our engineering team performed routine maintenance to apply software updates for our ESXi hypervisors, including security patches. To proceed with these updates, we first placed each ESXi hypervisor into “maintenance mode,” which evacuated the “compute” but not the “storage” portion of running VMs to other ESXi hypervisors. We then applied the software updates and rebooted the ESXi hypervisor to finalize installation. We performed the updates in this manner with the goal of applying security updates as soon as possible while not disrupting services and keeping a level of redundancy in place. We ensured the VMs compute resources were transferred to another available ESXi hypervisor without downtime, keeping storage available to the VMs from vSAN on another hypervisor.

Shortly after rebooting ESXi hypervisor “A,” we experienced vSAN Solid State Disk (SSD) hardware failures on a separate ESXi hypervisor “B.” This was not a recoverable failure, and the data on ESXi hypervisor “B” would require to be rebuilt from other data sources on other ESXi hypervisors. VMs utilizing disks or swap objects across these two hypervisors were immediately powered off due to their storage objects on vSAN entering an All Points Down (ADP) status.

ESXi hypervisor “A” came back online with storage objects having older Configuration Sequence Numbers (CSNs) than the vSAN cluster was aware of. The lack of quorum resulted in a scenario where ESXi hypervisor “B” didn’t have the data. It needed to sync it from ESXi hypervisor “A.” ESXi hypervisor “A” had data, but it was considered stale and needed to sync from ESXi hypervisor “B.”

This left all affected objects inaccessible and unrecoverable, and recovery from backups, stored separately to the VMware infrastructure, was required. Although, in some cases, only a small amount of the total data for each VM may have been affected, it was enough to cause significant corruption requiring many VMs to require a bare metal restore. In some cases, we had VMs that would not boot at all, VMs that would boot but had corrupt data, and VMs that were not impacted and remained online throughout the incident. Due to the amount of data needing to be restored and each affected VM being brought up one by one, the restore times varied significantly for individual customers, with some of the larger PWH VMs, in particular, taking much longer.

Future Mitigation

Rochen is taking steps to ensure we can better mitigate such an event in the future.

We are changing our maintenance procedures to fully evacuate both the “compute” and “storage” portions of VMs from ESXi hypervisors when any updates are applied to ensure all copies of all data are accessible at all times. This is a much more resource- and time-intensive process, meaning updates will not be applied as quickly. From our discussions with VMware over the past two weeks, we have confirmed that this is not something that should be required and is a process that we generally reserve for the permanent removal of a vSAN cluster member. This is, however, a trade-off we feel we must make going forward.

While redundancy was in place, in light of this event, Rochen is planning for and bringing online additional ESXi and vSAN capacity to further increase our Failures to Tolerate (FTT) levels across the vSphere environment. The additional physical hardware required to increase the FTT levels is already on-site at the Dallas data center and is currently being deployed.

Lastly, we will be conducting a review of our backup systems and processes to see if any improvements can be made. While the backup systems did work to recover all data, the restore times were significantly longer than we would like due to the severity of this incident and the massive amounts of data needing to be recovered.

Conclusions

The entire team at Rochen fully appreciates the disruption and frustration this outage caused. In our twenty years of providing web hosting services, this incident is the most severe we have experienced, and it is something we are learning from and do not want to see repeated. We sincerely thank everyone for their understanding, patience and generosity shown to us while we worked to resolve the matter and now continue to make further improvements.

Outage Status Updates October 2nd through 11th, 2023

This thread has been created for clients to be able to view and track our progress regarding emergency US vSAN maintenance work that is currently ongoing. 

  • October 11, 2023: 12:44 pm UTC / 8.44 am EDT
    • US505 restores are now fully complete.
    • The second Bare Metal Restore (BMR) attempt to restore US505 was successful and we did not have to fall back on the alternate file restore method that was running in parallel.
    • This final restore of US505 completes our immediate work. All servers and services are now online and functioning normally.
    • A full Reason For Outage (RFO) report will be issued by the end of this week.
    • Our entire team thanks all of our customers who were impacted by this unprecedented situation once again for their patience due to the amount of time it took to get some services restored and the kindness and generosity shown to us throughout.
  • October 11, 2023: 08:40 am UTC / 4.40 am EDT
    • The current estimate for the US505 Bare Metal Restore (BMR) is:
      US505: 4h 58m
    • We continue to be thankful for your continued patience. As noted in previous updates our alternate file restore continues from a diffrent backup server to a diffrent target server as a fall back to the above BMR.
  • October 10, 2023: 22:34 pm UTC / 6.34 pm EDT
    • The current estimate for the US505 Bare Metal Restore (BMR) is:
      US505: 14h 12m
    • Thank you again everyone. We will have a further update tomorrow. Our support team remains available 24/7 through the My Rochen customer portal. Please also refer to our previous few updates for more information about the US505 restore processes.
  • October 10, 2023: 10:51 am UTC / 6.51 am EDT
    • The current estimate for the US505 Bare Metal Restore (BMR) is:
      US505: 23h 7m
    • As noted previously, we also have a second file based restore running from a diffrent backup system to a diffrent target server in case we run into issues with this second US505 BMR attempt.
    • Service continues without issue to all other servers already restored. We continue to thank customers still impacted by US505 though as we work diligently to get this final server online.
  • October 9, 2023: 18:39 pm UTC / 2.39 pm EDT
    • US505 restore efforts continue after the first attempt ran into issues. We are now restoring US505 from two separate backup systems to two separate target servers using both a Bare Metal Restore (BMR) and alternate files restore method.
    • The current estimate is:
      US505: 1d 18h
    • This restore time is obviously longer than we would like and we hope will improve. We want to be as transparent as possible with customers who still are impacted by the US505 outage though.
    • If you are impacted by US505’s continued outage, please reach out via ticket if we can help you with a temporary hosting solution for your emails, landing pages and(or) personal backups as outlined here.
    • Our My Rochen portal remains open 24/7, for technical support, and response times have increased significantly as service has been restored to most impacted customers.
    • All other services and servers except for US505 are already restored.
    • Thank you again for your patience as we work to restore service to this last server.
  • October 9, 2023: 11:14 am UTC / 07:14 am EDT
    • US504 restores are now fully complete.
      • We continue to pursue an alternate restore method for US505 but cannot provide a reliable estimated time of resolution at the moment. We are performing a backup data sync as part of this alternative method and hope to be able to provide more information in around 6 – 7 hours time.
      • While we are pleased to have services restored to all servers now except US505, we understand the importance of getting this server back online as soon as possible for those affected customers. Our efforts continue around the clock to get this one last server back online too.
      • We continue to be grateful to everyone for their patience and kindness shown to our team.
  • October 9, 2023: 10:30 am UTC / 06:30 am EDT
    • US503 restores are now fully complete.
    • Progress continues on the remaining server groups. The current estimates are:
      • US504: 45m
      • US505: Unknown at this time. We will provide more information when possible.
    • Thank you for your continued patience.
  • October 9, 2023: 04:08 am UTC / 00:08 am EDT
    • Progress continues on the remaining server groups. The current estimates are:
      • US503: 5h 10m
      • US504: 7h 8m
      • US505: Unknown at this time
    • US505 continues using an alternative restore method. We hope to have more information in our next updates though.
    • Thank you again for your patience while we continue to work to restore these last remaining services.
  • October 8, 2023: 18:40 pm UTC / 2:40 pm EDT
    • Progress continues on the remaining server groups. The current estimates are:
      • US503: 8h 48m
      • US504: 16h 9m
      • US505: Unknown at this time
    • We continue to work on all possible options for speed optimization and will keep everyone posted.
    • Please reach out via ticket if you think a temporary plan as outlined in this article, can be helpful to you for email, landing pages, or restoring your own local backups.
    • Current support ticket response times are back to normal, and our team remains available to assist 24/7 via My Rochen.
    • We continue to be grateful for your patience.
  • October 8, 2023 – 09:27 am UTC / 5:27am EDT
    • US505 has run into some issues with the restore. We have got it mostly back online but we are having to restore some volumes using alternatives methods that are more time consuming. We do not have an estimated time of resolution currently but we continue to work away at this diligently and as quickly as possible.
    • Here are the updated estimates for completion for the other two groups. Again, these are estimates only and could change.
      • US503: 17h 39m
      • US504: 1d 2h
    • Further updates will follow and we continue to thank everyone for their understanding.
  • October 8, 2023 – 23:55 pm UTC / 7:55 pm EDT:
    • We are still working on full restores for the remaining server groups, which are significantly larger than the previous groups that have been restored.
    • While we regret that we have no updated ETAs, at this time, please know that we continue to work on this 24/7 and will be sharing more information as soon as we can.
  • October 7, 2023 – 12:40 pm UTC / 8:40 am EDT:
    • Although service has been restored to almost all servers now, we know many customers in the remaining three large Premium Web Hosting (PWH) groups are still affected. Work continues around the clock to restore services to all customers.
    • Here are the updated estimates for completion. Again, these are estimates only and could change.
      • US503: 1d 12h
      • US504: 1d 20h
      • US505: 5h 7m
    • Thank you again to everyone for your continued patience.
  • October 7, 2023 – 00:49am UTC / 8:49 pm EDT:
    • US509 restores are now fully complete.
    • Here are the updated estimates for completion. Again, these are estimates only and could change.
      • US503: 1d 23h
      • US504: 2d 6h
      • US505: 14h 49m
    • The whole Rochen team continues to work to restore all services.
  • October 6, 2023 – 22:00 pm UTC / 6:00 pm EDT:
    • We continue to make steady progress on the remaining restores in our Premium Web Hosting (PWH) shared server groups, and this work will continue through the weekend.
    • Here are the updated estimates for completion. Again, these are estimates only and could change.
      • US503: 2d 3h
      • US504: 2d 8h
      • US505: 17h 6m
      • US509: 1h 38m
    • Please reach out via ticket if we can help you with a temporary hosting solution for your emails, landing pages and(or) personal backups as outlined here.
    • Our My Rochen portal remains open 24/7, for technical support, and response times have increased significantly in the past 24 hours.
    • Thank you all for continuing to bear with us through this.
  • October 6, 2023 – 14:48 pm UTC / 10:48 am EDT:
    • US508 restores are now fully complete.
    • Here are some updated estimated times for the remaining Premium Web Hosting (PWH) shared server groups. These are only estimates and cannot be guaranteed.
      • Current remaining group restore times:
        • US503 2d 11h
        • US504 2d 17h
        • US505 22h 49m
        • US509 9h 44m
    • Our entire team continues to be grateful to everyone as work continues to restore the remaining services.
    • Edit added at 15:00 UTC: As we noted yesterday, temporary plans are available for affected clients to help with landing pages, emails, etc. Please see this article for more details.
  • October 6, 2023 – 12:55 pm UTC / 8:55 am EDT:
    • Our progress continues steadily.
    • Here are some updated estimated times for the remaining Premium Web Hosting (PWH) shared server groups. These are only estimates and cannot be guaranteed.
      • Current remaining group restore times:
        • US503 2d 13h
        • US504 2d 19h
        • US505 1d 0h 45m
        • US508 2h 11m
        • US509 11h 43m
    • Thank you, everyone, for your ongoing patience.
  • October 5, 2023 – 20:20 pm UTC / 4:20 pm EDT:
    • Managed Cloud Server (MCS) restores are fully complete.
    • Significant progress has been made on the larger capacity Premium Web Hosting (PWH) shared servers in the last 24 hours, with the vast majority now fully restored and operational.
    • The estimated restoration time for the remaining groups is as follows:
      • us503 – 2+ days
      • us504 – 2+ days
      • us505 – 1 day 20 hours
      • us508 – 20 hours
      • us509 – 1 day 4 hours
      • Please note that the above times are estimates only and are subject to change as the situation unfolds. We cannot guarantee any timelines.
      • To locate your server, please log in to My Rochen, select “Manage Hosting” and click the account.
    • For clients who continue to be affected, we would like to offer you a temporary hosting plan on an unaffected server for setting up landing pages, re-routing DNS and emails or uploading any saved backups.
      • Please see this article for more information about how to make use of the above option.
    • Our customer support efforts continue 24/7, albeit slower than we would like due to the understandable volume of requests.
    • Discussions continue with VMware and other vendors, and a full Reason For Outage (RFO) report will be made available next week.
    • Our entire team appreciates everyone’s continued patience throughout this unprecedented event.
  • October 4, 2023 – 14:50 pm UTC / 10:50 am EDT:
    • Progress continues. Server restores have been ongoing through the night and into this morning.
    • Many Managed Cloud Servers are now fully restored, with some larger restores remaining.
    • Premium Web Hosting (PWH) servers continue restoring as well, though, as noted earlier, they will take longer due to the sheer volume of data on those larger servers. We do not have an updated ETA we can confidently provide at this time. If that changes, we will provide a specific update.
    • Rochen staff continue to work around the clock to get everything up and running for every client.
    • Support channels remain very busy and response times, regrettably, are still below our usual standard. Every single ticket and inquiry will be addressed as time allows.
    • We thank everyone for the immense patience and grace you’ve shown our team over the past few days.
  • October 4, 2023 – 22:15 pm UTC / 6:15 pm EDT:
    • Throughout the day, we’ve brought more servers back to operational status. This includes a mix of Managed Cloud Servers, shared servers hosting legacy plans and even, finally, some of our Premium Web Hosting (PWH) servers.
    • We continue to work on getting everyone restored fully.
    • Support remains very busy, and we are working to get everyone’s tickets updated as needed.
    • If you want to get in touch regarding this issue, please open a technical support ticket at https://my.rochen.com if you have not already done so. Response time may be delayed, but having an active ticket will ensure your specific case is logged and tracked to be updated by support team members. Please see this article if you need login recovery for My Rochen.
    • We continue to be grateful for your patience and overall kindness toward our team. We know this has been a significant ordeal for everyone. As we noted in a prior update this week, there will be a full Reason for Outage (RFO) report published after our incident investigation concludes.
  • October 3, 2023 – 20:23 pm UTC / 4:33 pm EDT:
    • In addition to continuing with the restore work, we have had extensive discussions with the vSAN team at VMware Global Support as well as with our backup systems vendor, ConnectWise, to see if we can find a way to speed up the restore process, particularly for some of the larger shared servers.
    • Currently, the main issue is the restore speed with the much larger capacity Premium Web Hosting (PWH) shared servers. We are looking into all possible ways that may help to speed this process up. VMware believes we may be able to recover some data from the original vSAN storage array itself, which could speed up the process.
    • We do not have an ETA at this time. Unfortunately, based on our discussions with vendors and findings today, some of our larger shared servers may take longer into the week to fully restore.
    • We are grateful for your continued patience under these unprecedented circumstances. We understand the impact this has had on people and businesses, and we will be providing a Reason for Outage (RFO) report next week after full recovery and investigation.
  • October 3 2023 – 12:14 UTC – We continue to make progress with restores but significantly slower than we would like due to the volume of data and servers needing to be restored at the same time unexpectedly. Our engineering team is looking at ways we can increase available private bandwidth and resources to our backup appliances to try to overcome this challenge to some extent. Additionally, we are engaging with VMware Global Support to begin looking at what failed with vSAN, which should be a redundant and distributed platform. It is hard for us to provide an estimated time for full resolution, for some clients it will be happening right now but for others we will unfortunately still be quite a bit off. Our support channels remain extremely busy so we apologize for the delays in any responses. We appreciate this is an extremely challenging situation and probably the toughest we have faced in our decades providing hosting and cloud services. Thank you again to everyone.
  • October 3 2023 – 05:36 UTC – Our team continues to work away on the restore process. All shared servers for the current range of Solo, Growth, Pro and Reseller (purchased after February 2020) plans are now restored (where there are less overall servers). Several shared servers for our previous generation Premium Web Hosting (PWH) service still need to be restored. Many Managed Cloud Servers (MCS) still need to be restored. We will continue to look at other options about how to speed up the restore process throughout today.
  • October 2 2023 – 23:25 UTC – Servers continue to be restored but we are making slower progress than we would like. This overall event is likely to extend well into tomorrow at this time. We are working with VMware and other vendors to try and find a way to speed up the overall restore process though.
  • October 2 2023 – 17:00 UTC – Our team is continuing to actively work away on restores. It is going to be som time though until we have the situation fully resolved.
  • October 2 2023 – 12:00 UTC – Many VMs are back online but we continue the restore process for other impacted VMs. We are moving as fast as we can with this process but due to the sheer volume of data and number of VMs impacted it is taking longer than we would like. Our most senior team members are focused on getting everything resolved as soon as possible though.
  • October 2 2023 – 05:00 UTC – Some VMs are fully restored and back online, but work continues to progress on the remaining affected servers.
  • October 1 2023 – 23:50 UTC – Restores are ongoing
  • October 1 2023 – 21:38 UTC – The restore process has been kicked off for affected VMs
  • October 1 2023 – 19:01 UTC – vCenter has been brought back online and we are sorting out a cluster issue currently
  • October 1 2023 – 17:44 UTC – vCenter repair is in progress and once vCenter is restored we will be able to start recovering affected VMs

We have run into a serious problem with one of our VMware vSAN distributed storage clusters at our Dallas location. We will fully investigate the matter and put together a Reason For Outage (RFO) report once everything is fully resolved but it looks like a range of freak and unexpected events happened at the same time to result in the failure. Up until this point we have been running vSAN for many years with great performance and pretty much 100% uptime.

We will keep this page updated as we have further information available. 

Your patience is very much appreciated and all hands are on deck getting this situation resolved as quickly as possible. 

Edits: 

10/03/2023: Clarified Reseller plan type in October 3 2023 – 05:36 UTC update. The distinction is Reseller plans purchased after February 2020 – WR

10/07/2023: Time stamp corrected on first update made October 7th. – WR

Updated on October 16, 2023

Was this article helpful?

Need Support?
24/7 support is available through the My Rochen portal.
Login