Major Incident Report – Hosting


As a result of the major incident that occurred at OVH (Primary DC) on the 9th December continuing into the 10th December, we are publishing this report to explain what occurred at OVH, timeline of events, lessons learnt and improvements planned/already implemented.


09-12-25 9pm Alerted to an instance offline, engineers reviewing the issue


09-12-25 9.30pm Managed failover performed on reported offline instances


10-12-25 12am Escalated to OVH as proven to be DC related


10-12-25 1am-4am Managed failover performed on reported offline instances


10-12-25 7am Telephone call with OVH service desk, following up on incident raised


10-12-25 7.30am Invoked DR to mass migrate affected instances. All efforts made to fail over and continue liaising with OVH


10-12-25 10.30am OVH begin to restore infrastructure. 50% of affected instances recovering to normal service


10-12-25 12.35pm All infrastructure services restored at OVH

All remaining instances started to come back online. As each instance returned to normal service, failed over instances started to fail back. 99% of all affected instances failed back by 4pm. <1% of instances remain failed over. These are fully operational but need to be failed back. We are working with OVH to complete this recovery today (11-12-25).


We have a meeting with OVH this afternoon. Currently they are only able to make available the details found on their incident report published here. They have advised that it may take over a week to provide a full RCA (Root Cause Analysis) report. At which point we will make this available.

LESSONS LEARNT AND PLANNED IMPROVEMENTS

Communication

In the interests of openness and transparency we have to accept, in hindsight, that that our systems and processes for providing service-related updates were not good enough. As a result, the following improvements will be made:

  1. We will add a service status link to our home page that will display in green, amber and red to visually draw attention to any service outages.
  2. Our service status page will be regularly updated in the event of any service outages. You will be able to subscribe to email updates.
  3. In the event of a major incident, we will immediately add a message to our phone system to advise all inbound callers that we are experiencing a major incident with directions to our service status page.

These changes will be implemented within the next few days.

Website

Early into the incident, our website became unresponsive due to the large number of internal API requests that were created as a result of failing over such a large volume of instances
simultaneously. In light of this, the failover process has now been refactored and the reliance on the internal API is no longer required.

Failover

Although our failover process worked as expected we were unable to complete this in an acceptable timeframe
.
Digital Ocean (Failover DC) throttled our API requests meaning that we were unable to failover all affected instances simultaneously. This was the core issue that our engineers worked on throughout the day to overcome.

As a result of this we have implemented measures to significantly reduce the number of API requests required to failover an instance.
We will be performing volume failover tests over the coming weeks to simulate this type of event to ensure that, were this event to occur again, we could failover in the expected time.
Once failover has been triggered, we would expect an instance to take between 15-60 mins to failover depending on the size of the instance.

We are planning to change our solution from ‘Managed Failover’ to ‘Auto Failover’. On balance, we now feel that the downsides relating to ‘Auto Failover’ outweigh the pros of ‘Managed Failover’. Historically, we took the view that extended downtime could be incurred as a result of FQDN updates for service outages that were resolved in a short time frame.

We are now planning to fail over instances automatically when an instance has been off for 30 minutes. You will have the ability, via the customer portal, to disable auto failover on a per instance basis.

As is the case currently, you will receive an email confirming the current IP address of the instance after a failover or failback. This information is crucial if you use IP authenticated SIP trunks that will need updating.

This is planned development work and should be completed by the end of January 2026.

Non-production (Alpha/Beta/Release Candidate) 3CX Versions

Please note that instances running non-production versions of the 3CX software cannot be failed over as we use production 3CX images.
We will now detect and notify partners via email of any instances running non-production versions that do not fully comply with our DR solution.


Want a feel for our products? There are more videos available on our Yellowgrid YouTube Channel.

Reseller application – Click HERE to apply

Looking for more information? No problem, contact us on 03330144340 or sales@yellowgrid.co.uk.


Catch up with our Other Blogs!