Service interruption DRO1

Resolved
Resolved

We've now resolved the incident. Thanks for your patience. We apologize for any inconvenience caused by this.

We use Ceph as the underlying storage platform. Ceph is designed to storage data three times on three different servers. In short, if a device fails, it is thrown out of the cluster and Ceph stores the data it contained elsewhere. Normally, this does not have any impact.

In this case, Ceph recovered the data quickly. However, the node where the NVME drive was physically located, became unresponsive. Normally, Ceph should throw out this node from the cluster as well, however, for some reason Ceph decided to not do so.

We put a lot of effort in service availability and designed the system to prevent an issue like this. We will research why Ceph did not act like it should have and continue to offer stable services.

For now, all services are back in normal operation.

Thank you for your patience.

Avatar for
Recovering

We've fixed the core issue, and are waiting for things to recover.

Avatar for
Identified

We've confirmed there is a problem with the underlying storage, we're working to resolve it.

Avatar for
Investigating

We are researching the service availability in Dronten. The services in Dronten may not be reachable.

Avatar for
Began at:

Affected components
  • Network
  • VPS
  • Webservers