High w_await on RX nodes

May 19 at 12:05am MSK

Affected services

RX-Line [AMD Ryzen 9 9950X]

Resolved
May 19 at 12:05am MSK

Our monitoring systems recorded a sharp increase in the w_await indicator on some servers of the RX cluster. This indicator reflects the response time of NVMe drives during write operations.

Since we use RAID1 on our servers to increase the reliability of data storage, its speed depends on the "slowest disk" itself. If at least one disk from the array starts working incorrectly, this is reflected in the entire array.

During an internal investigation, we found that all the problematic drives belong to a single batch, which was received by an engineer at the data center and assembled the servers. We have ordered a new batch and expect it to be delivered next week.

No downtime is expected during the work on replacing storage devices, and virtual machines from these nodes will be migrated to other servers in order to service emergency nodes.

We continue to monitor the situation and apologize for any inconvenience.