Our monitoring systems recorded a sharp increase in the w_await indicator on some servers of the RX cluster. This indicator reflects the response time of NVMe drives during write operations.
Since we use RAID1 on our servers to increase the reliability of data storage, its speed depends on the "slowest disk" itself. If at least one disk from the array starts working incorrectly, this is reflected in the entire array.
During an internal investigation, we found that all the problematic drives...