Friday, June 10, 2011

A Non-Traditional Way Of Stress-Testing Hard Drive Arrays

Tremendous amounts of computing are now moving to the cloud. Those of us old enough to remember not just before the Internet, but before the personal computer, may remember that the cloud used to be called "mainframes." Like mainframes, the modern server farms that make up the cloud have unusual problems sometimes: here, engineer Brendan Gregg demonstrates how to reproduce one failure condition that you're not likely to run into at home.

Brendan has his own writeup of the error, which shows the diagnostic screens more clearly.

It's important to remember that even though computers are deterministic, they're still complicated enough that a given error can come from wildly improbable causes. The reason that we need monitoring systems is to observe errors as they happen, and to be able to to try and prevent them, even if we can't directly look at the underlying cause. Monitoring and testing let you answer the "what's happening?" question without requiring that you answer the "why is it happening?" question.


