Monday, February 26, 2007

Google disk failure research dumbs down SMART

NEW Google Labs research on failure rates of consumer-grade disk drives has thrown into question the predictive accuracy of disk-drives’ Self-Monitoring Analysis and Reporting Technology (SMART).

In particular, the Google study found little evidence to support the generally held industry belief that there was a strong correlation between failure rates and factors like higher temperatures and high utilization rates.

The just-published Google Labs paper finds failure prediction models based on SMART parameters alone are “severely limited” in their prediction accuracy given that a large fraction of Googles failed drives showed no SMART error signals whatsoever.

The Google study is thought to be unprecendented in size and scope, given the sheer size of the Google disk drive population and the newly highly parallel health data collection and analysis infrastructure the company used in the research.

The paper, called Failure Trends in a Large Disk Drive Population, will light a fire under the storage industry as it appears to debunk some long-held claims of manufacturers.

“One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels,” the paper concludes.

“Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population,” it says.

“Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.”

Despite the importance of the subject – some 90 per cent of all new information in the world is stored on disk – there are few published studies on failure characteristics.

And most available information comes from the disk manufacturers themselves, usually based on extrapolation from accelerated life test data of small populations, or from returned unit databases.

“Accelerated life tests, although useful in providing insight into how some environmental factors can affect disk drive lifetime, have been known to be poor predictors of actual failure rates as seen by customers in the field,” the Google research says.

“Statistics from returned units are typically based on much larger populations, but since there is little or no visibility into the deployment characteristics, the analysis lacks valuable insight into what actually happened to the drive during operation.”

The research found that while some SMART parameters – like scan errors, reallocation counts, offline reallocation counts and probational counts – have a large impact on failure probability, SMART was not enough to reliably predict failure.

“Given the lack of occurrence of predictive SMART signals on a large fraction of failed drives, it is unlikely that an accurate predictive failure model can be built based on these signals alone,” Google said.

For more Data Storage News click here.