What does cause Winchester drives to fail?

One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels.  Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population.  Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities.  We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors.  First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities.  Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.  This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components.

Failure Trends in a Large Disk Drive Population; Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso, Google Inc. as published in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), February 2007

And what was most amazing to me was that, in my experience, drives that were over five years old were more likely to fail, but this paper said that infant-mortality was a bigger problem than old drives.  The other thing that I’ve come to know during my career is that SCSI drives don’t like to be powered down… ever.  I’ve seen more drives in a storage array crap out after being power cycled than for any other reason.

Another really funny thing in this article was where they were talking about taking drives out of the research pool for things like, “some drives have reported temperatures that were hotter than the surface of the sun.”  Now that’s a hostile environment.


Comments

Leave a Reply