skip header   hitachi.com   hitachi.us    Global
Hitachi Global Storage Technologies
 

Predictive Failure Analysis

Introduction -- disk drive reliability

Shortly after the first disk drive was created, another first occurred, the first disk drive failure. Computer system failures on the whole are aggravating. Production is delayed, customers upset, users dismayed, and in general, nothing can be accomplished until the system is operational, and data restored. Fortunately, disk drive reliability has been constantly improving, but failures still occur.

Solution -- maintenance

Historically, there are four ways to manage hardware maintenance (See Figure 1):

  1. You do nothing until something fails and then replace the defective part. This is cost-effective if you do not mind this unplanned down time, lost data, and all of the other unpleasantness of a disk drive failure.
  2. Or, you can practice preventive maintenance and replace all parts that typically fail, before they fail. This is "somewhat" effective in reducing unscheduled down time (parts do not always fail on schedule), but has a high cost in replacing parts that would not have failed.
  3. Or, you can use redundancy, that is, if you need one disk drive, use two, one for primary and one as a mirrored backup. Redundant Array of Independent Disks (RAID) is another example of redundancy. Redundancy has additional expense because of the extra hardware and software requirement, and may lower the performance of your system.
  4. Or, you can choose a fourth maintenance solution, condition monitoring. Hitachi's Predictive Failure Analysis (PFA) condition monitoring, an improved method, can provide early warning of impending failure, and allow scheduled replacement of the failing device.

Figure 1

What is PFA?

PFA monitors key device performance indicators for change over time or exceeding specified limits. The device notifies the system when an indicator surpasses a pre-determined threshold.

Advantage

PFA is an attractive solution to disk drive maintenance. PFA can minimize your exposure to data loss, and at a much lower cost than redundancy. PFA will only call for preventive replacement of a disk drive when that drive's performance is degraded. PFA will give you a new level of data protection and allow for scheduled replacement of the drive.

You can maximize your data reliability with Hitachi's industry leading Mean Time Between Failure (MTBF) and Predictive Failure Analysis.

How does it work?

Figure 2

As with any electrical/mechanical device, there are two basic failure types (See Figure 2). First, there is the on/off type of failure. A cable breaks, a component burns out, a solder connection fails, these are all examples of unpredictable catastrophic failures. As assembly and component processes have improved, these types of defects have been reduced but not eliminated. PFA cannot provide warning for on/off unpredictable failures.

The second type of failure is the gradual performance degradation of components. Predictive Failure Analysis has been developed to monitor performance of the disk drive, analyze data from periodic internal measurements, and recommend replacement when specific thresholds are exceeded. The thresholds have been determined by examining the history logs of disk drives that have failed in actual customer operation.

Figure 3

Predictive Failure Analysis monitors performance two ways. PFA has a "measurement driven" process and a "symptom driven" process. The measurement driven process is based on Hitachi's exclusive Generalized Error Measurement feature (See Figure 3).

A periodic intervals, PFA's Generalized Error Measurement (GEM) automatically performs a suite of self-diagnostic tests which measure changes in the disk drive's component characteristics.

Hitachi Global Storage Technologies leads the disk drive industry with this two-step condition monitoring approach. To accomplish this task, GEM directly measures various magnetic parameters of the head and disk, as well as figures of merit for the channel electronics. The GEM circuit monitors head fly height on all data surfaces, channel noise, signal coherence, signal amplitude, writing parameters ... and more. Unlike conventional error monitors, this feature provides for direct detection of specific mechanisms that can precede a disk drive failure.

The "symptom driven" portion of PFA uses the output of data, non-data, and motor start error recovery logs. The analysis of the error log information is performed periodically during idle periods. When PFA analysis detects a threshold exceeded failure, the host system is notified. The design goal of PFA is to provide a minimum of 24 hours warning before a drive fails. Not all of the disk drive failures are predictable, at least not yet, but we're working on them.


Other names are trademarks or registered trademarks of their respective owners.

References in this publication to Hitachi products, programs, or services do not imply that Hitachi intends to make them available in all countries in which Hitachi operates.

Product description data represents design objectives and is provided for comparative purposes; actual results may vary depending on a variety of factors. Product claims are true as of the date of the first printing. This product data does not constitute a warranty. Questions regarding Hitachi warranty terms or the methodology used to derive this data should be referred to an Hitachi representative. Data subject to change without notice.









  Terms of Use | Privacy Policy | Contact Us  © 2009 Hitachi Global Storage Technologies