Calculating MTBF Figures for Switching Systems

We are occasionally asked what the MTBF's (Mean Time Between Failure) of their switching products are. This is a notoriously difficult area since most MTBF calculations are made based on MIL 217 (currently at Revision F and Notice 2), but field experience of failures appears to be almost unrelated to the calculated values for a variety of reasons. This application note is intended to provide some background information on MTBF calculations based on MIL 217 and how they relate to our switching products. There are alternative MTBF methods of calculation of MTBF based on other standards, such as Bellcore/Telcordia which give different answers to MIL 217, typically much higher MTBF numbers as the reliability data is created for commercial applications. However, the databases for these methods are all quite old.

Requesting an MTBF for a Switching System

We are able to provide MTBF figures for switching systems on request to MIL 217F Notice 2. To do so, please contact customer support,you will be asked to fill a simple form and we will return the form with your MTBF estimate. We only ask for simple information and no det

MIL 217

MIL 217 provides a mechanism for calculating MTBF based on a variety of factors that include solder joint counts, IC complexity and passive component counts of various types. A thorough calculation will also take into account the stress levels on each component (e.g. percentage of rated power rating, ambient conditions) and is referred to as a stress based calculation. Calculations include numbers for mechanical components such as fans or connectors. There is also a simpler method of calculation based on parts count for each component type which generally produces much higher failure rates, this is contained in Appendix A of MIL 217F Notice 2.

It is rare for the fault patterns predicted by MIL 217 to tie up with the actual field failure experience for a variety of reasons:

  • MIL 217 is based on figures that predict the MTBF of a component either failing or going out of specification. Many products will still function despite components going out tolerance (e.g. logic pull up resistors), but some products will fail (for example a resistor drifting in a resistor module). Field MTBF’s are often much better than calculated MTBF’s based on MIL 217.
  • Each new edition of MIL 217 has had to reflect the changing reliability of electronic components, particularly integrated circuits. Early versions of MIL 217 for example would predict modern high complexity IC’s (such as microprocessors) would fail very quickly. Later versions reflect the changing technology used in modern devices with revised parameters for MTBF calculation. MIL 217 has not followed the vast changes in both semiconductor technology and manufacturing methods, its last update being dated at 1995. Modern semiconductor geometry sizes, gate counts, packages and the massive changes on surface mounted technology are not reflected in the standard. Similar issues arise for other types of components where component technology has changed.
  • Calculations on certain devices have to be made with a set of assumptions about usage which may turn out to not reflect user experience. This is particularly true for connector and switching components when stress based calculations are used, and is critically affected by the use (or not) of hot switching in the case of relays and the types of faults seen on the devices under test.
  • Some devices do not exhibit normal failure distributions. For example, switching components may almost never fail until they reach some end of life criteria (for example the number of operations on a specified load), at which point a significant proportion start to fail. MTBF's according to MIL 217 do not necessarily take these wear out mechanisms into account. Instead it calculates the MTBF within the rated service life of the component and the user must manage end of life criteria separately. So a relay reaching the end of its hot switch life under a given load condition is not a failure that is accounted for by MIL 217, it is an end of life failure.
  • The calculations do not include "infant mortality" effects which are usually caused by manufacturing defects in components or assembly processes
  • It does not take into account abuse of parts, this is particularly a factor for switching systems since they can be exposed to faulty UUT's which cause operation beyond their intended use. Experience suggests that the majority of switching system relay failures can be traced to accidental abuse.
  • Some of the factors used in MTBF calculations have bizarre consequences which are dependent on stress conditions which change in steps (See Cycling Factor below).

Consequently, MTBF calculations for products based on MIL 217 are at best only a relative guide to reliability based on the complexity of the product. The MIL 217 document also notes in its opening sections that this is frequently the case; it is not a reliable indicator of field failure patterns, however, it does give a common method for making comparisons across products.

Application to Switching Modules and Systems

Calculating MTBF for switching modules based on MIL 217 is not straightforward and generally may not align to the field experience of our switching products.

In a typical switching module based on the PXI and LXI the number of electronic components is relatively small, so their MTBF contribution to the computed MTBF is low. The module design is typically quite tolerant of these components drifting out of specification. Doing a calculation based on the electronic parts will reveal a very high MTBF (much greater than 100,000 thousand hours). Failure of the electronics that support the switches is an unusual event, most service engineers fault finding on a switching module will automatically look for a relay failure because their experience indicates that this is by far the most likely fault.

Reed Relay and EMR Background

We use instrument grade (sometimes called professional grade) reed relays in its products which are manufactured by our reed relay division, Pickering Electronics. These are not to be confused with the relays often used by competitors which are commercial or industrial grade.

Instrument grade reed relays use Ruthenium contacts to ensure reliable operation at both low currents and voltages and at higher currents and voltages. The contacts are hermetically sealed in a glass envelope, so performance is not sensitive to humidity. High temperatures can cause failures if the reed blades lose their magnetic properties, but generally it is not a sensitive environmental parameter. If high currents are switched it will result in erosion of contact materials and lower the life of the reed relay, a factor that is reflected in the data sheet specification.

The reed blades are operated by a wire coil that generates a magnetic field. Sometimes early failures can occur because of weakness in the coil or its connections, but generally the coils are very reliable in service, since manufacturing screening eliminates these early failures.

Other than the flexing of the reed blades inside its sealed glass envelope there are no moving parts to wear out. This leads to an extraordinarily long mechanical (low current) life compared to electromagnetic relays. The mechanical life of the reed relays the we use is well in excess of one billion operations. Furthermore, failures during the service interval are rare, so the number of failures per billion operations in the first billion operations is significantly less in than in subsequent billions of operations. If a matrix, for example, was being exercised with cross points being randomly closed it is likely most, or all, of the cross points will survive at least the billion mechanical operations. A calculation based on, for example, making an assumption on how many operations per hour are performed on each switch, does not lead to a sensible MTBF estimate for the matrix as the user is applying end of life criteria to the component.

It should be noted that products using either lower grade reed relays (for example using rhodium contacts) or electromechanical relays will not necessarily have this performance. They may have failures associated with degraded contact materials even at low signal levels, since materials used in lower grade reed relays have lower grade contact materials and levels of cleanliness within the glass envelope.

Electromechanical relays are not hermetically sealed and have moving parts that wear out, so they tend to have a lower mechanical life. However, for high current (and power) applications their larger contacts and wiping action make them more suited to applications where higher currents are experienced since the stress on the part us lower. The same issues of end of life aspects apply as for reed relays.

As a consequence of these issues the manufacturers of complex ATE systems use instrument grade reed relays in their systems for low level switching of test points because only instrument grade reed relays provide the required consistent performance and lifetime that they require, and these are the same components that we use in our PXI modules. Less demanding applications will use EMR's, as will applications requiring higher hot switching capacity or larger carry currents/voltages. (see: Hot switching relays)

Solid State Relays

MIL 217 shows MTTF (Mean Time to Failure) numbers which are orders of magnitude better for solid state relays than for mechanical relays.

Hot Switch Power Impact On Life

As the loaded power of the reed relay (or for that matter any relay) increases there comes a point where the mechanical life is not the determining factor in switch life. As the current or voltage is switched arcing and contact material migration will start to damage the contact area and limit the life, the degradation being current, voltage and power dependent.

For a typical reed relay with a low power (mechanical) life above of 1 billion operations, the lifetime is likely to reduce to, for example, 5 million operations at its rated load (voltage, current or power). Lifetime will be higher if the test system design is such that it avoids hot switching since arcing and inrush currents from intended or stray capacitive loads (even including cables) are minimized. The operating conditions are absolutely critical in estimating the lifetime of the relay in service.

There is no convincing data on how a relay with time varying loads behaves, even though MIL 217 does contain some assumptions in this area. Estimating life under these conditions is not as simple as weighting the switching operations according to load conditions.

It should also be remembered that under hot switching conditions the users cabling may also have a significant impact on life because the cable may add a reactive load (e.g. capacitance) to the load seen by the relay.

System Debugging - A Major Cause of Early Failures

One problem that frequently occurs is during the debugging phase. Programming or construction errors outside the switch matrix can result in accidental shorts being introduced which overload or damage the relays during the commissioning of the system. Our experience has been that early failures are dominated by such factors, the most common being relay contacts welded together because of hot switching of heavy currents (for example across a power supply or providing power to large capacitive load). The switching modules are at their most vulnerable during this phase of their use, and these failures are not included in any MTBF calculation. The worst scenario is of course where the contacts are damaged but still function.

Relay Operation Counting

Some products may use counters to indicate how many times a relay has been operated to advise the user to consider relay replacement as a maintenance item. Given that almost three orders of magnitude variation in life according to loading factors occur, this is hard to recommend as a good practice. The user would need to know what the load was for each operation, and then have some representative model of how that impacts service life.

Prematurely replacing relays can induce more faults through component stress during service (particularly surface mounted devices), and repeated intervention may render a perfectly good switching system unusable. The relays should be replaced because they are failing, or have already failed – there is much to recommend the practice of if it is working leave it alone.

For reed relay based designs, we do not support relay operation counting because it does not perform a useful function unless the user makes extraordinary efforts to estimate load conditions (perhaps through the IVI drivers) for each operation. Most users do not have the luxury of enough resources or time to do it well enough to make a difference or a useful prediction.

Diagnostic Test ToolsPickering Diagnostic Test Tools

A much more useful approach to servicing is to have a diagnostic test tool that can identify faults in a matrix. we offer two tools for diagnosing relay failures, BIRST (Built-in Relay Self-Test and eBIRST test tools, where the contact is rated at 2A or less. The tools will identify the faulty cross point or isolation switch so that the user can replace it. eBIRST requires the use of external tools and BIRST is built into some of our matrix designs.  Here are some additional articles on these tools:

The use of leaded components in preference to surface mount relays reduces the stress on adjacent components when servicing is performed.

The larger physical size of VXI and GPIB based switching modules allows them to have diagnostics systems built into them to permit self testing of the module switches, but these systems reduce the density of the switching system and increase their complexity and cost.

MIL 217 Calculation

Even though the reliability calculations may not be useful in reflecting field experience, customers may still have to make MTBF calculations to comply with their contractual obligations. This section provides some guidance and notes on the shortcomings of these calculations; it still leaves the customer the task of stating the assumptions and interpreting the results.

MIL 217 provides a set of guidelines on the reliability of specific types of products. For each component in a system the user estimates the reliability as the number of failures per million hours of operation. For PXI switching products the most relevant section is Section 13.1 Mechanical Relays, contained in Notice 2. The MTBF for a system (or module) is obtained by calculating the

Failure rate (Mean Time To Failure) for each component, adding them together and then calculating the MTBF in hours (the reciprocal of failure rate). The failure rate per million hours for each component is derived from a base value that is multiplied by a number of factors to arrive at a final figure. The factors take into account a number of possible variables.

For relays the document starts with a base failure rate, b, of 0.0059 per million hours. This base rate is then adjusted by a series of factors as follows:

  • Base Failure
    The base failure rate is temperature dependent, 0.0059 applies at 25°C. This figure increases as temperature increases, but the rate of increase for products in PXI chassis is not rapid. At 30°C, for example, it has increased to 0.0067. For typical PXI environments the 0.0059 (per million hours) figure is relevant since the chassis is cooled and switching modules do not dissipate much power.

  • Contact form factor 
    SPST switches are simpler and more reliable than more complex switches (especially true for reed relays) and so a factor is included to account for this. SPST switches have a factor of 1 and DPST a factor of 1.5 (other form factors are worse). These are the two switch types normally used in Pickering Interfaces high density products.  

  • Stress Factor
    For lightly loaded switches the stress factor is 1, the factor increases for load conditions which are given for resistive loads, inductive loads and lamps. The standard says nothing about capacitive loads which can be the most damaging. The most predictable environment is resistive and is the condition that should be used for cold switching. For this condition the stress increases from 1 to 4.77 at full load. This is not in keeping with the experience of relays where the life is affected by a factor of greater than 200 according to the load level, but this is not the expectation since this is a wear out mechanism and not an in service failure mechanism.  
  • Cycling Factor
    This factor is applied depending upon the number of operations of the relay per hour. For commercial quality relays less than 10 per hour the factor is 1, for rates between 10 and 1000 per hour the factor changes to operation per hour divided by 10, for rates above 1000 per hour it becomes the number of operations per hour squared divided by 100. In high-density arrays of relays many could be operated at less than 10 per hour, many may never even be operated, and some may be operated much more frequently.
  • Quality Factor
    Depending on the screening level of the component a factor is applied which varies from 2.9 for commercial grade to 0.1 for a military R grade. For the relays that Pickering Interfaces we suggest using a factor of 1 or 1.5. 
  • Environmental Factor
    The Ground Benign factor is 1 and is the most representative value for instrumentation designed for normal laboratory style use. Ground Fixed is a more stressed environment with a factor of 2, typically being the case in a rack of equipment which requires ventilation support because of a high thermal load.
  • Application Factor
    This classifies different types of relays and lists dry reed relays as having a factor of 6. By comparison with other relay types this factor does not agree with Pickering Interfaces experience (for example it suggests long armature relays have a factor of 4 which is certainly not industry experience of the relative life of armature and reed relays).

Making a Calculation

It should be evident that a calculation according to MIL 217 requires the user to makes some assumptions about the conditions the system is operating in, and that it does not include wear out mechanisms. So the section that follows uses some assumptions to arrive at sample results. Note that these results assume nothing about the relative merits of individual products, they are simply calculations based on the figures included in the standard and based on the component types used.

  • Single Relay Low Stress Condition

Taking the low stress conditions first a sample calculation is:

25°C, low stress, 1000 cycles per hour, quality factor of 1, ground benign and reed construction

MTTF = 0.0059*1*1*100*1*1*6

MTTF = 3.54 per million hours 

25°C, low stress, 100 cycles per hour, quality factor of 1, ground benign and reed construction

MTTF = 0.0059*1*1*10*1*1*6

MTTF = 0.354 per million hours 

25°C, low stress, up to 10 cycles per hour, quality factor of 1, ground benign and reed construction 

MTTF = 0.0059*1*1*1*1*1*6

MTTF = 0.0354 per million hours

Comparing this to real life test results where a typical Pickering Interfaces reed relay has a life exceeding 1 billion operations, or 1 million hours if operated 1000 times per hour. The actual failures under these conditions are much less at it was calculated from a wear-out mechanism, and end of life criteria which is not part of MIL 217 calculations.

  • Applying to a Matrix
  • These numbers in themselves may seem to indicate a long life but for high-density switching systems, this perception can quickly change.

    A high-density BRIC4 matrix module contains up to 2200 relays in a single module. If it is assumed that 95% of the matrix is used at up to 10 operations per hour and 5% at up to 1000 per hour an MTTR is arrived at of:

    MTTF = 2200*0.95*0.0354 + 2200*0.05*3.54

    MTTF= 464 per million hours or MTBF = 2155 hours

    The calculation that in this case is dominated by the high cycle rate component (5% of the matrix), highlighting the impact that cycling factors have on the calculated result. If the 5% of the matrix was reduced to 100 operations per hour the MTTF would drop to 112 per million hours and the MTBF increase to over 8900 hours.

    Pickering Interfaces experience is that these results should be significantly exceeded in real life conditions, and clearly making minor adjustments to the cycling factor in particular will have a big impact on the reported number in ways we do not see in field failures. It is for this reason that we do not supply users with a single MTBF number.

    • High Stress Conditions Single Relay

    The calculations are different in high stress conditions, but the principle is the same. If the stress factor is increased to full load resistive (implying cold switching) the stress factor increases to 4.77. 

    Taking the same conditions as before for each relay: 

    25°C, low stress, 1000 cycles per hour, quality factor of 1, ground benign and reed construction, full load 

    MTTF = 0.0059*4.77*1*100*1*1*6 

    MTTF = 16.88 per million hours. 

    25°C, low stress, 100 cycles per hour, quality factor of 1, ground benign and reed construction 

    MTTF = 0.0059*4.77*1*10*1*1*6 

    MTTF = 1.688 per million hours 

    25°C, low stress, up to 10 cycles per hour, quality factor of 1, ground benign and reed construction 

    MTTF = 0.0059*4.77*1*1*1*1*6 

    MTTF = 0.1688 per million hours. 

    As stated previously a reed relay at full rated load is likely to have a minimum life of 5 million operations (an electromechanical relay can be much worse, though its ratings may be higher). A relay with a life of 5 million operations at full load would have an MTTF of 5000 per million hours at 1000 operations per hour, clearly much worse than the calculated value. Again the discrepancy arises because these wear out mechanisms are not part of the MIL 217 standard calculation - the number calculated indicates failure rate during service at times in the service life where the wear out mechanism is not a factor.

    Summary

    It can be seen that calculating MTBF numbers for relay assemblies is not a straightforward task and the results need to be treated with caution. To make the calculation the user has to state the conditions and assumptions that are used. 

    Use of the most appropriate high quality products can improve system reliability, but the actual service life calculation can lead to misunderstandings about the life expectancy.

    Relay counting methods of predicting future failures are particularly unreliable in anticipating failures, and premature servicing of relays can cause more failures that the replacement exercise was designed to reduce. It is much more effective to rely on good diagnostic tools being available, such as BIRST and eBIRST. These tools will reduce the risk of replacing relays unnecessarily and avoid making errors in diagnosing problems.

    Finally a reminder, if you provide us with the key information required we can provide users with MTBF figures to MIL 217 by contacting customer support, you will be asked to fill a simple form and we will return the form with your MTBF estimate.

How did we do?
0 out of 0 people found this helpful.