Fault tolerance (FT) ensures the continuous operation of computer systems or networks even when system components are malfunctioning or have even failed. Fault-tolerant systems are used where maximum availability and high availability are essential.
Such a system, whose behavior is also described with Graceful Degradation, must master various error handling methods: Fault detection, fault avoidance and the propagation of faults, recovery procedures to restore normal conditions after a fault, malfunction or defect. The system must also be able to reconfigure itself.
Fault tolerance includes several preventive measures. These lie in the redundancy of processors, components and devices that take over the function of a component in the event of its failure. This also includes the duplication of data inventories, i.e. mirroring to multiple storage units such as RAID systems, which protects against errors and failures. In addition, fault tolerance can refer to the physical network structure, which can be duplicated. In the event of a failure in the cabling system, the parallel cabling structure is automatically used for transmission.
Software fault tolerance refers to fault detection and recovery procedures.