Network fault management, a key part of the today Network Management architecture, covers functions such as detect, isolate, determine the cause and correct malfunctions in a network. The objectives of doing fault management are to increase network availability, reduce network downtime and restore network failure quickly.
The basic requirements for a fault management system are:
- Monitoring and collect of statistics on network devices, traffic conditions and usage in real-time to avoid and forecast potential faults
- Setting thresholds and alarms that may cause network failure to warn the network admin
- Setting alarms that warns of performance degradation on network devices and links
- Setting alarms of network resource (such as hard disk space) usage and limitation problems
- Remotely control network devices for rebooting, shutting down etc.
- Have a centralized consol to perform all of the above functions
A typical fault management system follows these steps:
| Detection | -> | Analysis | -> | Action Taking |
|
|
|
When an error occurs, a report is generated and is sent to the fault analyzer. The fault analyzer diagnoses and records the problem. Finally, a system or a person uses the information from the fault analyzer to take appropriate actions such as isolating the error, black-listing failing or failed components, automatically restarting/restoring services, and alerting the system administrator.
Related Terms: Network management, Performance management, Configuration management, Security management
