Reliability Engineering
Designing and building components to be reliable over time in a variety of real world conditions.Change Management
Controlling change to an environment using a managed process that requires extensive testing before changes are launched to production.Configuration Management
Tracking the configuration and design of production to support activities such as troubleshooting and rollback to a stable state.Capacity Management
Managing the capacity of resources such as licenses, storage and computing.Runbook Automation
Automation of support, operations and incident response processes.Load Balancing
Distributing workloads across multiple resources using techniques such as cloud computing.Failure Detection
Automatically detecting failures.Failover
Automatically moving workloads from failed resources to functioning resources.Service Desk
Providing a single contact for users to report incidents.Incident Management
Rapidly escalating incidents to the people in a position to fix things. Restoring service in the quickest way possible.Problem Management
Following up on incidents to determine root cause and implement fixes and changes to prevent future incidents.Overview: High Availability | ||
Type | ||
Definition (1) | A service that is designed and operated to minimize downtime. | |
Definition (2) | A service designed and operated to achieve uptime of 99.99% or more. | |
Related Concepts | Reliability EngineeringIncident ManagementProblem ManagementTestingConfiguration ManagementService Desk |