Reliability

Companies are increasingly relying on cloud services to run critical areas of their businesses. Even the slightest downtime translates to lost productivity and revenue. To rely on cloud services, it is an industry standard to guarantee 99.9% service uptime. That means solution providers must have less than 90 seconds of downtime in a day, less than 45 minutes a month and less than 9 hours of downtime in a year. To maintain uptime each critical component of an application needs to be available. To the customer, available, means the end-to-end business functions are working as intended. When the major cloud providers (i.e. Amazon’s AWS, Microsoft Azure) guarantee an uptime of 99.95% there is very little room for error by the solution providers.

Understanding Uptime
To maintain high availability, we need to understand the reasons that services can be unavailable.

Hardware Failure
All applications need physical hardware somewhere to compute and store data. The most recognizable form of downtime is due to hardware failure. If the power goes out or the hardware malfunctions, your whole system fails. For solution providers, this is one area where the cloud providers take on all the risk. Cloud providers continually monitor the hardware performance and notify solution providers when configuration changes are required.

Security Events
Online services and customer data is valuable. Thus, there will always be malicious users trying to compromise online applications. To maintain 99.9% uptime a system must be secure.

Application Failures
Applications must work as intended to maintain uptime numbers. Even when functionality is tested, software failures can happen. Common software failures are related to unexpected user input and management of system resources. Application failures is the area where solution providers spend most of their effort to maintain high availability.

Identifying Downtime
Knowing there is an issue is the first step in maintaining 99.9% uptime. When every minute counts we cannot rely on humans to monitor services. Solution providers must implement a wide arsenal of tools to identify downtime and automatically attempt to recover from issues that may occur. Ultimately monitoring tools tell us the status of a service (OK/BAD) but equally important is the performance or health of a service.

Internal Monitoring Tools
Software Solutions are made up of many systems and services. In most cases, any one of these services can take down a whole application. Internally we break down each of these services and identify the components that need to be monitored. Internal monitoring allows us to customize the parameters we look at for health (i.e. Memory consumption, CPU cycles, drive space, etc.).

External Monitoring
If a system doesn’t work for the end customer, then it doesn’t work. If you don’t monitor your applications the same way your customers use your applications, then a disaster is waiting to happen. Monitoring Tools like Pingdom and Uptime Robot provide basic 3rd party monitoring that can send automated alerts to notify you when there is a problem. External tools are essential for holding companies accountable when providing uptime metrics to their clients. External tools will also not allow companies to manipulate up-time data to meet strict customer commitments.

24/7 Support
Even with a full set of monitoring tools, you cannot guarantee 99.9% uptime without on-call support personnel in place. Ultimately the best person to identify downtime is the customer. If the customer thinks there is a problem, then there is a problem. Having agreements in place to call a support line when there are problems is essential for any high availability service.

Maintaining Uptime
Software applications are becoming increasingly complex. To maintain high availability, systems need to be continually monitored and get better with each failure. Each time an issue is resolved, a root cause analysis will determine if enhancements need to be made.

To achieve consistent levels of uptime there are also regular maintenance tasks that need to be performed:

Regular Security Updates – to protect from malicious users from taking down your systems.
Redundancy – to take over when failures are detected.
Data backup and recovery – to restore data to the last good condition in the event of a failure.
Risk Assessments – to identify new risks added by functionality, technology or security trends.

As a solution provider, your reputation and customer satisfaction relies on your product availability. Software as a service must work reliably. By implementing a robust automated monitoring and recovery system solution providers can achieve 99.9% uptime and higher. As a customer, you shouldn’t expect any less.