Fault Tolerance with Linux High Availability

Derek Wiedenhoeft
by (26posts) under Managed Hosting
0 Comments


IT downtime is expensive for any business.  Gartner[I] estimates that each minute of downtime costs $5,600 on average, with true costs depending on the vertical, the size of the company, and other factors.  The cost can be largely avoided, however, with systems designed for high availability and fault tolerance.

Definition: High Availability
Oracle[II] defines high availability as “computing environments configured to provide nearly full-time availability.”  A commonly held standard for high availability is “five nines,” or 99.999 percent uptime.

Not all service providers are able to meet this robust standard, which makes just over 5 minutes of downtime per year permissible.

For organizations that would approach the average downtime cost, achieving even higher availability than “five nines” is important to profitability, and even survival. Atlantic.net offers an industry-leading 100 percent network uptime guarantee, in part by leveraging Linux High Availability (Linux-HA).

Introduction to High Availability

As Oracle explains, networks are configured for high availability by utilizing redundant hardware and software, and avoiding single “points-of-failure” to keep the system running in the event of a problem.  Workloads are distributed among parts of the network by the load balancer, which redirects traffic away from whatever component has failed or been taken offline.

The servers grouped together for unified functioning by the load balancer are known as a cluster.  A system that continues to operate properly when one of its components fails is considered fault tolerant.  The automatic movement of traffic or a workload within the cluster to avoid a failure is called a failover process, and when it is employed, an end-user can continue using an application even if the server it is on crashes.

The primary benefit of high availability systems is the reduction of costs from unplanned downtime.  Load balancing not only increases reliability, however, it can also improve recovery speed through automation and error detection.  Further, it can also improve application performance.

“Even if an application is poorly written, or has problems with scaling, a load balancer can improve the user experience without any other changes.”

By distributing workloads between servers, the load balancer allows a particular application to scale up as needed to keep the system running smoothly, regardless of how busy it is.

The ability to update system components without taking the whole system offline also helps ensure maintenance tasks like backups and updates are performed properly, and not rushed to get back into operation.  High availability systems provide further protection by enabling organizations to proactively monitor their network, and reducing the risk of data loss with redundant storage.

High availability can also be valuable, or even necessary, for ensuring regulatory compliance, such as HIPAA compliance.  The HIPAA Security Rule[IV] requires that “information is accessible and useable upon demand,” as well as a contingency plan to ensure it remains so “during unexpected negative events,” such as unexpected demand or hardware failure.

Building Fault Tolerance into Your Network

Some commonly used load balancing products that can provide fault tolerance include Apache Zookeeper, Pacemaker, and HAProxy.  Zookeeper[V] is an open-source coordination service for distributed systems that provides high availability when run on multiple servers.  It runs on network nodes in odd numbered “ensembles,” and coordinates them through a namespace of data registers it creates.  Pacemaker[VI] is a cluster resource manager, is also open-source and was originally part of the Linux-HA project, but has since become its own.  It too runs on the nodes and coordinates them via the cluster infrastructure service, such as Heartbeat or OpenAIS.

HAProxy[VII], by contrast, is included with Atlantic.Net’s Managed Firewall appliance.

Fault tolerance is provided by HAProxy’s control of redundant network resources.  If a server fails, HAProxy uses one of a number of algorithms it includes to redirect traffic away from the problem, and to the redundant server, which it has maintained in readiness for this purpose.  The switch to the new server takes roughly a second, while it can take hours to bring a crashed server online.  The cost of that redundant server is generally saved in reduced downtime within mere minutes of this occurring.

HAProxy not only functions to assist your site in the event one server fails or needs to be brought down for maintenance, but it also can be setup to load balance your web traffic when both servers are up, increasing response times to your customers.

HAProxy is also open-source, and is now shipped with many popular Linux distributions.  The active HAProxy community continually updates the software, and new versions can be deployed without reconfiguration.  HAProxy serves billions of web pages a day, and moves large amounts of money for Fortune 500 companies, and has been without a bug in a stable (finished) version or a single known intrusion for 13 years.

Interrelated Best Practices

In a scenario in which malicious network traffic causes a failure, load balancing will generally not resolve the problem on its own.  The firewall, which filters traffic, prevents the problem from simply following the workload to the new server.  Likewise, the traffic filtering of the firewall does little to reduce the network’s vulnerability to hardware failures or software bugs within it.  Utilizing both a strong firewall and a high availability system provides a dramatic improvement in overall protection.

A network with built-in redundancy, with workloads controlled by a load balancer, is tolerant of even worst-case faults, and provides maximum availability.  Just as the right mix of different components ensures network reliability, organizations that would benefit from high availability will achieve them by using a load balancer like HAProxy, along with a full set of redundant network components.

With Atlantic.Net’s Managed Hosting solutions, we make sure your servers are setup for high availability upon request. Combined with our Managed Firewall appliance, this provides a reliable solution to prevent your site from going down when you need it the most. For help or more information, email us at sales@atlantic.net. Our sales team can help guide you quickly and easily through the process.

[I] http://blogs.gartner.com/andrew-lerner/2014/07/16/the-cost-of-downtime/

[II] https://docs.oracle.com/cd/A91202_01/901_doc/rac.901/a89867/pshavdtl.htm

[III] https://www.nginx.com/blog/10-tips-for-10x-application-performance/

[IV]https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/administrative/securityrule/securityrulepdf.pdf?language=es

[V] https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription

[VI] http://wiki.clusterlabs.org/wiki/Pacemaker

[VII] http://www.haproxy.org/


New York, NY

100 Delawanna Ave, Building 1

Clifton, NJ 07014

United States

San Francisco, CA

2820 Northwestern Pkwy,

Santa Clara, CA 95051

United States

Dallas, TX

2323 Bryan Street,

Dallas, Texas 75201

United States

Orlando, FL

440 W Kennedy Blvd, Suite 3

Orlando, FL 32810

United States

London, UK

14 Liverpool Road, Slough,

Berkshire SL1 4QZ

United Kingdom

Toronto, Canada

20 Pullman Ct, Scarborough,

Ontario M1X 1E4

Canada