Why ECC Memory is Critical for Financial and Medical Businesses

By: Kris Fieler

As businesses depend more on big data, the need to prevent data loss has never been more important. One of the most vital areas for this loss prevention is where data is temporarily stored, RAM.  ECC, or Error-Correcting Code, protects your system from potential crashes and inadvertent changes in data by automatically correcting data errors.  This is achieved with the addition of a ninth computer chip on the RAM board, which acts as an error check and correction for the other eight chips. While marginally more expensive than non-ECC RAM, the added protection it provides is critical as applications become more dependent on large amounts of data.

ecc-vs-nonecc

Likelihood of a Memory Error

On any server with financial information or critical personal information, especially medical, any data loss or transcription error is unacceptable.  Memory errors can cause security vulnerabilities, crashes, transcription errors, lost transactions, and corrupted or lost data.

The chances of a memory error occurring are estimated by experts to occur at rates of 2000–6000 per GB per year of uninterrupted operation[i]. While desktop computers may not have noticeable memory errors very often, systems that operate for long periods of time, like data-centric servers, are at a greater risk.  The risk also increases with larger amounts of memory and the age of the system. In a sensitive, high-demand work environment, all caution must be taken to prevent any likelihood of errors. The most common type of memory error is a single-bit error.

Single-Bit Errors

A single-bit error is when one bit (a binary 1 or 0) of a byte of data (8 bits) is changed to the opposite value (1 to 0, or vice versa).  It is the most likely error to corrupt data, as it is so small that the computer may not automatically recognize it as incorrect data.  Multiple bit errors (more than 1 bit being simultaneously affected), are more likely to occur, but less likely to be accepted by the computer as valid input.   Multiple bit errors can be detected by single-bit ECC, but may not be corrected by it in all instances.  Instead the system ignores it and reloads the data.

There are two types of single-bit memory errors: hard errors and soft errors. Physical factors, such as voltage stress, impact shock, temperature variation, or other nominal hardware damage, cause hard errors.  This could be due to a manufacturer error, mishandled hardware, or it can simply be caused by stress over time. On the other hand, soft errors are caused by data being written or read differently than originally intended.  As data is moved in and out of RAM, some corruption naturally occurs.  Since bits retain their programmed value in the form of an electrical charge, there are many potential causes of these errors. Theories of why this occurs range from naturally occurring isotopes emitting Alpha particles, cosmic rays, magnetic variances, fluctuations in electricity flow, and even electromagnetic interference (EMI) from the computer itself.

How ECC Operates

Whereas errors and hard faults in hard disk storage can be prevented with redundancy solutions like mirrored RAID (where the same data is written to two separate disks), the information in RAM is a fast, short-term, volatile storage, and is not mirrored. The question becomes how do we prevent errors as we are accessing the data?

Before ECC, error detection was done through Parity Bits[ii].  Commonly, computer data is stored in 8-bit groups, called bytes.  With parity, a ninth bit is used to check for errors.  Even and odd parity work by adding a bit of 0 or 1 at the end of each byte to make it even or odd.  For example, if even parity was used and the bits in the byte added to an odd number like 7, the parity bit would be 1, resulting in an even number of 8.  If an even parity byte evaluates to an odd number, it indicates the byte is corrupt and will be reloaded.

evenparityexample

Parity is usable in small runs of data as a safeguard, but as the blocks of information get larger, the process becomes slower. Parity also cannot automatically correct the error, except by reloading the data.

ECC is a logical step to parity. It uses multiple parity bits assigned to larger chunks of data to detect and correct single bit errors. Instead of a single parity bit for each 8 bits of data, ECC generates a 7-bit code for each 64 bits of data by using non-binary, cyclic error-correcting code.  When the 64 bits of data is read by the system, a second 7-bit code is generated, then compared to the original 7-bit code. If the codes match, then the data is free of errors. If the codes don’t match, the system can find the error and fix it by comparing the two 7 bit codes.

Due to this check process, ECC RAM is slightly slower.  Depending on the brand and model, this drop in speed averages between 1% and 2%[iii]. A 2% reduction of speed is unlikely to be noticeable to a human user for most standard applications.  SQL databases may slow down by a minimal amount as their memory usage peaks, however this reduction of speed is acceptable to prevent a loss of critical data.

Why This is Important for Your Server

If your business specializes in finance and the server crashes while processing a transaction due to a memory error, the transaction would be lost. Memory errors could also lead to data transcription errors, where a number is changed or a decimal is misplaced. In this scenario, you may not even know the error has occurred. It could be days or weeks before that transaction is next reviewed. Even then it may still not get caught by whoever is reviewing it.

These kinds of errors can also happen in other environments like the medical industry, where record accuracy is critical. When your employees are transcribing a file and are inputting ICD diagnosis codes, you want to be sure that the information entered is what is being recorded. Without the added layer of error checking that ECC provides, your important data may be saved as a different code or may simply be corrupted, making it that much more difficult to categorize and track the patient properly. This could cause serious ramifications when reviewed by another specialist or the insurance company.  Arguably, this could be corrected by reading the notes on the file, but it could cause a delay in response to a critical patient. Medical information is sensitive to both time and accuracy.

Security vulnerabilities, transcription errors, corrupted information, lost data, and downtime caused by system crashes all are technological complications that may be minimized or even eliminated by ECC memory.  With critical information in the balance, ECC is advisable to prioritize data accuracy and system stability.

At Atlantic.Net, our entire Cloud Hosting environment runs on ECC RAM, combined with the reliability of Intel Xeon processors. ECC RAM is also available as an option when purchasing one of our Dedicated Servers. Whether it be for your HIPAA-Compliant medical platform, payment transaction database, or any information that needs to be available and accurate 24/7, Atlantic.Net has solutions that fit your needs and your budget. Please contact us via email to [email protected] with some basic information of what you need out of your server.  Our sales team would be happy to guide you towards a plan that is suited to your business needs.


[i] http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
[ii] https://en.wikipedia.org/wiki/Parity_bit
[iii] Puget Systems ran a comprehensive benchmark test you can read through here.