DMT Web Hosting is too concerned about the health and safety of both clients and employees. Our office is closed due to COVID-19 countrywide lock down. You can reach us on +92 300 044 4656, +92 321 112 6660 during the lock down period. We appreciate your understanding and patience.

Server failure: risks, consequences and response

Summary

  • Overview of fault scenarios
  • Consequences of a system failure
  • Server crash resolution
  • Business Continuity Management (BMC)

Overview of fault scenarios

Security experts differentiate two types of risk sources causing such server failures: external and internal threats. Internal threats bring together all the scenarios where failures are caused by your own IT infrastructure, such as the power system or employee errors. External threats are generally caused by malicious attacks or by unpredictable external events such as accidents or disasters.

Internal sources of danger:

  • Fire in the data center
  • Power failure in the data center
  • Hardware failure (hard drive crash, overload, overheating)
  • Software errors (database failure)
  • Network issues
  • Human error

External sources of danger:

  • Infiltration (Attack of the middle man, Phishing, social engineering)
  • Sabotage (Attacks on SCADA systems)
  • Virus, Trojan Horse, Worms
  • Denial of Service (DDoS) attack
  • Theft of equipment
  • Force majeure (earthquake, lightning strike, flood)
  • Accidents (Air disaster)
  • Attacks

It is generally easier to prevent internal threats than external threats for companies. The reason for this is that hackers continually adapt their attack model to the security standards used by companies, which must constantly cope with these intrusions. On the contrary, internal threats are anticipated in the long term through uninterrupted power supply, fire protection measures, by increasing the availability of servers and by comprehensive security training.

Consequences of a system failure

A server failure results in financial damage. This is very clear to most companies. A study shows that 300 companies with 200 to 4,999 employees. About 77 percent of respondents noted critical computer system failures in the year preceding the study. The main concerned were the trading, production and distribution companies. This resulted in an average of four failures per company during the study period. The average time required to recover the data was 3.8 hours.

111111111

The costs incurred per hour of breakdown vary according to the size of the company. While companies with less than 500 employees noted damage of up to 21,000 USD per hour of breakdown, those of 1,000 employees had to pay double, or 43,000 USD per hour. If we take into account the time to repair the server failure and recover the data, this represents an average annual cost of 413,088 USD for medium-sized companies.

The calculation of the costs of an interruption of the activity should, apart from the hourly rate of employees unable to work, also take into account losses related to existing orders not arriving at destination, penalties for delay depending contracts, etc. Add to this the damage linked to the image of the company, which is difficult to calculate.

Server crash resolution

To combat these server failures, it is a question of fighting against real risks through preventive measures. These generally refer to a series of organizational measures for the choice and design of the server environment.

Fire protection and power system

To protect servers from physical influences such as fires, floods, power outages or sabotage, your engine room must be equipped accordingly. It starts from the choice of its location. Cellars are not recommended for the risk of flooding they entail. Furthermore, access to this room should be limited to specialists and the latter should be fitted with safety partitions. These spaces should not, in the long term, be thought of as workplaces.

A basic condition for uninterrupted server operation is a constant power supply. An interruption of more than 10 ms is already considered a fault. For this, you can set up an electric bridge using an emergency power supply. This allows a self-sufficient exploitation of electricity, independently of the public electricity service, when an interruption of the latter occurs.

Reliability and availability

Medium-sized companies very often underestimate the consequences of such computer system failures on their business. One reason for this is the high reliability of standard components used in business today. Their availability is generally 99.9 percent. A figure which may seem high, but which can cause a maximum interruption of 9 hours in a year by exploiting the resources 24 hours a day. If an interruption occurs at a busy time, the company can pay a high price for unavailability relatively short. These 99.99 percent high availability computer systems are also used as a standard for the provision of sensitive data. With this type of equipment, a maximum downtime of 52 minutes per year is guaranteed. This is why experts speak of a computer system with very high availability. „ High Availability (HA for short) refers to the availability of resources in a computer system, in the wake of component failures in the system.”

Data security and recovery

In order to quickly recover sensitive commercial data in the event of a server failure, it is recommended to develop a computer backup concept in accordance with international industrial standards such as ISO 27001. This makes it possible to determine who is responsible for computer backup and to appoint the people with decision-making power in the event of data recovery.

222222

Full data backup: A full data backup therefore takes a long time and requires high storage capacity, especially when several generations of data are kept in parallel. However, this type of computer backup scores points via a quick and easy data recovery, because only the last stored backup must be reconstituted. But companies lose this advantage when backups are performed too rarely.

Incremental data backup: If companies decide on an incremental data security, the backup only concerns data that has been modified since the last backup. This reduces the time required to make a backup, but that is not all. The need for storage capacity for different generations is also significantly lower than with a full backup. An incremental computer backup presupposes at least a full backup. In practice, this often results in combinations of storage strategies. During a data recovery, the full backup is used as a basis and is supplemented by the data of the incremental backup cycles. In general, several computer backups must be adjusted one after the other.

Differential backup: Even a differential backup is built on a full backup. All data that has changed since the last full backup is backed up. Unlike an incremental backup, it is not a gear of backups. Adapting the last full back up with the current differential backup is sufficient for data recovery.

Business Continuity Management (BMC)

In order to minimize the damage from server failures, companies are increasingly investing in prevention measures. The emphasis is therefore placed on what is called Business Continuity Management (BMC). In IT, BMC strategies aim to combat server failures in critical business areas, as well as to ensure an immediate resumption of activity. A prerequisite for such emergency management is what we call in English Business Impact Analysis (BIA). This analysis helps companies identify critical business processes. A process is defined as critical when a failure has significant repercussions on the activity.

Risk analysis

The risk analysis in emergency management has the function of identifying the sources of internal and external dangers which could lead to a server failure and the interruption of the activity which results from it. The aim is to make security risks and their consequences transparent, in order to find suitable solutions and reduce potential risks. A risk assessment can be carried out based on the anticipated damage and the corresponding probability.

Save current status

If the sources of danger and potential damage from concrete server failures have been determined as part of a BIA and a risk analysis, it is then necessary to record the current state, always in this strategy of continuity. The established emergency precautionary measures as well as the current restart times are of great importance. The recording of the actual current state allows companies to assess a need to act in the face of concrete security threats as well as the costs associated with this.

Choice of continuity strategy

There are generally various strategies for the different sources of external and internal dangers, allowing the continuation of the activity, or at least a rapid resumption, in spite of the malfunctions encountered. The choice of the continuity strategy to adopt in a critical situation is made within the framework of Business Continuity Management. The cost-use analysis forms the basis of this decision because it contains the main factors as well as the necessary financial means, the level of reliability of the solution and the estimated restart time.

The continuity strategies developed are determined in the emergency security concept which contains the instructions for concrete actions for all.

About the author

DMTwebhosting.com‘s Editorial Team prides itself on bringing you the latest web hosting news and the best web hosting articles!

You could also link to the news and articles sections:

http://www.DMTwebhosting.com/blog

Share
Share
Share
Share

Fill out the form to send Email

DMT Web Hosting is too concerned about the health and safety of both clients and employees. Our office is closed due to countrywide lock down. You can reach us on +92 300 044 4656, +92 321 112 6660 during the lock down period. We appreciate your understanding and patience.