Redundant Computer Systems

In this article we will see an introduction of different techniques that are used for computer systems are available and can be accessed even when some part of the system fails.

When you have critical systems that have to be available and running 24 hours a day, 365 days a year, try to minimize failures that may affect the normal operation of the system. Failures are going to happen, but there are techniques and configurations that help to have redundant systems, in which certain parts can fail without this affecting the operation of the same one.

In a current computer system, there are many components necessary for this to work, the more components, the more likely we have that something fails. These problems can occur in the server itself, disk failures, power supplies, network cards, etc and in the infrastructure necessary for the server to be used, network components, Internet access, electrical system,….

To continuation we will go commenting some of the techniques used to obtain redundant systems. The degree of redundancy of a system will depend on its importance and the money we lose when the system is not available for a failure. It is not worth investing in ‘ redundancy ‘, if the investment needed to have a redundant system costs more than you lose in money, reputation and hours of work, if the system failed.

The techniques and configurations we talk about below are not unique to Linux systems. They can be applied in their vast majority to other operating systems and platforms.

Server Component Redundancy

The most normal redundant components on a server are usually disks, network cards and power supplies. There are servers with multiple CPUs that even continue to work without problems with any CPU or memory module damaged.

Disks

Hard disks are the devices where the data is recorded. The most common fault on a server is the failure of a hard drive. If the server has a disk and this fails, the server fails to complete and we will not be able to access the data contained in it. There are techniques that help us to minimize this problem and that the server continues to work and does not lose data even when a hard disk fails. The most normal also, is that you can replace the disks that fail without having to turn off the server (HOTSWAP).

The most common technique is the call RAID (redundant array of independent disks) [Spanish | English]. With this technique we create a set of redundant disks that can help us, both to increase the speed and performance of the storage system, and to keep the system running even if some disk fails. There are software and hardware implementations and different RAID configurations, the most common being RAID1, RAID5 and RAID10.

Network cards

The network card is the device that allows the server to communicate with the rest of the world. It is therefore very common that the servers have at least 2 network cards, to ensure that this communication is not cut in case of failure of one of the cards.

In Linux there is also a technique called ‘ bonding ‘, by which we can use 2 or more network cards as if they were a single device, adding the capacities of them and having redundancy in the case that any of the cards fails.

Sources of food

The power supply is in charge of supplying electricity to the server. It is also common that the servers have 2 or more sources of power connected to different electrical systems, to guarantee the supply in the case that one of the sources or one of the electrical systems fail. The most common thing is that you can replace the power sources that fail without having to turn off the server (HotSwap). Other system components such as routers, switches, disc cabinets, etc. usually use the same redundancy technique.
Redundancy in electrical supply

Every electrical component, and a server could not be less, needs a constant supply of electricity to operate. Failures in this supply, even for very short periods of time, will have catastrofales consequences for our system. And not only do we need a constant supply, we also need not to have ups and downs brusquedas that can damage electronic components.

To achieve this you can use different components according to the degree of protection we want.

UPS: These are more or less advanced batteries that connect between the server and the power supply. They guarantee a constant and stable supply for a while, depending on the capacity of the same.
Electric generators: They are generally operated with diesel and are connected between the UPS and the electrical supply network. They only come into operation when the supply is cut for more than a certain time. They can supply electricity for an indefinite time as long as they have fuel in the tank.
Independent supply lines: in large data centers, there are usually at least 2 separate and independent connections to the power supply network.

If we want redundancy in the electrical system, needless to say that not only the servers have to have double connections, routers, switches and ultimately any component of the system that uses electricity should have sources of power Redundant (connected). As they say, your system will only be as safe, stable and redundant as the weakest component of it. It is not the first time, for example, that in a data center, groups of servers with redundancy at all levels have been left incommunicado because they were connected to a switch that has failed to have a redundant system of electrical supply.

Redundancy in network components

It is no use having servers with duplicate and redundant components and a constant electrical supply and equilibrarado if some of the components of the network fail and we cannot access the server.

The most normal components in a network are:

  • Routers — is a device that interconnects network segments or entire networks
  • Switch — is a device that interconnects two or more network segments
  • NIC or network card: an electronic device that allows a DTE (Data Terminal Equipment), computer or printer, to access a network and share resources
  • Network Cables: To interconnect the different components, there are many and varied types, the most common being the twisted pair cable and the optical fiber
  • Connection lines: To wide area network, WAN (e.g. Internet)

Any of these components may fail, leaving the system incommunicado. But there are techniques to prevent this from happening, what is usually done is to configure the network, so that there are at least 2 different paths between two components A and B. In the following graph you have a schema, in which you can see how to configure a network with double redundancy from the server to the Internet. This way you can damage a router, a switch and a network card at the same time without losing connectivity. The same scheme could be extended to have triple or quadruple redundancy of the components.

Server redundancy, load balancing

What happens if the power supply works and the network works, but our server fails in such a way that none of the redundant components that it has can avoid the failure and the fall of it. There are different types of configurations with multiple servers that can help us with this problem. are called clusters, there are different types, but among the most usales is the balancing of loads with fault tolerance. In this type of clusters, not only does it not matter that one or several of the servers stop working, but if we need more resources to provide a service, we can incorporate new servers that increase the process capacity of the cluster.

The most important components of this type of clusters are, the single storage systems between all the servers that provide a service and the load balancing device, which can be a specific hardware for this work or implementable by Software on a normal server. The most important Linux project on this topic is the so-called Linux virtual Server (LVS).

Below you have a number of examples of how these clusters can be organized, where the failure of a server, not for the operation of a service. When one or more servers fail in the cluster, the process capacity is reduced, so it is important to always have some unused capacity so that in the event of a failure the response time is not reduced much.

An example of a cluster with load balance connected to a disk array to store the information. Typical use for file and Web servers.

Leave a Reply

Your email address will not be published. Required fields are marked *