If you’ve ever used a non-trivial data storage system, you may have noticed an interesting phenomena: a single slow device (HDD, SSD, etc.) can significantly slow down an entire system of hundreds or thousands of devices! It epitomizes the cliches; a chain is only as strong as its weakest link; or a bad apple ruins the bunch. You get the idea. Moreover, this also happens in some clustered scale-out systems even if the member nodes are completely independent and do not communicate with each other at all!
The question is – why does a single member have such a negative impact on a potentially massive system?
It turns out that this phenomena isn’t exclusive to data storage, but instead represents a general queuing and load balancing problem that can (and does) impact many types of systems. In the following blog I will attempt to demonstrate that the root cause of this issue is a lack of adaptability in the work scheduling. Though the explanation to this phenomena is very simple, it seems to be also counter-intuitive, and therefore is poorly handled in the storage industry.
In the following text I will try to explain this problem using a situation many may know – going through the lines in front of the passport control gates right after the landing in almost any international airport. I will discuss what the person that directs you to one of the lines to the passport control booths queues (lines) is doing (the “queue organizer”), and why what he or she does is near optimal and reduces the waiting time of travelers handled by passport control (remember this the next time you are enduring those lines…).
Another note: The phenomena that I describe is part of queuing theory and/or operations research. I (as do most humans on earth!) lack the theoretical knowledge to have a meaningful discussion within these areas, but I think a relatively straightforward intuitive analysis can explain many interesting and relevant queuing theory issues. If you are interested in formal mathematical analysis, this is NOT the place, even though I provide some links and references throughout the text and in the appendix.
To simplify the explanation, let’s assume we are visiting an airport with a passport control gate having a single main queue line (depicted below). The line leads to the queue organizer that directs you to one queues in front of the 3 passport control booths. Each booth only allows 2 people in the queue. Now, for the sake of this example, let’s assume that the time each traveler spends at the booth is 60 seconds. This means that, under optimal conditions, 3 travelers are handled every minute, so our throughput performance is 3 travelers per minute. A plane of 100 passengers can therefore be handled in 33.5 minutes. But what happens if the handling time is not consistent or if one of the passport control officers is taking a break? The answer depends on how the queue organizer distributes the passengers among the passport control queues. To try to illustrate that, let’s assume two different queue organizers – Sleepy Bob and Quick Cami.
Sleepy Bob isn’t getting enough rest before work. Being so sleepy, he’s not putting in the extra effort to optimize his queues. He’s wearily standing at the end of the unified entrance queue and mindlessly directing each traveler to the next booth queue in order independent of the number of the people already in the target line. In other words, Bob tells the first traveler to go to line 1, the second to line 2, the third to line 3. With the fourth he starts over with line 1 again, the fifth goes to line 2, and so on. Furthermore, as Bob is impatient and doesn’t look back to the booth lines to see which is full and which is not, he doesn’t notice the queues filling up unevenly. If a target line (queue) is full (i.e. a queue with 2 persons waiting in line), the next person in the entrance queue will have to wait until at least one person from that particular line moves on (i.e. handled by the officer). Even worse, any person standing after that unlucky person will also be forced to wait, even if there are free slots in different queues. In other words, if the first person in the line needs to wait, the entire queue stalls. In computer terminology the technique that Bob is using is called round-robin scheduling. As we will soon see, though round-robin is simple to implement, it leads quickly to inefficient handling of the total system.
Let’s start our analysis with a simple case where the time each traveler spends at the passport control booth is even. For the sake of simplicity, let’s assume that only two travelers can stand in each queue, the person being served and the one behind them (it considerably shortens the flows…). This scenario is depicted below, and I think it is easy to be convinced that the lines are evenly loaded and the total work (performance/throughput) is quite optimal, i.e. 3 travelers per minute. So to sum this up, we can claim that round robin is optimal if the serving time is even. However, in the next section we will see the major negative impact round-robin has when serving time is uneven.
Now let’s analyze the previous scenarios, but this time the customer officer in booth 1 is not feeling well, and therefore this officer needs 2 minutes to handle each traveler instead of the normal 1 minute. This scenario is depicted below. Until the lines fill up i.e. until that the 6th person enters to the lines, the scenario is the same as the previous example. The trouble starts with the 7th person that needs to wait for line 1, even though lines 2 & 3 have available slots (step 5). When the 1st officer finishes handling his first traveler, officers 2 & 3 finish handling their second (step 6). But still only 3 new travelers can enter to the lines (step 7) as the 4th will need to wait for line 1 again (step 8). It’s straightforward to see that officer 1 stalls the entire main queue for 2 minutes each time. Line 1, the slowest of the 3, determines the overall main queue handling rate, reducing it from the 3 travelers per minute to 3 travelers per two minutes or 1.5 travelers per minute. We can say that round-robin scheduling may cause the main queue to stall due to a single problematic queue. Though the other queues have excess capacity, their ability to process work is wasted. This means that one slow officer reduces the entire handling rate of the whole system regardless of the quantity of other faster officers. In this case, a plane with 100 travelers would be handled in 67 minutes instead of 34! Note that it is expected that a slow officer will slow down the aggregate handling rate as less work is done, but in a perfect system, it shouldn’t impact the other officers work (throughput). In the following section we will see that a very simple method can solve this problem.
The next day, Agent Cami gets to work in the same passport control gate. Cami is alert and efficient. Each time she directs someone from the main queue to a passport control booth she picks the queue with the shortest line i.e. with the fewest people waiting in it. This is called shortest (least) queue (also join shortest queue – JSQ) scheduling. To see how this is done let’s analyze the previous case again, where officer 1’s handling rate is 2 minutes per traveler while officers 2 & 3’s handling time is 1 minute per traveler. As is shown below, this time Quick Cami causes some travelers to “skip” the slow officer (step 5 & 6) and directs them to the other, faster officers. This way the faster officers 2 & 3 can each handle 1 traveler per minute regardless of officer 1’s slow performance. Shortest queue scheduling lets the queue organizer direct more travelers to the faster officers, and does not let the slow officer stall the entire queue. The shortest queue method is relatively simple to implement, but still provides very effective “feedback.” In other words it lets the queue organizer see where more work is done in a simple & efficient way just by analyzing the queue length. In many realistic cases this is close to the most optimal possible method.
The logic to schedule route work items/requests to several workers is very common within computerized systems, especially within parallel and distributed scale-out systems.
The example “passport control lines problem” exemplifies a challenge for any system that schedules work among several workers; static work scheduling will cause the slowest worker (node, device, etc.) to determine the entire system performance.
In other words, if the work scheduling does not react to the worker’s actual service time (quality), and each client has a limited number of work items/requests they’re allowed to generate, the performance of the entire system is always determined by the slowest worker.
Of course, the actual severity of this phenomena is determined by the level of imbalance among the workers. As is demonstrated above, if the worker performance is relatively balanced, the described problem may not be a big issue. But if one or more of the workers are significantly imbalanced (i.e. providing considerably different levels of service quality) the impact of the system decreasing its performance to match the slowest worker’s pace may be very painful.
There are several common reasons why work scheduling cannot or does not react (i.e. react dynamically) to the worker service quality:
The following sections describe each of these common cases.
In the sections above, I compared round-robin scheduling to shortest queue scheduling. In fact, these two scheduling techniques are common representative examples for two classes of scheduling techniques: static and dynamic (reactive) scheduling. Static scheduling is any scheduling technique that ignores the state of the system when making a scheduling decision. Common examples for such scheduling techniques are: round-robin, random selection, and hashed selection. To make a round-robin decision, you only need to know the number of workers and the last selected worker. Random selection (and I mean a real random selection, not the airport security kind of “random” selection 😉 ) requires knowing only the number of workers. Finally, a hashed selection requires some sort of name (key) for each request (work item) and some hashing function to calculate the destination worker. Though I didn’t demonstrate it, using random selection or hashed selection instead of round-robin would not solve the “passport control queues problem,” as both are fundamentally also non-reactive techniques.
Dynamic (reactive) scheduling technique is any technique that uses some real time information about the state of the system to make the scheduling decisions. These techniques can range from very simple, such as the shortest queue technique, to very sophisticated with many variables affecting the scheduling decisions. There are probably infinite variants for such techniques tailored to different cases, but in the spirit of Einstein’s famous saying, “everything should be as simple as it can be but not simpler,” I believe it will be hard to beat the efficiency of the shortest queue technique. In many cases shortest queue scheduling maximizes the system throughput. (see here).
Until now, all examples and cases assumed that the queue organizer – the scheduler – can select any target queue he/she wishes. We demonstrated what happens if non-optimal scheduling is used. But sometimes the scheduler has much less freedom to select the target queue, or perhaps no freedom at all. Dataset sharding is just such a case. Dataset sharding is any predefined and static division of the work items to certain workers. A simple example is the citizen and non-citizen passport control lines. If you are not a citizen, you can not be directed to an officer that is reserved for citizens. This limits the selection of the queue organizer. Another real world example is concert ticket lines that are divided by the first letter of last names, so people with last names starting with letter A-G goes to line 1 while those with names starting with letter H-M go to line 2, etc. The scheduler (if such exists) has no freedom to select optimal lines and probably should be treated as a “router” (but let’s leave it here).
As I already mentioned above, many scale-out systems are using dataset sharding to scale total system throughput. This way, the work can be distributed (and parallelized) among N workers such that the aggregate work scales (linearly if possible). For our discussion, what really matters is that once a work request is directed (“routed”) to a worker. The selection is static (predetermined) and therefore ignores real time factors. Since dataset sharding leads to non-dynamic work scheduling, even if the item’s ownership is evenly distributed amongst the workers, the “passport control queues problem” will still cause the system to operate at the pace of the slowest worker as long as the controller serves the line on a first-come-first-served basis.
Another issue that leads to lack of selection freedom during work scheduling is data gravity. Data gravity is a case of memory affecting scheduling decisions, i.e. where a decision about a certain item impacts future scheduling decisions. This means that the scheduler has “memory.” For example, a coat check worker that puts your coat in a specific spot and gives you a tag identifying the location. Once you you come back and give the coat worker the tag, the worker does not have the freedom to select any random coat. He must pick your coat that is stored in a specific spot. In many systems there is a similar process. The first time a new item is introduced to the system, a target destination is selected for it. Any further access to this item must be directed (routed) to the same destination. For example, in many file systems, the first write to a file allocates a space for the data (“write allocate”), but then any additional read or write operations are directed to the same location. The result is that due to data gravity the work scheduler may have no other choice and it must direct the work request to the target that stores or owns the requested item. The lack of freedom in these systems can also suffer from the “passport control lines problem.”
Work scheduling (routing) in distributed systems needs to balance several conflicting short, medium, and long term demands. In many cases, scheduling (distribution) “freedom” is traded in favor of a higher priority need, such as capacity load balancing, simple metadata management, or any number of other priorities. The system designers must ensure they understand the implications of not being reactive to real-time load changes or they run the risk of picking the wrong tradeoff and experiencing poor system performance with wasted capacity. I hope the phenomena I’ve described demonstrate the possible implications of the various scheduling techniques leverageable in distributed systems.