Why is it hard to scale a database, in layman’s terms?

Why is it hard to scale a database, in layman’s terms? by Paul King

Answer by Paul King:

There are four main challenges when scaling a database: search, concurrency, consistency, and speed.
 
Suppose you have a list of 10 names. To find someone, you just go down the list.

But what if there are 1 million names? Now you need a strategy for finding something. A telephone book lists the names in alphabetical order so you can skip around. This is a solution to the search problem.

What if 1 million people are trying to use the telephone book at the same time? This is the problem of concurrency. Everyone could wait in one very long line at City Hall, or you could print 1 million copies of the book — a strategy called "replication". If you put them in people's homes — a strategy called "distributed" — you also get faster access.

What if someone changes their phone number? The strategy of replication created a problem, which is that you now have to change all 1 million phone books. And when are you going to change them, because they are all in use? You could change them one at a time, but this would create a data consistency problem. You could take them all away and issue new ones, but now you have an availability problem while you are doing it.

And what if thousands of people are changing their phone numbers every hour? Now you have a giant traffic jam called "contention for resources" which leads to "race conditions" (unpredictable outcomes) and "deadlocks" (database gridlock).

All of these problems have solutions, but the solutions can get very complex. For example, you can issue addendums to the phone books (called "change logs")  rather than reprinting all of them. But you have to make sure to check the addendums all the time. You can distribute new versions of the phone books with a cut-over date, so that everyone switches at the same time to get greater consistency, but now the phone books are always slightly out of date.
 
Now scale this to billions of names in data centers distributed around the world accessed by millions of users.
 
The basic goal of a database is to maintain the illusion that there is only one copy, only one person changes it at a time, everyone always sees the most current copy, and it is instantly fast. This is impossible to achieve at scale when millions of people are accessing and updating trillions of data elements from all over the world.
 
The task of database design, therefore, is to come as close to this illusion as possible using hundreds of interlocking algorithmic tricks.

Why is it hard to scale a database, in layman’s terms?

Advertisements