Database Sharding

Good day software developers and engineers. Today is Wednesday and it seems it is going to be a sunny day in the Twin Cities of Minneapolis and St. Paul. Hope the weather cooperates in your neck of the woods.

I continue to read and experiment with the contents presented in the on-line course Introduction to Python Fundamentals: Lesson 02 by Paul Deitel published by Pearson Education, Inc. As a matter of fact, tomorrow I will start with lesson 03. So far I am pleased to be reviewing and learning new things during the process.

I am always interested in improving the performance of the `storage server` product at work. Earlier this week I watched the video Seattle Conference on Scalability:  YouTube Scalability which features a talk by Cuong Do who is a former employee of YouTube and Google. The video was posted August 22, 2012.

The video describes different changes that were made to YouTube in order to improve performance. I enjoyed the presentation because several of the problems described were also encountered and solved in the late 1990’s by my team.

For example, there used to be limitations in most operating systems (never generalize) with the number of files a folder / directory could hold. In case of the storage server, we started by designing locations for the bitfiles (files) as single disk drives we call a disk cache. Each disk cache contained a single folder.

Shortly after we encounter limits in the number of files that directories in Linux, UNIX, and Windows could hold.

The issue was quickly solved by creating 256 folders in the root of the disk caches. The folders are named from 00 to ff. The bitfile names are similar to GUIDs in Windows but the contents differ from GUIDs. I initially called them FID (for File ID) and later changed the name to GUID. A GUID is a 16-byte (32-character) hexadecimal value. I decided to take the last two hexadecimal values from the GUID and associate it with the name of a folder in a disk cache. By incrementing the last 4-bytes in the name of the GUID, we were able to store each GUID in the corresponding folder. For example, the bitfile named with the unique GUID 00505689D25C4C11ED076058B07B7FE7 would be stored in any disk cache in the folder named EF.

When the storage server runs, the total number of files in all folders is in a small range of values. The YouTube video did not cover in specific how the issue was addressed.

Another issue to improve performance was the use of database sharding. It appears that at the time YouTube was using MySQL. In my case we started only supporting SQL Server from Microsoft and then added support for MySQL and MariaDB.

For performance reasons, a server needs to quickly locate a specific bitfile (file) in a disk cache. Let’s assume (incorrectly) that we use the GUID. As we mentioned, a GUID is a 16-byte string. We also mentioned that the last 4 bytes (could have been reduced to 2 and till would be incorrect) in a GUID are generated in monotonically ascending order. This helps distributing bitfile evenly in a disk cache. So far all seems fine with the approach.

Let’s take a look at a single disk cache using a single disk cache. It would hold 256 folders with names in the range 00 – ff. At some point in time we would need to split the database into two halves. The first shard would holds bitfiles in the range 00-7f and the second shard would hold bitfiles in the range 80 – ff. The same would hold true for the database. One server would hold a database and a disk cache with the first half of the bitfiles, and a second server with a separate database and disk cache would hold the second half.

Now let’s see what happens when a disk cache or a database needs to be split into two shards. There are many operations (one per database record and one per bitfile) needed to split the database and the disk cache. Such operations are time consuming and cannot be performed when a limit is reached (i.e., a number in the database reaches a limit). This is an operation that should occur as maintenance (batch) when a limit is approaching. A request to store operation would take a long time and the user would notice the delay when compared to operations that do not require a database split.

A slightly different approach, which was also reached by YouTube, was to shard the databases using the userID field. Since the userID value is generated in a monotonically ascending order, when a limit is reached, a new shard that is prepared in anticipation can be used. This change would require a change in the function that routes the new userID files to a new shard and hopefully into a new disk cache in a new server. Note that this operation can be done as part of servicing a store request because no data movements between databases or reconciliations are needed.

In the case of the storage server, we use a bitfileID field that holds a unique value for each bitfile and is generated in a monotonically ascending order. No need to move data between databases or bitfiles between disk caches residing in different servers when a new shard is required. This is a much cleaner and simpler approach (KISS principle) when operating with a sharded database and disk caches.

Since it will take some time to complete the Python course, tomorrow I will continue solving problems using Java. I will switch to Python when I complete the on-line course.

If you have comments or questions regarding this, or any other post in this blog, please do not hesitate and leave me a note below. I will reply as soon as possible.

Keep on reading and experimenting. It is one of the best ways to learn, become proficient, refresh your knowledge and enhance your developer toolset.

Thanks for reading this post, feel free to connect with me John Canessa at LinkedIn.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.