When playing around with Cassandra, I have realized that with it's limiting data model and lack of queries, using pure Cassandra may not be the optimal solution in applications that require joining lot's of different data and use complex queries as you'd have to denormalize the data substantially and this could become a management nightmare when you need to update something that is stored in five different column families etc so I decided to try to build a hybrid of the non-relational Cassandra and the great MySQL.
Sharding
To achieve near infinite scaling and fault tolerance on relational databases, pretty much the only way to go that I know of is resolving to sharding that basically splits up the entire schema by rows so there are many databases on different machines with each database containing the entire schema but only a subset of the rows, a slice of the total knowledge.
This way, if we know which database holds which data, we can go ask it for it or write to it and thus both read and write performance is nicely distributed among the different machines. This also improves availability as even if one of the machines dies and we have a total of say 20 of them simple cheap commodity hardware boxes, only one twentieth of our service would be disrupted.
Sharding is a big topic that I don't pretend to fully cover in this blog post, but I'll try to remind the general idea through my own experience.
How does it work
The first thing to figure out in a specific problem is by which to shard your data. This is an important choice because you want to place most tightly coupled data on the same shard so you can still do joins over the different data and this is only possible if the data to join lives on the same database. In a user-centric system such as social networks, it's usually a good idea to shard by user such that everything to do with a single user such as profile info, friends, pictures etc are on the same shard - this way joins can still be performed for given user.
Shard key
This brings us to sharding key that is used to choose which shard to place the data on. A simple candidate for it could be the first letter of the user name, but this is actually a bad idea, as there are much more people names starting with S than Z and you want the keys to be distributed equally on every machine.
Since in a sharded system, the data is divided between many machines, you have to figure out on which box to write new data and which machine to fetch and update it from. This brings us to sharding scheme, which decides, which shard should get which keys.
Sharding scheme
The simplest way would be to take some user property such as email, hash it for better distribution (as hash is a rather random-looking string even though the same input always produces the same hash) and then take a modulo with the number of shards in the pool.
Is pseudocode:
shardCount = 10
shardKey = kallaspriit@gmail.com
hash = sha1(shardKey)
shardId = hash % shardCount
This divides all the user emails among the shards equally and is very simple to implement. This would work great to the point that we add or remove any shards. This would change the shard count and thus the modulo operation result for most of the keys which would mean we'd have to relocate most of the data that would be very expensive thing to do so this approach is only reasonable if one is pretty sure that the number of shards wont change and if you are aiming for big scalability, this wont be the case.
Consistent hashing
One might wonder whether there is a way to minimize the number of keys that we have to relocate in the case of adding or removing shards and in fact, there is. This method is called consistent hashing and having n shards, the number of keys that need to be relocated when adding or removing a shard is at most 1/n.
How consistent hashing works in short is by hashing both the shards and shard keys on an imaginary circle and to find which shard a key belongs to, just move along this circle until you reach a hash of one of the shards.
In the picture, A, B and C are the shards and numbers are the shard keys. They are are hashed into a range that is connected end-to-end such that if no shard is found for a key in the end of the range, the search continues from the start. In the example above, keys 1, 4 match to server A, 2 to B and 3 to C. If we were to remove server B, key 2 would also hash to server C and same logic applies for adding nodes. In practise, for better distribution, each node is hashed to the circle many times (64 worked good for me with low enough standard deviation).
Dictionary lookup
I implemented the consistent hashing algorithm described above in PHP but then I realized that this is still good enough as it minimizes but does not eliminate the need for relocating data, that would be a pain. The most powerful way to implement sharding is by keeping a lookup service that knows where to map each shard key. This eliminates relocating stemming from changing the number of shards but supports relocating very precisely should it be necessary.
The initial way I imagined this was to create a central web-service that each server node could contact to find which shard to store or fetch data from. Then I remembered that "central" is a very bad concept in the world of distributed and this would become both a bottleneck and a single-point-of-failure.
Then I decided to implement this using a database that every server node could contact for this mapping purpose. This way there isn't a single service node to become the bottleneck and talking directly with a database is generally faster than doing it through a web-service, even if a lightweight one such as REST. There would also be security issues to solve.
Implementation
I implemented this in PHP and the MySQL database and to test the concept, built a simplistic blog on it. Basically the blog allows people to register, create their blog, add entries and others can read these.
The first thing I discovered building such a system is that it is not enough to match a single shard key (say the email that I used as username) to a shard as to login, I needed to know which shard does the globally unique email address resolve to but to display anyone's blog, I need to find the shard by blog name. To overcome this, I introduced extra level of abstraction by mapping any number of identifiers (email, blog name etc) to a shard key and then the shard keys to the shards.
As in a bigger system, there may be multiple separate components requiring separate schema's, I also introduced the concept of shard groups that included the SQL needed to setup new shards. This way it would be easy to build an administration panel where by just entering the access information of new shard database and choosing the group, the database could be setup automatically.
All of this led up to four tables for my MySQL-based sharding schema:
The group_id and identifier form a primary key for alias table as in a single group, the identifier must be unique. This is also a convenient way to ensure that the email address users register with are unique as without this, one would have to query all of the shards to find out whether an email address is already taken.
The key table just matches the shard key to a shard. Shard table contains the information required to contact the shard and a reference to group so it knows how to initially setup the shard with the SQL provided. The shard also has a capacity field which shows the relative power of the sharding machine. This enables sharding over various performance boxes as the load balancer makes sure that servers with capacity index twice as large will get about twice the number of keys. The number of keys a shard is currently storing is also kept here and is used for the load-balancing. When a new shard is added, it starts to get slightly more keys than the rest until the number of stored rows balance out.
In the actual implementation, there are two levels of cache between the clients and the database. For a short period of time, very fast local machine cache (PHP APC) is used. When this times out, global network of memcache is consulted and only when this is missed, an actual query to the database is made. This is of-course true only for fetch operations, adding new entries always require talking to the database but for fetches, it should have near 100% cache hit rate so the database should not get much load.
The key table just matches the shard key to a shard. Shard table contains the information required to contact the shard and a reference to group so it knows how to initially setup the shard with the SQL provided. The shard also has a capacity field which shows the relative power of the sharding machine. This enables sharding over various performance boxes as the load balancer makes sure that servers with capacity index twice as large will get about twice the number of keys. The number of keys a shard is currently storing is also kept here and is used for the load-balancing. When a new shard is added, it starts to get slightly more keys than the rest until the number of stored rows balance out.
In the actual implementation, there are two levels of cache between the clients and the database. For a short period of time, very fast local machine cache (PHP APC) is used. When this times out, global network of memcache is consulted and only when this is missed, an actual query to the database is made. This is of-course true only for fetch operations, adding new entries always require talking to the database but for fetches, it should have near 100% cache hit rate so the database should not get much load.
What next
I have tested this setup and it works great, the only problems being that this single central lookup database does not scale, eventually becoming a bottleneck and is also a single point of failure.
This is where finally Cassandra comes in as I plan to port this system over from using this central relational database to the distributed Cassandra, that should solve the problems of scaling and availability. The schema is simple enough so it should not be too hard to implement on Cassandra so I will try it and see how it pans out, You should be able to read about it in coming posts so stay tuned.
I have tested this setup and it works great, the only problems being that this single central lookup database does not scale, eventually becoming a bottleneck and is also a single point of failure.
This is where finally Cassandra comes in as I plan to port this system over from using this central relational database to the distributed Cassandra, that should solve the problems of scaling and availability. The schema is simple enough so it should not be too hard to implement on Cassandra so I will try it and see how it pans out, You should be able to read about it in coming posts so stay tuned.