Lightspeed-php performance
The still under heavy development Lightspeed-PHP has been designed from ground up to be lightweight, easily understandable and extendable, full-featured yet web-oriented high performance PHP framework that plays well with the Hiphop PHP to C++ code transformer-compiler and is well suitable for distributed deployments.
It's now maturing and mostly ready for prime-time, taking on it's first projects. There is thorough auto-generated documentation from the comments but proper manual will be created once it has driven an actual project and become stable enough.
A sample application was created on it - a simple blog where people can register and add posts. This was used for the performance and scaling testing and it uses the Cassandra-driven sharding architecture discussed so far.
The more code and heavy-lifting there is in the code, the more Hiphop shines. Below is a graph of the example application front-page benchmark, comparing vanilla-PHP with Hiphop-driven.
As can be seen, Hiphop performs nearly ten times better at higher loads. Similarly the 90% render times show greatly improved performance.
As the frontpage is very lightweight, it's not a good indication of the real performance. To test this, the system simulated registering a used that required several roundtrips to Cassandra, MySQL shard and Memcached. Graphs of requests per second and 90% time for this comparison come below.
Again, the results given by Hiphop-PHP driven application are impressive, translating into great savings on hardware of heavily-loaded systems.
Scaling testing
To test scaling, the cluster in the Amazon Cloud was doubled so now it contained two of everything - webservers, Cassandra and Memcache instances and RDS databases. With a testing machine and load-balancer, this basically meant ten machines. Graphs for the results follow.
As can be seen, the requests-per-second almost doubled while the 90% time lowered too. As neither of them alone is sufficient indication for the performance comparison, I devised a method that I called performance index that is simply the division of requests-per-second by the 90% time as this gives a bigger index for higher req/s and lower 90% time that are both valued qualities. Putting this data on a graph gives the following:
As the req/s stat almost doubled and the 90% time also decreased, the increase of performance seems to be actual higher than doubling, ~350% to be precise. I'm not entirely sure what are the implication of this but it seems great.
This testing has shown the good performance of Lightspeed-PHP framework, especially when using the Hiphop-PHP transformer. It also displayed that the Cassandra-driven sharding architecture works and should indeed provide high and linear scalability.
Detailed analysis
This test is more thoroughly explained with all the numbers in a PDF document that you can download here.
Cassandra in PHP
What is Cassandra, why should you care, what is it good for, how does it compare to relational databases and how to use it in the context of distributed high performance web applications using PHP.
Tuesday, May 24, 2011
Monday, May 9, 2011
Sharding PHP implementation
Design
The main design objectives implementing the database sharding described in the previous post on PHP was to make it pluggable and unobstrusive meaning that the rest of the data manipulation code is loosely coupled to it making using sharding optional and as painless as possible.
The implementation uses strategy pattern enabling plugging in different sharding backends (relational database and cassandra based strategies implemented) and also sharders meaning that one might use cassandra to shard relational databases, or perhaps filesystems instead.
Following is a class diagram of the core of the sharding implementation showing the various entities and their relationships.
Shard represents one of the shards that the sharder maps keys to and has a name, belongs to a group, knows its capacity and the number of keys it's currently storing. As shards are not backend-specific, it does not have explicit fields for database DSN or username-password but rather a single string properties field that stores whatever information needed to connect to the shard based on its type. The Sharder implementation knows how to parse this and access the actual shard (that may be a relational database in main usecase).
A ShardGroup is, as the name suggests, a group of shards and has a name and the properties needed to set up a new shard on a new host. Thinking of it now, the "getSetupSql()" method is likely to be renamed to something more generic as all shards may not be databases.
The ShardSchema is the main manager of shards and shard keys and implements managing groups, shards, creating and resolving keys etc. It uses the strategy pattern so it can be implemented on various backends.
The Sharder is a higher level proxy for ShardSchema that has the main purpose of adding a cache layer on top of the actual schema as operations like resolving an identifier to a key and a key to a shard can easily be cached eliminating hitting the sharding backend for every request and use the faster cache instead. It actually implements three levels of cache for most operations:
- Consecutive requests for the same lookups during a single request are returned from internal hashmap so it's very cheap - not even hitting the cache.
- For a short period of time, lookups are served from local cache (APC by default), that is very fast but not distributed.
- For a longer period of time, lookups hit the global distributed cache that is the same for all webservers (Memcache by default).
Cassandra setup
The relational schema explained in the previous post was mainly converted to Cassandra's domain. As Cassandra is not good for numeric auto-increment indices common in relational databases, I switched to using named identifiers instead for both Cassandra and the earlier PDO implementation giving rise to a major rewrite - good thing I had covered the code 100% with unit-tests making the refactoring easier.
Just for reference, the new relational sharder schema is now as follows:
CREATE TABLE IF NOT EXISTS `alias` (
`group` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`identifier` varchar(60) COLLATE utf8_unicode_ci NOT NULL,
`key_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`group`,`identifier`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE IF NOT EXISTS `group` (
`name` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`setup_sql` text COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE IF NOT EXISTS `key` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`shard` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=876 ;
CREATE TABLE IF NOT EXISTS `shard` (
`name` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`group` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`keys` int(10) unsigned NOT NULL,
`capacity` int(10) NOT NULL,
`properties` varchar(1024) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For Cassandra, I managed to keep the schema rather similar where the primary keys became row keys and on using multiple fields for primary key, they are both present in Cassandra too (aggregate key).
Above is the Cassandra schema and as I don't know of proper standards of how to describe it in class-diagram'ish notation, I invented my own. As you can see a alias points to a key which in return points to the shard where the data of given key lives on and shards belong to groups - very simple really.
With [IdxGroup] behind the "group" field of "shard" column family, I mean that this is using a new feature of Cassandra called secondary index, that means that I do not have to create a "materialized view" of shards by group myself but Cassandra does it for me and allows me to do basic "where" queries so for example, I can fetch a list of shards by group name and exactly that is required by my implementation. Secondary indexes make using Cassandra for more general-purpose database much easier so it's a feature I'm very excited about.
The additional column family "counter" comes from the fact that I need something like auto-increment for key identifiers but Cassandra does not have it built in so I have my own counter that I can increment and get value of by name. I am actually lying a bit as latest version of Cassandra has counters, but this does not seem to be implemented in the PHP library phpcassa that I'm using. I will definitely come back to this as using a counter table this way is very bad as it's subject to race conditions where I may pick a counter value and before using and incrementing it, another thread takes it too.
Next steps
I actually managed to get this working with the Hiphop-PHP compiler too requiring some modifications to the underlying Thrift library so I'm pretty much ready to start testing the performance of such a setup. Hoping to get some cloud resources and set it through its paces with some JMeter tests and an actual cluster of webservers, Cassandra instances, shard databases and memcache and see whether and how well it scales.
Saturday, April 16, 2011
A sharding model
A hybrid database model
When playing around with Cassandra, I have realized that with it's limiting data model and lack of queries, using pure Cassandra may not be the optimal solution in applications that require joining lot's of different data and use complex queries as you'd have to denormalize the data substantially and this could become a management nightmare when you need to update something that is stored in five different column families etc so I decided to try to build a hybrid of the non-relational Cassandra and the great MySQL.
Sharding
To achieve near infinite scaling and fault tolerance on relational databases, pretty much the only way to go that I know of is resolving to sharding that basically splits up the entire schema by rows so there are many databases on different machines with each database containing the entire schema but only a subset of the rows, a slice of the total knowledge.
This way, if we know which database holds which data, we can go ask it for it or write to it and thus both read and write performance is nicely distributed among the different machines. This also improves availability as even if one of the machines dies and we have a total of say 20 of them simple cheap commodity hardware boxes, only one twentieth of our service would be disrupted.
Sharding is a big topic that I don't pretend to fully cover in this blog post, but I'll try to remind the general idea through my own experience.
How does it work
The first thing to figure out in a specific problem is by which to shard your data. This is an important choice because you want to place most tightly coupled data on the same shard so you can still do joins over the different data and this is only possible if the data to join lives on the same database. In a user-centric system such as social networks, it's usually a good idea to shard by user such that everything to do with a single user such as profile info, friends, pictures etc are on the same shard - this way joins can still be performed for given user.
Shard key
This brings us to sharding key that is used to choose which shard to place the data on. A simple candidate for it could be the first letter of the user name, but this is actually a bad idea, as there are much more people names starting with S than Z and you want the keys to be distributed equally on every machine.
Since in a sharded system, the data is divided between many machines, you have to figure out on which box to write new data and which machine to fetch and update it from. This brings us to sharding scheme, which decides, which shard should get which keys.
Sharding scheme
The simplest way would be to take some user property such as email, hash it for better distribution (as hash is a rather random-looking string even though the same input always produces the same hash) and then take a modulo with the number of shards in the pool.
Is pseudocode:
shardCount = 10
shardKey = kallaspriit@gmail.com
hash = sha1(shardKey)
shardId = hash % shardCount
This divides all the user emails among the shards equally and is very simple to implement. This would work great to the point that we add or remove any shards. This would change the shard count and thus the modulo operation result for most of the keys which would mean we'd have to relocate most of the data that would be very expensive thing to do so this approach is only reasonable if one is pretty sure that the number of shards wont change and if you are aiming for big scalability, this wont be the case.
When playing around with Cassandra, I have realized that with it's limiting data model and lack of queries, using pure Cassandra may not be the optimal solution in applications that require joining lot's of different data and use complex queries as you'd have to denormalize the data substantially and this could become a management nightmare when you need to update something that is stored in five different column families etc so I decided to try to build a hybrid of the non-relational Cassandra and the great MySQL.
Sharding
To achieve near infinite scaling and fault tolerance on relational databases, pretty much the only way to go that I know of is resolving to sharding that basically splits up the entire schema by rows so there are many databases on different machines with each database containing the entire schema but only a subset of the rows, a slice of the total knowledge.
This way, if we know which database holds which data, we can go ask it for it or write to it and thus both read and write performance is nicely distributed among the different machines. This also improves availability as even if one of the machines dies and we have a total of say 20 of them simple cheap commodity hardware boxes, only one twentieth of our service would be disrupted.
Sharding is a big topic that I don't pretend to fully cover in this blog post, but I'll try to remind the general idea through my own experience.
How does it work
The first thing to figure out in a specific problem is by which to shard your data. This is an important choice because you want to place most tightly coupled data on the same shard so you can still do joins over the different data and this is only possible if the data to join lives on the same database. In a user-centric system such as social networks, it's usually a good idea to shard by user such that everything to do with a single user such as profile info, friends, pictures etc are on the same shard - this way joins can still be performed for given user.
Shard key
This brings us to sharding key that is used to choose which shard to place the data on. A simple candidate for it could be the first letter of the user name, but this is actually a bad idea, as there are much more people names starting with S than Z and you want the keys to be distributed equally on every machine.
Since in a sharded system, the data is divided between many machines, you have to figure out on which box to write new data and which machine to fetch and update it from. This brings us to sharding scheme, which decides, which shard should get which keys.
Sharding scheme
The simplest way would be to take some user property such as email, hash it for better distribution (as hash is a rather random-looking string even though the same input always produces the same hash) and then take a modulo with the number of shards in the pool.
Is pseudocode:
shardCount = 10
shardKey = kallaspriit@gmail.com
hash = sha1(shardKey)
shardId = hash % shardCount
This divides all the user emails among the shards equally and is very simple to implement. This would work great to the point that we add or remove any shards. This would change the shard count and thus the modulo operation result for most of the keys which would mean we'd have to relocate most of the data that would be very expensive thing to do so this approach is only reasonable if one is pretty sure that the number of shards wont change and if you are aiming for big scalability, this wont be the case.
Consistent hashing
One might wonder whether there is a way to minimize the number of keys that we have to relocate in the case of adding or removing shards and in fact, there is. This method is called consistent hashing and having n shards, the number of keys that need to be relocated when adding or removing a shard is at most 1/n.
How consistent hashing works in short is by hashing both the shards and shard keys on an imaginary circle and to find which shard a key belongs to, just move along this circle until you reach a hash of one of the shards.
In the picture, A, B and C are the shards and numbers are the shard keys. They are are hashed into a range that is connected end-to-end such that if no shard is found for a key in the end of the range, the search continues from the start. In the example above, keys 1, 4 match to server A, 2 to B and 3 to C. If we were to remove server B, key 2 would also hash to server C and same logic applies for adding nodes. In practise, for better distribution, each node is hashed to the circle many times (64 worked good for me with low enough standard deviation).
Dictionary lookup
I implemented the consistent hashing algorithm described above in PHP but then I realized that this is still good enough as it minimizes but does not eliminate the need for relocating data, that would be a pain. The most powerful way to implement sharding is by keeping a lookup service that knows where to map each shard key. This eliminates relocating stemming from changing the number of shards but supports relocating very precisely should it be necessary.
The initial way I imagined this was to create a central web-service that each server node could contact to find which shard to store or fetch data from. Then I remembered that "central" is a very bad concept in the world of distributed and this would become both a bottleneck and a single-point-of-failure.
Then I decided to implement this using a database that every server node could contact for this mapping purpose. This way there isn't a single service node to become the bottleneck and talking directly with a database is generally faster than doing it through a web-service, even if a lightweight one such as REST. There would also be security issues to solve.
Implementation
I implemented this in PHP and the MySQL database and to test the concept, built a simplistic blog on it. Basically the blog allows people to register, create their blog, add entries and others can read these.
The first thing I discovered building such a system is that it is not enough to match a single shard key (say the email that I used as username) to a shard as to login, I needed to know which shard does the globally unique email address resolve to but to display anyone's blog, I need to find the shard by blog name. To overcome this, I introduced extra level of abstraction by mapping any number of identifiers (email, blog name etc) to a shard key and then the shard keys to the shards.
As in a bigger system, there may be multiple separate components requiring separate schema's, I also introduced the concept of shard groups that included the SQL needed to setup new shards. This way it would be easy to build an administration panel where by just entering the access information of new shard database and choosing the group, the database could be setup automatically.
All of this led up to four tables for my MySQL-based sharding schema:
The group_id and identifier form a primary key for alias table as in a single group, the identifier must be unique. This is also a convenient way to ensure that the email address users register with are unique as without this, one would have to query all of the shards to find out whether an email address is already taken.
The key table just matches the shard key to a shard. Shard table contains the information required to contact the shard and a reference to group so it knows how to initially setup the shard with the SQL provided. The shard also has a capacity field which shows the relative power of the sharding machine. This enables sharding over various performance boxes as the load balancer makes sure that servers with capacity index twice as large will get about twice the number of keys. The number of keys a shard is currently storing is also kept here and is used for the load-balancing. When a new shard is added, it starts to get slightly more keys than the rest until the number of stored rows balance out.
In the actual implementation, there are two levels of cache between the clients and the database. For a short period of time, very fast local machine cache (PHP APC) is used. When this times out, global network of memcache is consulted and only when this is missed, an actual query to the database is made. This is of-course true only for fetch operations, adding new entries always require talking to the database but for fetches, it should have near 100% cache hit rate so the database should not get much load.
The key table just matches the shard key to a shard. Shard table contains the information required to contact the shard and a reference to group so it knows how to initially setup the shard with the SQL provided. The shard also has a capacity field which shows the relative power of the sharding machine. This enables sharding over various performance boxes as the load balancer makes sure that servers with capacity index twice as large will get about twice the number of keys. The number of keys a shard is currently storing is also kept here and is used for the load-balancing. When a new shard is added, it starts to get slightly more keys than the rest until the number of stored rows balance out.
In the actual implementation, there are two levels of cache between the clients and the database. For a short period of time, very fast local machine cache (PHP APC) is used. When this times out, global network of memcache is consulted and only when this is missed, an actual query to the database is made. This is of-course true only for fetch operations, adding new entries always require talking to the database but for fetches, it should have near 100% cache hit rate so the database should not get much load.
What next
I have tested this setup and it works great, the only problems being that this single central lookup database does not scale, eventually becoming a bottleneck and is also a single point of failure.
This is where finally Cassandra comes in as I plan to port this system over from using this central relational database to the distributed Cassandra, that should solve the problems of scaling and availability. The schema is simple enough so it should not be too hard to implement on Cassandra so I will try it and see how it pans out, You should be able to read about it in coming posts so stay tuned.
I have tested this setup and it works great, the only problems being that this single central lookup database does not scale, eventually becoming a bottleneck and is also a single point of failure.
This is where finally Cassandra comes in as I plan to port this system over from using this central relational database to the distributed Cassandra, that should solve the problems of scaling and availability. The schema is simple enough so it should not be too hard to implement on Cassandra so I will try it and see how it pans out, You should be able to read about it in coming posts so stay tuned.
Saturday, April 9, 2011
The data model
The way Cassandra organizes, stores and retrieves it's data is what makes it special. The model in conceptually different from what one might be used to from relational databases but this makes it's distributed nature possible.
Cassandra's data model trades ACID-compliant data practices for important advantages in performance, availability, and operational manageability.
Columns
Starting from bottom up, the smallest increment of data are columns, that represent a tuple (well, triplet actually), that contains a name, value and a timestamp.
A column JSON-encoded could look like this:
{
name: "email",
value: "foo@bar.com",
timestamp: 345678765
}
All the values are supplied by the client, including the timestamp, which means that the clocks on the clients and server environments should be synchonized, as the timestamps are used for conflict resolution (remember, Cassandra embraces always-write so conflicts are solved on reading). Timestamps can be anything you like, but microseconds since 1970 is a convention.
Super columns
Super columns can be used like columns except they add another layer to the hierarchy so instead of mapping keys to values, it maps keys to another set of columns thus a super column is a associative array of columns and the keys for it are also sorted.
One can thus think of columns and super columns in terms of maps: A row in a regular column family is basically a sorted map of column names to column values; a row in a super column family is a sorted map of super column names to maps of column names to column values.
Column families
A column family is the closest to a table in a relational system as it's a collection of columns. These you have to define in your storage configuration file and they can't be modified without restarting the Cassandra process. Column families hold ordered list of columns, that you can retrieve by name.
The way they are ordered can be chosen by the user and this is important for operations that slice a collection of columns. Natively supported ordering implementations include ASCII, UTF-8, Long and UUID (lexical or time).
A column family may have virtually unlimited set of column names defined, which makes it common to use the column names as a piece of runtime populated data.
Rows
Each column family is stored in a separate file, that is sorted in row (i.e key) major order so related columns that you access together should be kept in the same column family. The row key determines which machine the data is stored on. Each row key can have data from multiple column families associated with it, but since these are logically distinct, the interface is oriented around accessing one column family per key at a time.
Keyspaces
Keyspaces are the first and largest dimension of the Cassandra hash system. It's a container of column families which makes them equaivalent of a schema (collection of tables) in a relational database. Keyspaces are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
Visually
Below is my attempt to visualize the data model.
Cassandra's data model trades ACID-compliant data practices for important advantages in performance, availability, and operational manageability.
Columns
Starting from bottom up, the smallest increment of data are columns, that represent a tuple (well, triplet actually), that contains a name, value and a timestamp.
A column JSON-encoded could look like this:
{
name: "email",
value: "foo@bar.com",
timestamp: 345678765
}
All the values are supplied by the client, including the timestamp, which means that the clocks on the clients and server environments should be synchonized, as the timestamps are used for conflict resolution (remember, Cassandra embraces always-write so conflicts are solved on reading). Timestamps can be anything you like, but microseconds since 1970 is a convention.
Super columns
Super columns can be used like columns except they add another layer to the hierarchy so instead of mapping keys to values, it maps keys to another set of columns thus a super column is a associative array of columns and the keys for it are also sorted.
One can thus think of columns and super columns in terms of maps: A row in a regular column family is basically a sorted map of column names to column values; a row in a super column family is a sorted map of super column names to maps of column names to column values.
Column families
A column family is the closest to a table in a relational system as it's a collection of columns. These you have to define in your storage configuration file and they can't be modified without restarting the Cassandra process. Column families hold ordered list of columns, that you can retrieve by name.
The way they are ordered can be chosen by the user and this is important for operations that slice a collection of columns. Natively supported ordering implementations include ASCII, UTF-8, Long and UUID (lexical or time).
A column family may have virtually unlimited set of column names defined, which makes it common to use the column names as a piece of runtime populated data.
Rows
Each column family is stored in a separate file, that is sorted in row (i.e key) major order so related columns that you access together should be kept in the same column family. The row key determines which machine the data is stored on. Each row key can have data from multiple column families associated with it, but since these are logically distinct, the interface is oriented around accessing one column family per key at a time.
Keyspaces
Keyspaces are the first and largest dimension of the Cassandra hash system. It's a container of column families which makes them equaivalent of a schema (collection of tables) in a relational database. Keyspaces are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
Visually
Below is my attempt to visualize the data model.
Sunday, April 3, 2011
Cassandra elevator pitch
This is a good elevator pitch for the Cassandra database in 50 words written by Eben Hewitt:
"Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon's Dynamo and its data model on Google's Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web".
Distributed and decentralized means that it is meant to be deployed on more then one, possible hundreds of common commodity hardware servers and it can run on them seemingly as a unified whole, application can talk to it like a simple single-box database. It is decentralized as every node of the database is equal, there are no master servers coordinating the rest and thus there is no single point of failure. Every node is equal and holds part of the data and share the total load equally.
Cassandra is elastically scalable as it's is easy to add new nodes as the application becomes more popular or remove some should the opposite happen. You can just add new nodes and Cassandra will begin to send it work without the application even knowing that anything has changed.
Replication of data among the nodes is used so if a node fails, the data is still available in some other nodes so it can be recovered. Users can tune this replication factor to choose between the overhead and peace of heart. In fact there are several replication strategies. The simple, rack unaware method simply places replicated data on the next node along the ring while if you are running a bigger deployment on several data centers/availability zones, you can use the rack-aware strategy in which case replica 2 is placed in the first node along the ring that belongs in another data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the same rack as the first. This sort of replication means that Cassandra is highly available and since every node is equals and failing of any random one does not cause the whole system to become unstable, it is also fault tolerant.
Consistency means that whenever you read from the database, you get the most recently written value. Imagine buying something off eBay and someone else buying the exact same thing at almost the exact same moment. Who-ever got their request processed first should be able to buy the item while the other buyer should be notified that the product is already sold. This again seems like a logical thing to expect from a database, but this comes at a price in distributed databases as all update operations should be executed synchronously, meaning they must block, locking all replicas until the operation completes and locking in distributed systems is very expensive.
In many cases, you don't actually need this as for example when you have replies in a forum, it won't hurt anyone if right after changing a post, someone views the thread and still sees the unchanged entry. In Cassandra, one can choose in each case whether absolute consistency is needed or not and tune the value in between. This is why Cassandra can be called "tuneably consistent". This eventual consistency means that all updates propagate through the system to all replicas in some time. The optimistic approach to replication is not blocking the client and propagate changes in the background but this means we have to detect and resolve possible conflicts. Whether to do this during writing or reading the data is an important design choice and dictates whether the system is always readable or writable - again in the world of compromises, you can't have it all.
Cassandra chose to be always writable, deferring the complexity of conflict resolution to read operations. By tuneably consistent, it is mean that clients can choose how many replicas to block for during update operations or consult during read operations. You could always set this consistency level to the number of replicas, but this would impact performance and lose the point of using something like Cassandra in the first place. Setting it to below the replication factor means that operations can succeed even if some nodes are down and you can not expect to always read the most recently written value but as described earlier, this is not always necessary. Nice thing about Cassandra is that you can still choose to opt for consistency for critical operations that require this.
"Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon's Dynamo and its data model on Google's Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web".
Distributed and decentralized means that it is meant to be deployed on more then one, possible hundreds of common commodity hardware servers and it can run on them seemingly as a unified whole, application can talk to it like a simple single-box database. It is decentralized as every node of the database is equal, there are no master servers coordinating the rest and thus there is no single point of failure. Every node is equal and holds part of the data and share the total load equally.
Cassandra is elastically scalable as it's is easy to add new nodes as the application becomes more popular or remove some should the opposite happen. You can just add new nodes and Cassandra will begin to send it work without the application even knowing that anything has changed.
Replication of data among the nodes is used so if a node fails, the data is still available in some other nodes so it can be recovered. Users can tune this replication factor to choose between the overhead and peace of heart. In fact there are several replication strategies. The simple, rack unaware method simply places replicated data on the next node along the ring while if you are running a bigger deployment on several data centers/availability zones, you can use the rack-aware strategy in which case replica 2 is placed in the first node along the ring that belongs in another data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the same rack as the first. This sort of replication means that Cassandra is highly available and since every node is equals and failing of any random one does not cause the whole system to become unstable, it is also fault tolerant.
Consistency means that whenever you read from the database, you get the most recently written value. Imagine buying something off eBay and someone else buying the exact same thing at almost the exact same moment. Who-ever got their request processed first should be able to buy the item while the other buyer should be notified that the product is already sold. This again seems like a logical thing to expect from a database, but this comes at a price in distributed databases as all update operations should be executed synchronously, meaning they must block, locking all replicas until the operation completes and locking in distributed systems is very expensive.
In many cases, you don't actually need this as for example when you have replies in a forum, it won't hurt anyone if right after changing a post, someone views the thread and still sees the unchanged entry. In Cassandra, one can choose in each case whether absolute consistency is needed or not and tune the value in between. This is why Cassandra can be called "tuneably consistent". This eventual consistency means that all updates propagate through the system to all replicas in some time. The optimistic approach to replication is not blocking the client and propagate changes in the background but this means we have to detect and resolve possible conflicts. Whether to do this during writing or reading the data is an important design choice and dictates whether the system is always readable or writable - again in the world of compromises, you can't have it all.
Cassandra chose to be always writable, deferring the complexity of conflict resolution to read operations. By tuneably consistent, it is mean that clients can choose how many replicas to block for during update operations or consult during read operations. You could always set this consistency level to the number of replicas, but this would impact performance and lose the point of using something like Cassandra in the first place. Setting it to below the replication factor means that operations can succeed even if some nodes are down and you can not expect to always read the most recently written value but as described earlier, this is not always necessary. Nice thing about Cassandra is that you can still choose to opt for consistency for critical operations that require this.
Saturday, April 2, 2011
ACID, what and why?
One of the fundamental features of modern databases are transactions and confirming to the ACID properties. A transaction is transformation of state from one consistent state to another that happens virtually in a single step.
ACID is an acronym for:
ACID is an acronym for:
- Atomic means all or nothing meaning when a transaction is executed, all the statements contained must all succeed or none of them can have any effect, there can be no partial failure where some parts of it succeed and others fail, this would put the database in an inconsistent state. Common example is with money transfers where it's obviously very important that if on one side, the transferred amount is subtracted, on the other side, it is added, even if the power goes out somewhere in the middle.
- Consistent means that the database moves from one consistent state to another with no possibility that some other client could read the data in some intermediate state.
- Isolated means that transactions executed concurrently will not clash with each other, each executes in it's own space and if they wish to update the same data, they have to wait for one of them to complete.
- Durable means simply that once transaction succeeds, the data is not lost, the modified data is available for future queries to read and modify.
These properties are obviously desirable but on distributed systems, they may be hard to enforce without sacrificing performance and in future entries we will learn how clients of Cassandra can choose to tune between the different merits of consistency, availability and partition tolerance.
These compose the CAP theorem which states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency - all nodes see the same data at all times
- Availability - node failures do not stop the rest of the system operating
- Partition tolerance - the system can be split up to run on multiple nodes and can continue to function in spite of node and network failures
Cassandra has chosen availability and partition tolerance (AP) while relational databases are available and consistent (AC) and for example MongoDB, HBase, Google Bigtable chose CP.
Transactions can be implemented across several machines using two phase commit, but as this locks all affected resources, it makes transactions even more expensive then they already are making this quickly not an optimal solution for high performance applications.
In Cassandra, one can choose the balance between consistency and latency but more on this later.
Next we will take a closer look at Cassandra's architecture and design philosophies.
Why consider non-relational databases?
Why would you want to be using non-relational databases such as Cassandra in the first place when we have great relational databases like MySQL that have proven to work great for decades, there is lots of know-how available and learning to be productive at it is rather easy. Relational databases provide us with great tools like data normalization, consistency, transactions, complicated joins and queries.
The question to "what's wrong with relational databases" is really "nothing". They work great for many applications and are simple to use. The problem with relational databases arise when we begin to need more performance out of our database than normal RDBMS deployments can handle.
Modern web applications often connect millions of people in an interactive way with lots of communication and data stored and exchanged at any time and unfortunately, there is no simple way to scale the performance of relational databases horizontally by simply adding more nodes.
When an application is becoming popular, the performance issues often arise. This is first mitigated by throwing beefier machines at it, which can postpone the problem for some time until there are no more affordable powerful enough machines. Then often a cache layer is introduced into the application if not already there, this can reduce the read load significantly but will make the application much more complex.
Read load can be further reduced by creating read-only replicas of the main database, so all the reads go to any of the replicas and the main database is only used for writing. This complicates things further and brings problems like what happens if you update something and redirect user to a page that shows the updated content. On this page, you will read from the replica and perhaps store the new value to a cache that you cleared when updating the record. Replicas lag slightly behind the main and what happens if you read from it before it has had time to propagate. This can happen often as the redirect is fast. Now you read old data and show it to the user that can get confused why the update had no effect. So instead of clearing some caches on data updating, deleting, you have to update them that further complicates everything. Notice that we have so far only been able to improve read performance, all the writes still happen on the main database.
Now you've implemented caching and read replicas and optimized all the queries but your site is becoming ever more popular so it's still not enough. You might now partition the database by functionality, for example by moving the search, comments or any other loosely coupled part to some other machine but you can do this only that many times. I've been down this exact road in my own experience and it's not a pretty ride.
At this point, you need a solution that really scales horizontally, meaning you can easily add or remove some machines handling all the data as your needs change. There are a couple of ways to do this, most popular being sharding your database so different rows of the same tables live of different machines or by using a relational database. Actually sharding is very powerful and when designing a truly scalable architecture, I would try to mix both sharding and non-relational databases to get the best out of both worlds.
The question to "what's wrong with relational databases" is really "nothing". They work great for many applications and are simple to use. The problem with relational databases arise when we begin to need more performance out of our database than normal RDBMS deployments can handle.
Modern web applications often connect millions of people in an interactive way with lots of communication and data stored and exchanged at any time and unfortunately, there is no simple way to scale the performance of relational databases horizontally by simply adding more nodes.
When an application is becoming popular, the performance issues often arise. This is first mitigated by throwing beefier machines at it, which can postpone the problem for some time until there are no more affordable powerful enough machines. Then often a cache layer is introduced into the application if not already there, this can reduce the read load significantly but will make the application much more complex.
Read load can be further reduced by creating read-only replicas of the main database, so all the reads go to any of the replicas and the main database is only used for writing. This complicates things further and brings problems like what happens if you update something and redirect user to a page that shows the updated content. On this page, you will read from the replica and perhaps store the new value to a cache that you cleared when updating the record. Replicas lag slightly behind the main and what happens if you read from it before it has had time to propagate. This can happen often as the redirect is fast. Now you read old data and show it to the user that can get confused why the update had no effect. So instead of clearing some caches on data updating, deleting, you have to update them that further complicates everything. Notice that we have so far only been able to improve read performance, all the writes still happen on the main database.
Now you've implemented caching and read replicas and optimized all the queries but your site is becoming ever more popular so it's still not enough. You might now partition the database by functionality, for example by moving the search, comments or any other loosely coupled part to some other machine but you can do this only that many times. I've been down this exact road in my own experience and it's not a pretty ride.
At this point, you need a solution that really scales horizontally, meaning you can easily add or remove some machines handling all the data as your needs change. There are a couple of ways to do this, most popular being sharding your database so different rows of the same tables live of different machines or by using a relational database. Actually sharding is very powerful and when designing a truly scalable architecture, I would try to mix both sharding and non-relational databases to get the best out of both worlds.
Subscribe to:
Posts (Atom)