Why Apache Cassandra?

cassandra_logoThe following question came up a couple weeks ago: Why would I use Apache Cassandra? First of all let’s get a definition for Apache Cassandra. Edited from Wikipedia (https://en.wikipedia.org/wiki/Apache_Cassandra):

“Apache Cassandra is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous master less replication allowing low latency operations for all clients.

Cassandra also places a high value on performance. In 2012, University of Toronto researchers studying NoSQL systems concluded that “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments” although “this comes at the price of high write and read latencies.”

When selecting a database it is worthwhile to take a look at the CAP Theorem which is illustrated in the following figure:


The CAP theorem according to Wikipedia (https://en.wikipedia.org/wiki/CAP_theorem):

In theoretical computer science, the CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Guarantees Definition
Consistency Every read receives the most recent write or an error.
Availability Every request receives a response, without guarantee that it contains the most recent version of the information.
Partition Tolerance The system continues to operate despite an arbitrary number of messages being dropped by the network between nodes.

In other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP Theorem is quite different from the consistency guaranteed in ACID (Atomicity, Consistency, Isolation, Durability) database transactions.acid

Apache Cassandra is an open source, written in Java, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database.

Master slave architectures (e.g., MongoDB, Hadoop) depend on a master. In Cassandra all nodes behave the same. Nodes rebooting does not affect the operation of the entire system.

One could use Cassandra in many applications. Follow is a short list of requirements that suggests its use:

Big Data (billions of records rows & columns).
Very high velocity random reads & writes.
Flexible sparse / wide column requirements.
No multiple secondary indexes need.
Low latency.
Global logs.

One should not consider using Cassandra under the following circumstances:

Secondary indices are required (when a query has a where in a clause).
Relational data.
Transactional (rollback and commit).
Primary and financial records.
Stringent security & authorization needed on data (only when using anonymized data).
Dynamic queries on columns.
Searching column data.
Low latency (action reaction times very small).
Do not store binary data in Cassandra. Store the metadata. The binary should be kept in a storage server / disk.

This blog is based on a set of notes / observations that I made when selecting Cassandra for a project at work. The resulting system is hybrid. It contains a RDBMS (MySQL) and a NoSQL (Cassandra) databases.

If you have comments or questions regarding this or any other blog entry, please send me a message.



Leave a Reply

Your email address will not be published. Required fields are marked *