Friday, April 17, 2015

Cassandra basics


What is Cassandra: 


Cassandra is a highly scalable open source NOSQL db. Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Each node exchanges information across the cluster every second.

Characteristics: 
  • Distributed in nature. it is based on masterless architecture, 
  • Every single node is independent. It shares nothing with the other node. Each node is responsible for a portion of the dataset. If you need more capacity, add more nodes. 
  • Fully replicated. Client writes in local, data synchronises across the other nodes. 

What happens when you write or read to Cassandra. 
  • A sequentially written commit log on each node captures write activity to ensure data durability. 
  • Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache. 
  • Once the memory structure is full, the data is written to disk in an SSTable data file.
  • Client read or write requests can be sent to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. 
  • The coordinator acts as a proxy between the client application and the nodes that own the data being requested. 
  • The coordinator determines which nodes in the ring should get the request based on how the cluster is configured.

Key terminologies: 
  • Node: where the data is stored. 
  • Data center: Collection of nodes. 
  • Cluster: contains one more more data centre, can span physical locations. 
  • Commit log: data written to commit log for durability. 
  • SSTable: Sorted string table which are append only, and stored on disk sequentially. 

Key components: 
  • Gossip: Peer to peer communication protocol which nodes use to discover and share information about other nodes in the cluster. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. So all nodes learn about all other nodes in the cluster. 
  • Partitioner: determines how to distribute data across the nodes in the cluster, each row of data is identified by a partition key and distributed across the cluster. 
  • Replication factor: determines how many copy of data is stored in the cluster.  

——————————————————————————————
Installing
——————————————————————————————


No comments:

Post a Comment