Cassandra Student Andreea Prodan USAGES Is wellsuited for

  • Slides: 12
Download presentation
Cassandra Student: Andreea Prodan

Cassandra Student: Andreea Prodan

USAGES Is well-suited for managing large amounts of data, when high write speeds are

USAGES Is well-suited for managing large amounts of data, when high write speeds are needed and there are more writes than reads (from sensors, connected appliances, and applications). Is usually used for analytics, event logging, monitoring, and e. Commerce purposes. ADVANTAGES: High-speed Data Writes Decentralized - each separate node is capable of presenting itself to any end-user as a complete or partial replica of the database (Peer to Peer Architecture) No single point of failure Distributed - it is distributed across many nodes and even data centers; this allows data to be replicated across multiple geographic location Scalable - can be easily scaled horizontally, by adding more nodes Highly Available - data remains available even if one or several nodes and data centers go down Fault Tolerance 2

HIERARCHY • Cluster • Data center(s) • Rack(s) • Server(s) • Virtual nodes A

HIERARCHY • Cluster • Data center(s) • Rack(s) • Server(s) • Virtual nodes A cluster is a collection of data centers. A data center is a collection of racks. A rack is a collection of servers (machines where the Cassandra software is installed), also called physical nodes). A server contains virtual nodes. 3

KEYSPACES keyspace = database When keyspaces are created, they define: • a replication factor

KEYSPACES keyspace = database When keyspaces are created, they define: • a replication factor - number of copies of data kept in the cluster • a replication strategy - defines how these replicas are distributed - there are two methods: • Simple. Strategy • Network. Topology. Strategy 4

SIMPLE STRATEGY Is confined to a single data center. Successor nodes or the nodes

SIMPLE STRATEGY Is confined to a single data center. Successor nodes or the nodes on the ring immediate following in clockwise direction to the coordinator node are selected as replicas. 5

NETWORK TOPOLOGY STRATEGY Allows for the use of multiple datacenters. User need to specify

NETWORK TOPOLOGY STRATEGY Allows for the use of multiple datacenters. User need to specify per data center replication factor in multiple data center environment. In each data center, successor nodes to the coordinator node which are from distinct racks are chosen. 6

DATA STRUCTURE Is a column oriented database. column family = table query first approach

DATA STRUCTURE Is a column oriented database. column family = table query first approach = we design our tables for a specific query so we have to query only one table when we're reading data partition key - on which node is the data stored clustering key - how data is sorted 7

Cassandra uses tokens to determine which node holds what data. A token is a

Cassandra uses tokens to determine which node holds what data. A token is a 64 -bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. 8

LANGUAGE Cassandra exposes a dialect similar to SQL called CQL. While similar to SQL,

LANGUAGE Cassandra exposes a dialect similar to SQL called CQL. While similar to SQL, it does not support join operations or subqueries. INSERT statements overwrite data with the same primary key. This can be prevented by using the statement IF NOT EXISTS. UPDATE Unlike in SQL, here UPDATE performs an implicit insert if the primary key does not exist in the dataset. This is useful for COUNTER columns. TIME TO LIVE Sets the time limit for a specific period of time, expressed in seconds. Can be used in INSERT and UPDATE by appending USING TTL <seconds> to the end. 9

COUNTER Is an integer which can only be incremented and decremented. It is added

COUNTER Is an integer which can only be incremented and decremented. It is added using UPDATE. An example of usage would be to keep track of the number of items bought by a customer or the number of views on webpages There are some limitations: • can only be created in dedicated tables • can't be assigned to the column that serves as PK 10

BATCH To improve performance, insert and update statements can be wrapped in a BATCH

BATCH To improve performance, insert and update statements can be wrapped in a BATCH statement. Is not a transaction but it improves performance if you need to write to multiple tables or to process a large number of requests. BEGIN COUNTER BATCH … INSERT/UPDATE statements … APPLY BATCH; 11

RESOURCES • Seminar support • Apache Cassandra - Tutorial https: //www. youtube. com/watch? v=Soaod

RESOURCES • Seminar support • Apache Cassandra - Tutorial https: //www. youtube. com/watch? v=Soaod 2 WRmlg&list=PLalr. WAGybp. B-L 1 PGANf. Fu 2 ui. WHEsdsc. D&index=6 • Cassandra Data Replication http: //distributeddatastore. blogspot. com/2015/08/cassandra-replication. html • Cassandra Data Partitioning https: //www. instaclustr. com/cassandra-data-partitioning/ 12