Monday, October 26, 2020

Choose relational vs NoSQL database?

Relational or NoSQL?

You have this great software application in mind. Now, to implement it, what database should you use? Should it be relational or NoSQL?

Concepts

To make this decision, it is important to understand two software concepts.

CAP Theorem

With requirements of Consistency (reading the latest data), Availability (client with data to write must be able to hand over the data without worrying about retries & db should accept the data and handle reliably writing it) & Partition tolerance (tolerate nodes or partial internal network errors), it is possible to meet only two requirements in full with compromises required on the third.

3Vs

A concept in Big Data. What is the volume, variety & velocity of the incoming data?

Choosing Relational

When the data is structured well (lacks variety) & within reasonable limits of volume & velocity, a relational database is usually a good choice. Relational databases have a mature plethora of features & supporting products that NoSQL databases will struggle to match.

In a SQL database, reads (with the right indexes) are fast while writes are slow since they have to go to one location. A SQL database is designed to follow CA (Consistency {ACID-Atomicity, Consistency, Isolation, Durability} & Availability {with multiple servers}) at the cost of partition tolerance. Complex solutions can reduce the partition tolerance.

For a social network requirement, a Graph database is a good choice, which also follows CA (Consistency/Availability) at the cost of P (Partition Tolerance).

Choosing NoSQL

If even one of the 3Vs (volume, variety, velocity) is high, it is better to go with a NoSQL database.

Availability & Partition Tolerance

Such databases have 'eventual consistency' that is quick but not guaranteed.

Cassandra is a NoSQL database server that can have multiple nodes. It will eventually need a DevOps team to manage complexity. Consistency is quick (but not immediate & guaranteed), dirty reads are a possibility & occasional bugs were reported in the past when a node goes down with dirty data & is brought back up.

Amazon Dynamo: A commercial Storage as a service solution with DevOps managed by the provider.

CouchDB: JSON Document database.

Both Cassandra & Dynamo are column based key-value databases.

Consistency & Partition Tolerance

If consistency & partition tolerance are more important than availability, MongoDb & HBase are options.

MongoDb is a document based database with an id as a key & indexed documents. By default, it will keep 3 copies in sync & will not succeed a write unless all copies are written successfully. It will ensure the reads don't read dirty data.

HBase is a columnar database (similar to Cassandra & Dynamo) and ensures 3 copies by default (similar to MongoDb). In a columnar database, data storage is column based (rather than row based) which is useful for data-warehousing that works with columns.

Availability can be improved with complex solutions on top.

NoSql database Categories

Key-Value

Memcached, Redis, Riak

Column-oriented

Cassandra

Document oriented (JSON)

MongoDb, CouchDb

Tabular

Big Table, HBase

Graph

Neo4J, ArangoDb (supports sharding with auto rebalancing), MarkLogic

XML, Object based

eXist, BaseX, MarcLogic, ZODB

Conclusion

Depending on the requirements, there are multiple options available for someone wishing to build a software solution. The right tools can be chosen for the right job.

Co-authored with Srinivasan Varadharajan.

Friday, October 23, 2020

Scaling applications-an overview

As a business grows, its software applications will need to scale.

Computing bottleneck

Run multiple identical instances of the application on multiple servers behind a load balancer. Eg: Web farms.

Storage bottleneck

Route requests based off attributes to different servers. Eg: Data for names starting from A-D will be stored in a specific storage system. In databases, this system is called sharding.

Application complexity bottleneck

As an application increases in complexity slowing down development & maintainability, increased modularization & isolation will help. Microservices is a popular architecture for solving this problem.

Why is Go fast?

Why is Go fast? Go has become popular for microprocesses & for scaling. What are the design decisions that make Go fast? Summary: 1. Cle...