How to choose the right database for data driven applications

Sairam Krish
6 min readMar 3, 2021

--

For data driven applications, choosing the right database solves half the problem. By crystallising data persistence layer first, it helps us ask the right question well before and tends to positively influence other layers. It also reduces unnecessary complexity in the implementation.

In this article, though the right choice ‘depends’ on the problem, I have tried to call out my preferred choice in general on specific categories like RDBMS, Key-Value pair etc.,

CAP theorem

As CAP theorem states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

  • Consistency: Every read receives the most recent write
  • Availability: Every request receives a response, without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped by the network between nodes

How shape of data influences choice of database

  • Data that has highly dense relationships could demand a Graph database
  • Data that has attributes that differ dynamically could demand a document database
  • Data that stay in fixed shape (attributes) with relationships to other data could demand a Relational database
  • Data that stay in fixed shape (attributes) but attributes could grow a lot and presence or absence of many attributes is dynamic in nature could demand a Columnar database

Distributed SQL over traditional RDBMS

Traditional RDBMS are good fit for data which are relational in nature. But trying to gain distributed database advantages at the later stage of development is very hard. If our application is large scale, choosing distributed SQL from the beginning gives the advantage of having relational data along with distributed resilience.

With respect to CAP Theorem, distributed SQL databases are consistent and partition-tolerant. They sacrifice availability. A failure of a primary node can make the database unavailable for writes.

Some distributed SQL database choices that I like :

TiDB

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

Drawback :

YugaByteDB calls out the following drawback in TiDB because of it’s multi-shard implementation.

TiDB uses Google Percolator as the inspiration for its multi-shard transaction design. This choice essentially makes TiDB unfit for deployments with geo-distributed writes since the majority of transactions in a random-access OLTP workload will now experience high WAN latency when acquiring a timestamp from the global timestamp oracle running in a different region. Additionally, TiDB lacks support for critical relational data modeling constructs such as foreign key constraints and Serializable isolation level.

If the application doesn’t involve geo-distributed data storage, then choosing TiDB holds good.

YugabyteDB

YugabyteDB is an open source database for high performance applications that require ACID transactions and planet-scale data distribution. It combines NoSQL and SQL in a single database.

My generally preferred choice in RDBMS — Postgresql

  • Though some benchmarks callout Mysql as more performant than Postgresql, I don’t believe this is true for practical applications. Postgresql is equally performant to Mysql in general. If we use Mysql — InnoDB (transactions, key constraints, etc), then differences are negligible between them.
  • Postgresql adheres to SQL standards
  • Postgresql license is truly open-source
  • PostgreSQL has a rich set of native data types available to users. We can also add new types using the CREATE TYPE command.
  • It has rich data types that are so useful in applications like Arrays, JSON, Composite types, Range types to name few — here is the complete list
  • Authorization with Row-Level Security (RLS, introduced in PG9.5) through Foreign Data Wrappers (FDW)
  • Real time notifications with LISTEN/NOTIFY
  • Distributed SQL db like CockroachDB offers compatibility with Postgresql. In future, if we like to move from Postgresql to purely distributed SQL db, that makes it easier.
  • There are advanced solutions that get’s developed on top of Postgresql because of the extensibility & rich features of postgres. For example:

✓ auto generation of GraphQL API from postgres —PostGraphile

✓ auto generation of REST API from postgres — PostgREST

However, using postgres for large data volumes may not be a good idea. Here is a note from airbyte documentation:

Credits: airbyte documentation

My generally preferred choice in Key-Value store — Redis

For key-value & data structure needs, Redis adds lot of value to the application architecture. It is not just a key value store but a data structure store. Being a data structure store, we can solve key-value needs plus other powerful needs like data with the shape of set or hierarchy etc.,

Wide column stores vs column oriented stores

Wide column stores supports large numbers of dynamic columns. Cassandra is a wide column store. We can imagine wide column stores as 2 dimensional key value stores.

Column oriented store is a storage technique. Traditional RDBMS, stores table data in the form of record by record. Column oriented store tries to improve the storage and retrieval performance by storing table data in the form of column by column. Clickhouse is a column oriented store.

Apache Cassandra Overview

Apache Cassandra is a distributed NoSQL database known for its:

  • Rapid writes
  • Staged event-driven architecture (SEDA)
  • High scalability
  • Decentralized & Distributed nature
  • Fault tolerance
  • Dynamic schema / data model

Cassandra is best for OLTP use cases that are write heavy with small, highly constrained queries. Druid is best for OLAP use cases that are read heavy, and require full analytical query capacity.

One application — Multiple database

There is nothing wrong with one application having multiple databases. Sometimes I get into discussions on applications that has different needs and teams trying to find one database which satisfies all the needs. There is nothing wrong with having a data structure store like redis along side relational database along side a document store like mongoDB. Infact by having the right database based on the shape of the data reduces the implementation complexity. It makes the application more light weight mostly just orchestrating data flow.

Let’s not shy away from changing database choice

After few months/years of development, the part that teams doesn’t want to think is changing the persistence layer. Just the sound of it feels that “Oh no., We are too deep. It’s not a choice”. I have seen many teams who agree that they made a mistake in choosing the database layer. But most of them don’t want to disturb the flow for now.

Best approach to change the choice of database is to introduce new one, incrementally move requirements & relevant data from one to another, keep releasing. There are teams who try to change the database in a separate experimental branch. But it is highly unlikely such pilot projects & experimental branches will make it to production. Let’s just bite the bullet. It will pay huge dividends in long run.

Versional controlled databases

This is paradigm shift in the way we think about databases.

Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository.

Connect to Dolt just like any MySQL database to read or modify schema and data. Version control functionality is exposed in SQL via system tables, functions, and procedures.

Or, use the Git-like command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes. All the commands you know for Git work exactly the same for Dolt.

Git versions files. Dolt versions tables. It’s like Git and MySQL had a baby.

Drawbacks of Dolt : It’s not meant for horizontally scallable data. It’s meant for single server data load that may need vertical scaling.

Summary

Best way to architect a system, is by continuous refactoring and not shying away from rewriting the whole flow if needed to do the right thing. By driving the architecture of data driven systems through data, makes the whole flow simple and elegant, easy to add/remove capabilities.

--

--