Clickhouse — sharding and replication

6 min readSep 9, 2023

Clickhouse is a powerful OLAP database. Coming from a distributed architecture world, we always think of sharding and replication when considering tools and applications. With respect to clickhouse, we need to be extra careful while doing sharding and replication.

Prefer scale up over scale out for clickhouse

In general, horizontal scaling is the preferred way to architect large scale systems.

Clickhouse’s architecture is done in such a way that it naturally prefers scale up ( vertical scaling ) over scale out ( horizontal scaling ). This is a big difference between clickhouse and other big data distributed databases. Clickhouse is built on share-nothing architecture which influences the scale up preference.

This bias towards scale up is evident while we try to horizontally scale clickhouse. We need to configure many things correctly and fine tune them to perfection. Without this, the desired outcome of horizontally scaling may not be achieved. Once we functionally get it right, we will face performance issues. Good set of such settings and configuration is dependent on multiple factors ( data that is stored in clickhouse, how data is ingested, how data is read, etc ). A good set of such settings are not in public domain.

Self hosted — Community user — pain points

At the moment of writing this, there is no good set of server settings & configuration options available that a community user can adopt and kick start his journey. Community user need to figure out everything themself going through each & every server configuration available and find what suites his needs. This is a big pain point in achieving horizontal scaling in a self hosted clickhouse installation.

Two companies provides commerical support — Official clickhouse and Altinity. They both have different pricing and deployment options (hosted, self hosted etc.,).

To the contrast, scale up approach feels straight forward, easy to install & maintain.

Replication vs sharding

Often confused with each other, they are independent from each other. Let’s dive deep and observe their clickhouse’s specifi ccharacteristics below.

Replication

Replication — helps with data integrity and fault tolerance. Clickhouse by default is replicated and has 1 replica. The original data is considered as a replica by itself and when we add one more replica, we are adding replica 2. This may sound common knowledge but this is fundamentally different from many databases where they refer single instance as a standalone installation.

Sharding

Sharding — helps with horizontal scaling or scaling out.

The destination server is determined by the sharding key, and is defined when you create the distributed table. The sharding key can be random or as an output of a hash function. The deployment examples involving sharding will use rand() as the sharding key, and will provide further information on when and how to choose a different sharding key.

Courtesy : https://www.digitalocean.com/community/tutorials/understanding-database-sharding

shardKey concept shown in above picture is presented at a generic level. In clickhouse, shard key lifecycle has small differences. This is explained in following sections.

Different clickhouse setup

We can use sharding without replication, replication without sharding or sharding & replication. Everything is valid.

Single shard — 3 replica → this would be a simple, straight forward clickhouse setup.

3 shard — 3 replica → this would be the complex scenario which needs expertise with configuring clickhouse to get things right.

Altinity clickhouse operator

The example scripts uses altinity clickhouse operator to setup clickhouse in kubernetes. However at a high level sharding & replication concepts discussed here are generic. Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse clusters running on Kubernetes.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "example-1"
spec:
  configuration:
    clusters:
      - name: "shard1-repl2"
        layout:
          shardsCount: 1
          replicasCount: 2

This clickhouse would have a single shard and 2 replicas

Factors that influence sharding & replication

Declaring number of shards and number of replicas alone won’t have desired effect. Following factors influence the actual effect :

Table engine that is used
Cluster co-ordination system — like zookeeper or clickhouse keeper

DistributedTableEngine

Sharding data across multiple servers can be used to divide the load if you exceed the capacity of a single server. Sharding depends on distributed engine or DDL . Without using a distributed table, having shards at the cluster level, would not have any sharding effect on the data.

Replicated*MergeTree table engine

Table engines such as Replicated*MergeTree family of table engines support replicated the data based on the cluster’s replica settings. However Replicated*MergeTree engine doesn’t influence the sharding of the data. Sharding is taken care by distributed table engine.

If the insertion of data is done directly on a Replicated*MergeTree table, the data will not be sharded. If the insertion operation happens on a Distributed table, this shards the data and stores it accordingly.

So it is not only the table engines but also the operation performed and on which objects they are performed influences the sharding & replication of the data.

Replication needs cluster coordination system

Replication needs cluster coordination system like zookeeper or clickhouse keeper. If we don’t want replication, we don’t need this. If we want sharding without replication, still we don’t need coordination system.

zookeeper or clickhouse-keeper provides a consensus system to garantee all replicas are in sync with each other. It also ensures all operations are performed in the same order. This co-ordination system only stores metadata and not the actual clickhouse database data.

Weightage of shards

Each shard can have a <weight> defined in the config file. By default, the weight is 1. Data is distributed across shards in the amount proportional to the shard weight. All shard weights are summed up, then each shard's weight is divided by the total to determine each shard's proportion. For example, if there are two shards and the first has a weight of 1 while the second has a weight of 2, the first will be sent one third (1 / 3) of inserted rows and the second will be sent two thirds (2 / 3).

<remote_servers>
...        
        <shard>
            <!-- Optional. Shard weight when writing data. Default: 1. -->
            <weight>1</weight>
            ...
        </shard>
        <shard>
            <weight>2</weight>
            ...
        </shard>
 ...
</remote_servers>

Should distributed table manage relication

This is a decision we need to take. In a default clickhouse configuration, Distributed table, if insertion operation happens on it, it would shard the data and stores them in appropriate shard. If underlying table is a Replicated*MergeTree distributed table would also take care of inserting into different replicas. We can turn this off, so that we can achieve better insertion performance and also avoid inconsistent data issue.

Each shard can have the internal_replication parameter defined in the config file. If this parameter is set to true, the write operation selects the first healthy replica and writes data to it. Use this if the tables underlying the Distributed table are replicated tables (e.g. any of the Replicated*MergeTree table engines). One of the table replicas will receive the write and it will be replicated to the other replicas automatically.

If internal_replication is set to false (the default), data is written to all replicas. In this case, the Distributed table replicates data itself. This is worse than using replicated tables because the consistency of replicas is not checked and, over time, they will contain slightly different data.

<remote_servers>
...        
        <shard>
            <!-- Optional. Whether to write data to just one of the replicas. 
              Default: false (write data to all replicas). -->
            <internal_replication>true</internal_replication>
            ...
        </shard>
        <shard>
            <internal_replication>true</internal_replication>
            ...
        </shard>
 ...
</remote_servers>

Hashing functions for sharding key

DistributedTableEngine takes in a sharding key. This influences which the data is sent to which shard. In practice, we may have different kind of data in the underlying table. We may need to influence sharding based on an integer field or a string field etc. But we cannot pass this data directly as a sharding key. We may need to generate a hash from our data and pass it as sharding key.

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE = Distributed(cluster, database, table[, sharding_key[, policy_name]])
[SETTINGS name=value, ...]

Clickhouse provides built-in hash functions that could help us with sharding key. Here are some frequently used functions:

rand() → this would automatically generate random hash. This has no relationship with the data. The result would be the whole data would be equally sharding amoung the number of shards. For some usecases, this approach can be useful.
intHash32(userId) → we can pass an integer field and based on that a hash would generated.
murmurHash2_32(productUUID) → we can pass a string and generate a hash. Would be handy if we have a UUID or other string column based on which we like to shard the data

Clickhouse — useful queries — SQL snippets