Sitemap

Clickhouse No active replica has part issue — steps to debug

2 min readApr 5, 2024
<Error> void DB::printExceptionWithRespectToAbort(Poco::Logger *): 
Code: 234. DB::Exception: No active replica has part all_1117_1123_1
or covering part (cannot execute queue-0000001206:
MERGE_PARTS with virtual parts [all_1117_1123_1]).
(NO_REPLICA_HAS_PART), Stack trace
(when copying this message, always include the lines below):

Problem

When this problem happens, clickhouse doesn’t respond to request and becomes unusable. During the database starting phase, it tries to fix the replica issue and keep trying forever in a loop. If there are 3 replica, all 3 becomes unsable because of this.

Solution Approach

  • Narrow down the table that has replica issue
  • Clickhouse error message doesn’t provide this info. So we have to investigate to find out the problematic table.
  • Once table is figured out., take meaningful action based on context.
  • One possible reason for this state : Such replica issue could be because of a big table insertion operation failing abruptly

Where to get more debug information

`system.replication_queue` table contains which replication failed, reason for failure, how many times it tried etc.,

# system.replication_queue table structure
┌─name───────────────────┬─type──────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ database │ String │ │ │ │ │ │
│ table │ String │ │ │ │ │ │
│ replica_name │ String │ │ │ │ │ │
│ position │ UInt32 │ │ │ │ │ │
│ node_name │ String │ │ │ │ │ │
│ type │ String │ │ │ │ │ │
│ create_time │ DateTime │ │ │ │ │ │
│ required_quorum │ UInt32 │ │ │ │ │ │
│ source_replica │ String │ │ │ │ │ │
│ new_part_name │ String │ │ │ │ │ │
│ parts_to_merge │ Array(String) │ │ │ │ │ │
│ is_detach │ UInt8 │ │ │ │ │ │
│ is_currently_executing │ UInt8 │ │ │ │ │ │
│ num_tries │ UInt32 │ │ │ │ │ │
│ last_exception │ String │ │ │ │ │ │
│ last_exception_time │ DateTime │ │ │ │ │ │
│ last_attempt_time │ DateTime │ │ │ │ │ │
│ num_postponed │ UInt32 │ │ │ │ │ │
│ postpone_reason │ String │ │ │ │ │ │
│ last_postpone_time │ DateTime │ │ │ │ │ │
│ merge_type │ String │ │ │ │ │ │
└────────────────────────┴───────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Useful columns for this problem :

  • database
  • table
  • num_tries
  • postpone_reason

Query to find the problematic table

SELECT
database,
table,
postpone_reason
FROM system.replication_queue
WHERE num_tries >= 8
  • Default clickhouse configuration is to retry replication for a max of 8 times. So the above query would give continuously failing replication

Drop problematic table

If problematic table can be recreated from scratch., this is the best way at the moment to bring clickhouse to stable state.

-- distributed table name
SET param_table_name = 'problem_table';
-- replicated table name
SET param_table_name_local = 'problem_table_local';

DROP table IF EXISTS {table_name_local: Identifier } ON CLUSTER 'cluster_name' SYNC;
DROP table IF EXISTS {table_name: Identifier} ON CLUSTER 'cluster_name' SYNC;

Once the problematic table is dropped, we should see clickhouse start sequence should continue smooth and all replicas should get back to stable state.

References

--

--

Sairam Krish
Sairam Krish

Written by Sairam Krish

Software Architect ★ Data Architect

No responses yet