Identifying and Resolving Hidden ClickHouse Bottlenecks: A Step-by-Step Guide

Introduction

Even when all the usual suspects look clean — I/O, memory, rows scanned, parts read — a ClickHouse query can still crawl. At Cloudflare, our billing pipeline, which processes hundreds of millions of dollars in usage revenue, suddenly slowed after a routine migration. The culprit turned out to be a hidden bottleneck buried deep inside ClickHouse internals. This guide walks you through the same diagnostic and resolution process we used, so you can detect and fix similar issues before they affect your critical pipelines.

Identifying and Resolving Hidden ClickHouse Bottlenecks: A Step-by-Step Guide — Source: blog.cloudflare.com

Note: This guide assumes intermediate knowledge of ClickHouse. For absolute beginners, review the official documentation first.

What You Need

Access to the ClickHouse system tables (especially system.query_log, system.parts, system.events, and system.metrics).
Monitoring tools (e.g., Grafana, Prometheus) configured to track query latency, throughput, and resource usage.
Understanding of your table schema, primary key, partitioning, and retention policies.
Permission to run EXPLAIN statements and adjust ClickHouse settings (at least on a test instance).
ClickHouse version information (the patches we wrote apply to versions 23.8+; adapt for earlier versions).
A staging environment mirroring production to safely test patches.

Step-by-Step Process

Step 1: Recognize the Symptoms

Your pipeline has suddenly become slow, and the problem appears after a migration or configuration change. Typical signs:

Daily aggregation jobs (e.g., billing, reporting) take much longer than usual.
Queries that used to finish in seconds now run for minutes or hours.
Overall system throughput drops, causing backlogs in downstream processes like invoice generation or fraud detection.

In our case, the billing pipeline timing became erratic, and invoices became increasingly difficult to reconcile.

Step 2: Check the Usual Suspects

Start with the metrics that normally pinpoint a slowdown:

I/O: Check disk read/write latency and queue depth. If these are high, you may have a storage bottleneck.
Memory: Verify available memory and swap usage. Memory pressure can cause queries to spill to disk.
Rows scanned: Compare the number of rows read before and after the slowdown. A sudden increase often indicates a missing index or poor partition pruning.
Parts read: ClickHouse merges parts; if too many small parts are being read, that can degrade performance.

In our scenario, all these metrics appeared normal. I/O was low, memory was fine, rows scanned hadn't increased, and parts read were stable. This told us the bottleneck was internal — something deeper in the query execution engine.

Step 3: Dig Deeper – Profile Internal Events

When normal checks fail, turn to the system.events table. Look for unusual values in low-level counters, especially those related to:

MergeTreeReadTaskNewBytes – bytes read per task
MergeTreeReadTaskNewMicroseconds – time spent reading each task
ReadBufferFromFileDescriptorRead – number of reads from file descriptors
ReadBufferFromFileDescriptorReadBytes – total bytes read from descriptors

We noticed a spike in the number of small read operations. Despite reading the same total bytes, ClickHouse was performing many more individual system calls. This pointed to a contention issue inside the ReadBuffer layer — the code responsible for reading data from disk.

Step 4: Identify the Root Cause

Compare the system events between fast and slow runs (or before and after a migration). Look for events where the count increased dramatically while the total bytes remained constant. In our case, we found that a change in the way ClickHouse prefetches data had introduced a global mutex lock inside the ReadBufferFromFileDescriptor class. Normally, each thread has its own read buffer; after the migration, multiple threads were contending for a single buffer, causing severe serialization.

Check your ClickHouse version’s changelog for any changes to read prefetch logic. If you suspect a similar mutex issue, you can confirm by running perf top or strace to see whether pthread_mutex_lock appears prominently during query execution.

Step 5: Implement the Fixes

We wrote three patches to resolve the bottleneck. Depending on your exact issue, you may need to adapt them:

Remove the global mutex in ReadBuffer. Replace it with a per-thread buffer allocation so that prefetch threads don’t compete for the same resource.
Adjust prefetch size. The default prefetch amount was too small, causing many tiny reads. Increase it using the setting max_read_buffer_size (e.g., to 2 MB).
Optimize async reads. Improve coordination between the main read thread and prefetch threads to reduce context switching.

Always test these changes in a staging environment first. Our patches increased query throughput by over 400% for the affected workloads.

Step 6: Validate and Monitor

After applying the changes:

Re-run the slow queries and compare their execution time against baseline.
Check system.events again — the number of small read operations should drop, and the average read size should increase.
Monitor system resources over the next few days to ensure no regressions.

We saw immediate improvement: daily aggregation jobs returned to normal, and the billing pipeline cleared its backlog within 24 hours.

Tips and Best Practices

Build a performance regression test suite. Automate queries that represent your most critical workloads and compare runtimes against a rolling baseline.
Keep monitoring on system.events. Low-level metrics often catch issues before they impact user-facing performance.
Maintain a staging environment that mirrors production. Test all patches, migrations, and configuration changes there first.
Engage with the ClickHouse community. Our patches were eventually upstreamed; sharing findings helps everyone.
Consider per-node prefetch settings. If your cluster has heterogeneous hardware, tune max_read_buffer_size per node.

Hidden bottlenecks are rare but devastating. By systematically profiling internal ClickHouse events, you can uncover issues that traditional monitoring misses and keep your pipelines fast and reliable.

Tags: