Identifying and Resolving Hidden ClickHouse Bottlenecks: A Step-by-Step Guide
Introduction
Even when all the usual suspects look clean — I/O, memory, rows scanned, parts read — a ClickHouse query can still crawl. At Cloudflare, our billing pipeline, which processes hundreds of millions of dollars in usage revenue, suddenly slowed after a routine migration. The culprit turned out to be a hidden bottleneck buried deep inside ClickHouse internals. This guide walks you through the same diagnostic and resolution process we used, so you can detect and fix similar issues before they affect your critical pipelines.

Note: This guide assumes intermediate knowledge of ClickHouse. For absolute beginners, review the official documentation first.
What You Need
- Access to the ClickHouse system tables (especially
system.query_log,system.parts,system.events, andsystem.metrics). - Monitoring tools (e.g., Grafana, Prometheus) configured to track query latency, throughput, and resource usage.
- Understanding of your table schema, primary key, partitioning, and retention policies.
- Permission to run
EXPLAINstatements and adjust ClickHouse settings (at least on a test instance). - ClickHouse version information (the patches we wrote apply to versions 23.8+; adapt for earlier versions).
- A staging environment mirroring production to safely test patches.
Step-by-Step Process
Step 1: Recognize the Symptoms
Your pipeline has suddenly become slow, and the problem appears after a migration or configuration change. Typical signs:
- Daily aggregation jobs (e.g., billing, reporting) take much longer than usual.
- Queries that used to finish in seconds now run for minutes or hours.
- Overall system throughput drops, causing backlogs in downstream processes like invoice generation or fraud detection.
In our case, the billing pipeline timing became erratic, and invoices became increasingly difficult to reconcile.
Step 2: Check the Usual Suspects
Start with the metrics that normally pinpoint a slowdown:
- I/O: Check disk read/write latency and queue depth. If these are high, you may have a storage bottleneck.
- Memory: Verify available memory and swap usage. Memory pressure can cause queries to spill to disk.
- Rows scanned: Compare the number of rows read before and after the slowdown. A sudden increase often indicates a missing index or poor partition pruning.
- Parts read: ClickHouse merges parts; if too many small parts are being read, that can degrade performance.
In our scenario, all these metrics appeared normal. I/O was low, memory was fine, rows scanned hadn't increased, and parts read were stable. This told us the bottleneck was internal — something deeper in the query execution engine.
Step 3: Dig Deeper – Profile Internal Events
When normal checks fail, turn to the system.events table. Look for unusual values in low-level counters, especially those related to:
MergeTreeReadTaskNewBytes– bytes read per taskMergeTreeReadTaskNewMicroseconds– time spent reading each taskReadBufferFromFileDescriptorRead– number of reads from file descriptorsReadBufferFromFileDescriptorReadBytes– total bytes read from descriptors
We noticed a spike in the number of small read operations. Despite reading the same total bytes, ClickHouse was performing many more individual system calls. This pointed to a contention issue inside the ReadBuffer layer — the code responsible for reading data from disk.
Step 4: Identify the Root Cause
Compare the system events between fast and slow runs (or before and after a migration). Look for events where the count increased dramatically while the total bytes remained constant. In our case, we found that a change in the way ClickHouse prefetches data had introduced a global mutex lock inside the ReadBufferFromFileDescriptor class. Normally, each thread has its own read buffer; after the migration, multiple threads were contending for a single buffer, causing severe serialization.

Check your ClickHouse version’s changelog for any changes to read prefetch logic. If you suspect a similar mutex issue, you can confirm by running perf top or strace to see whether pthread_mutex_lock appears prominently during query execution.
Step 5: Implement the Fixes
We wrote three patches to resolve the bottleneck. Depending on your exact issue, you may need to adapt them:
- Remove the global mutex in ReadBuffer. Replace it with a per-thread buffer allocation so that prefetch threads don’t compete for the same resource.
- Adjust prefetch size. The default prefetch amount was too small, causing many tiny reads. Increase it using the setting
max_read_buffer_size(e.g., to 2 MB). - Optimize async reads. Improve coordination between the main read thread and prefetch threads to reduce context switching.
Always test these changes in a staging environment first. Our patches increased query throughput by over 400% for the affected workloads.
Step 6: Validate and Monitor
After applying the changes:
- Re-run the slow queries and compare their execution time against baseline.
- Check
system.eventsagain — the number of small read operations should drop, and the average read size should increase. - Monitor system resources over the next few days to ensure no regressions.
We saw immediate improvement: daily aggregation jobs returned to normal, and the billing pipeline cleared its backlog within 24 hours.
Tips and Best Practices
- Build a performance regression test suite. Automate queries that represent your most critical workloads and compare runtimes against a rolling baseline.
- Keep monitoring on
system.events. Low-level metrics often catch issues before they impact user-facing performance. - Maintain a staging environment that mirrors production. Test all patches, migrations, and configuration changes there first.
- Engage with the ClickHouse community. Our patches were eventually upstreamed; sharing findings helps everyone.
- Consider per-node prefetch settings. If your cluster has heterogeneous hardware, tune
max_read_buffer_sizeper node.
Hidden bottlenecks are rare but devastating. By systematically profiling internal ClickHouse events, you can uncover issues that traditional monitoring misses and keep your pipelines fast and reliable.
Related Articles
- 5 Key Insights into Magic: The Gathering's The Hobbit Set and Its Reprints from Tales of Middle-earth
- Global Gender Gap in Math Achievement Widens Post-Pandemic, International Study Reveals
- Exploring Neural Networks: The Activation Atlas
- How to Prevent and Mitigate Reward Hacking in Reinforcement Learning
- The Ever-Changing Web: A Design History from Tables to Standards
- Navigating the AI Revolution: A Step-by-Step Guide for New Graduates
- 6 Essential Tactics for Mastering the Interrogatory LLM
- Scaling Data Preparation: From Manual Wrangling to Enterprise AI Readiness