ClickHouse / ClickBench
ClickBench: a Benchmark For Analytical Databases
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing ClickHouse/ClickBench in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewClickBench: a Benchmark For Analytical Databases https://benchmark.clickhouse.com/ Discussion on Hackernews: https://news.ycombinator.com/item?id=32084571 Overview This benchmark represents typical workload in the following areas: clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. It covers the typical queries in ad-hoc analytics and real-time dashboards. The dataset in this benchmark was obtained from the actual traffic recording of one of the world's largest web analytics platforms. It is anonymized while keeping all the essential distributions of the data. The set of queries was improvised to reflect realistic workloads (the queries are not directly from production). Goals The main goals of this benchmark are: Reproducibility The benchmark allows to reproduce every test result quickly in as little as 20 minutes (although some systems may take several hours) in a semi-automated way. The test setup is documented and uses inexpensive cloud VMs. The test process is documented in the form of a shell script, covering the installation of every system, loading of the data, running the workload, and collecting the result numbers. The dataset is published and made available for download in multiple formats. Compatibility The tables and queries use mostly standard SQL and require minimum or no adaptation for most SQL DBMS. The dataset has been filtered to avoid difficulties with parsing and loading. Diversity The benchmark process is easy enough to cover a wide range of systems, including: • modern and historical self-managed OLAP DBMS, • traditional OLTP DBMS (for comparison baseline), • managed database-as-a-service offerings, • as well as serverless cloud-native databases, • some NoSQL databases, • document databases, • and specialized time-series databases for reference, even if they don't specialize on the ClickBench workload. Realism The dataset is derived from production data. The realistic data distributions allow to evaluate compression, indices, codecs, custom data structures, etc., something which is not possible with most of the random dataset generators. The workload consists of 43 queries and test the efficiency of full scan and filtered scan, as well as index lookups, and the main relational operations. It can test various aspects of hardware as well: some queries require high storage throughput; some queries benefit from a large number of CPU cores, and some benefit from single-core speed; some queries benefit from high main memory bandwidth. Limitations The limitations of this benchmark allow to reproduce it and include more systems in the comparison easily. The benchmark represents only a subset of all possible workloads and scenarios. While it aims to be as fair as possible, focusing on a specific subset of workloads may give an advantage to the systems that specialize in those workloads. Note these limitations: • The dataset is a single flat table. This is different from classical data warehouses, which use a normalized star or snowflake data model. Therefore, classical data warehouses may have an unfair disadvantage in ClickHouse. • The table consists of exactly 99'997'497 records. This is rather small by modern standards but allows tests to be completed in reasonable time. • While the benchmark allows testing distributed systems, and it includes multi-node and serverless cloud-native setups, most of the results so far have been obtained on single node setups. • The benchmark runs its queries one after another and does not test workloads with concurrent queries, neither does it test for system capacity. Every query is run only a few times. This allows for some variability in the results. • Many setups and systems are different enough to make direct comparison tricky. For example, it is not possible to test the efficiency of storage used for in-memory databases, or the time of data loading for stateless query engines. The goal of the benchmark is to produce numbers. You need to interpret them by your own. Tl;dr: *All Benchmarks Are ~~Bastards~~ Liars*. Rules and Contribution How To Add a New Result To add a new entry, copy-paste one of the existing directories and edit the files accordingly: • : this is the main script which runs the benchmark on a fresh VM; Ubuntu 24.04 or newer should be used by default. For databases that can be installed locally, the script should be able to run in a fully automated manner so it can be used in the benchmark automation (cloud-init). It should output the results in the following format: - one or more lines with the time in seconds; - a line with the data size in bytes; the data size should include indexes and transaction logs if applicable; - 43 consecutive lines in the form of for the runtimes of every query; - the output may include other lines with the logs, that are not used for the report. For managed databases, if the setup requires clicking in a UI, write a instead. • : contains comments and observations if needed. For managed databases, it can describe the setup procedure to be used instead of a shell script. • : a CREATE TABLE statement. If it's a NoSQL system, another file like can be used instead. • : contains 43 ClickBench queries to run; • : a loop that running the queries; every query is run three times, see section "Caching" below for details. • : put the .json files with the results for every hardware configuration into this directory. Please double-check that each file is valid JSON (e.g., no comma errors). To introduce a new result for an existing system for a different hardware configuration, add a new file to . To introduce a new result for an existing system with a different usage scenario, either copy the whole directory and name it differently (e.g. , ) or add a new file to the directory. can be re-generated using . The CI (GitHub Actions) does this automatically, this step is optional. All tests were originally run on AWS c6a.4xlarge…