My Cloud Data Lake (2): ClickHouse

Wubaoqi
4 min readApr 29, 2022
Photo by Kenny Eliason on Unsplash

In part 1 “My Cloud Data Lake (1): dbt + dremio”, I mainly introduced using DBT + Dremio to import & process data out of PostgreSQL, and finally save results into Parquet files.

After that, I can shutdown my cloud VM to save cost. Since the result Parquet files are saved in JuiceFS (it works like a virtual POSIX File System over Object Storage), the storage cost is trivial. But this is only the data collect part, data will only be valuable when you digest it. So, this part will introduce how I analyze my data locally by using ClickHouse.

Why I use ClickHouse on my local machine

The installation of ClickHouse is very easy, since I use MacBook Pro (Apple Silicon), I just run the following command:

curl -O 'https://builds.clickhouse.com/master/macos-aarch64/clickhouse' && chmod a+x ./clickhouse

Details can be found at: https://clickhouse.com/docs/en/getting-started/install/

Then, I just use the following command under the same folder (all ClickHouse data will also be saved into that local folder):

./clickhouse server

That’s it, I don’t need root privilege, and I don’t need to run Docker. If I don’t query it, ClickHouse just stand idle, and only use less than 200M memory.

Then, the most advantage of ClickHouse is: FAST, FAST, FAST. From my limited experience, the query performance of ClickHouse is similar with Dremio (if not a little faster), but please don’t make any decision on my limited experience, and please do your own performance test on your own machine if you need accurate benchmark result). But why I like to use ClickHouse on my local machine is: I need to query data in my local time zone.

Currently, Dremio will mainly support UTC timezone, If I need to group by local date, I have to use Dremio’s “convert_timezone” function.

select DATE_TRUNC('DAY', convert_timezone('UTC', 'Asia/Shanghai', submit_time))
, count(1)
from combined_task_status
group by 1
order by 1 desc

But once I added the convert_timezone function, Dremio’s query performance will drop from 0.5 seconds to 1.6 seconds (Query on 13 million rows table). But if I use ClickHouse, the result is default to my local time zone.

Use ClickHouse to query Parquet files directly

First, the JuiceFS folder will be mount into my local machine (like a SMB shared folder), I use “sync” to sync the data folders into ClickHouse’s “user_files” folder.

Then, I can register my datasets if not registered before. For example, the previous “combined_task_status” table are just a folder with 30 parquet files under it. I can use the following SQL to register it once.

CREATE TABLE combined_task_status
ENGINE=File(Parquet, 'combined_task_status/*.parquet')

Then, I can just run queries as if they were created in ClickHouse before.

Tips: if the table is large (over 10 million rows), I can add as sorting by a datetime column, before generate the Parquet files, it will speed up queries with limited time range.

If query parquet files still slow (rarely happen), I will consider to rewrite the table into ClickHouse’s native MergeTree tables, by using:

CREATE TABLE combined_task_status_speedup
ENGINE = MergeTree
PARTITION BY toYYYYMM(submit_time)
ORDER BY submit_time
AS SELECT * FROM combined_task_status

To Nullable or not

Soon, I found: if some column contains null values, if the query contains that column, ClickHouse will return error.

Code: 349. DB::Exception: Cannot convert NULL value to non-Nullable type: while converting column `running_time` from type Nullable(DateTime64(3, ‘UTC’)) to type DateTime64(3, ‘UTC’)

This is mainly because, to make ClickHouse VERY FAST, it did many optimizations. So, whether the column could contain NULL affect performance. In ClickHouse, we need to specify the Nullable explicitly.

To make things easier, I wrote a simple Python script to read one of the Parquet files, and output the Schema in ClickHouse syntax.

import pandas as pddef translate_dtype(t):
if t.name == 'object':
return 'String'
elif t.name == 'int32':
return 'Int32'
elif t.name == 'int64':
return 'Int64'
elif t.name == 'float64':
return 'Float64'
elif t.name == 'bool':
return 'Boolean'
elif t.name == 'datetime64':
return 'Datetime64'
elif t.name == 'datetime64[ns]':
return 'Datetime64'
elif t.name == 'datetime64[ns, UTC]':
return 'Datetime64'
else:
raise(RuntimeError('unknown field type: ' + t.name))
def parquet_to_clickhouse_schema(table_name, full_path):
df = pd.read_parquet(full_path)
columns = []
contains_null = df.isnull().any()
for i in range(len(df.columns)):
fdType = translate_dtype(df.dtypes[i])
if contains_null[i]:
columns.append(f'`{df.columns[i]}` Nullable({fdType})')
else:
columns.append(f'`{df.columns[i]}` {fdType}')
columns_result = ', '.join(columns)
final_result = f"create table {table_name} ({columns_result}) ENGINE=File('Parquet', '{table_name}/*.parquet')"
print(final_result)
parquet_to_clickhouse_schema('combined_task_status',
"/Users/wubaoqi/data/jfs-saas/cloudlake/combined_task_status/1_1_1.parquet")

Then, it will output the correct table SQL like:

CREATE TABLE combined_task_status (
`task_id` String,
`submit_time` Datetime64,
`running_time` Nullable(Datetime64)
)
ENGINE=File(Parquet, 'combined_task_status/*.parquet')

Yes, this may be a little harder to use ClickHouse than other OLAP db. But this can make it faster, and let user have some choices. Any way, we can build a higher layer wrapper if we want an automatically solution.

Additional Benefit: I can easily export my data as other formats

Sometimes, I want to share part of my data with other people. Then, ClickHouse’s rich file type supports come to handy.

According to https://clickhouse.com/docs/en/interfaces/formats/, ClickHouse support the following file types ( only list part of them here):

  • TabSeparatedRawWithNames
  • CSVWithNamesAndTypes
  • CustomSeparatedWithNames
  • JSON
  • Protobuf
  • Avro
  • Parquet
  • Arrow
  • ArrowStream
  • ORC
  • XML
  • MsgPack

So, export to CSV (with column names) can be very handy. Also, during my daily work, I frequently need to read Avro/Arrow format, which I can use ClickHouse to help.

CREATE TABLE export_csv_table1 ENGINE=File(CSVWithNames)
AS
SELECT *
FROM combined_task_status
WHERE submit_time >= '2022-04-29' AND submit_time < '2022-04-30'

Then, I can find the generated “data.CSVWithNames” under “clickhouse/data/default/export_csv_table1” folder.

To Be Continued

ClickHouse is very interesting, I may talk more about ClickHouse later, like: use MetaBase to connect to ClickHouse, or build a Spark SQL wrapper around ClickHouse so that I can use Spark SQL’s syntax to query ClickHouse, etc.

--

--