My Cloud Data Lake (1): dbt + dremio

Wubaoqi
3 min readApr 1, 2022
Photo by FLY:D on Unsplash

During daily work, I have many data want to explore. I want to find an economic way to save and analyze my data.

Storage Format: Apache Parquet

To save the data, I use a standard file format: Apache Parquet format. https://github.com/apache/parquet-mr

By using standard format, it makes inter-operation more easily between several popular software tools:

  • spark
  • dremio
  • clickhouse
  • duckdb

So that I can use the most suitable tool for each different tasks.

Storage Location: JuiceFS + Object Storage

I want to keep my data for some longer time, the more economic way is to use cloud object storage. For example, AWS S3 or Azure Data Lake Storage V2.

But the object storage is not easy to use. So, I used JuiceFS: https://github.com/juicedata/juicefs . I used their free hosting service.

So that, I can mount the same folder on both my cloud VM and my local desktop. While the actual data are all saved into cloud object storage.

Data Transform: dbt + dremio

Many of my data are stored in PostgreSQL database, in JSON format. So, I need to fetch those data, and do some transform, and save them into a better format for analysis.

For the transform part, the best tool for me seems to be dbt: https://github.com/dbt-labs/dbt-core. By using dbt, I can use version control & software engineering to write SQL.

The actual compute engine I used is dremio: https://github.com/dremio/dremio-oss. Dremio can combine several data sources, and provide a great speed-up for query execution. While https://github.com/fabrice-etanchaud/dbt-dremio project bring dremio and dbt together.

So, the whole process inside this stage is:

  1. I register a external PostgreSQL source in Dremio
  2. write many transform SQL to do data processing
  3. use “create table as” function to write final results (suitable for analysis) into JuiceFS folder

Though, when I begin to use this solution, I soon encountered the following limitation for Dremio: The maximum length for Dremio’s column is 32K.

I tried to avoid this 32K limitation by using 2 workaround.

First, seems I can not query the postgresql table directly in Dremio (Some JSON column have value with length more than 32K), I use Dremio’s external_query() function, like:

select *
from
table(pg.external_query('
select
t."task_id",
t.task_param->>''etlJobsCount'' as "etlJobsCount",
{{ truncate_string_to_32k_pg(col = "t.task_param->>''description''", alias="description") }}
from pg_tasks
where t.submit_time > ''{{results.v[0]}}''
'))

Second, I write a dbt macro, to truncate the string content to less than 32K. (Note, in dremio, the 32K limit is for the count of bytes, but not the length of characters, which is UTF-8 encoded)

{% macro truncate_string_to_32k_pg(col, alias) %}{% set truncate_to_size = 31900 %}
left({{ col }}, GREATEST(length({{ col }}) - (octet_length({{ col }}) - {{ truncate_to_size }}), {{ truncate_to_size }}/3)) as {{ alias }},
octet_length({{ col }}) > {{ truncate_to_size }} as {{ alias }}_is_truncated
{% endmacro %}

Finally, It works! Note: at this stage, dbt + dremio is not the only option, we can also use dbt + spark. But dremio is faster than spark.

Overall workflow

Now, the over all workflow is like:

  1. Once a week, when I want to collect data, I spawn a VM from Azure.
  2. Run “dbt run” to collect new data and do data transform in dremio
  3. The intermediate data and final tables are all saved into JuiceFS (and Object Storage behind it), in Parquet format.
  4. Now, I can close and release the VM to save money.

By now, part 1 is finished. When I want to analyze my data, I will sync the result tables from JuiceFS folder (mounted in my Desktop) to my local folders, and analyze them in Clickhouse, this will be covered in part 2 later.

--

--