SQL Parser vs SQL Generator: What beats jOOQ may not be another SQL Generator, but a String Template

Wubaoqi
11 min readMar 2, 2021
Photo by Joanna Kosinska on Unsplash

I worked in a small startup to build Business Intelligence product, one job was to deal with different kinds databases & SQL dialects. So, I’d like to talk about 2 SQL related concepts: SQL Parser and SQL Generator.

SQL Parser

For SQL Parser, we have support from “compiler principle” theory. The steps to parse SQL, is quite similar with what we do to parse Java/Python programs.

For example,

  • In C/C++, we can use lex & yacc to do lexical analysis & syntax analysis
  • In Java, we can use JavaCC or ANTLR

In those tools, we normally describe the syntax rules by using BNF notation. Take “yacc” file for example: https://github.com/jgarzik/sqlfun/blob/master/sql.y

select_stmt: SELECT select_opts select_expr_list
{ sqlp_select_nodata(pstate, $2, $3); } ;
| SELECT select_opts select_expr_list
FROM table_references
opt_where opt_groupby opt_having opt_orderby opt_limit
opt_into_list { sqlp_select(pstate, $2, $3, $5); } ;
;
select_expr_list: select_expr { $$ = 1; }
| select_expr_list ',' select_expr {$$ = $1 + 1; }
| '*' { sqlp_select_all(pstate); $$ = 1; }
;
select_expr: expr opt_as_alias ;opt_as_alias: AS NAME { sqlp_alias(pstate, $2); free($2); }
| NAME { sqlp_alias(pstate, $1); free($1); }
| /* nil */
;
opt_limit: /* nil */ | LIMIT expr { sqlp_limit(pstate, 0); }
| LIMIT expr ',' expr { sqlp_limit(pstate, 1); }
;
opt_where: /* nil */
| WHERE expr { sqlp_where(pstate); };
opt_groupby: /* nil */
| GROUP BY groupby_list opt_with_rollup
{ sqlp_group_by_list(pstate, $3, $4); }
;
opt_orderby: /* nil */ | ORDER BY groupby_list { sqlp_order_by(pstate, $3); }
;

It seems complicated, but as we have support from “compiler theory”, there are lots for tools to assist us.

This type of SQL Parser can be classified as “Parser Generator”. By “Parser Generator”, we will use other tools to analyze those syntax rules, and finally generate the Parser for target language.

Of course, there are some new ways to simplify this process, like “Parser Combinator”. The “Combinator” in name means it can combine existing parsers to form new parser. We can refer to the following StackOverflow answer for more details:

https://stackoverflow.com/a/5057470

One of the main differences between the tools you listed, is that ANTLR, Bison and their friends are parser generators, whereas Parsec is a parser combinator library.

A parser generator reads in a description of a grammar and spits out a parser. It is generally not possible to combine existing grammars into a new grammar, and it is certainly not possible to combine two existing generated parsers into a new parser.

A parser combinator OTOH does nothing but combine existing parsers into new parsers. Usually, a parser combinator library ships with a couple of trivial built-in parsers that can parse the empty string or a single character, and it ships with a set of combinators that take 1 or more parsers and return a new one that, for example, parses the sequence of the original parsers (e.g. you can combine a d parser and an o parser to form a do parser), the alternation of the original parsers (e.g. a 0 parser and a 1 parser to a 0|1 parser) or parses the original parse multiple times (repetetion).

What this means is that you could, for example, take an existing parser for Java and an existing parser for HTML and combine them into a parser for JSP.

Most parser generators don’t support this, or only support it in a limited way. Parser combinators OTOH only support this and nothing else.

And we can also find some example code, like the following example, with uses “Scala Parser Combinator” to parse SQL:

https://github.com/stephentu/scala-sql-parser/blob/master/src/main/scala/parser.scala

class SQLParser extends StandardTokenParsers {
def select: Parser[SelectStmt] =
"select" ~> projections ~
opt(relations) ~ opt(filter) ~
opt(groupBy) ~ opt(orderBy) ~ opt(limit) <~ opt(";") ^^ {
case p ~ r ~ f ~ g ~ o ~ l => SelectStmt(p, r, f, g, o, l)
}
def projections: Parser[Seq[SqlProj]] = repsep(projection, ",")def projection: Parser[SqlProj] =
"*" ^^ (_ => StarProj()) |
expr ~ opt("as" ~> ident) ^^ {
case expr ~ ident => ExprProj(expr, ident)
}
def groupBy: Parser[SqlGroupBy] =
"group" ~> "by" ~> rep1sep(expr, ",") ~ opt("having" ~> expr) ^^ {
case k ~ h => SqlGroupBy(k, h)
}
def orderBy: Parser[SqlOrderBy] =
"order" ~> "by" ~> rep1sep( expr ~ opt("asc" | "desc") ^^ {
case i ~ (Some("asc") | None) => (i, ASC)
case i ~ Some("desc") => (i, DESC)
}, ",") ^^ (SqlOrderBy(_))
def limit: Parser[Int] = "limit" ~> numericLit ^^ (_.toInt)
}

It is very convenient to use Scala programming language to describe rules, and we can also have Scala compiler’s static syntax check to detect error earlier.

In BI (Business Intelligence), SQL Parser’s usage is limited, mainly when:

  • To support some “custom SQL field” in dataset, we need to use SQL Parser to detect whether user’s formula is normal statement, aggregation statement or window statement
  • Some advances usage, like analyze user’s SQL to find out whether it could cause performance degradation issues.

SQL Generator

In BI software, SQL Generator is more important than SQL Parser. For example, we need to generate directly SQL for different DBMS.

And for different DBMS, the SQL syntax has several differences. Take a simple query scenario as example,

For Postgres, the SQL generated will be:

select
"member_id" as "member_id__0",
count(distinct "order_id") as "order_id__cnt_distinct__0"
from "orders"
group by "member_id__0"
limit 20000

For Microsoft SQL Server, the SQL generated will be:

select top 20000
[member_id] [member_id__0],
count(distinct [order_id]) [order_id__cnt_distinct__0]
from [orders]
group by [member_id]
order by (select 0)

For Oracle, the SQL generated will be:

select
"v0" "member_id__0",
"v1" "order_id__cnt_distinct__0"
from (
select
"x"."v0",
"x"."v1",
rownum "rn"
from (
select
"member_id" "v0",
count(distinct "order_id") "v1"
from "orders"
group by "member_id"
) "x"
where rownum <= (0 + 20000)
)
where "rn" > 0

In this example, the major cause for such difference is: the “limit” clause’s support.

Some simple solution may just use Java’s StringBuilder to combine string parts into final SQL. But this is not a good solution, it will be easy to generator invalid SQL & easy to cause SQL Injection vulnerability.

Fortunately, we also have some tools to assist SQL Generation, although the number is far less than SQL Parsers. Some projects in this category are:

As I’m more familiar with Java & jOOQ, I will use it as example, the source code to generate above SQL is like:

DSL.using(SQLDialect.POSTGRES)
.select(
field(name("member_id")).as("member_id__0"),
countDistinct(name("order_id")).as("order_id__cnt_distinct__0")
)
.from(name("orders"))
.groupBy(
"member_id__0",
"order_id__cnt_distinct__0"
)
.limit(20000)

JOOQ is very helpful to avoid SQL Injection and assist SQL generation. But jOOQ also has some problems:

Problem 1: In some scenario, jOOQ’s support is still limited, for example, for different DMBS, what can be put in Group By clause is different:

  • Some DBMS support use column index, like “Group By 1, 2, 3”
  • Some need to use column alias: “Group By Alias1, Alias2”
  • And some need to take the original calculation into GroupBy, like: “Group By Column1, to_date(Column2)”

Problem 2 is more hard to solve, it is hard to support new DBMS (which jOOQ don’t support officially yet). There was a feature request: let jOOQ’s user to support custom DMBS by writing some configuration files. But, the implementation for this feature is never released. Because: this implementation will conflict with jOOQ’s business interest. If every one can support new DMBS themselves, there will be fewer user buying their commercial editions. But as many open source DMBS are very popular nowadays (like Spark, Presto, ClickHouse), jOOQ’s absents limit its usefulness.

It’s a pity there is no good jOOQ alternatives in Java world.

Some insight from dbt

When I learned about how to use dbt (data build tool) to build data warehouse, I found dbt is very convenient.

By use dbt, we put SQLs into different folders, and in addition to write plain SQL, dbt also support to write Jinja templates to dynamically construct SQL.

Take the following Jinja template as example , (example from https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros#jinja)

{% set payment_methods = ["bank_transfer", "credit_card", "gift_card"] %}select
order_id,
{% for payment_method in payment_methods %}
sum(case when payment_method = '{{payment_method}}' then amount end) as {{payment_method}}_amount,
{% endfor %}
sum(amount) as total_amount
from app_data.payments
group by 1

will be translated to the following plain SQL

select
order_id,
sum(case when payment_method = 'bank_transfer' then amount end) as bank_transfer_amount,
sum(case when payment_method = 'credit_card' then amount end) as credit_card_amount,
sum(case when payment_method = 'gift_card' then amount end) as gift_card_amount,
sum(amount) as total_amount
from app_data.payments
group by 1

Woo, Template + SQL can have some amazing usage.

So, I begin to think: can I find out some similar projects in Java, and whether I can use them in BI software.

Then, I find a even more amazing open source project in Java — stringtmplate4 ( https://github.com/antlr/stringtemplate4)

“SQL Generator” becomes the reverse of “SQL Parser” by string template

stringtemplate4 is also a template engine.

When talking about template engine, we may think about some old projects which translate pre-defined we page template into web pages, and bind data together. For example: Apache Velocity project.

When the author of stringtemplate4 was developing jGuru.com website, he researched about what is the essential elements for page templates, which became stringtemplate4 project later.

And at the same time, the author of stringtemplate4 is the same author who wrote famous Java Parser library : ANTLR. So, under the support of “compiler theory”, and use BNF notation like rules to write templates, may lead to many interesting usages.

More details about stringtemplate4 can be found in its author’s paper http://www.cs.usfca.edu/~parrt/papers/mvc.templates.pdf. The paper is very insightful, and I recommend to read it.

OK, let’s enter the “Show Me the Code” stage, and let’s see how to use stringtemlate4 to replace jOOQ for SQL generation.

Try 1: Generate one simple SELECT statement

Let’s start with the more basic SELECT statement: select col1, col2 from table1

First, we will define a data structure in Java to stand for this SELECT statement:

@Data
@Builder
class SelectStmt {
public List<String> columnNames;
public String tableName;
}

Then, we need to write a stringtemplate4 template, and put inside java project’s resource folder:

select_stmt(stmt) ::= "SELECT <stmt.columnNames;separator=\",\"> FROM <stmt.tableName>"

(Note the content inside < >, and how to deal with single value & multiple values)

Finally, we can call stringtemplate4 to generate final SQL

public class TestST {
public static void main(String[] args) {
SelectStmt stmt = SelectStmt.builder()
.columnNames(Lists.newArrayList("col1", "col2"))
.tableName("table1")
.build();
STGroup group = new STGroupFile(Resources.getResource("postgresql/template.stg"));
ST st = group.getInstanceOf("select_stmt");
st.add("stmt", stmt);
System.out.println(st.render());
}
}

The final SQL will be:

SELECT col1,col2 FROM table1

Try 2: Add support for Column Alias

Now, let’s add “alias” support.

First, we need to extend the definition of “SelectStmt”, and add ColumnOptAlias class to store Column Name and Alias. Note: Alias is optional, it means there is no alias if “alias” variable is null.

@Data
@AllArgsConstructor
class ColumnOptAlias {
public String name;
public String alias;
}
@Data
@Builder
class SelectStmt {
public List<ColumnOptAlias> columns;
public String tableName;
}

Then, change template file accordingly

select_stmt(stmt) ::= <<
SELECT <stmt.columns:column_opt_alias();separator=",">
FROM <stmt.tableName>
>>
column_opt_alias(col) ::= <%
<col.name>
<if(col.alias)>
<\ >AS <col.alias>
<endif>
%>

Note:

  • there are 2 rules now, instead of 1
  • In “try 1”, we used single line string, so, we need to use “\” to escape the double quote inside string. But in “try 2”, we used multiple-line string to describe rule.
  • When using << >> to encode rule, the spaces/tabs inside rule will also be kept in final SQL, but when use <% %>, it will ignore those spaces/tabs.

Finally, modify the test program: (Note: column “col1” has alias “alias1”, but column “col2” doesn’t have any alias)

public class TestST {
public static void main(String[] args) {
SelectStmt stmt = SelectStmt.builder()
.columns(Lists.newArrayList(
new ColumnOptAlias("col1", "alias1"),
new ColumnOptAlias("col2", null)
))
.tableName("table1")
.build();
STGroup group = new STGroupFile(Resources.getResource("postgresql/template.stg"));
ST st = group.getInstanceOf("select_stmt");
st.add("stmt", stmt);
System.out.println(st.render());
}
}

We can get the following SQL

SELECT col1 AS alias1,col2
FROM table1

Try 3, Add support for LIMIT (Postgres dialect)

First, for SelectStmt, we need to add a new field “limit”, when “limit” is null, it means no limit.

@Data
@Builder
class SelectStmt {
public List<ColumnOptAlias> columns;
public String tableName;
public Long limit;
}

Then, modify template file accordingly

select_stmt(stmt) ::= <<
SELECT <stmt.columns:column_opt_alias();separator=",">
FROM <stmt.tableName>
<limit_stmt(stmt)>
>>
column_opt_alias(col) ::= <%
<col.name>
<if(col.alias)>
<\ >AS <col.alias>
<endif>
%>
limit_stmt(stmt) ::= <%
<if(stmt.limit)>
LIMIT <stmt.limit>
<endif>
%>

in test program:

SelectStmt stmt = SelectStmt.builder()
.columns(Lists.newArrayList(
new ColumnOptAlias("col1", "alias1"),
new ColumnOptAlias("col2", null)
))
.tableName("table1")
.limit(100l)
.build();

We will get the following SQL for Postgres

SELECT col1 AS alias1,col2
FROM table1
LIMIT 100

Try 4, Add support for LIMIT (Microsoft SQL Server dialect))

For MS SQL Server, we can create a new file /mssql/template.stg to describe its syntax rule

select_stmt(stmt) ::= <<
SELECT <limit_stmt(stmt)><stmt.columns:column_opt_alias();separator=",">
FROM <stmt.tableName>
>>
column_opt_alias(col) ::= <%
<col.name>
<if(col.alias)>
<\ >AS <col.alias>
<endif>
%>
limit_stmt(stmt) ::= <%
<if(stmt.limit)>
TOP <stmt.limit><\n><\t>
<endif>
%>

We can get the following SQL

SELECT TOP 100
col1 AS alias1,col2
FROM table1

Try 5, Add support for LIMIT (old Oracle dialect)

For old version Oracle (before 12c), Sine it doesn’t support LIMIT clause, the implementation will be a little complicated, let’s generate a similar SQL as what jOOQ generated. First, we define the template rules:

select_stmt(stmt) ::= <<
<if(stmt.limit)>
<with_limit_stmt(stmt)>
<else>
<nolimit_select_stmt(stmt)>
<endif>
>>
nolimit_select_stmt(stmt) ::= <<
SELECT <stmt.columns:column_opt_alias();separator=",">
FROM <stmt.tableName>
>>
column_opt_alias(col) ::= <%
<col.name>
<if(col.alias)>
<\ >AS <col.alias>
<endif>
%>
column_final_display(col) ::= "<if(col.alias)><col.alias><else><col.name><endif>"with_limit_stmt(stmt) ::= <<
SELECT <stmt.columns:column_final_display();separator=",">
FROM (
SELECT <stmt.columns:column_final_display();separator=",">
,rownum "rn"
FROM (
<nolimit_select_stmt(stmt)>
) "x"
WHERE rownum \<= (0 + <stmt.limit>)
)
WHERE "rn" > 0
>>

This template is more complicated than previous ones, it includes 5 rules, and we can get the SQL for Oracle as:

SELECT alias1,col2
FROM (
SELECT alias1,col2
,rownum "rn"
FROM (
SELECT col1 AS alias1,col2
FROM table1
) "x"
WHERE rownum <= (0 + 100)
)
WHERE "rn" > 0

Try 6, Support Escape to avoid SQL Injection

Through above examples, although we demo the greatness of stringtemplate4. But we still left one issue, that is: in real world, it’s hard to make sure all column name & string values are all simple ascii characters.

For example, if column name “col2” becomes col2_with”\_special, which contains special characters, then, the generated SQL will be illegal.

In stringtemplate4’s paper http://www.cs.usfca.edu/~parrt/papers/mvc.templates.pdf, it also mentions this issue. For template rule itself, it can not handle this Escape. But we can support Escape by using Java.

How then does one escape strings, a common and necessary operation?

The inescapable fact is that there must be some Java code somewhere that computes HTML escape sequences because the computation cannot be done in the template. Besides, Java already provides heavy support, such as the java.text package, for dealing with text, numbers, dates, and so on. Since neither the model nor the controller may contain this computation, the code must exist elsewhere.

The author introduced a new role “renderer” for this job, then MVC becomes MVCR (model-view-controller-renderer).

For our program, we can define the toString() method for every places where we need to handle escape, like the “column name”.

@AllArgsConstructor
class Name {
public String name;
@Override
public String toString() {
String escape = "\"";
return escape +
name.replace(escape, escape + escape)
+ escape;
}
}
@Data
@AllArgsConstructor
class ColumnOptAlias {
public Name name;
public Name alias;
}
@Data
@Builder
class SelectStmt {
public List<ColumnOptAlias> columns;
public Name tableName;
public Long limit;
}

Note: to support escape column name & table name, we change the variable’s data type from “String” to “Name” class. And the toString() in this “Name” class will handle Escape.

With this new renderer support, if we change column name “col2” to col2_with”\_special, we can get correct SQL:

SELECT "col1" AS "alias1","col2_with""\_special"
FROM "table1"
LIMIT 100

Perfect! Of course, for different DBMS, we need different Escape characters, we can put those into a configuration file. And by using similar solution, we can do escape for String, Date, Number literal values to the correct format.

Final

stringtemplate4 is really a great project, and there should be more places we can apply it.

--

--