SQL Style Guide
My personal SQL style guide focuses on making complex analytic queries easier to write and maintain as projects grow. It helps avoid common pitfalls, keeps code readable for both new developers and your future self, and makes performance optimization in Snowflake easier.
Core Principles
- Maintainability: Keep SQL easy to update and debug as projects grow.
- Readability: Write clear, structured queries that are easy to understand.
- Performance-Friendly: Structure SQL for easier optimization in Snowflake.
- AI/LLM-Compatible: Write SQL that is structured and unambiguous, making it easier for AI and large language models to parse and reason about.
Code Formatting
SQLFluff
With SQLFluff, SQL style guides matter much less. As an opinionated linter, SQLFluff simplifies formatting and improves SQL quality. This guide assumes you use SQLFluff and focuses only on what complements its rules.
I recommend ignoring these SQLFluff rules:
-
layout.long_lines
(LT05)
Long lines can improve readability, especially when longer names are necessary. -
structure.column_order
(ST06)
Sometimes custom column ordering can improve readability. -
structure.column_order
(RF04) OPTIONAL
Snowflake reserves several words, includingaccount
,organization
, andvalues
. These may be needed in some of your exposure models.
These rules can be disabled in your .sqlfluff file. Example: StarlightDataWarehouseTemplate/.sqlfluff
Naming Conventions
Use snake_case and lowercase
Use snake_case
for all identifiers as it aligns with analytical community standards.
-- Do
customer_account_id
order_status_type
transaction_date
-- Don't
CustomerAccountId -- Pascal case
customer_account_ID -- Mixed case
customeraccountid -- No word separation
CUSTOMER_ACCOUNT_ID -- Screaming snake case
CUSTOMERACCOUNTID -- I hate my coworkers case
-- Do
select
amount,
created_at
from transactions
-- Don't
SELECT
amount,
created_at
FROM transactions
Snowflake Naming Considerations
Unfortunately, Snowflake converts unquoted identifiers to uppercase by default. However, keep the identifiers lowercase and unquoted in your code, and just let Snowflake be Snowflake.
Recommendations:
- Write SQL in lowercase.
- Accept Snowflake’s uppercase conversion for internal objects.
- Use double quotes for lowercase identifiers only when required by external systems.
-- Do
select
amount,
created_at
from transactions
-- Do: Only when external systems need lowercase
select
amount as "amount",
created_at as "created_at"
from transactions
Query Structure
Join Patterns
Use only left join
and cross join
, with cross join
applied only in very limited cases.
If you find yourself needing other join types, it's usually due to a Level of Detail issue.
Even when other joins might produce correct results, they often:
- Mask underlying data issues
- Silently remove duplicates without addressing their root cause
-- Do: Standard left join pattern
select
orders.order_id,
customers.customer_name
from orders
left join customers
on orders.customer_id = customers.customer_id
cross join
is useful when you need to expand a table's level of detail. For example, when creating one row for each combination of city and month:
Always consider if this can be pregenerated #generated-tables
-- Do: Cross join for data expansion
select
city.city_name,
month.month_name
from city
cross join month
Important: cross join
operations can be extremely expensive, potentially causing queries to run for a very long time.
A cross join
creates one row for each combination of rows between the joined tables. Examples:
cross join
5 cities with 12 months → 60 rowscross join
1,000 cities with (5 × 365) = 1,825 days → 1,825,000 rowscross join
that with 5,000,000 orders → 9.125 trillion rows. Even Snowflake struggles with datasets of this scale.
Query Organization
CTEs Over Subqueries
Use CTEs (WITH clauses) instead of subqueries. CTEs provide:
- Better readability
- Easier debugging
- Step-by-step query logic
Building queries with CTEs makes debugging easier since you can verify the level of detail at each step.
-- Do
with monthly_sales as (
select
date_trunc('month', order_date) as month,
sum(amount) as total_sales
from orders
group by date_trunc('month', order_date)
)
select
month,
total_sales,
lag(total_sales) over (order by month) as previous_month_sales
from monthly_sales
-- Don't
select
month,
total_sales,
lag(total_sales) over (order by month) as previous_month_sales
from (
select
date_trunc('month', order_date) as month,
sum(amount) as total_sales
from orders
group by date_trunc('month', order_date)
) monthly_sales
Exception: Simple in
clauses can use subqueries:
-- Do: Simple IN clause is fine
with filtered_customers as (
select customer_id
from customers
where customer_type = 'Demo'
)
select
invoice_id,
invoice_number
from invoices
where customer_id not in (select customer_id from filtered_customers)
Logical Operators
Use NOT for Negation
Prefer NOT for its clarity:
-- Do
where not user_id = 1
where not status in ('deleted', 'archived')
-- Don't
where user_id != 1
where user_id <> 1
Comments
-
Comment with purpose—never comment the obvious.
Comments should explain why something is done, not what the code does. Well-written code should be self-explanatory.- Don't duplicate logic. If your comment simply repeats what the code does, you're duplicating logic. This is unnecessary and creates maintenance overhead—every change requires updating both the code and the comment, increasing the risk of outdated documentation.
-
Always document:
- Hardcoded values magic number.
- Business logic.
- Performance decisions.
-
No need to comment on complex calculations.
- The reader can copy-paste the logic into an LLM for explanation if needed.
- Instead, document why a specific approach was chosen or highlight edge cases.
-
Never leave commented-out code.
- Use version control to track changes instead.
- Keeps the codebase clean and maintainable.
- Prevents confusion about what’s active and what’s obsolete.
Precise code reduces the need for excessive comments. Write clearly, then document wisely.
-- Do
select
user_id,
status
from users
where not user_id in (4, 7) -- Exclude internal test accounts
and created_at >= '2023-01-01' -- System migration date
-- Don't
select
user_id,
status
from users
where not user_id in (4, 7)
-- and created_at >= '2023-01-01'
Generated Tables
Instead of using generators inline, create dedicated tables for commonly needed data:
create table all_numbers as (
select row_number() over (order by seq4()) as number
from
table(generator(rowcount => 1000))
)
create table all_dates as
select dateadd('day', seq4(), '2022-01-01')::date as all_date
from
table(generator(rowcount => 10000)) -- using a large number to cover the date range
where
dateadd('day', seq4(), '2022-01-01')::date between '2022-01-01' and '2025-12-31';
create table all_first_of_month as
select date_trunc('month', dateadd('month', seq4(), date '2022-01-01'))::date as first_of_month
from
table(generator(rowcount => 1000)) -- using a large number to cover the date range
where
date_trunc('month', dateadd('month', seq4(), date '2022-01-01'))::date <= date '2025-12-31';
In the Starlight Data Framework, these are added as seeds. However, it is preferable to pre-materialize the tables elsewhere to avoid running seed table generation multiple times.
That said, seeds are still significantly better than generating data within your SQL.
Snowflake Performance
After more than six years of using Snowflake, if not daily, then at least weekly, I have come up with my top four technical recommendations for Snowflake query performance:
- Keep queries simple.
- Consider splitting large (many lines of SQL) models into separate models.
- Pre-generate commonly needed lookup tables.
- Consider materializing tables instead of using views.
However, if you want a fast-running dbt build
, no technical optimization beats the following:
- Only add what the business uses, delete everything else. I can nearly say with perfect certainty that if you analyzed what the business actually uses today, you would be able to delete more than 95% of your entire data warehouse infrastructure and code.
- Ensure that all models have a clear Level of Detail.
dbt Considerations
I don't think it's controversial to say that dbt is the most important tools in data management.
However, dbt only enables great data management—it doesn't ensure it. Here are a few tips for using dbt effectively and some pointers to keep you out of trouble.
dbt excels at transformation and modeling. Use it for its core strengths:
- Dependency management
- Testing
Tests
Add not_null
and unique
tests to nearly all models. This ensures the correct Level of Detail.
Generated Tables Over Seeds
If possible, prefer generated tables over seeds.
Avoid Macros and Elaborate Jinja
- They make debugging difficult.
- Jinja macro code is hard to read, write, understand, and debug.
- It is easier for LLMs to understand and generate pure SQL.
- Write pure SQL instead.
In my experience, I have never encountered a situation where:
- Macros or elaborate Jinja were faster to write than pure SQL.
- Macros or elaborate Jinja were easier to understand for other developers (or yourself six months later) than pure SQL.
- Macros or elaborate Jinja were easier or faster to debug than pure SQL.
- Macros or elaborate Jinja were the only way to solve an issue.
Don't use severity
The dbt severity features should not be used.
The main rule for errors is: if you don’t intend to fix them immediately, they should not be classified as errors.
With the error_if
feature, where you can set a threshold for errors, you will never know if an error is genuinely critical or just one of the expected errors. Therefore, you should always handle such errors in code instead of silently accepting them.
The warn_if
feature serves no real purpose except to confuse people. As the main rule states, if you don’t intend to fix something immediately, it shouldn’t be there. If you don’t establish a process for handling warnings, they will only create confusion.
Avoid incremental models
If at all possible, avoid incremental models.
Incremental models may seem like an easy way to fix performance problems by only loading changed and new rows, but they have significant downsides.
Downsides:
- Require considerable effort to develop
- Require considerable effort to debug
- Add complexity to the entire project
- The data in an incremental model depends on when each load was run, meaning Dev, Test, CI/CD, and Production environments won’t have the same data
- Not always faster or cheaper
- Extremely difficult to change
- Often, better performance optimizations are available. However, once an incremental model is in place, it becomes very difficult to change the architecture and implement better optimizations.
To read more and explore different opinions, check out this thread on the dbt Slack forum. And if you must use incremental models, it is highly recommended to check out Deena’s Coalesce 2024 talk: How Amplify Optimized Their Incremental Models with dbt on Snowflake.
No EL
Don't use dbt for extract, load, or other data engineering tasks.
Comments, questions, and likes are highly appreciated. ❤️