SQL Style Guide

Core Principles

Maintainability: Keep SQL easy to update and debug as projects grow.
Readability: Write clear, structured queries that are easy to understand.
Performance-Friendly: Structure SQL for easier optimization in Snowflake.
AI/LLM-Compatible: Write SQL that is structured and unambiguous, making it easier for AI and large language models to parse and reason about.

With SQLFluff, SQL style guides matter much less. As an opinionated linter, SQLFluff simplifies formatting and improves SQL quality. This guide assumes you use SQLFluff and focuses only on what complements its rules.

I recommend ignoring these SQLFluff rules:

layout.long_lines (LT05)
Long lines can improve readability, especially when longer names are necessary.
structure.column_order (ST06)
Sometimes custom column ordering can improve readability.
structure.column_order (RF04) OPTIONAL
Snowflake reserves several words, including account, organization, and values. These may be needed in some of your exposure models.

These rules can be disabled in your .sqlfluff file. Example: StarlightDataWarehouseTemplate/.sqlfluff

Naming Conventions

Use snake_case and lowercase

Use snake_case for all identifiers as it aligns with analytical community standards.

-- Do
customer_account_id
order_status_type
transaction_date

-- Don't
CustomerAccountId    -- Pascal case
customer_account_ID  -- Mixed case
customeraccountid    -- No word separation
CUSTOMER_ACCOUNT_ID  -- Screaming snake case
CUSTOMERACCOUNTID    -- I hate my coworkers case

-- Do
select
   amount,
   created_at
from transactions

-- Don't
SELECT
   amount,
   created_at
FROM transactions

Snowflake Naming Considerations

Unfortunately, Snowflake converts unquoted identifiers to uppercase by default. However, keep the identifiers lowercase and unquoted in your code, and just let Snowflake be Snowflake.

Recommendations:

Write SQL in lowercase.
Accept Snowflake’s uppercase conversion for internal objects.
Use double quotes for lowercase identifiers only when required by external systems.

-- Do
select
   amount,
   created_at
from transactions

-- Do: Only when external systems need lowercase
select
   amount as "amount",
   created_at as "created_at"
from transactions

Query Structure

Join Patterns

Use only left join and cross join, with cross join applied only in very limited cases.

If you find yourself needing other join types, it's usually due to a Level of Detail issue.

Even when other joins might produce correct results, they often:

Mask underlying data issues
Silently remove duplicates without addressing their root cause

-- Do: Standard left join pattern
select
   orders.order_id,
   customers.customer_name
from orders
left join customers
   on orders.customer_id = customers.customer_id

cross join is useful when you need to expand a table's level of detail. For example, when creating one row for each combination of city and month:

Always consider if this can be pregenerated #generated-tables

-- Do: Cross join for data expansion
select
   city.city_name,
   month.month_name
from city
cross join month

Important: cross join operations can be extremely expensive, potentially causing queries to run for a very long time.

A cross join creates one row for each combination of rows between the joined tables. Examples:

cross join 5 cities with 12 months → 60 rows
cross join 1,000 cities with (5 × 365) = 1,825 days → 1,825,000 rows
cross join that with 5,000,000 orders → 9.125 trillion rows. Even Snowflake struggles with datasets of this scale.

Query Organization

CTEs Over Subqueries

Use CTEs (WITH clauses) instead of subqueries. CTEs provide:

Better readability
Easier debugging
Step-by-step query logic

Building queries with CTEs makes debugging easier since you can verify the level of detail at each step.

-- Do
with monthly_sales as (
   select
       date_trunc('month', order_date) as month,
       sum(amount) as total_sales
   from orders
   group by date_trunc('month', order_date)
)
select
   month,
   total_sales,
   lag(total_sales) over (order by month) as previous_month_sales
from monthly_sales

-- Don't
select
   month,
   total_sales,
   lag(total_sales) over (order by month) as previous_month_sales
from (
   select
       date_trunc('month', order_date) as month,
       sum(amount) as total_sales
   from orders
   group by date_trunc('month', order_date)
) monthly_sales

Exception: Simple in clauses can use subqueries:

-- Do: Simple IN clause is fine
with filtered_customers as (
  select customer_id
  from customers
  where customer_type = 'Demo'
)
select
  invoice_id,
  invoice_number
from invoices
where customer_id not in (select customer_id from filtered_customers)

Logical Operators

Use NOT for Negation

Prefer NOT for its clarity:

-- Do
where not user_id = 1
where not status in ('deleted', 'archived')

-- Don't
where user_id != 1
where user_id <> 1

Comments

Comment with purpose—never comment the obvious.
Comments should explain why something is done, not what the code does. Well-written code should be self-explanatory.
- Don't duplicate logic. If your comment simply repeats what the code does, you're duplicating logic. This is unnecessary and creates maintenance overhead—every change requires updating both the code and the comment, increasing the risk of outdated documentation.
Always document:
- Hardcoded values magic number.
- Business logic.
- Performance decisions.
No need to comment on complex calculations.
- The reader can copy-paste the logic into an LLM for explanation if needed.
- Instead, document why a specific approach was chosen or highlight edge cases.
Never leave commented-out code.
- Use version control to track changes instead.
- Keeps the codebase clean and maintainable.
- Prevents confusion about what’s active and what’s obsolete.

Precise code reduces the need for excessive comments. Write clearly, then document wisely.

-- Do
select
   user_id,
   status
from users
where not user_id in (4, 7)  -- Exclude internal test accounts
   and created_at >= '2023-01-01'  -- System migration date

-- Don't
select
   user_id,
   status
from users
where not user_id in (4, 7)
   -- and created_at >= '2023-01-01'

Generated Tables

Instead of using generators inline, create dedicated tables for commonly needed data:

create table all_numbers as (
    select row_number() over (order by seq4()) as number
    from
        table(generator(rowcount => 1000))
)

create table all_dates as
select dateadd('day', seq4(), '2022-01-01')::date as all_date
from
    table(generator(rowcount => 10000)) -- using a large number to cover the date range
where
    dateadd('day', seq4(), '2022-01-01')::date between '2022-01-01' and '2025-12-31';

create table all_first_of_month as
select date_trunc('month', dateadd('month', seq4(), date '2022-01-01'))::date as first_of_month
from
    table(generator(rowcount => 1000)) -- using a large number to cover the date range
where
    date_trunc('month', dateadd('month', seq4(), date '2022-01-01'))::date <= date '2025-12-31';

In the Starlight Data Framework, these are added as seeds. However, it is preferable to pre-materialize the tables elsewhere to avoid running seed table generation multiple times.

That said, seeds are still significantly better than generating data within your SQL.

Snowflake Performance

After more than six years of using Snowflake, if not daily, then at least weekly, I have come up with my top four technical recommendations for Snowflake query performance:

Keep queries simple.
Consider splitting large (many lines of SQL) models into separate models.
Pre-generate commonly needed lookup tables.
Consider materializing tables instead of using views.

However, if you want a fast-running dbt build, no technical optimization beats the following:

Only add what the business uses, delete everything else. I can nearly say with perfect certainty that if you analyzed what the business actually uses today, you would be able to delete more than 95% of your entire data warehouse infrastructure and code.
Ensure that all models have a clear Level of Detail.

dbt Considerations

I don't think it's controversial to say that dbt is the most important tools in data management.

However, dbt only enables great data management—it doesn't ensure it. Here are a few tips for using dbt effectively and some pointers to keep you out of trouble.

dbt excels at transformation and modeling. Use it for its core strengths:

Dependency management
Testing

Tests

Add not_null and unique tests to nearly all models. This ensures the correct Level of Detail.

Generated Tables Over Seeds

If possible, prefer generated tables over seeds.

Avoid Macros and Elaborate Jinja

They make debugging difficult.
Jinja macro code is hard to read, write, understand, and debug.
It is easier for LLMs to understand and generate pure SQL.
Write pure SQL instead.

In my experience, I have never encountered a situation where:

Macros or elaborate Jinja were faster to write than pure SQL.
Macros or elaborate Jinja were easier to understand for other developers (or yourself six months later) than pure SQL.
Macros or elaborate Jinja were easier or faster to debug than pure SQL.
Macros or elaborate Jinja were the only way to solve an issue.

Don't use severity

The dbt severity features should not be used.

The main rule for errors is: if you don’t intend to fix them immediately, they should not be classified as errors.

With the error_if feature, where you can set a threshold for errors, you will never know if an error is genuinely critical or just one of the expected errors. Therefore, you should always handle such errors in code instead of silently accepting them.

The warn_if feature serves no real purpose except to confuse people. As the main rule states, if you don’t intend to fix something immediately, it shouldn’t be there. If you don’t establish a process for handling warnings, they will only create confusion.

Avoid incremental models

If at all possible, avoid incremental models.

Incremental models may seem like an easy way to fix performance problems by only loading changed and new rows, but they have significant downsides.

Downsides:

Require considerable effort to develop
Require considerable effort to debug
Add complexity to the entire project
The data in an incremental model depends on when each load was run, meaning Dev, Test, CI/CD, and Production environments won’t have the same data
Not always faster or cheaper
Extremely difficult to change
Often, better performance optimizations are available. However, once an incremental model is in place, it becomes very difficult to change the architecture and implement better optimizations.

To read more and explore different opinions, check out this thread on the dbt Slack forum. And if you must use incremental models, it is highly recommended to check out Deena’s Coalesce 2024 talk: How Amplify Optimized Their Incremental Models with dbt on Snowflake.

No EL

Don't use dbt for extract, load, or other data engineering tasks.

Comments, questions, and likes are highly appreciated. ❤️

SQL Style Guide is currently being discussed on: