Legible + efficient SQL with CTEs

Thanks to Jonny Hughes for the image! Thanks to Jonny Hughes for the base image!

Howdy! I’m Paul, and I’m one of the newer faces here at iron.io. My plan is to blog, champion local meetups, and empower developers. That last one is a whiz-bang business term for “help folks learn how to do dead-simple job processing.”

Today, I’ll blog about an old love of mine, Common Table Expressions (CTEs). It’s a tool anyone who’s even sneezed near a SQL statement ought to know about. Just a select few SQL sorcerers seem to be aware of their existence. Let’s change that!

CTEs are an easy way to beautify and speed up SQL. They beautify by aliasing and separating small query chunks. They speed up the same way a temporary table would, but without the need for cleanup.

The net result of beauty and speed is maintainability. Which, is why I love CTEs.

Let’s take a look at a basic CTE (postgres syntax):

WITH table1 AS (
    [SELECT statement here]
)
, table2 AS (
    [SELECT statement here]
)

SELECT *
FROM table1, table2
WHERE [...]

The basic components are:

    1. We tell the interpreter it’s a CTE by starting with WITH. CTEs are sometimes called WITH queries for this reason.
    1. The names table1 and table2 are used as aliases. You can name just one table, or a bunch.
  1. The bottom-most SELECT is the parent query. It joins everything together.

For a more real world example, let’s take a gander at a public dataset. I’ve chosen the crunchbase data on modeanalytics. You can follow along by cloning the following query:

WITH first_acq_date AS (
    SELECT company_permalink, min(acquired_quarter) AS min_acquired_quarter
    FROM tutorial.crunchbase_acquisitions
    GROUP BY company_permalink
)
, raised_amount AS (
    SELECT company_permalink, SUM(raised_amount_usd)
    FROM tutorial.crunchbase_investments
    WHERE raised_amount_usd is not null
    GROUP BY company_permalink
)
, url AS (
SELECT permalink, homepage_url
FROM tutorial.crunchbase_companies
)

SELECT fad.*, ra.sum, url.homepage_url
FROM first_acq_date fad, raised_amount ra, url
WHERE fad.company_permalink = ra.company_permalink
  AND fad.company_permalink = url.permalink
ORDER BY min_acquired_quarter, sum

The benefits of CTEs should be clear. If you name intelligently, aliasing each chunk of the query means CTEs are self-documenting. Martin Fowler is happy.

Follow the aliases in the above. We’re finding results for a company’s first acquisition date, the amount of funding raised, and a link to the company website. The parent query stitches the Frankenstein together, and voila! We’re done.

Thanks to Mode we get this snazzy time-series graph of capital raised, as measured in unicorns.

(click in to view Mode's fancy graph hover effects) (click the image view Mode’s fancy graph hover effects)

Adding market cap as a value would be interesting. Good news – since we’re using a WITH query that’s as simple as aliasing market cap, and updating the parent (helpful aside – here’s a helpful tutorial of what the GROUP BY statement does).

So, when should you use CTEs? They shine in queries with two or more joins or sub-selects. I also think they’re a fantastic way to write self-joins. This blurb from the postgres docs reveals an important use case:

A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work.

In other words, CTEs are a great candidate for speeding up expensive joins.

CTEs made my life better. Now that you know about them, they should do the same for you. Go forth! Write beautiful less ugly SQL. Here are docs in various SQL dialects:

Redshift
Postgres
Oracle
Microsoft

Enjoy!