Legible + efficient SQL with CTEs
Thanks to Jonny Hughes for the base image!
Howdy! I’m Paul, and I’m one of the newer faces here at iron.io. My plan is to blog, champion local meetups, and empower developers. That last one is a whiz-bang business term for “help folks learn how to do dead-simple job processing.”
Today, I’ll blog about an old love of mine, Common Table Expressions (CTEs). It’s a tool anyone who’s even sneezed near a SQL statement ought to know about. Just a select few SQL sorcerers seem to be aware of their existence. Let’s change that!
CTEs are an easy way to beautify and speed up SQL. They beautify by aliasing and separating small query chunks. They speed up the same way a temporary table would, but without the need for cleanup.
The net result of beauty and speed is maintainability. Which, is why I love CTEs.
Let’s take a look at a basic CTE (postgres syntax):
WITH table1 AS ( [SELECT statement here] ) , table2 AS ( [SELECT statement here] ) SELECT * FROM table1, table2 WHERE [...]
The basic components are:
-
- We tell the interpreter it’s a CTE by starting with
WITH
. CTEs are sometimes called WITH queries for this reason.
- We tell the interpreter it’s a CTE by starting with
-
- The names
table1
andtable2
are used as aliases. You can name just one table, or a bunch.
- The names
- The bottom-most
SELECT
is the parent query. It joins everything together.
For a more real world example, let’s take a gander at a public dataset. I’ve chosen the crunchbase data on modeanalytics. You can follow along by cloning the following query:
WITH first_acq_date AS ( SELECT company_permalink, min(acquired_quarter) AS min_acquired_quarter FROM tutorial.crunchbase_acquisitions GROUP BY company_permalink ) , raised_amount AS ( SELECT company_permalink, SUM(raised_amount_usd) FROM tutorial.crunchbase_investments WHERE raised_amount_usd is not null GROUP BY company_permalink ) , url AS ( SELECT permalink, homepage_url FROM tutorial.crunchbase_companies ) SELECT fad.*, ra.sum, url.homepage_url FROM first_acq_date fad, raised_amount ra, url WHERE fad.company_permalink = ra.company_permalink AND fad.company_permalink = url.permalink ORDER BY min_acquired_quarter, sum
The benefits of CTEs should be clear. If you name intelligently, aliasing each chunk of the query means CTEs are self-documenting. Martin Fowler is happy.
Follow the aliases in the above. We’re finding results for a company’s first acquisition date, the amount of funding raised, and a link to the company website. The parent query stitches the Frankenstein together, and voila! We’re done.
Thanks to Mode we get this snazzy time-series graph of capital raised, as measured in unicorns.
(click the image view Mode’s fancy graph hover effects)
Adding market cap as a value would be interesting. Good news – since we’re using a WITH query that’s as simple as aliasing market cap, and updating the parent (helpful aside – here’s a helpful tutorial of what the GROUP BY statement does).
So, when should you use CTEs? They shine in queries with two or more joins or sub-selects. I also think they’re a fantastic way to write self-joins. This blurb from the postgres docs reveals an important use case:
A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work.
In other words, CTEs are a great candidate for speeding up expensive joins.
CTEs made my life better. Now that you know about them, they should do the same for you. Go forth! Write beautiful less ugly SQL. Here are docs in various SQL dialects:
Redshift
Postgres
Oracle
Microsoft
Enjoy!