19 November 2020

Recursive Query with CTE – How it works

By Eric Lin

Recursive query in SQL can be as useful as recursive functions that developers use all the time. Snowflake has documentations on how to use CTE to construct recursive queries. Example can be found here: Working with CTEs. It uses a simple company employees hierarchy to demonstrate how to run recursive queries, and it also describes in a bit of detail on how it works. However, the doc does not detail in what exactly happened when query was run. So I think it might be helpful to write a bit of my own understanding and share on my blog.

To start with, below are the queries from the official Snowflake doc that were used to demonstrate this. I will copy them below for your convenience:

CREATE OR REPLACE TABLE 
EMPLOYEES (
    TITLE VARCHAR, 
    EMPLOYEE_ID INTEGER, 
    MANAGER_ID INTEGER
);

INSERT INTO 
EMPLOYEES (TITLE, EMPLOYEE_ID, MANAGER_ID) VALUES
    ('President', 1, NULL),  -- THE PRESIDENT HAS NO MANAGER.
        ('Vice President Engineering', 10, 1),
            ('Programmer', 100, 10),
            ('QA Engineer', 101, 10),
        ('Vice President HR', 20, 1),
            ('Health Insurance Analyst', 200, 20);
-- This is the WITH clause, of course.
WITH RECURSIVE MANAGERS 
      -- Column names for the "view"/CTE
    (INDENT, EMPLOYEE_ID, MANAGER_ID, EMPLOYEE_TITLE) 
    AS
      -- Common Table Expression
      (

        -- Anchor Clause
        SELECT 
          '' AS INDENT, EMPLOYEE_ID, 
          MANAGER_ID, TITLE AS EMPLOYEE_TITLE
        FROM EMPLOYEES
        WHERE TITLE = 'President'

        UNION ALL

        -- Recursive Clause
        SELECT INDENT || '--- ',
          EMPLOYEES.EMPLOYEE_ID, 
          EMPLOYEES.MANAGER_ID, 
          EMPLOYEES.TITLE
        FROM EMPLOYEES 
        JOIN MANAGERS 
          ON EMPLOYEES.MANAGER_ID = MANAGERS.EMPLOYEE_ID
      )

 -- This is the "main select".
 SELECT 
   INDENT || EMPLOYEE_TITLE AS TITLE, 
   EMPLOYEE_ID, MANAGER_ID
 FROM MANAGERS;

+----------------------------------+-------------+------------+
| TITLE                            | EMPLOYEE_ID | MANAGER_ID |
|----------------------------------+-------------+------------|
| President                        |           1 |       NULL |
| --- Vice President Engineering   |          10 |          1 |
| --- Vice President HR            |          20 |          1 |
| --- --- Programmer               |         100 |         10 |
| --- --- QA Engineer              |         101 |         10 |
| --- --- Health Insurance Analyst |         200 |         20 |
+----------------------------------+-------------+------------+

In this simple example, we can see that there are 3 levels in the hierarchy, so it is clear that there will be 3 JOINs combined with 2 unions, to concatenate the result of those 3 JOINs. Let’s see what happens here step by step.

Firstly, we need to find the top level employee as the starting point. In this case is the President. The query is obvious as we have seen from the example above:

SELECT 
    TITLE AS EMPLOYEE_TITLE, 
    EMPLOYEE_ID, 
    MANAGER_ID
FROM EMPLOYEES
WHERE TITLE = 'President';

+----------------+-------------+------------+
| EMPLOYEE_TITLE | EMPLOYEE_ID | MANAGER_ID |
|----------------+-------------+------------|
| President      |           1 |       NULL |
+----------------+-------------+------------+

Secondly, after we have found the President, we need to find the employees who are immediately below President. This is done by a simple JOIN using employees table with the President data that we have got from step 1. Query as below:

SELECT 
    '--- ' || VICE_PRESIDENT.TITLE AS EMPLOYEE_TITLE,
    VICE_PRESIDENT.EMPLOYEE_ID, 
    VICE_PRESIDENT.MANAGER_ID
FROM (
    SELECT 
        '' AS INDENT, 
        EMPLOYEE_ID, 
        MANAGER_ID, 
        TITLE AS EMPLOYEE_TITLE
    FROM EMPLOYEES
    WHERE TITLE = 'President'
) AS PRESIDENT
JOIN EMPLOYEES VICE_PRESIDENT
ON VICE_PRESIDENT.MANAGER_ID = PRESIDENT.EMPLOYEE_ID;

+--------------------------------+-------------+------------+
| EMPLOYEE_TITLE                 | EMPLOYEE_ID | MANAGER_ID |
|--------------------------------+-------------+------------|
| --- Vice President Engineering |          10 |          1 |
| --- Vice President HR          |          20 |          1 |
+--------------------------------+-------------+------------+

Once this step is done, we have two data sets ready, one for the President, and another for Vice President, so Snowflake will combine them now with UNION ALL as below:

SELECT 
    TITLE AS EMPLOYEE_TITLE, 
    EMPLOYEE_ID, 
    MANAGER_ID
FROM EMPLOYEES
WHERE TITLE = 'President'

UNION ALL

SELECT 
    '--- ' || VICE_PRESIDENT.TITLE AS EMPLOYEE_TITLE,
    VICE_PRESIDENT.EMPLOYEE_ID, 
    VICE_PRESIDENT.MANAGER_ID
FROM (
    SELECT 
        '' AS INDENT, 
        EMPLOYEE_ID, 
        MANAGER_ID, 
        TITLE AS EMPLOYEE_TITLE
    FROM EMPLOYEES
    WHERE TITLE = 'President'
) AS PRESIDENT
JOIN EMPLOYEES VICE_PRESIDENT
ON VICE_PRESIDENT.MANAGER_ID = PRESIDENT.EMPLOYEE_ID;

This combined resultset is then saved and ready to be combined again after next iteration finishes.

Now, we have found the Vice Presidents, we can then use it to find all the employees whom they manage. This, again, can be achieved by JOINing back with employees table to find out whose manager_ids are 10 and 20:

SELECT 
    '------ ' || DEVELOPERS.TITLE AS EMPLOYEE_TITLE,
    DEVELOPERS.EMPLOYEE_ID, 
    DEVELOPERS.MANAGER_ID
FROM (
    SELECT 
        '' AS INDENT, 
        EMPLOYEE_ID, 
        MANAGER_ID, 
        TITLE AS EMPLOYEE_TITLE
    FROM EMPLOYEES
    WHERE TITLE = 'President'
) AS PRESIDENT
JOIN EMPLOYEES VICE_PRESIDENT
ON VICE_PRESIDENT.MANAGER_ID = PRESIDENT.EMPLOYEE_ID
JOIN EMPLOYEES DEVELOPERS
ON DEVELOPERS.MANAGER_ID = VICE_PRESIDENT.EMPLOYEE_ID
;

+---------------------------------+-------------+------------+
| EMPLOYEE_TITLE                  | EMPLOYEE_ID | MANAGER_ID |
|---------------------------------+-------------+------------|
| ------ Programmer               |         100 |         10 |
| ------ QA Engineer              |         101 |         10 |
| ------ Health Insurance Analyst |         200 |         20 |
+---------------------------------+-------------+------------+

Again, the resultset returned at this stage will be combined with the previous resultset.

At this stage, we have got all the employees data, including President, Vice President and Developers. However, Snowflake does not know we have done yet. It will have to perform another call, so find the employees who are managed by those Programmer/QA Engineer and Analyst. So the query looks something like below:

SELECT 
    '------ ' || DEVELOPERS.TITLE AS EMPLOYEE_TITLE,
    DEVELOPERS.EMPLOYEE_ID, 
    DEVELOPERS.MANAGER_ID
FROM (
    SELECT 
        '' AS INDENT, 
        EMPLOYEE_ID, 
        MANAGER_ID, 
        TITLE AS EMPLOYEE_TITLE
    FROM EMPLOYEES
    WHERE TITLE = 'President'
) AS PRESIDENT
JOIN EMPLOYEES VICE_PRESIDENT
ON VICE_PRESIDENT.MANAGER_ID = PRESIDENT.EMPLOYEE_ID
JOIN EMPLOYEES DEVELOPERS
ON DEVELOPERS.MANAGER_ID = VICE_PRESIDENT.EMPLOYEE_ID
JOIN EMPLOYEES MORE_EMPLOYEES
ON MORE_EMPLOYEES.MANAGER_ID = DEVELOPERS.EMPLOYEE_ID
;

+----------------+-------------+------------+
| EMPLOYEE_TITLE | EMPLOYEE_ID | MANAGER_ID |
|----------------+-------------+------------|
+----------------+-------------+------------+

We can see that this time, the resultset returned is empty, now Snowflake understands that we have reached the end of the recursive call. So the combined UNION ALL resultsets will be made available to the outside SELECT query.

Of course, this is not the exact query that will be run by Snowflake behind the scene, but you get the idea of what had happened.

So to conclude, the effect of the CTE recursive query that was used in the beginning of the post can be rewritten as below UNION ALL query:

SELECT 
    TITLE AS EMPLOYEE_TITLE, 
    EMPLOYEE_ID, 
    MANAGER_ID
FROM EMPLOYEES
WHERE TITLE = 'President'

UNION ALL

SELECT 
    '--- ' || VICE_PRESIDENT.TITLE AS EMPLOYEE_TITLE,
    VICE_PRESIDENT.EMPLOYEE_ID, 
    VICE_PRESIDENT.MANAGER_ID
FROM (
    SELECT 
        '' AS INDENT, 
        EMPLOYEE_ID, 
        MANAGER_ID, 
        TITLE AS EMPLOYEE_TITLE
    FROM EMPLOYEES
    WHERE TITLE = 'President'
) AS PRESIDENT
JOIN EMPLOYEES VICE_PRESIDENT
ON VICE_PRESIDENT.MANAGER_ID = PRESIDENT.EMPLOYEE_ID

UNION ALL

SELECT 
    '--- --- ' || DEVELOPERS.TITLE AS EMPLOYEE_TITLE,
    DEVELOPERS.EMPLOYEE_ID, 
    DEVELOPERS.MANAGER_ID
FROM (
    SELECT 
        '' AS INDENT, 
        EMPLOYEE_ID, 
        MANAGER_ID, 
        TITLE AS EMPLOYEE_TITLE
    FROM EMPLOYEES
    WHERE TITLE = 'President'
) AS PRESIDENT
JOIN EMPLOYEES VICE_PRESIDENT
ON VICE_PRESIDENT.MANAGER_ID = PRESIDENT.EMPLOYEE_ID
JOIN EMPLOYEES DEVELOPERS
ON DEVELOPERS.MANAGER_ID = VICE_PRESIDENT.EMPLOYEE_ID
;

+----------------------------------+-------------+------------+
| EMPLOYEE_TITLE                   | EMPLOYEE_ID | MANAGER_ID |
|----------------------------------+-------------+------------|
| President                        |           1 |       NULL |
| --- Vice President Engineering   |          10 |          1 |
| --- Vice President HR            |          20 |          1 |
| --- --- Programmer               |         100 |         10 |
| --- --- QA Engineer              |         101 |         10 |
| --- --- Health Insurance Analyst |         200 |         20 |
+----------------------------------+-------------+------------+

You can compare the result we got using the recursive query above to confirm the output:

+----------------------------------+-------------+------------+
| TITLE                            | EMPLOYEE_ID | MANAGER_ID |
|----------------------------------+-------------+------------|
| President                        |           1 |       NULL |
| --- Vice President Engineering   |          10 |          1 |
| --- Vice President HR            |          20 |          1 |
| --- --- Programmer               |         100 |         10 |
| --- --- QA Engineer              |         101 |         10 |
| --- --- Health Insurance Analyst |         200 |         20 |
+----------------------------------+-------------+------------+

This concludes this post. In my next post, I will discuss using another type of recursive query using START WITH …. CONNECT BY to achieve the same goal.

Comments are welcome!