18 February 2021

File Pruning for External Tables

By Eric Lin

Currently Snowflake does not support pruning External Table files. To prove it, I have performed below test.

Assume that I have an external S3 stage setup correctly, which points to an external S3 location with a few CSV files, I create below external table:

CREATE OR REPLACE EXTERNAL TABLE EXT_TABLE_TEST (
    V VARCHAR AS (VALUE:C1::VARCHAR)
)
WITH LOCATION = @ERICLIN_S3_STAGE/test/
AUTO_REFRESH = TRUE
REFRESH_ON_CREATE = TRUE
FILE_FORMAT = (
    TYPE=CSV 
    COMPRESSION=NONE 
    REPLACE_INVALID_CHARACTERS = TRUE 
    FIELD_DELIMITER = '|' 
    ESCAPE_UNENCLOSED_FIELD = '\\'
 ;

The stage @ERICLIN_S3_STAGE/test contains a list of CSV files

And each file contains a few lines of data, whose content is not relevant here, so I will ignore them for now.

Now, if I run below query

SELECT 
  METADATA$FILENAME, *
FROM EXT_TABLE_TEST
WHERE METADATA$FILENAME ILIKE '%TEST5%';

You can see that I only tried to get data from CSV file test5.csv, but if you examine the query profile, you can see that all files were scanned.

In theory I would expect only one file (named partition in Snowflake) to be scanned.

This is because currently (at the time of writing, Snowflake is at version 5.3.1) the metadata$filename information is stored at Snowflake’s Compute Service layer, so while compiler performs the compilation at Cloud Services layer, it has no knowledge of the files the external table has, hence unable to prune the files/partitions.

For now, if files/partitions pruning is required for external tables, the tables need to be partitioned correctly, and then partition pruning will happen as normal, just like other tables, as the column information is stored in the Cloud Service level.

For details on how to partition Snowflake’s external tables, please refer to below official doc:

Introduction to External Tables – Partitioned External Tables