Nano second precision lost after Parquet file Unload

When unloading Timestamp data into Parquet file, the nano and micro seconds precision will be lost, and only milliseconds are stored in the file. 

To reproduce the issue, please see below script:

-- Load sample data into a table
CREATE OR REPLACE TABLE TEST (A TIMESTAMP(9));
INSERT INTO TEST VALUES ('2021-05-20 12:34:45.12434453');

-- Create a stage to store parquet file
CREATE STAGE TEST;
COPY INTO @TEST FROM (
    SELECT $1::TIMESTAMP(9) FROM TEST
)
FILE_FORMAT = ( TYPE = PARQUET )
OVERWRITE = TRUE;

-- Create a parquet file format so that we can read parquet file
CREATE OR REPLACE FILE FORMAT MY_PARQUET_FORMAT
  TYPE = PARQUET
  COMPRESSION = SNAPPY;

-- Now read the data from parquet
SELECT $1 FROM @TEST
(FILE_FORMAT => 'MY_PARQUET_FORMAT');

-- It will return below data:
{ "_COL_0": "2021-05-20 12:34:45.124" }

We can see that the nano seconds part .12434453 was truncated to .124.

This is due to limitations on Parquet V1 that is currently being used in Snowflake. Please refer to below documentation link:
https://docs.snowflake.com/en/user-guide/data-load-prepare.html#supported-file-formats

Specifically:

Currently, Snowflake supports the schema of Parquet files produced using the Parquet writer v1. Files produced using v2 of the writer are not supported.

And Parquet write V1 only supports storing upto milliseconds for timestamp. Nano second support was introduced in V2 via JIRA: 
https://issues.apache.org/jira/browse/PARQUET-1387

So, until Snowflake is migrated to support Parquet V2, use other formats like CSV to store timestamp data upto Nano seconds.

Loading

Leave a Reply

Your email address will not be published.