Recently I was dealing with an issue that a Snowflake query against Parquet file in internal Staging failed with error as below:
Error parsing the parquet file: Not yet implemented: Unsupported encoding. File 'path/to/external/table/partition/xxx-00001.snappy.parquet' Row 0 starts at line 0, column
After researching online, I figured out that it was nothing to do with Snowflake, it was because user used AWS Cost and Usage Report (CUR) tool to generate the Parquet file, which contains a version of Parquet Schema that is currently NOT supported by Snowflake.
This issue can be reproduced by simply use the parquet-tools available online and try to parse the parquet file, it will produce the same error:
parquet-tools csv xxx-00001.snappy.parquet Traceback (most recent call last): File "/usr/local/bin/parquet-tools", line 10, in <module> sys.exit(main()) File "/Library/Python/3.8/site-packages/parquet_tools/cli.py", line 26, in main args.handler(args) File "/Library/Python/3.8/site-packages/parquet_tools/commands/csv.py", line 44, in _cli with get_datafame_from_objs(pfs, args.head) as df: File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen) File "/Library/Python/3.8/site-packages/parquet_tools/commands/utils.py", line 190, in get_datafame_from_objs df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe()) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/contextlib.py", line 425, in enter_context result = _cm_type.__enter__(cm) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen) File "/Library/Python/3.8/site-packages/parquet_tools/commands/utils.py", line 71, in get_dataframe yield pq.read_table(local_path).to_pandas() File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 1550, in read_table return pf.read(columns=columns, use_threads=use_threads, File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 1274, in read table = piece.read(columns=columns, use_threads=use_threads, File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 721, in read table = reader.read(**options) File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 336, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status OSError: Not yet implemented: Unsupported encoding.
This V2 version of Parquet Schema was introduced by PARQUET-458, and it was not considered as production ready at the time (in 2019). Even though we are in 2021, I do not think it has been widely adopted, including Snowflake.
For the time being, until a support is provided, if you are using CUR tool to generate data, please use CSV or JSON instead, to avoid this issue from happening in Snowflake.