20 January 2021

Query Against Parquet File failed with error “Not yet implemented: Unsupported encoding”

By Eric Lin

Recently I was dealing with an issue that a Snowflake query against Parquet file in internal Staging failed with error as below:

Error parsing the parquet file: Not yet implemented: Unsupported encoding. 
File 'path/to/external/table/partition/xxx-00001.snappy.parquet' Row 0 starts at line 0, column

After researching online, I figured out that it was nothing to do with Snowflake, it was because user used AWS Cost and Usage Report (CUR) tool to generate the Parquet file, which contains a version of Parquet Schema that is currently NOT supported by Snowflake.

This issue can be reproduced by simply use the parquet-tools available online and try to parse the parquet file, it will produce the same error:

parquet-tools csv xxx-00001.snappy.parquet
Traceback (most recent call last):
  File "/usr/local/bin/parquet-tools", line 10, in <module>
    sys.exit(main())
  File "/Library/Python/3.8/site-packages/parquet_tools/cli.py", line 26, in main
    args.handler(args)
  File "/Library/Python/3.8/site-packages/parquet_tools/commands/csv.py", line 44, in _cli
    with get_datafame_from_objs(pfs, args.head) as df:
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Library/Python/3.8/site-packages/parquet_tools/commands/utils.py", line 190, in get_datafame_from_objs
    df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Library/Python/3.8/site-packages/parquet_tools/commands/utils.py", line 71, in get_dataframe
    yield pq.read_table(local_path).to_pandas()
  File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 1550, in read_table
    return pf.read(columns=columns, use_threads=use_threads,
  File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 1274, in read
    table = piece.read(columns=columns, use_threads=use_threads,
  File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 721, in read
    table = reader.read(**options)
  File "/Library/Python/3.8/site-packages/pyarrow/parquet.py", line 336, in read
    return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
OSError: Not yet implemented: Unsupported encoding.

This V2 version of Parquet Schema was introduced by PARQUET-458, and it was not considered as production ready at the time (in 2019). Even though we are in 2021, I do not think it has been widely adopted, including Snowflake.

For the time being, until a support is provided, if you are using CUR tool to generate data, please use CSV or JSON instead, to avoid this issue from happening in Snowflake.