23 August 2020

Loading Data into Snowflake

By Eric Lin

Welcome to my first Snowflake blog post. Since I just started as a Cloud Data Warehouse Engineer 2 weeks ago, Snowflake is also very new to me. I have spent sometime played around with Snowflake interface and learn how to load data from local file system into Snowflake and then query from Snowflake Web UI. I have noticed that though loading data is straight forward, there are fundamental differences between traditional RDBMS and Hadoop’s Hive/Impala with Snowflake, so I would like to start my blog with some basic information on how the data loading works in Snowflake.

Since Snowflake is an enterprise SaaS, you are not able to download it and try out on your machine. In this post, I will assume that you have either a Snowflake trial account, or a Snowflake customer who has access to Snowflake Data Warehouse to learn how to perform the steps.

Before I start, there is one important concept that is new in Snowflake I want to introduce. It is called STAGE, and the process of loading data into those Stages is called Staging.

Think about the stage as an intermediate storage location, where Snowflake will read from in order to load data into Snowflake. Please check the graph below:

There are two types of stages, internal and external. The Internal Stage is internally managed by Snowflake. Users won’t have any control over it, and I believe it uses S3 as default storage. External Stage however, can be either S3, GCP or Azure locations, and Snowflake will need to be setup to be able to read data from those Cloud providers in order to load data into the final table (like token and secret keys, I will cover this in the later posts).

Let’s start by having an example below. Since my preferred way is to use command line, I will use SnowSQL, the command line utility to perform those steps. In this example, I will again use the Flight Delay data from Kaggle‘s FLIGHT table as a sample (the one I used in my Impala Profile post), since it has larger dataset.

There are a few steps that we need to prepare to perform the data loading:

  1. Create table in Snowflake
  2. Split data into multiple files, this will allow Snowflake to perform multi-thread loading
  3. Create File Format object, so that Snowflake knows how to interpret the file
  4. Create Stage so that Snowflake can be ready to load data into table
  5. Load data into Stage
  6. Load data from Stage into Snowflake table

Step 1, we need to create a table in Snowflake, you can run below query and it will create a FLIGHT table under default “PUBLIC” schema in a database that I created FLIGHT_DELAY:

CREATE OR REPLACE TABLE FLIGHT (
	YEAR NUMBER(38,0),
	MONTH NUMBER(38,0),
	DAY NUMBER(38,0),
	DAY_OF_WEEK NUMBER(38,0),
	AIRLINE VARCHAR(2),
	FLIGHT_NUMBER NUMBER(38,0),
	TAIL_NUMBER VARCHAR(6),
	ORIGIN_AIRPORT VARCHAR(10),
	DESTINATION_AIRPORT VARCHAR(10),
	SCHEDULED_DEPARTURE VARCHAR(4),
	DEPARTURE_TIME VARCHAR(4),
	DEPARTURE_DELAY NUMBER(38,0),
	TAXI_OUT NUMBER(38,0),
	WHEELS_OFF VARCHAR(4),
	SCHEDULED_TIME NUMBER(38,0),
	ELAPSED_TIME NUMBER(38,0),
	AIR_TIME NUMBER(38,0),
	DISTANCE NUMBER(38,0),
	WHEELS_ON VARCHAR(4),
	TAXI_IN NUMBER(38,0),
	SCHEDULED_ARRIVAL VARCHAR(4),
	ARRIVAL_TIME VARCHAR(4),
	ARRIVAL_DELAY NUMBER(38,0),
	DIVERTED BOOLEAN,
	CANCELLED BOOLEAN,
	CANCELLATION_REASON VARCHAR(16777216),
	AIR_SYSTEM_DELAY NUMBER(38,0),
	SECURITY_DELAY NUMBER(38,0),
	AIRLINE_DELAY NUMBER(38,0),
	LATE_AIRCRAFT_DELAY NUMBER(38,0),
	WEATHER_DELAY NUMBER(38,0)
);

Step 2, I need to decide how to split my file into multiple smaller ones, so that we can make use of multiple machines/hosts to load each of them in parallel. Since my flights.csv file is around 550MB, I want to split into 6 files, with around 100MB each. Please be careful that you should split the file by lines, not by bytes, as it has the potential to split a single line into two, and will break Snowflake while loading data. I used below command to split the file into 1M rows each:

split -l 1000000 flights.csv flights-

And it generated below files for me instantly:

-rw-r--r--  1 elin  staff  100654689 23 Aug 16:21 flights-aa
-rw-r--r--  1 elin  staff  101254821 23 Aug 16:22 flights-ab
-rw-r--r--  1 elin  staff  101416395 23 Aug 16:22 flights-ac
-rw-r--r--  1 elin  staff  101472800 23 Aug 16:22 flights-ad
-rw-r--r--  1 elin  staff  103703628 23 Aug 16:22 flights-ae
-rw-r--r--  1 elin  staff   83904258 23 Aug 16:22 flights-af

Step 3, create File Format object in Snowflake. simply run below query in SnowSQL client:

CREATE FILE FORMAT "FLIGHT_DELAY"."PUBLIC".FLIGHT_DATA_CSV 
TYPE = 'CSV' 
COMPRESSION = 'AUTO' 
FIELD_DELIMITER = ',' 
RECORD_DELIMITER = '\n' 
SKIP_HEADER = 1 
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' 
TRIM_SPACE = FALSE 
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE 
ESCAPE = 'NONE' 
ESCAPE_UNENCLOSED_FIELD = '\134' 
DATE_FORMAT = 'AUTO' 
TIMESTAMP_FORMAT = 'AUTO' 
NULL_IF = ('\\N');

If you open the flights.csv file, you can see that the file is “,” delimited with first line as the header, so we leave default FIELD_DELIMITER as “,” and set “SKIP_HEADER” as 1 (default of 0), then click on “Finish”.

Step 4, create Stage in Snowflake, again, just run below query:

CREATE STAGE "FLIGHT_DELAY"."PUBLIC".FLIGHT_STAGE_DATA;

Step 5, now, we are ready to load our split files into our newly created Stage in Snowflake.

PUT file:///Users/elin/Downloads/flight-data/flights/* @FLIGHT_STAGE_DATA auto_compress=true;

This might take sometime, as we have a few files with total of 550MB+ to be uploaded to Snowflake’s Staging storage. Once it is done, SnowSQL will return the list of stages loaded, like my example below (notice that Snowflake will compress data automatically):

+------------+---------------+-------------+-------------+--------------------+--------------------+----------+---------+
| source     | target        | source_size | target_size | source_compression | target_compression | status   | message |
|------------+---------------+-------------+-------------+--------------------+--------------------+----------+---------|
| flights-aa | flights-aa.gz |   100654689 |    32990377 | NONE               | GZIP               | UPLOADED |         |
| flights-ab | flights-ab.gz |   101254821 |    32795207 | NONE               | GZIP               | UPLOADED |         |
| flights-ac | flights-ac.gz |   101416395 |    33164114 | NONE               | GZIP               | UPLOADED |         |
| flights-ad | flights-ad.gz |   101472800 |    32997432 | NONE               | GZIP               | UPLOADED |         |
| flights-ae | flights-ae.gz |   103703628 |    32552157 | NONE               | GZIP               | UPLOADED |         |
| flights-af | flights-af.gz |    83904258 |    27037323 | NONE               | GZIP               | UPLOADED |         |
+------------+---------------+-------------+-------------+--------------------+--------------------+----------+---------+
6 Row(s) produced. Time Elapsed: 260.598s

We can also run LIST command against the mentioned stage to confirm the output. You can see that in order to reference stages, we need “@” prefix:

LIST @FLIGHT_STAGE_DATA;
+---------------------------------+----------+----------------------------------+-------------------------------+
| name                            |     size | md5                              | last_modified                 |
|---------------------------------+----------+----------------------------------+-------------------------------|
| flight_stage_data/flights-aa.gz | 32990384 | afcd62db42ef082a8b5005f6f2748627 | Sun, 23 Aug 2020 06:29:44 GMT |
| flight_stage_data/flights-ab.gz | 32795216 | d6a6ca7d7990d2793f36a7612f034b2c | Sun, 23 Aug 2020 06:27:36 GMT |
| flight_stage_data/flights-ac.gz | 33164128 | d6b5bcb67ce0aef0a4b649b66c00449c | Sun, 23 Aug 2020 06:26:54 GMT |
| flight_stage_data/flights-ad.gz | 32997440 | d8e9faadbf3ee1c3f151c39461e42cc9 | Sun, 23 Aug 2020 06:26:06 GMT |
| flight_stage_data/flights-ae.gz | 32552160 | a1c10853154a89cea808f070d96d02ae | Sun, 23 Aug 2020 06:28:28 GMT |
| flight_stage_data/flights-af.gz | 27037328 | 3d8959ee6461dd06af67fcb0cbca78df | Sun, 23 Aug 2020 06:29:06 GMT |
+---------------------------------+----------+----------------------------------+-------------------------------+
6 Row(s) produced. Time Elapsed: 0.616s

We can also query Stage data just like a table, the only thing different is that since there is no schema, there is no column information, so we need a special way to let Snowflake know what we need:

SELECT $1, $2, $3, $4, $5 FROM @FLIGHT_STAGE_DATA LIMIT 10;
+------+----+----+----+----+
| $1   | $2 | $3 | $4 | $5 |
|------+----+----+----+----|
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
| 2015 | 11 | 9  | 1  | AA |
+------+----+----+----+----+
10 Row(s) produced. Time Elapsed: 1.107s

I think you can guess how to retrieve the rest of the columns by yourself.

Now, we have data in Stage area, we are ready to load data into our FLIGHT table, run below query:

COPY INTO FLIGHT
  FROM @FLIGHT_STAGE_DATA
  FILE_FORMAT = (format_name = FLIGHT_DATA_CSV)
  PATTERN = '.*flights-aa.*'
  ON_ERROR = 'skip_file';

It is pretty self explanatory, that we asked Snowflake to COPY data from stage @FLIGHT_STAGE_DATA with files that match pattern ‘.*flights-aa.*‘, and by using the CSV file format FLIGHT_DATA_CSV that we created earlier. If it encounters any errors, simply skip the whole file.

It will return how many rows processed, and if there were any errors, which line and which column. Below is the successful output:

+---------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                            | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|---------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| flight_stage_data/flights-aa.gz | LOADED |      999999 |      999999 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+---------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 7.926s

And if it fails, it will look something like below, where you can get info on where the data might have issues so that you can go back and check them:

+---------------------------------+-------------+-------------+-------------+-------------+-------------+-------------------------------------------------------------------------------------------+------------------+-----------------------+------------------------------------+
| file                            | status      | rows_parsed | rows_loaded | error_limit | errors_seen | first_error                                                                               | first_error_line | first_error_character | first_error_column_name            |
|---------------------------------+-------------+-------------+-------------+-------------+-------------+-------------------------------------------------------------------------------------------+------------------+-----------------------+------------------------------------|
| flight_stage_data/flights-aa.gz | LOAD_FAILED |     1041796 |           0 |           1 |           1 | End of record reached while expected to parse column '"FLIGHT"["WHEELS_ON":19]'           |          1041797 |                    69 | "FLIGHT"["WHEELS_ON":19]           |
| flight_stage_data/flights-ac.gz | LOAD_FAILED |     1033650 |           0 |           1 |           1 | End of record reached while expected to parse column '"FLIGHT"["WEATHER_DELAY":31]'       |          1033651 |                   103 | "FLIGHT"["WEATHER_DELAY":31]       |
| flight_stage_data/flights-af.gz | LOADED      |      664985 |      664985 |           1 |           0 | NULL                                                                                      |             NULL |                  NULL | NULL                               |
| flight_stage_data/flights-ad.gz | LOAD_FAILED |     1033819 |           0 |           1 |           1 | End of record reached while expected to parse column '"FLIGHT"["DESTINATION_AIRPORT":9]'  |          1033820 |                    29 | "FLIGHT"["DESTINATION_AIRPORT":9]  |
| flight_stage_data/flights-ae.gz | LOAD_FAILED |     1009429 |           0 |           1 |           1 | End of record reached while expected to parse column '"FLIGHT"["ARRIVAL_TIME":22]'        |          1009430 |                    82 | "FLIGHT"["ARRIVAL_TIME":22]        |
| flight_stage_data/flights-ab.gz | LOAD_FAILED |     1035400 |           0 |           1 |           1 | End of record reached while expected to parse column '"FLIGHT"["LATE_AIRCRAFT_DELAY":30]' |          1035401 |                    98 | "FLIGHT"["LATE_AIRCRAFT_DELAY":30] |
+---------------------------------+-------------+-------------+-------------+-------------+-------------+-------------------------------------------------------------------------------------------+------------------+-----------------------+------------------------------------+
6 Row(s) produced. Time Elapsed: 8.302s

And Snowflake remembers which ones it has processed, so if we try again with the same file, Snowflake will ignore it to avoid double loading:

COPY INTO FLIGHT
 FROM @FLIGHT_STAGE_DATA
 FILE_FORMAT = (format_name = FLIGHT_DATA_CSV)
 PATTERN = '.*flights-aa.*'
 ON_ERROR = 'skip_file';
+---------------------------------------+
| status                                |
|---------------------------------------|
| Copy executed with 0 files processed. |
+---------------------------------------+
1 Row(s) produced. Time Elapsed: 1.152s

To load the rest of the staged files, simply remove the PATTERN to load all of them, and let Snowflake to ignore the ones that we have loaded before:

COPY INTO FLIGHT
FROM @FLIGHT_STAGE_DATA
FILE_FORMAT = (format_name = FLIGHT_DATA_CSV)
ON_ERROR = 'skip_file';
+---------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                            | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|---------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| flight_stage_data/flights-af.gz | LOADED |      819079 |      819079 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
| flight_stage_data/flights-ab.gz | LOADED |      999999 |      999999 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
| flight_stage_data/flights-ac.gz | LOADED |      999999 |      999999 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
| flight_stage_data/flights-ad.gz | LOADED |      999999 |      999999 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
| flight_stage_data/flights-ae.gz | LOADED |      999999 |      999999 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+---------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
5 Row(s) produced. Time Elapsed: 9.855s

This concludes my first post about Snowflake’s data loading from local file system. In my next post, I will demonstrate how to load external staged data into Snowflake from AWS S3.

Any questions or feedbacks, please share at the bottom of the page. Thanks in advance.