Saturday, 2 June 2018

Querying Parquet files in Blob storage

--Use Case

I have couple of Parquet format files in an Azure Blob Storage container which I want to use in my Azure SQL Data Warehouse. How can I query the content of that file through Azure SQL DW?

1) Credentials
We will create a credential which will be used to access the Azure Blob Storage. 

--Credential
CREATE DATABASE SCOPED CREDENTIAL BlobStorageCred
WITH
    IDENTITY = 'user',
    SECRET = '********==' -- Please copy the secret key from the storage account in Azure Portal.

;








2) External data source:
Now we can use the stored credentials to map to the Azure blob container path. This just tells where in the Storage account the data goes. The wasbs:// protocol is used to point blob storage path.







--  Example with sample Storage account,container name: mytest, storage account name: democa

CREATE EXTERNAL DATA SOURCE myexternalsource
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://mytest@democa.blob.core.windows.net',
    CREDENTIAL = BlobStorageCred

);

3) External File format :
Creates an External File Format object defining external data stored in Azure Blob Storage. Creating an external file format is a prerequisite for creating an External Table. By creating an External File Format, you specify the actual layout of the data referenced by an external table.

CREATE EXTERNAL FILE FORMAT parquetfile
WITH (  
    FORMAT_TYPE = PARQUET,  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
/*Specifies the data compression method for the external data. When DATA_COMPRESSION isn't specified, the default is uncompressed data. You can choose from different data_compression options */

);  

4) External Table:
SQL Server uses external tables to access data stored in a Hadoop cluster or Azure blob storage. And Import and store data from Hadoop or Azure blob storage into your SQL Server database.

Note - The column datatype should match between parquet files stored in Blob storage and new external tables created in Azure SQL DWH.

CREATE EXTERNAL TABLE dbo.testdata (
registration_dttm DATETIME,
id INT,
first_name VARCHAR(500),
last_name VARCHAR(500),
email VARCHAR(500),
gender VARCHAR(5),
ip_address VARCHAR(500),
cc VARCHAR(500),
country VARCHAR(500),
birthdate VARCHAR(500),
salary FLOAT,
title VARCHAR(50),
comments VARCHAR(500))
WITH (
    LOCATION='/',
    DATA_SOURCE=myexternalsource, -- from step 2
    FILE_FORMAT=parquetfile1      -- from step 3

);


I have couple of files stored in Azure Blob storage as shown below. Hence used '/' in Location field , you can specify the individual file name (e,g; LOCATION='userdata1.parquet' , )


Fire SQL on your new external tables now.





















Summary:

This is my first blog related to Cloud Datawarehouse Solution where I tried to give very basic idea about querying Parquet files stored in Azure Blob Storage using Polybase technology in Azure SQL DWH.