SQL export to Parquet file-John Liu Blog

In SQL 2022+, we can export SQL query results directly to a Parquet file using the CETAS (Create External Table As Select) statement using PolyBase. PolyBase exists in SQL Server 2016+ and Azure SQL Database, but CETAS is only supported in SQL Server 2022+ and Azure SQL Database. Here is an example of how to export SQL query results to a Parquet file stored in an S3 bucket or Azure storage:

1. Enable PolyBase and Exports feature in SQL Server

EXEC sp_configure 'polybase enable',1
RECONFIGURE

EXEC sp_configure 'allow polybase export',1
RECONFIGURE

2. Create security credentials for AWS/Azure if needed

-- Create a Master Key if you don't have one
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'YourStrongPassword123!';

-- Create the credential for S3
CREATE DATABASE SCOPED CREDENTIAL [S3_Parquet_Export]
WITH IDENTITY = 'S3 Access Key', 
SECRET = 'YOUR_ACCESS_KEY:YOUR_SECRET_KEY';

-- Create the credential for Azure SAS token
CREATE DATABASE SCOPED CREDENTIAL [AzureStorageSASCredential]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE', 
SECRET = 'st=2024-01-01...your_token_here...';

-- Create the credential for Azure Service Principal
CREATE DATABASE SCOPED CREDENTIAL [ADLS_ServicePrincipal_Cred]
WITH IDENTITY = 'AppID:ClientSecret', -- Format is IMPORTANT
SECRET = 'YOUR_APP_ID:YOUR_CLIENT_SECRET';

-- Store the Service Principal details as a single secret string
CREATE DATABASE SCOPED CREDENTIAL [ADLSServicePrincipal]
WITH IDENTITY = 'Service Principal',
SECRET = '{"client_id":"your-id", "tenant_id":"your-tenant", "client_secret":"your-secret"}';

3. Define external data source

-- For AWS S3 bucket:
CREATE EXTERNAL DATA SOURCE [MyS3Bucket]
WITH (
    LOCATION = 's3://<bucket_name>.s3.<region>.amazonaws.com',
    CREDENTIAL = [S3_Parquet_Export]
);

--For Azure blob storage:
CREATE EXTERNAL DATA SOURCE [MyAzureBlob]
WITH (
    LOCATION = 'abs://my-container@mystorageaccount.blob.core.windows.net',
    CREDENTIAL = [AzureStorageSASCredential]
);

--For Azure Data Lake storage Gen2 using SAS:
CREATE EXTERNAL DATA SOURCE [MyAzureDataLake]
WITH (
    LOCATION = 'adls://container-name@storageaccount.dfs.core.windows.net',
    CREDENTIAL = [AzureStorageSASCredential]
);

--For Azure Data Lake storage Gen2 using Service Principle:
CREATE EXTERNAL DATA SOURCE [MyAzureDataLake]
WITH (
    LOCATION = 'adls://container-name@storageaccount.dfs.core.windows.net',
    CREDENTIAL = [ADLS_ServicePrincipal_Cred]
);

4. Define Parquet file format

This tells SQL Server how to package the data. We can choose different comparession types (Snappy is standard for Parquet)

CREATE EXTERNAL FILE FORMAT [ParquetFormat]
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

5. Run export

--export to S3
CREATE EXTERNAL TABLE [dbo].[Export_Data_S3]
WITH (
    LOCATION = '/exports/sales_2024/', -- Folder will be created in S3
    DATA_SOURCE = [MyS3Bucket],
    FILE_FORMAT = [ParquetFormat]
)
AS
SELECT * FROM [YourLocalTable]
WHERE TransactionDate >= '2024-01-01';

--export to Azure
CREATE EXTERNAL TABLE [dbo].[Export_Data_Azure]
WITH (
    LOCATION = '/exports/sales_2024/', -- Folder will be created in Azure
    DATA_SOURCE = [MyAzureBlob],
    FILE_FORMAT = [ParquetFormat]
)
AS
SELECT * FROM [YourLocalTable]
WHERE TransactionDate >= '2024-01-01';

6. Drop existing folder

SQL Server cannot “overwrite” an existing folder. If run the CETAS command twice witht the same LOCATION, it will fail. We must drop the external table and delete the S3/Azure folder, or use a new timestamped folder name.

-- Step 1: Remove the metadata link in SQL
IF EXISTS (SELECT * FROM sys.external_tables WHERE name = 'Export_Data_Azure')
    DROP EXTERNAL TABLE [dbo].[Export_Data_Azure];

-- Step 2: Delete the actual blobs in Azure using PowerShell script
# Example: Clear a folder prefix in a container
Get-AzStorageBlob -Container "my-container" -Prefix "exports/sales/" | Remove-AzStorageBlob

7. Other requirements

Permissions:

For AWS S3 bucket, the IAM user must have s3:PutObject and s3:ListBucket permission on the target bucket.

For Azure SAS token, requires Read/Write/List permissions at the container level.

For Azure ADLS gen2, requires Execute permission on all parent folders leading to destination, write permission on the destination folder itself.

TLS/SSL:

SQL Server requires an encrypted connection to S3. Ensure SQL Server host has necessary CA certificates to trust AWS S3 endpoints.

ADLS Gen2 is strict. If you get a “Forbidden” error, ensure:

The App has the Storage Blob Data Contributor role.
If you have a firewall on your Storage Account, ensure your SQL Server’s IP is whitelisted or “Allow trusted Microsoft services” is enabled.
For ADLS Gen2 specifically, the App needs Execute permission on all parent folders in the path.

8. Export to local

PolyBase is built primaryly for “External” cloud source and the Parquet file name generated is a GUID. There are several ways to generate Parquet file locally with speicify file name.

Method 1: Use Python script

This method is built-in to SQL Server 2017 and later. It treats the data as a dataframe and writes it directly to local disk.

EXEC sp_execute_external_script
  @language = N'Python',
  @script = N'
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import os

# Create directory if it doesn''t exist
path = "C:/temp/Export"
if not os.path.exists(path):
    os.makedirs(path)

# Write the result set to Parquet
df = InputDataSet
table = pa.Table.from_pandas(df)
pq.write_table(table, os.path.join(path, "my_local_data.parquet"))
',
  @input_data_1 = N'SELECT * FROM YourTable';

A more efficent way to use Python if needs to export many query results via using the pyodbc and pyarrow libraries. Because SQL Server has to launch the Launchpad service and initialize a new Python runtime environment for every single call, we make one call and let the Python loop through all queries instead.

EXEC sp_execute_external_script
  @language = N'Python',
  @script = N'
import pandas as pd
import pyodbc

# 1. Define Connection Strings
# Use "Trusted_Connection=yes" to use the SQL Service Account identity
conn_str = "DRIVER={SQL Server};SERVER=YourServerName\InstanceB;DATABASE=SalesDB;Trusted_Connection=yes"
cnxn = pyodbc.connect(conn_str)

# 2. Define your list of queries and target filenames
tasks = [
    {"sql": "SELECT * FROM Orders", "file": "C:/Exports/Orders.parquet"},
    {"sql": "SELECT * FROM Customers", "file": "C:/Exports/Customers.parquet"},
    {"sql": "SELECT * FROM Products", "file": "C:/Exports/Products.parquet"}
]

# 3. Loop through and process
for task in tasks:
    print(f"Processing {task[''file'']}...")
    df = pd.read_sql(task["sql"], cnxn)
    df.to_parquet(task["file"], engine="pyarrow")

cnxn.close()
'

Method 2: The PolyBase “S3-Compatible” Trick

PolyBase does not have a native LOCATION = ‘C:...’ provider. To use PolyBase for local files, you typically have to point it to a local S3-compatible object store (like MinIO) running on the same server.

Install MinIO (a tiny, free executable) and point it to C:\temp\Export.

Configure PolyBase to treat MinIO as an S3 source (using LOCATION = ‘s3://localhost:9000’).

Run CETAS as if you were writing to AWS.

Method 3: Visiual studio Code (with Data Wrangler extension from Microsoft)

Run the query and Save as CSV. Open the CSV file using Data Wrangler and export as Parquet file.

FEATURED TAGS

ai api automation availability availability sets availability zones aws vm azure azure automation runbook azure blob azure cosmos db azure data lake azure function app azure openai azure sign-in azure site recovery azure sql database azure sql db azure subscription azure vm base64 certificate change data capture change tracking chrome clr container cte data api builder data conversion data gateway database role database size date table dax db config derived table diagram direct query disk management disk space docker downtime dtc dynamic m parameter embedding encrypted connection excel excel online execution plan extended events external data fabric fabric capacity failover cluster fk geometry hierarchy httpwebrequest hugo hyper-v incognito mode index infrastructure inline tvf json kql lakehouse linked server live query statistics locking m machine learning machine learning model machine learning services master key mcp mdx memory memory grant mermaid mirrored sql server network network card network category onedrive onnx runtime openrowset p2v parquet performance polybase power automate power bi power bi report tricks power platform power query powershell printer public ip address pyspark python qgis qt designer query performance query plan query troubleshooting r regex replication route s3 schema design scripting self-signed certificate server role sharepoint snowflake software development sofware development spark sql sql agent sql availability group sql error sql failover cluster instance sql index sql openjson sql permission sql recovery sql script sql security sql server sql server admin sql server config sql statistics ssis ssisdb ssl ssl/tls error ssms table expression tempdb tips troubleshooting unicode view visual studio visual studio code vmware wait statistics wi-fi connection issue windows settings