SQL Generate Parquet File-John Liu Blog

There are several ways to generate Parquet file from SQL.

PolyBase can generate Parquet file to S3 or Azure but we can’t specify the file name generated. Refer to SQL Export to parquet file.
Azure Data Factory (ADF) or Fabric pipe line Copy Activity can generate Parquet file with specific file name.
Python script can generate Parquet file with specific name.

Following are examples using Python script in SQL to generate the Parquet file.

Simple example to local file

EXEC sp_execute_external_script
  @language = N'Python',
  @script = N'
import pandas as pd

# The data from @input_data_1 is automatically passed as "InputDataSet"
df = InputDataSet

# Define your specific path and filename
# Ensure the SQL Service account has WRITE permissions to this folder
target_file = "C:/Exports/My_Custom_Report.parquet"

# Save to Parquet
df.to_parquet(target_file, engine="pyarrow", compression="snappy")
',
  @input_data_1 = N'SELECT SalesID, OrderDate, Amount FROM Sales.Orders WHERE Status = ''Completed''';

More efficient method

A more efficent way to use Python if needs to export many query results via using the pyodbc and pyarrow libraries. Because SQL Server has to launch the Launchpad service and initialize a new Python runtime environment for every single call, we make one call and let the Python loop through all queries instead.

EXEC sp_execute_external_script
  @language = N'Python',
  @script = N'
import pandas as pd
import pyodbc

# 1. Define Connection Strings
# Use "Trusted_Connection=yes" to use the SQL Service Account identity
conn_str = "DRIVER={SQL Server};SERVER=YourServerName\InstanceB;DATABASE=SalesDB;Trusted_Connection=yes"
cnxn = pyodbc.connect(conn_str)

# 2. Define your list of queries and target filenames
tasks = [
    {"sql": "SELECT * FROM Orders", "file": "C:/Exports/Orders.parquet"},
    {"sql": "SELECT * FROM Customers", "file": "C:/Exports/Customers.parquet"},
    {"sql": "SELECT * FROM Products", "file": "C:/Exports/Products.parquet"}
]

# 3. Loop through and process
for task in tasks:
    print(f"Processing {task[''file'']}...")
    df = pd.read_sql(task["sql"], cnxn)
    df.to_parquet(task["file"], engine="pyarrow")

cnxn.close()
'

The above examples generate the Parquet file locally. We can also put the Parquet file onto S3 bucket or Azure blob storage without using PolyBase.

We use Azure Key Vault to store the credential.

Upload to Azure Blob

For Azure Storage:

--to install required libraries, navigate to your SQL Python folder and run: .\python.exe -m pip install azure-identity azure-keyvault-secrets azure-storage-blob pyarrow pandas
--to access Azure Storage, give App Registration the Storage Blob Data Contributor role

EXEC sp_execute_external_script
  @language = N'Python310',
  @script = N'
import pandas as pd
import io
from azure.identity import ClientSecretCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import BlobServiceClient

# 1. Identity of the script (App Registration details)
# These are considered "Identifiers," not secrets themselves
tenant_id = "your-tenant-id"
client_id = "your-app-registration-client-id"
client_secret = "your-app-registration-secret" # Only this secret is needed here
vault_url = "https://your-vault-name.vault.azure.net/"

# 2. Authenticate to Key Vault
credential = ClientSecretCredential(tenant_id, client_id, client_secret)
kv_client = SecretClient(vault_url=vault_url, credential=credential)

# 3. Fetch the STORAGE SECRET from Key Vault
# This is where the actual sensitive ADLS key is stored
storage_key_secret = kv_client.get_secret("adls-storage-key")
storage_account_key = storage_key_secret.value

# 4. Initialize Storage Client using the secret fetched from Vault
account_url = "https://youraccount.blob.core.windows.net"
service_client = BlobServiceClient(account_url, credential=storage_account_key)

# 5. Process Data to Parquet
pq_buffer = io.BytesIO()
InputDataSet.to_parquet(pq_buffer, engine="pyarrow", index=False)
pq_buffer.seek(0)

# 6. Upload
blob_client = service_client.get_blob_client(container="my-container", blob="data.parquet")
blob_client.upload_blob(pq_buffer, overwrite=True)

print("Successfully fetched secret from Key Vault and uploaded Parquet to ADLS.")
',
  @input_data_1 = N'SELECT * FROM YourSourceTable'
WITH RESULT SETS NONE;

Upload to AWS S3

For S3:

--to install required libraries, navigate to your SQL Python folder and run: .\python.exe -m pip install azure-identity azure-keyvault-secrets boto3 pyarrow pandas

EXEC sp_execute_external_script
  @language = N'Python310',
  @script = N'
import pandas as pd
import io
import boto3
from azure.identity import ClientSecretCredential
from azure.keyvault.secrets import SecretClient

# 1. Azure Key Vault Authentication (App Registration)
tenant_id = "your-azure-tenant-id"
client_id = "your-azure-app-id"
client_secret = "your-azure-app-secret"
vault_url = "https://your-vault-name.vault.azure.net/"

az_cred = ClientSecretCredential(tenant_id, client_id, client_secret)
kv_client = SecretClient(vault_url=vault_url, credential=az_cred)

# 2. Retrieve AWS Credentials from Azure
aws_id = kv_client.get_secret("AWS-ACCESS-KEY-ID").value
aws_secret = kv_client.get_secret("AWS-SECRET-ACCESS-KEY").value

# 3. Initialize AWS S3 Client
s3_client = boto3.client(
    "s3",
    aws_access_key_id=aws_id,
    aws_secret_access_key=aws_secret,
    region_name="us-east-1"
)

# 4. Process Data to Parquet
pq_buffer = io.BytesIO()
InputDataSet.to_parquet(pq_buffer, engine="pyarrow", index=False)
pq_buffer.seek(0)

# 5. Upload to AWS S3
s3_client.put_object(
    Bucket="your-s3-bucket-name", 
    Key="exports/data.parquet", 
    Body=pq_buffer.getvalue()
)

print("Successfully moved data from SQL to AWS S3 via Azure Key Vault.")
',
  @input_data_1 = N'SELECT * FROM YourTable'
WITH RESULT SETS NONE;

Memory consideration

When using Python via sp_execute_external_script, be careful with large dataset using @input_data_1. By default, SQL Server allocates only 20% of the server’s RAM to external scripts. If the dataset size exceed this limit, Python process will crash with “Out of Memory” (OOM) error. To workaround the OOM error, use pyodbc to query the data directly inside the Python script instead of using @input_data_1 parameter.

To check/adjust the SQL Resource Pool memory setting:

-- View current limits
SELECT * FROM sys.resource_governor_external_resource_pools;

-- Increase memory limit to 50%
ALTER EXTERNAL RESOURCE POOL "default" 
WITH (max_memory_percent = 50);

-- Apply the changes
ALTER RESOURCE GOVERNOR RECONFIGURE;

Certificate Method

The previous example using client secret. We can also use a certificate (Generate Self-signed Certificate) for authenticate.

To use certificate, export the .pfx file and place to a dedicated folder. For security reason, apart from administrator, only Grant LaunchPad service account and ALL APPLICATION PACKAGES READ access to this folder.

EXEC sp_execute_external_script
  @language = N'Python310',
  @script = N'
import io
import logging
import pandas as pd
from azure.identity import CertificateCredential
from azure.storage.blob import BlobServiceClient

# 1. Config
# Path to the .pfx file on the SQL Server disk
pfx_path = "C:/Path/To/Your/cert-name.pfx" 
pfx_password = "YourSecurePassword"
tenant_id = "your-tenant-id"
client_id = "your-app-registration-id"
account_url = "https://yourstorage.blob.core.windows.net"
container = "your-container"
blobname = "exports/data.parquet"

# 2. Authenticate using the Certificate
# CertificateCredential handles the handshake with Entra ID
cert_cred = CertificateCredential(
    tenant_id=tenant_id,
    client_id=client_id,
    certificate_path=pfx_path,
    password=pfx_password
)

# 3. Upload to ADLS/Blob as Parquet
# Add these lines to silence the "Details" from Azure and only capture if ERROR from Azure SDK
logging.getLogger(''azure'').setLevel(logging.ERROR)
logging.getLogger(''azure.storage.blob'').setLevel(logging.ERROR)
logging.getLogger(''azure.core.pipeline.policies.http_logging_policy'').setLevel(logging.ERROR)

# Set up logging to track failures: INFO, ERROR, CRITICAL
logging.basicConfig(level=logging.INFO)
# logging.basicConfig(level=logging.ERROR)
# logging.basicConfig(level=logging.CRITICAL)
logger = logging.getLogger(__name__)

service_client = BlobServiceClient(account_url=account_url, credential=cert_cred)

pq_buffer = io.BytesIO()

try:
    # 1. Convert to Parquet locally (in memory)
    # If this fails, no network call is ever made.
    InputDataSet.to_parquet(pq_buffer, engine="pyarrow", index=False)
    pq_buffer.seek(0)

    blob_client = service_client.get_blob_client(container=container, blob=blobname)

    # 2. Upload with built-in retry settings
    # connection_timeout and read_timeout help handle "hanging" connections
    logger.info(f"Uploading {blobname} to {container}...")
    blob_client.upload_blob(
        pq_buffer, 
        overwrite=True,
        connection_timeout=14400, # 4 hours for very large files
        max_concurrency=4         # Parallelize upload for speed/stability
    )
    logger.info("Upload completed successfully.")
    
    print("Authenticated with self-signed certificate and uploaded successfully.")

except Exception as e:
    logger.error(f"Critical failure during Parquet export: {e}")
    
    # 3. Cleanup: Check if a partial/corrupted file was left behind in Azure storage
    try:
        if ''blob_client'' in locals() and blob_client.exists():
            logger.warning("Partial file detected. Deleting corrupted blob to prevent errors...")
            blob_client.delete_blob()
    except AzureError as cleanup_error:
        logger.error(f"Failed to cleanup corrupted blob: {cleanup_error}")
    
    # 4. Re-raise the exception so the orchestrator (like SQL Agent or ADF) knows it failed
    raise 

finally:
    pq_buffer.close()
',
  @input_data_1 = N'SELECT * FROM YourTable'

Troubleshooting network issue

By default, there are Windows Firewall block rules in place to block LaunchPad service to access network.

To check if LaunchPad has network access:

EXEC sp_execute_external_script
  @language = N'Python310',
  @script = N'
import urllib.request
try:
    response = urllib.request.urlopen("https://login.microsoftonline.com", timeout=5)
    print("Network is OPEN")
except Exception as e:
    print(f"Network is STILL BLOCKED: {e}")
'

If network is blocked, check Windows Firewall for block rule name like “Block network access for AppContainer*<your instanc name>” and disable them. SQL Server uses a pool of accounts (typically 20-21) to handle concurrent external script requests. Each account gets its own AppContainer and each has one block rule created.

Use following PowerShell script to disable all those rules:

Get-NetFirewallRule -DisplayName "Block network access for AppContainer*SQLDEV_2019" | Disable-NetFirewallRule

One observation: After disabled the rules and made a network call, then reenable the block rule, the access is no longer blocked, even after reboot the computer. Not sure why or if this will be permanant.

SQL Generate Parquet File