John Liu Blog

Every drop counts

SQL Generate Parquet File

There are several ways to generate Parquet file from SQL. PolyBase can generate Parquet file to S3 or Azure but we can’t specify the file name generated. Refer to SQL Export to parquet file. Azure Data Factory (ADF) or Fabric pipe line Copy Activity can generate Parquet file with specific file name. Python script can generate Parquet file with specific name. Following are examples using Python script in SQL to generate the Parquet file.

SQL Read Parquet File

There are several ways to read Parquet file within SQL Server. 1. Using OPENROWSET (SQL2022+) Starting with SQL Server 2022, we can use OPENROWSET to query Parquet files directly from Azure Blob Storage, ADLS Gen2, or S3-compatible storage without creating a permanent table first. -- 1. Create credential CREATE DATABASE SCOPED CREDENTIAL [MyAzureCredential] WITH IDENTITY = 'SHARED ACCESS SIGNATURE', SECRET = 'sv=2022-11-02&ss=b&srt=sco&sp=rwdl&se=2025-12-31...'; -- 2. Create the Data Source pointing to your container CREATE EXTERNAL DATA SOURCE MyCloudLogs WITH ( LOCATION = 'abs://yourcontainer@yourstorageaccount.

SQL export to Parquet file

In SQL 2022+, we can export SQL query results directly to a Parquet file using the CETAS (Create External Table As Select) statement using PolyBase. PolyBase exists in SQL Server 2016+ and Azure SQL Database, but CETAS is only supported in SQL Server 2022+ and Azure SQL Database. Here is an example of how to export SQL query results to a Parquet file stored in an S3 bucket or Azure storage:

SQL Server Change Data Capture (CDC)

CDC (Change Data Capture) was introduced in SQL2008 Enterprise only, only available to Standard edtions since SQL2016 SP1. It’s not available in Express edtion as Express edtion doesn’t have SQL Agent. CDC involves three-tier configuration: Server-level (agent job), Database-level and Table-level. It works in all recovery modes. When using CDC, we need to minotr log file growth and adjust retention period, especially if not using FULL recovery mode. CDC is like a video recording of every single change, where as Change Tracking (CT) is more like a “motion sensor”.

SQL Server Change Tracking (CT)

Change Tracking (CT) introduced in SQL2008, all edtitions, includes Express. CT is more like a “motion sensor”. CT tells you which rows changed and what the latest values are, but id doesn’t keep a history of every intermediate step. CDC is like a video recording of every single change. CT is much more lightweight than CDC. CT can only be enabled on table with PK. It’s a sync operation (overhead on every transaction).