Run SQL queries with DuckDB in Jupyter over Parquet files stored in Azure ADLS storage



You can use DuckDB to directly query Parquet files on S3 thanks to HTTPFS extension. Unfortunately the same is not so easy for Azure’s ADLS Gen. 2 Storage (S3-like service but on Azure). However thanks to the recent addition of support for Python’s Filespec compatible libraries we can do it and here’s an instruction how to configure Jupyter to do it.

First let’s install all required libraries:

pip install duckdb jupyterlab jupysql duckdb-engine pandas adlfs ipython-autotime

Then you can start Jupyter Lab server by running:

jupyter lab

Create a notebook and execute these commands to configure the environment:

%load_ext autotime

This is not strictly necessary but it will provide a nice timer under each cell.

Note: if you need more verbose logging you can enable it by running:

import logging

logging.basicConfig(level=logging.DEBUG)

First let’s load JupySQL extension that adds %sql magic commands:

%load_ext sql

Useful settings that make the output more readable:

import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)

# configure JupySQL to return data as a Pandas DataFrame and have less verbose output
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

ADLFS library allows authenticating to Azure in multiple ways. One simple way is to generate a SAS token with read-only permissions on a container.

Let’s finish configuration:

from adlfs import AzureBlobFileSystem
from sqlalchemy import create_engine

# in case you need to be connected to corporate proxy to access the data you can do it by running:
# import os
# os.environ["http_proxy"] = "<some_url>"
# os.environ["https_proxy"] = "<some_url>"

fs = AzureBlobFileSystem(
    account_name="my_storage_account",
    sas_token="?xxx",
)

engine = create_engine("duckdb:///:memory:")
engine.raw_connection().register_filesystem(fs)

# register SQLAlchemy engine with JupySQL
%sql engine

Check DuckDB version or settings:

%sql select version()
%sql select current_setting('threads');

You can use either %sql for single line queries or %%sql for multiline queries.

** Important note: Protocol needs to be abfs:// not abfss:// like in Apache Spark. **

You can run queries like:

%%sql
SELECT *
FROM parquet_scan('abfs://container@my_storage_account.dfs.core.windows.net/directory/dataset/*/*.parquet', hive_partitioning=1)
LIMIT 5 

You can also save the result to a DataFrame:

%%sql df <<
SELECT *
FROM parquet_scan('xxx')
for col in df.columns:
    print(col)