SQL Cells

Query data with DuckDB, name results, and reference them across cells.

Creating SQL Cells

In percent-format .py files, mark a cell as SQL with either syntax:

# %% [sql]
SELECT * FROM 'data/orders.parquet'
# %%sql
SELECT * FROM 'data/orders.parquet'

If you're coming from JupySQL, the %%sql cell magic is also recognized:

# %%
%%sql
SELECT * FROM 'data/orders.parquet'

All three produce the same result. In the UI, you can also change a cell's type using the cell type dropdown.

Built-in DuckDB

SQL cells run against a built-in DuckDB instance by default. Everything executes locally on your machine with no external database required. You can query parquet, CSV, and JSON files directly:

SELECT * FROM 'data/sales.parquet'
WHERE region = 'US'
ORDER BY amount DESC

DuckDB extensions (like Iceberg, spatial, httpfs) work as expected. The internal DuckDB connection is shared across all SQL cells in the notebook, so tables and views created in one cell are available in others.

Naming Results

Give a SQL cell's output a name so you can reference it later. There are two ways:

In the cell header:

# %%sql orders <<
SELECT * FROM 'data/orders.parquet'

The << operator assigns the name orders to this cell's result.

Via the UI: Each SQL cell has a name input field in the toolbar. Type a name there and it works the same way.

Referencing Results in SQL

Once a result is named, other SQL cells can query it by name as if it were a table:

# %%sql orders <<
SELECT * FROM 'data/orders.parquet'
# %%sql
SELECT region, SUM(amount) as total
FROM orders
GROUP BY region

Behind the scenes, pynb registers each named result as a DuckDB view backed by the Arrow file on disk. This means no data is copied into memory just to reference it.

Referencing Results in Python

Named SQL results are also available as variables in Python cells. When you access one, it auto-materializes into a Polars DataFrame:

# %%
orders.filter(pl.col("amount") > 100).head(10)

The variable starts as a lazy reference (a LazyDF) that only reads the Arrow file from disk when you actually use it. This keeps memory usage low for large datasets. Calling .collect() or any DataFrame method triggers materialization.

Arrow IPC Storage

SQL results are stored as Arrow IPC files in a temporary directory. This has several benefits:

  • Results persist across cell re-runs without re-executing queries
  • DuckDB reads the Arrow files directly for cross-cell references (zero-copy)
  • Polars reads the same files when materializing in Python
  • Large results don't sit in memory unless actively used

Environment Variables in SQL

SQL queries support environment variable expansion using ${VAR_NAME} syntax. Variables are read from secrets configured in Settings > Secrets and from the system environment:

SELECT * FROM read_parquet('s3://my-bucket/data.parquet')
WHERE api_key = ${API_KEY}

The value is automatically single-quoted and escaped to prevent SQL injection. If the variable is not set, the query fails with a clear error message.