I saw that there was an update where it is now possible to use the Microsoft Fabric Warehouse to copy data directly from OneLake into the Warehouse.

This got me thinking, which would consume more capacity to get the data into the Warehouse table. As well as which one would be faster.

To do this I am going to be running a SQL query in the Warehouse.

Next, I will use a Notebook to copy the data from the OneLake files section to a Warehouse table.

Sometimes it is important to get the data quicker and other times it is to use less capacity.

Even though this is a Background process, if there are enough Background processes it can affect your capacity.

NOTE: that the tests below were done on sample data, it is recommended to test it on your own environment.

Data Source

I got the data from an open source location which you can find here: fabric-samples-healthcare/analytics-bi-directlake-starschema/01-DownloadCMSDataCsvFiles.ipynb at main · isinghrana/fabric-samples-healthcare

I then downloaded this to my local Lakehouse.

When I had a look the total size of the CSV files was roughly 38GB

SQL Query

Below is the SQL query which I ran to transform the data from the CSV file into a Warehouse table.

NOTE: I specified the column names with their data types, to ensure when the table was created it would not fail due to data type issues.

Actual query

CREATE TABLE Fact_Sales AS 
SELECT * 
FROM OPENROWSET (
    BULK 'https://onelake.dfs.fabric.microsoft.com/b9443dd2-a4cb-0000-aaaa-9ceea1efbafb/3068ccb4-aa52-0000-aaaa-669e66c647d5/Files/cms_raw/*.csv',
    FORMAT = 'CSV',
    HEADER_ROW = TRUE
) WITH (
    Prscrbr_NPI VARCHAR(1000),
Prscrbr_Last_Org_Name   VARCHAR(1000),
Prscrbr_First_Name  VARCHAR(1000),
Prscrbr_City    VARCHAR(1000),
Prscrbr_State_Abrvtn    VARCHAR(1000),
Prscrbr_State_FIPS  VARCHAR(1000),
Prscrbr_Type    VARCHAR(1000),
Prscrbr_Type_Src    VARCHAR(1000),
Brnd_Name   VARCHAR(1000),
Gnrc_Name   VARCHAR(1000),
Tot_Clms    VARCHAR(1000),
Tot_30day_Fills VARCHAR(1000),
Tot_Day_Suply   VARCHAR(1000),
Tot_Drug_Cst    VARCHAR(1000),
Tot_Benes   VARCHAR(1000),
GE65_Sprsn_Flag VARCHAR(1000),
GE65_Tot_Clms   VARCHAR(1000),
GE65_Tot_30day_Fills    VARCHAR(1000),
GE65_Tot_Drug_Cst   VARCHAR(1000),
GE65_Tot_Day_Suply  VARCHAR(1000),
GE65_Bene_Sprsn_Flag    VARCHAR(1000),
GE65_Tot_Benes  VARCHAR(1000)

) AS r;

I then ran the query, and it ran pretty quickly in my opinion to read and then write the data.

As shown below I could see from the Metrics App that it took 60 seconds to run and consumed 5,076 Capacity Units

Time taken

PySpark Notebook

Next is I did the same activity this time using a Spark Notebook.

Syntax found here

A screenshot of a computer code

AI-generated content may be incorrect.

NOTE: To write to the Warehouse table, the table has to first be created.

If the table is not already created in the Warehouse when trying to load it from the notebook it will fail.

I then wrote the dataframe to the Warehouse table

A screen shot of a computer

AI-generated content may be incorrect.

Reference Link: Spark connector for Microsoft Fabric Data Warehouse – Microsoft Fabric | Microsoft Learn

As shown below the query took longer to run at 281 seconds and consumed 4,218 Capacity Units.

Summary

When comparing running SQL vs, a Notebook the notebook consumes less Capacity Units overall.

If you are looking for the ingestion to be as fast as possible then SQL is faster but comes at a bit of a cost.

As always thanks for reading any questions or comments are welcome.

3 Comments

Marco Russo September 3, 2025 at 6:20 am

Did you compare whether a SQL Warehouse query reading data from a table in the Lakehouse (e.g., a Delta file) is more expensive than doing the same in a PySpark Notebook?

1. Gilbert Quevauvilliers September 3, 2025 at 8:22 am
  
  That is a great point, let me give that a try and let you know.
  This process would still require loading the CSV data to the Lakehouse table and then loading it from the Lakehouse table to the Warehouse table. It will be interesting to see how it compares.
  
Marco Russo September 4, 2025 at 1:42 am

I have two cases:
1) A set of CSVs on Azure storage through shortcuts.
2) Tables/parquet files loaded in Lakehouse reading data from MySQL through pipelines (yes I should evaluate notebook alternatives here, but there are reasons to keep it as is for now).

Comparing Fabric Capacity Consumption – Notebook vs Warehouse SQL

Data Source

SQL Query

PySpark Notebook

Summary

Related

Tags In

3 Comments

Leave a Reply Cancel reply

Comparing Fabric Capacity Consumption – Notebook vs Warehouse SQL

Data Source

SQL Query

PySpark Notebook

Summary

Share this:

Related

Tags In

Related Posts

Downloading Scanner API data using a Microsoft Fabric Notebook

Using a Python Notebook to Query Multiple Lakehouse Tables into a Data frame

How to use the Tabular Object Model using Semantic Link Labs in a Microsoft Fabric Notebook

3 Comments

Leave a Reply Cancel reply