Comparing Fabric Capacity Consumption – Notebook vs Warehouse SQL
I saw that there was an update where it is now possible to use the Microsoft Fabric Warehouse to copy data directly from OneLake into the Warehouse.
This got me thinking, which would consume more capacity to get the data into the Warehouse table. As well as which one would be faster.
To do this I am going to be running a SQL query in the Warehouse.
Next, I will use a Notebook to copy the data from the OneLake files section to a Warehouse table.
Sometimes it is important to get the data quicker and other times it is to use less capacity.
Even though this is a Background process, if there are enough Background processes it can affect your capacity.
NOTE: that the tests below were done on sample data, it is recommended to test it on your own environment.
Data Source
I got the data from an open source location which you can find here: fabric-samples-healthcare/analytics-bi-directlake-starschema/01-DownloadCMSDataCsvFiles.ipynb at main · isinghrana/fabric-samples-healthcare
I then downloaded this to my local Lakehouse.
When I had a look the total size of the CSV files was roughly 38GB
SQL Query
Below is the SQL query which I ran to transform the data from the CSV file into a Warehouse table.
NOTE: I specified the column names with their data types, to ensure when the table was created it would not fail due to data type issues.
Actual query
CREATE TABLE Fact_Sales AS SELECT * FROM OPENROWSET ( BULK 'https://onelake.dfs.fabric.microsoft.com/b9443dd2-a4cb-0000-aaaa-9ceea1efbafb/3068ccb4-aa52-0000-aaaa-669e66c647d5/Files/cms_raw/*.csv', FORMAT = 'CSV', HEADER_ROW = TRUE ) WITH ( Prscrbr_NPI VARCHAR(1000), Prscrbr_Last_Org_Name VARCHAR(1000), Prscrbr_First_Name VARCHAR(1000), Prscrbr_City VARCHAR(1000), Prscrbr_State_Abrvtn VARCHAR(1000), Prscrbr_State_FIPS VARCHAR(1000), Prscrbr_Type VARCHAR(1000), Prscrbr_Type_Src VARCHAR(1000), Brnd_Name VARCHAR(1000), Gnrc_Name VARCHAR(1000), Tot_Clms VARCHAR(1000), Tot_30day_Fills VARCHAR(1000), Tot_Day_Suply VARCHAR(1000), Tot_Drug_Cst VARCHAR(1000), Tot_Benes VARCHAR(1000), GE65_Sprsn_Flag VARCHAR(1000), GE65_Tot_Clms VARCHAR(1000), GE65_Tot_30day_Fills VARCHAR(1000), GE65_Tot_Drug_Cst VARCHAR(1000), GE65_Tot_Day_Suply VARCHAR(1000), GE65_Bene_Sprsn_Flag VARCHAR(1000), GE65_Tot_Benes VARCHAR(1000) ) AS r;
I then ran the query, and it ran pretty quickly in my opinion to read and then write the data.
As shown below I could see from the Metrics App that it took 60 seconds to run and consumed 5,076 Capacity Units
Time taken
PySpark Notebook
Next is I did the same activity this time using a Spark Notebook.
Syntax found here
NOTE: To write to the Warehouse table, the table has to first be created.
If the table is not already created in the Warehouse when trying to load it from the notebook it will fail.
I then wrote the dataframe to the Warehouse table
Reference Link: Spark connector for Microsoft Fabric Data Warehouse – Microsoft Fabric | Microsoft Learn
As shown below the query took longer to run at 281 seconds and consumed 4,218 Capacity Units.
Summary
When comparing running SQL vs, a Notebook the notebook consumes less Capacity Units overall.
If you are looking for the ingestion to be as fast as possible then SQL is faster but comes at a bit of a cost.
As always thanks for reading any questions or comments are welcome.
Did you compare whether a SQL Warehouse query reading data from a table in the Lakehouse (e.g., a Delta file) is more expensive than doing the same in a PySpark Notebook?
That is a great point, let me give that a try and let you know.
This process would still require loading the CSV data to the Lakehouse table and then loading it from the Lakehouse table to the Warehouse table. It will be interesting to see how it compares.
I have two cases:
1) A set of CSVs on Azure storage through shortcuts.
2) Tables/parquet files loaded in Lakehouse reading data from MySQL through pipelines (yes I should evaluate notebook alternatives here, but there are reasons to keep it as is for now).