I saw that there was an update where it is now possible to use the Microsoft Fabric Warehouse to copy data directly from OneLake into the Warehouse.

This got me thinking, which would consume more capacity to get the data into the Warehouse table. As well as which one would be faster.

To do this I am going to be running a SQL query in the Warehouse.

Next, I will use a Notebook to copy the data from the OneLake files section to a Warehouse table.

Sometimes it is important to get the data quicker and other times it is to use less capacity.

Even though this is a Background process, if there are enough Background processes it can affect your capacity.

NOTE: that the tests below were done on sample data, it is recommended to test it on your own environment.

Data Source

I got the data from an open source location which you can find here: fabric-samples-healthcare/analytics-bi-directlake-starschema/01-DownloadCMSDataCsvFiles.ipynb at main · isinghrana/fabric-samples-healthcare

I then downloaded this to my local Lakehouse.

When I had a look the total size of the CSV files was roughly 38GB

SQL Query

Below is the SQL query which I ran to transform the data from the CSV file into a Warehouse table.

NOTE: I specified the column names with their data types, to ensure when the table was created it would not fail due to data type issues.

Actual query

CREATE TABLE Fact_Sales AS 
SELECT * 
FROM OPENROWSET (
    BULK 'https://onelake.dfs.fabric.microsoft.com/b9443dd2-a4cb-0000-aaaa-9ceea1efbafb/3068ccb4-aa52-0000-aaaa-669e66c647d5/Files/cms_raw/*.csv',
    FORMAT = 'CSV',
    HEADER_ROW = TRUE
) WITH (
    Prscrbr_NPI VARCHAR(1000),
Prscrbr_Last_Org_Name   VARCHAR(1000),
Prscrbr_First_Name  VARCHAR(1000),
Prscrbr_City    VARCHAR(1000),
Prscrbr_State_Abrvtn    VARCHAR(1000),
Prscrbr_State_FIPS  VARCHAR(1000),
Prscrbr_Type    VARCHAR(1000),
Prscrbr_Type_Src    VARCHAR(1000),
Brnd_Name   VARCHAR(1000),
Gnrc_Name   VARCHAR(1000),
Tot_Clms    VARCHAR(1000),
Tot_30day_Fills VARCHAR(1000),
Tot_Day_Suply   VARCHAR(1000),
Tot_Drug_Cst    VARCHAR(1000),
Tot_Benes   VARCHAR(1000),
GE65_Sprsn_Flag VARCHAR(1000),
GE65_Tot_Clms   VARCHAR(1000),
GE65_Tot_30day_Fills    VARCHAR(1000),
GE65_Tot_Drug_Cst   VARCHAR(1000),
GE65_Tot_Day_Suply  VARCHAR(1000),
GE65_Bene_Sprsn_Flag    VARCHAR(1000),
GE65_Tot_Benes  VARCHAR(1000)

) AS r;

I then ran the query, and it ran pretty quickly in my opinion to read and then write the data.

As shown below I could see from the Metrics App that it took 60 seconds to run and consumed 5,076 Capacity Units

Time taken

PySpark Notebook

Next is I did the same activity this time using a Spark Notebook.

Syntax found here

A screenshot of a computer code

AI-generated content may be incorrect.

NOTE: To write to the Warehouse table, the table has to first be created.

If the table is not already created in the Warehouse when trying to load it from the notebook it will fail.

I then wrote the dataframe to the Warehouse table

A screen shot of a computer

AI-generated content may be incorrect.

Reference Link: Spark connector for Microsoft Fabric Data Warehouse – Microsoft Fabric | Microsoft Learn

As shown below the query took longer to run at 281 seconds and consumed 4,218 Capacity Units.

Summary

When comparing running SQL vs, a Notebook the notebook consumes less Capacity Units overall.

If you are looking for the ingestion to be as fast as possible then SQL is faster but comes at a bit of a cost.

As always thanks for reading any questions or comments are welcome.