My previous blog post got a lot of comments and suggestions which is great.

You can view it here for reference: Microsoft Fabric: Why Warehouse Beats Lakehouse by 233% in Speed and 278% in Capacity Savings – FourMoo | Microsoft Fabric | Power BI

I learnt a lot and based on the feedback people asked for me to compare the Lakehouse vs the Warehouse with 1 billion rows.

What I also did this time was to optimize anything I could with regards to loading data into the Lakehouse or the Warehouse based on the feedback I received.

Below is a list of the changes I made

  • When starting the notebook, I configured it with the native execution engine and set the file size to 128MB
  • I used the same ingestion file which I got from here Release Ready to use data · sql-bi/Contoso-Data-Generator-V2-Data
    • I then loaded this CSV 5 times to get it to over 1 billion rows
  • Lakehouse changes
    • Based on the blog post from Miles Cole Mastering Spark: Should You Infer Schema in Production? | Miles Cole where I created the Schema for the table before inserting into the Lakehouse table
    • No partitioning on the data into the Lakehouse table
    • This resulted in significantly less files being created.
      • Below was from the first load which was partitioned by day
      • And this is the files below, which has got 5x more data without the partitioning and setting the binSize
  • Warehouse changes
    • Instead of using the COPY INTO from the CSV I created the table and used the INSERT INTO with OPENROWSET
    • All the Warehouse tables were as is, I did not use any Clustering

Ingestion CSV Queries

I then ran the CSV ingestion queries and as shown below they ran in similar times and this time the CUs being used was much closer.

Lakehouse Ingestion

Warehouse Ingestion

The warehouse took 17 seconds longer to run compared to the Notebook

In terms of CUs the warehouse used less CUs than the Notebook by 1,793 CUs.

In this test the warehouse consumes 18% less CUs than the Notebook

Aggregation Queries

Next was for me to re-run the aggregation queries where I am querying from the 1 billion row tables and aggregating the data.

Lakehouse Aggregation query

Warehouse Aggregation query

The warehouse ran about 8x quicker than the Lakehouse aggregation query.

When comparing the CUs the warehouse was only 1x or 110% less CUs than the Lakehouse Spark SQL Query

NOTE: I did also try to run the aggregation using Python but that took significantly longer than the Spark SQL Query.

PBI Queries

As with my other test, was to then query the aggregated data into a Power BI Semantic Model.

Lakehouse Semantic Model Query

Warehouse Semantic Model Query

Once again, the Lakehouse and Warehouse were very close with the Lakehouse being quicker than the Warehouse

In terms of CUs the Warehouse consumed only 10% less CUs than the Lakehouse.

Summary

When comparing the Lakehouse vs the Warehouse by using a larger dataset as well as using the recommendations to optimize for Spark overall the Warehouse was quicker but not as much as my first blog post.

In terms of duration there was only a handful of seconds between the Lakehouse vs the Warehouse

Comparing the CUs which to me is the most important as this can affect how your capacity runs, the Warehouse consumed only 17% less CUs than the Lakehouse

I hope that this helps give a better overview of the engines.

For me it was very interesting to see that if I apply the recommendations, it can make a significant difference!