How to use SQL Code to read from Delta Tables in a Python Notebook
I come from a TSQL background, so using SQL makes it easy for me to work with data.
There are multiple ways to use SQL in a PySpark notebook, and when I started using a Python notebook it was not so straightforward.
In this blog post I will show you how I use SQL Code.
As mentioned previously I am by no means an expert, I typically find a way that works, is fast and doesn’t consume a lot of capacity. If that works consistently for me then that is how I go about it.
Why use SQL
A question you might have been thinking about is why use SQL.
To me SQL is a universal language that can be used across many different database platforms. Once you understand how SQL works, I have found that each different database uses a slightly different version, but the pattern of how to use SQL is fundamentally the same.
An example I often use for using SQL is to get the starting date or date time from a previous data load, which can then be used to load the new data.
Why use DuckDB
When I was learning to code using a Python notebook, I found that there were a lot of details on how to use DuckDB with a Python notebook.
Below are some of the advantages I have found when using DuckDB
- It has got SQL support, meaning I can use my existing SQL knowledge when writing queries.
- It understands CTE (Common Table Expressions), which are really handy when trying to remove duplicate data.
- It is really fast.
- It comes pre-installed with the Python notebook.
- It allows reading and writing to Lakehouse tables
- I know that there is support for SQL queries in a Python notebook, but as far as I am aware that is only for reading data.
- Open Source and improving all the time.
NOTE: Please note the one thing to remember is that DuckDB is CASE Sensitive. Please make sure when using Lakehouse tables that you make sure they match exactly.
Otherwise, you will run some code and then wonder why it is not querying or inserting.
SQL Code for reading data from Lakehouse tables
Below are some sample SQL code using DuckDB to read data from Lakehouse table into a Data Frame.
In this example I am selecting the Top 10 rows from my Lakehouse table called “Reports”
NOTE: Because it is using DuckDB I had to use the LIMIT 10, to get the top 10 rows.
And here is the output, which is already in a data frame.
How to use SQL code from multiple tables
Here is another SQL Code example where I am querying data from multiple tables into a single data frame.
You might be wondering why the InvoiceDateKey and the Date columns have those numbers. My understanding is that it is using the using the Unix Timestamp in milliseconds.
And this is what is shown in the Lakehouse view, whereas shown below for the InvoiceDateKey example the data type is DateTime
NOTE: I have found that it will display this way in order for the DateTime columns to be shown not only in Lakehouse View, but also in the SQL Endpoint queries.
Summary
In this blog post I have shown you some examples of how to use SQL Code to read data from existing Lakehouse Delta tables.
In the next blog post I will show you how to create a data frame and write the data into a new or existing Lakehouse Delta table.
You can find a link to the notebook I used above here: Fabric/Blog – DuckDB SQL Code to read from Delta Tables.ipynb at main · GilbertQue/Fabric
Thanks for reading, any comments or suggestions please let me know!