Fabric Notebooks – Looping through data using PySpark

Continuing with my existing blog series on what I’m learning with notebooks and PySpark.

Today, I’m going to explain to you how I found a way to loop through data in a notebook.

In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.

Please note that I am using the option to Open in VS Code (Web) (PREVIEW), so that I can work using VS code in the web.

A screenshot of a computer program

Description automatically generated

The first thing I did in my notebook was to load Sempy and some configuration settings.

A screen shot of a computer program

Description automatically generated

Getting the dataset to loop through

As explained in Ma, working example, I am going to loop through a set of dates which I am getting from Lakehouse table.

The way I am getting this data is to use the PySpark SQL as shown below.

The reason for my query in this format is because my DirectLake semantic model only has dates for the periods shown below.

Also, when passing through dates to DAX it needs to be split into Year, Month and Day.

A screenshot of a computer screen

Description automatically generated

When I run the above cell, I can then see my output as shown below (This is to ensure that I am getting the expected output)

A screenshot of a black screen

Description automatically generated

Code to loop through the data

Below is the final step where I am now going to show you how I looped through the data from the dataframe df_Dates_Query

I have put comments in the code which will explain the different steps I completed.

One thing I want to call out is that when looping through all the rows in the dataset, it must have the indentation (TAB) across to reference each row in the dataset.

##### Looping through the Dates #####

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import datetime
from pyspark.sql.types import IntegerType,BooleanType,DateType
from pyspark.sql.functions import col, year, month, quarter
from pyspark.sql.functions import *
from datetime import datetime,date,timedelta
import pandas as pd

# Workspace and Semantic Model Name
WorkspaceName = "Fabric - FourMoo"
SemanticModelname = "Near Realtime Semantic Model"

# Create a Spark session
spark = SparkSession.builder.appName("AddColumnExample").getOrCreate()

# Get the dataframe I want to loop through.
# I used the method "collect()" to use it as my looping function.
# My variable is called "data_Collect"
# I am then using my previous dataframe called "df_Dates_Query"
data_collect = df_Dates_Query.collect()

# If needed I can see the dataframe I am going to loop through for some validation.

# looping thorough each row of the dataframe
for row in data_collect:
    # NOTE: There has to be a tab in order for each row in the dataset to be used

    # This is where I am assigning a variable name for each row in my looping dataset.
    var_Date = row["Date"]
    var_Year = row["YearValue"]
    var_Month = row["MonthValue"]
    var_Day = row["DayValue"]

# Below is my DAX Query String
    querytext = '''\
    // DAX Query
            KEEPFILTERS( TREATAS( {DATE(''' + var_Year + ''', ''' + var_Month + ''', ''' + var_Day + ''')}, SQL_Mirror_Sales[InvoiceDateKey] )),
            "Sales", [Sales (Mirror)]

    # Running the DAX Query
    df_DAX_QueryResult = fabric.evaluate_dax(workspace=WorkspaceName, dataset=SemanticModelname, dax_string=querytext)

    # Get the Output of the DAX Query

Here is the output from my DAX query, where you can see I am looping through each of the dates highlighted below.

If I did not have the indentation (TAB) and I try and run the code I get the following as shown below.

A screenshot of a computer

Description automatically generated


In this blog post I have shown you how I loop through data in a PySpark notebook.

Thanks for reading, comments and suggestions are most welcome 😊