How I saved 40%+ on my Power BI Dataset size
I have been working with a customer who has a fairly large dataset. Part of my role was to see if there was any way to see how I could reduce the amount of memory being consumed, whilst still allowing for the same metrics to be represented.
I had already applied all the best practices in terms of reducing the cardinality, removing unwanted columns and making sure that only the data required is being brought into the dataset. Even at this point the dataset size was consuming 90GB of memory in Azure Analysis Services. With the steps below I got my dataset size down to a whopping 37GB of memory!
I used the awesome tools from SQLBI.COM and DAX Studio to see which columns were consuming the most space, and because my dataset had currency converted values, this meant that the cardinality was very high. (The reason that I decided to store the currency conversion values, is when trying to do it on the fly in a large dataset it is very slow)
Original dataset size
As you can see below the original size of the PBIX that I had was 5.01MB
When this dataset was up and running it was consuming about 20MB of memory (As previously blogged about on How much memory a PBIX will consume)
Reducing the dataset size by using Fixed Decimal
When I was reading through the documentation on the data types it appeared to me that if I could change the data type from Decimal Number to Fixed Decimal Number I could save on the dataset size.
I had a look and for the data type to become Fixed Decimal it can only have 4 decimal values.
I made sure that if I did change the values in my column to only 4 decimal places that I the totals would remain the same.
As you can see below the SalesAmount_EUR has got the data type set to Decimal Number which is my starting point.
Next I changed my source query to only return the values with 4 Decimal Numbers as shown below. (You could also do this in Power Query if need be)
I did double check the values and they still matched.
Now when I saved my PBIX it went from 5.01MB to 4.18MB
That is a saving of roughly 16%, which I initially thought was pretty good.
I still felt that there was something more that I could do to get the dataset smaller and with some creative thinking there is, so read on.
Making the dataset really small and efficient by using Whole Number
I had an idea, which I got after remember a conversation I had with Phil Seamark, where he mentioned that under the hoods of SQL Server Analysis Services (SSAS) tabular when you have a decimal point it converts it to a whole number and stores where the decimal point was.
This got me thinking, what if I could now split my SalesAmount_EUR into separate columns?
That would then mean I could two columns of whole numbers. Not only that but the cardinality for the decimal values would be significantly less.
I immediately got cracking and by using TSQL (You could once again use Power Query) I split the column into 2 columns. My one challenge was to ensure that when I had a negative value that my cents column also was negative!
Once I imported the data, I made sure that the data type for my columns was set to Whole Number
At this point I was really hoping that this would allow for a significant saving on the dataset & memory size.
When I saved the file, it had dropped down to a tiny 1.75MB
Of course, I just HAD to create a Power BI visual to show how much I had saved in this particular exercise.
As you can see above, I got a 65% decrease in the size of the dataset.
The memory size also decreased from 19MB of memory down to 2.1MB of memory.
Now I know you might be thinking that I now need to add in some additional logic into my DAX measure to bring these two columns together to represent the same values?
I did test this and due to the dataset size being a lot smaller the effect was then a smaller dataset size, smaller dictionary size and smaller columns hierarchy size which means that the super-fast SSAS engine can process the data faster.
For reference this is how I put the values back together in one measure
Sales Amount EUR = SUM ( Data[SalesAmount_EUR] ) + ( DIVIDE ( SUM ( 'Data'[SalesAmount_EUR_Cents] ), 10000 ) )
As I have shown there are at times very creative ways to reduce the size of your datasets.
I have read a fair amount of blog posts in my time and as I have previously read, I would suggest testing this with your own data. The mileage in terms of savings will vary and very often it will be worth the effort to get the savings. Not only that but you learn a lot on the way too.
Writing this blog post has been exciting for me and I hope that you enjoyed it too.
As always thanks for reading and if you got any comments or suggestions please leave them in the section below.