A Real implementation of Approximate Distinct Count vs DISTINCTCOUNT is 7X faster!
I am very fortunate to have known Phil Seamark for quite some time. And quite a while ago we were chatting, and Phil mentioned that he had this idea that he could get an approximate distinct count working where it could be VERY close to the actual DISTINCTCOUNT (Using DAX)
If you are unaware one of the most expensive DAX functions that you can use is DISTINCTCOUNT, because it has to find all the distinct values across your dataset. This is very time consuming and can appear to be slow. Whilst in reality it is really fast for what it does.
In my customers dataset there are over 64 million distinct users across a 2-year period. As you can imagine when doing a DISTINCTCOUNT this takes quite a while to complete the calculation.
I read Phil’s awesome blog post DAX : Approx Distinct Count, which I do recommend you read before continuing with my blog post. I am not going to go into the very detailed explanation that Phil wrote about or how to get it implemented.
I am currently running SQL Server Analysis Services (SSAS) 2019 Enterprise Edition. (This can also be applied to Power BI)
My Fact table has got roughly 950 Million rows stored in
And as mentioned previously it has got over 64 Million distinct users.
The data is queried from SQL Server into SSAS.
Implementation of the Approximate Distinct Count
As per Phil’s blog post what I had to do was to update my fact table with the 2 new helper columns when importing it into SSAS.
Next I created the Approximate Distinct Count as per Phil’s blog post
The last part was for me to process all my partitions. This took close to 14 hours to complete!
How close is the Approx Distinct Users to the DISTINCTCOUNT measure?
The first thing I needed to check was to see how far apart the Approx Distinct Users is when comparing it to the DISTINCTCOUNT.
I had to understand if it was close enough to be acceptable.
Below are the actual numbers and the % change which shows that they were never more than 1.55% and -1.10% different. THAT IS INCREDIBLE
Whilst that is super impressive, I wanted to see what this would look like on a chart where I am doing the exact same comparison in Power BI
Once again how close are those values, in terms of trending over time I would consider that to be excellent.
Comparison between DISTINCTCOUNT and Approx Distinct Users
The fun part was then to see how much faster the Approx Distinct Users would be when compared to the DISTINCTCOUNT measure.
As always, I used another great tool DAX Studio to compare and see the results.
Fortunately, I was working this over the weekend so I could clear out the cache before running the DAX queries to test the performance of the measure.
For both of the DAX queries below I did it based on the same visuals as previously shown where it is getting the Distinct Users by month for the past 2 months.
I first ran the DISTINCTCOUNT to see how long it would take, knowing that this would take the longer time to complete.
As shown below it took a total of 30 seconds to complete. The interesting thing is that it took a total of 389 seconds using almost 13 cores to get it back from the storage engine.
By this point I was really interested to see the performance of the Approx Distinct Users measure, and I was not disappointed.
It took only 4 seconds to complete! WOW. Not only that but the storage engine only took 51 seconds to complete using 13 cores.
I did run the queries multiple times and I pretty much got consistent results as shown above.
That is a performance improvement of 7.4X faster.
For users who consume reports, this will make a massive difference.
I was so excited I pinged Phil to show him how well his Approx Distinct Users measure was performing. Phil also suggested that there could be an even bigger performance improvement if I had to drop the UserKey from the Fact table (It would no longer be required). This would mean there would be less memory consumed in the data model. Which would then make processing faster as well as query execution because it would have less data to go through.
Whilst it did take me some time to get this implemented it certainly is worth all the effort. And as you can see above using a real-world implementation it is a LOT faster than the current DISTINCTCOUNT.
Once again thanks to Phil for his assistance and patience with all my questions.
If you have any questions or comments, please leave them in the section below. Thanks for reading!