June 10, 2024|lesson|

how the partitioning of delta lake tables can impact the performance of your direct lake semantic models.

introduction.

First off, this blog post is not supposed to tell you how you should partition your delta tables when consuming them in direct lake models. Neither suggests this article that you should partition your delta tables. Instead, this post is about acknowledging that partitioning can have an impact on your direct lake semantic models’ performance. To showcase this, we created the same 100 million row table four times with different partitions and tested their performance against each other in DAX studio by utilising two distinct measures.

1. What’s the setup?

Let’s start with the tables themselves. We created four times the same identical delta table in our Lakehouse. The difference between those tables is how we partitioned each of them: One table had no partition at all, whereas the other ones had partitions on month, week and date respectively. Each table consists of 100 million rows and three columns: IDInt (a row count column as integer), AttrString (with the three distinct values Banana, Apple and Orange) and a Date column (with a random date within the year of 2024). The script we used can be found in the appendix I. Below, a snippet of the table:

Here is how the tables look like in OneLake’s file explorer. I used the add-on that you can download from here.

Next, a lineage overview over the whole setup starting from the Lakehouse on the left to the models and reports on the right. In addition to our four semantic models in direct lake mode, we also have a model in import mode that we’ll take into consideration during benchmarking.

Below, you can see the applied data model:

And here the two measures: A simple count as well as a slightly more “complicated” YTD measure:

Copy to Clipboard
Copy to Clipboard

And finally the visuals producing the DAX queries. Those utilise our two measures broken down by either month, week or day (from the date dimension) or by the string column:

2. Test and compare performance with DAX studio

First things first: A quick explanation on the methodology. We ran each of the two DAX measures for every visual exactly three times while always clearing the cache before each run. We copied the DAX query from the Performance analyzer tool in Power BI desktop. Below a screenshot for the DAX measure CountID grouped by week (this corresponds to the DAX query from the visual on the top right in the report). In this specific case, we tested the query against the model toms_sem_model_withpartitiondate which is a direct lake model on top of the delta table that we partitioned on the date column.

We not only looked into the log durations, but also tracked the Server Timings. Also here, we ran each DAX query three times while taking note of the Formula (FE) and Storage Engine (SE) metrics:

Below the results for the log timings for the CountID measure. We averaged over the duration of the three runs. The unit for the durations is always in milliseconds (ms).

Of course, we did the same for the measure YTDSumID:

Overall, for the measure CountID, the impact of partitioning is less significant compared to the more complex measure YTDSumID. We can also note that the import model outperforms the direct lake models, particularly in case of the YTDSumID. Also, the DAX queries (originating from the visuals) didn’t seem to benefit differently from the partitioning in place: For instance, the direct lake model on the monthly partitioned table didn’t perform better than the weekly partitioned one for visuals that grouped by month. In other words, it looks like partition pruning isn’t happening, which is in line with what Sandeep has written as well. Lastly, the direct lake model without any partitions in place performed worst whereas in our case it appears the more granular the partition, the better the performance.

Now, let’s have a look at the Formula Engine (FE) and Storage Engine (SE) metrics for the
CountID measure:

Since there is hardly any significance for the Formula Engine results, we focus on the Storage Engine metrics instead. As before, the non-partitioned table was the slowest, while the import mode model was the fastest. The earlier findings seem to hold true here as well.

Next, we’ll have a look at the results for the more complex measure,
YTDSumID:

These results confirm our earlier observations as well. It’s worth noting that the import mode model performed equally well for both measures. Additionally, the week- and month-partitioned models, as well as the non-partitioned model, had the same performance penalties (around 10x) for CountID compared to YTDSumID. However, for the date-partitioned model we could observe this factor to be only 2.

end.

At least for the models above, we could observe performance variations between the differently partitioned tables. Once again, the decision if and how to partition your delta lake tables will depend on your very own model and use case. Yet, it could be worth to keep partitioning in the back of your head, when designing a solution that is based on direct lake mode.

appendix I.

Below the Python script we used to create the differently partitioned delta tables.

Copy to Clipboard
Copy to Clipboard
Copy to Clipboard
Copy to Clipboard

For the date table, I wanna refer to the blog post how to create a date dimension table in python and pyspark.

appendix II.

Here some visuals, showing data from DMVs run against the different models. In the first visual we can see that each direct lake mode model only has one partition.

Leave A Comment