September 17, 2023|tutorial|

how to create a date dimension table in python and pyspark.

introduction.

A generic date table or dimension is crucial for most of our analytical reporting use cases. You can create it anywhere in your ETL process, although the overall recommendation is probably to create it not too far downstream. While date tables are very much generic and there is not too much of a business logic that shall be reused, I still like the idea to provide it centrally for the whole organization. You might, indeed, have specific holidays in your country or other organization specific dates that should be marked in your date table. Also, recreating such a date dimension for any of your reports, datasets, marts or other layers just wastes resources. So here is a simple script in Python that you can use, i.e. in your Lakehouse. Anyone that needs a date table can just fetch the columns and rows they need from that table and they are good to go.

prerequisites.

1. In our case we use Databricks, but you can also use the script in Azure Synapse or anywhere else that provides you with a Python runnable environment.

1. What’s the goal?

The goal is to end up with a script that returns a date table like the below:

The script itself is somewhat dynamic where you can specify the start and end date as well as the columns that you need in the date table.

2. The script

In the following the script that you can use. In line 8 and 9, we specify the start and end date. In this case the end date is dynamically calculated as the last of December three years ahead of the current timestamp. In line 15 to 39, each column is specified with its name and logic. Of course, you do not need to implement all of them if you do not need them. The rest calculates and saves the table except line 62 which provides a space where you can try out different functions and logics for new columns.

Copy to Clipboard

3. Upload the script

Obviously, you could just copy and paste each of the cells in to your notebook. However, it also works to copy the code into notepad and save it as a .py file locally. After, navigate to your folder you would like to save it, click on the three dots and press import.

4. Showtime

In the code sample above, I added some display function which are commented out by default. But if you wanna check the results on the way you can just remove the comment itself. Here the result when running it:

end.

That is it! Feel free to add new columns with a new logic or remove the ones that you do not need. Hope this helps!

Leave A Comment