January 31, 2025|tutorial|

how to scale your data cleansing in power query to improve data quality.

introduction.

Usually, you would want to outsource such cleansing tasks to the source as part of your ETL flow i.e. into or within your lakehouse or data warehouse. But we all know that sometimes you still get dirty data and you are forced to do the cleansing yourself in Power Query or you just do not have such a data infrastructure in the first place. Either way, this post provides a function where you can scale your data cleaning by specifying validation checks. This article is somewhat a continuation of the blog post how to replace several characters in a column in just one power query step. There, we leverage a mapping table that is used to simply run over a target table, replacing an old (dirty) value with a new one. Feel free to check it out!
In this article, however, we take it to the next level, where we define certain rulesets (functions). We then apply those rulesets with another function
DataQualityCheck according to an action table onto any target table and column that we want to. If the validation defined in the ruleset fails, the DataQualityCheck function runs an action upon it, i.e. replace the value or delete the entire row. All of this is done dynamically. In Excel world this is very similar to a Macro.

prerequisites.

1. Power BI Desktop

1. What’s the goal?

The goal itself becomes quite clear after illustrating the problem. Say you have the following table:

Obviously, the attributes phone number, email and age have some wrong data. To clean this mess up, we specify functions in M that validates each record according to pre-set rules. For instance, our phone number shall always follow the pattern 3 digits, dash 6 digits (“123-123456”), our email shall be in the format “somestring@someotherstring.xx” and our age shall just be a number. As an example, below the M code of the age check:

As you can see, the M validation code is embedded into a function. Do not worry if you have not used functions before. The M code itself is the same as you would use in a custom column. The rest is just a wrapper that you can copy and paste (more on this later).
In case the respective validation rule fails, we want to take action upon it, i.e. either delete or replace the value. We specify this in an action table, where we define which validation ruleset (=function) shall be applied to each column as well as what we shall do with it, if the validation fails (replace the value or delete the row entirely):

So, the table above tells us to replace any value for the column phone number or email in MyTestTable in case of a validation violation. Further it’ll delete any row that is not an age according to our validation rule. When running our DataQualityCheck function onto our target table it will apply the validations and actions according to our action table and functions. The result is a cleaned target table:

All of this, we could have easily done without defining functions or populating an action table… So why the hassle?
That’s true! However, with this approach you decouple the validation and cleaning code from the table in question. This means you can reuse any cleaning ruleset without specifying the same logic over and over again. Say, you have a certain ID that pops up in several tables. Now, all you have to do is to store the logic of your validation once and reapply it as many times as you want. Also, if the logic changes in the future, you’ll just need to adapt in one place.

2. Create the cleaning functions

For the DataQualityCheck function to work, we first need to create our validation functions. Those specify what a valid phone number, email or age looks like. The beauty of it? You only have to specify the rule once and then you can just apply it to any table and column.
For this, just add a Blank Query and open the Advanced Editor. Then paste the subsequent code snippet into the advanced editor.

Copy to Clipboard

You need to substitute YourValidationRuleHere with the M code that shall validate the specific column. You can easily debug your M code by adding a custom column to your original table. When you are ready, just replace the YourValidationRuleHere placeholder with your code. Important: Replace the column name in your logic to data when replacing YourValidationRuleHere.
Below the snippets of my phone number, email and age validation code:

Copy to Clipboard
Copy to Clipboard
Copy to Clipboard

Lastly, we need to add a record that specifies each function. The record needs to be called ValidationFunctions. This name is crucial otherwise the function would throw an error later on.

Just paste the names of your functions in the same way as per below. It doesn’t matter how you call them as long as you reference the function correctly (on the right). The name on the left can be called as you wish, but remember to specify the same name in your action table (see next bullet).

Copy to Clipboard

3. Create the action table

The action table is nothing else than a configuration table telling the RunDataQualityCheck function (see next bullet), what it shall do, if a validation (that is the functions we specified in the previous bullet) fails.

In our action table we specified to replace the phone number with “000-000000”, if the validation check concludes the value is wrong. For email, we do the same (“not@an.email”) . For the age column, we delete the entire row, If the function finds it to be invalid. Here the specifications for each column:
TableName: The name of the target table to be cleaned
ColumnName
: The name of the target column to be cleaned
ValidationFunction
: The function name that shall be used. Note, needs to be the same as specified on the left hand side of our record table (see previous bullet)
BadDataAction
: Can either be “replace” or “delete”
DataReplacement
: If you chose “replace” in the BadDataAction column, you can specify the replacement value here

Note, the columns of your action table have to have those names. You can specify your own configuration by clicking on Enter Data in the New Query tab.

4. Create the data quality check function

Almost there! Now that we specified our validation functions and our action table, all that is left is to add the heart of this approach: The RunDataQualityCheck function. This function will actually do the work and apply each validation function specified in the action table to our target columns.
For this, just add yet again another blank query and paste the subsequent code into the advanced editor:

Copy to Clipboard

5. Showtime

Now we are all set!
To invoke the function, select it in the query pane, fill in the values and click Invoke. Unfortunately, we have to specify the table twice, one time from the dropdown and another time as a string. I did not manage to cast the table from table type to a string type as part of the function.

And here the result, the cleaned table:

end.

This is it! We are now able to scale our data cleansing in Power Query, by defining reusable functions, populating an action table and running our RunDataQualityCheck function. Lastly, I wanna inform you about other Power Query tips and small scripts here on tackytech.blog, as well as how to get the next scheduled refresh into your Power BI report.

Leave A Comment