top of page
  • Writer's pictureAndrée Laforge

The sexy topic of the hour: HR data quality!

I wanted to entice you to read this blog post and thought that using the word "sexy" might get you interested. But there's nothing sexy about data quality, if anything it's nauseating!

However, if you knew how important this is, and that it will likely be one of the biggest issues you will face when implementing HR analytics (besides getting approval and a budget from your management).

Very few organizations have an HR data warehouse. Data consolidation, data cleansing, and having a single, reliable source - these are the tasks you'll spend the most time on, not the visualization and analysis (the real "sexy" part of HR analytics). An analysis is simple once the data is in good shape.

Data is the cornerstone of an HR analytics initiative. Do you know what a cornerstone is? It’s a foundation, an ESSENTIAL base. And the quality of that cornerstone is considered the most important technical aspect for the success of your HR analytics initiatives.

Data quality: an IT problem or an HR problem?

Data quality is often perceived as an IT problem when it is not: it is up to HR to take ownership and fix it (I repeat IT IS UP TO HR TO TAKE OWNERSHIP AND FIX IT). Often, the IT team will be able to put a band-aid on data quality issues in the extraction, transformation, and loading phase, and that is why (the band-aid) HR does not take responsibility for data quality. So the question needs to be asked, should we put a band-aid on it or just attack the problem head-on?

Achieving a high level of data quality is difficult, and organizational and data ownership issues have a significant impact on this. I ask you: who owns the data from your HR/payroll system? Certainly not IT. In the short term, the easy solution is often to put band-aids on problems rather than address the source of the problem. If you don't trust your data, you're dead. HR analytics won't help you. So you need to take ownership and start fixing your data.

What to do first?

Improve data quality and then implement HR analytics or implement HR analytics on bad data and improve the data as you go?

It seems obvious, doesn't it? Any sane person would not embark on an HR analytics initiative with bad data! In reality, many do, because they have little choice. First, because they have no idea what the quality of their data is before they start their project. Also, it is very difficult to address the root causes of this non-quality without investing a lot of time and resources.

Second, having multiple operating systems without a unique key (like an employee number for example), having inconsistent data definitions (e.g., what is an employee?), and doing data entry incorrectly (data is entered by humans...), until all of it is exposed to HR through reporting, there is little incentive to address the source of the quality problem. Given this chicken-and-egg situation, I recommend pursuing the HR analytics project if you have serious data quality issues but do so with clear expectations and limited project scope. The extent of the problem is often discovered by really putting the spotlight on the data!

Therefore, it is important to communicate loudly about data quality issues and the risks associated with deploying HR analytics tools on bad data. It is also important to advise the various stakeholders on what can be done to solve the data quality issues – systematically and organizationally. Complaining without providing recommendations solves nothing!

Data quality depends on your data entry processes! How easy is it for people to enter good data? Are they motivated to enter quality data? Do they know why it is important? Do you REGULARLY validate the data entered?

You must change the habits and behaviors of those who enter the data! Otherwise, the investment in your HRIS will be useless!

Do you need perfect data?

All HR analytics projects require data, but they do not require data perfection. High quality should always be a goal, but the pursuit of complete and perfectly clean data should not be a barrier to progress or a reason not to undertake an HR analytics project. In many cases, data is incomplete, inconsistently defined, outdated, missing, dirty (containing errors), or stored in multiple disconnected systems. The challenges are real and numerous, but they are not insurmountable.

What to do when data quality is not good?

Without a reasonable degree of confidence in the quality of the data, HR analytics should remain in the hands of experts (e.g., the HR analytics team) and should not be extended to the rest of the HR team and certainly not to senior management or managers. The deployment of HR analytics, in this case, should be done in a limited way, so that data quality issues will be exposed, understood, and eventually resolved. Then, the deployment can be gradually expanded.

How do you know if the data quality is sufficient for the project you are undertaking?

The well-known phrase "garbage in, garbage out" is quite appropriate in the context of HR analytics. Don't try to fill every gap in your data and solve every problem to the point where you lose sight of the goals of your analysis. There will always be data problems.

To know if your data quality is good enough to undertake your project, you need to get to know the data and understand it, that's the first step! In many cases, this means learning from others, from experts in the HR field. Some things are much easier than others. For example, if you have a negative age or negative seniority, or age over 100, you know something is wrong with your data. However, if you have negative values for sales? Does this indicate an error? Perhaps, but you will need to check with the salespeople to see if it is an error, it could be a canceled order or a price renegotiation of a previous order. You need to invest time to understand the data.

The use of automated data profiling tools can also help overcome data challenges. Data profiling involves checking the allowed values, logic, and consistency of data sets. Data profiling tools analyze data for consistency with business rules and provide recommendations on areas to investigate further in a dataset. After profiling your data, how do you determine if the data is "good enough" to pursue analysis? Again, you must look to the data owner. For example, in our Kara platform, we have a profiling tool in place with a validation rule that identifies all employees under the age of 14. The goal is to highlight outliers. For one of our clients, we had a lot of employees in this age range. So, we talked to the client and they explained that in the retail industry (convenience store), they were hiring more and more people under 14 if the parent would allow the child to work (largely due to labor shortages).

What are the common data problems and what are the solutions?

What if you determine that the data is not good enough to proceed with the analysis? The first step is to understand the difficulties. Sometimes the data element you want to analyze has missing values. Sometimes the data has not been updated and, therefore, does not reflect the most recent values. In some cases, the data you want to analyze does not even exist. Each of these scenarios can seem frustrating, even discouraging, but there is almost always a solution. Do what you can with what you have. You can always move forward.

The first thing to do, especially in the beginning, is to make sure you have the right definitions. It's very important to get high-level agreement on the metrics, which ones are most relevant to the organization, and what their definitions are. For example, let's talk about the number of employees: it sounds easy, but I've seen an organization that debated for a whole day about the definition of an employee. Should temporary workers be included? What about inactive employees (people on parental leave or disability)? It's important to get your HR data and metrics right and to put that definition on paper (virtually speaking) in your metrics dictionary.

Here are some tips for checking the quality of your data

a) Make sure you have all the dataset files you were expecting and that they contain all the information you need to continue your project. Check that the files cover the period you agreed upon. Don't just check the endings and beginnings of the files - check for data everywhere.

b) Verify that the amount of data corresponds to what you know about the organization (number of records at least equal to the number of employees?). The number of rows should be in line with your expectations. Check that the last line of data is complete, as a bad file transfer can cut off the end of the file.

c) Check to see if the list of columns is complete. Identify data columns that were not requested, but are included, to see if they can be useful to your project.

d) Examine the lists of values for the coded fields. Are they clear and consistent with expectations? For example, for the gender column, if you specify 1 for female and 2 for male, are there any values other than 1 and 2 in your dataset?

e) Check the range of values in each column. Are there any columns with values that do not appear to be appropriately distributed or that have extreme values?

f) Do your files contain many missing values? And your decision to use this data or not will depend on what is missing. For example, if you want to analyze the impact of diversity on promotions, if unfortunately, your diversity data is incomplete, it will be impossible to draw good conclusions. When you are working with hundreds and thousands of employees, a few missing pieces of information will not impact the overall analysis. However, when large amounts of data are missing, say 50% or more, you should consider that data suspect in your analysis.

g) Check for duplicates. You may need to confirm with the data owner whether they are true duplicates or whether an excluded data column would distinguish them.

h) Pay attention to dates and HR data is known to have a lot of dates. Dates can be a major problem, as different systems have different conventions (separating numbers with a slash or dash, years are indicated by two or four digits, etc.).

That's it for now. I hope you now understand how important data quality is to your HR analytics project! In a next post, I'll talk about the biggest threats to your data quality, what metrics to put in place to measure your data quality, and finally, what are the best practices to improve your data quality. Until then, be well and stop burying your head in the sand, you are responsible for HR data quality. It's time to act!

If you have any questions or comments, don't hesitate!



bottom of page