Explore Yellowfin now with our sample datasetGet Started
Ensuring overall data quality is paramount while preparing data for analysis and managing it on an ongoing basis. Here we detail 7 key data quality issues to avoid, and the best practices to follow.
There are 7 key causes of data quality issues. In preparing your data for analysis and managing your data on an ongoing basis, you should continually monitor your data to ensure these issues do not occur.
Duplicate data is when the same data is entered multiple times, but in slightly different ways. Duplicate data is often created when extracting data from multiple siloed systems and merged together in a data warehouse, creating ‘copies’ of the same record. Duplication may produce skewed or incorrect insights when they go undetected.
For example: A Customer Name spelt slightly differently or Address Data with different syntax and abbreviations.
Storing the same type of data in inconsistent formats is a common quality issue.
For example: Storing dates in mixed format, such as (US Date) MMM DD YYYY, (European Date) DD MMM YYYY and (Japan Date) YYYY MMM DD – all are valid, but if these formats are used in the same field, it will make effective analysis near impossible.
This data quality issue occurs when crucial pieces of information are missing, either as a result of failure to input it at the source system, or as a result of ETL processes.
For example: An address details where ZIP/Post Code data is sparse. Without a ZIP code, the ability to conduct location analysis is significantly compromised.
Data inconsistency is the result of storing data in the same field that is either in a different language or in different units.
For example: Storing Volume in Metric (Litres) as well as Empirical units (Gallons) – the result of this would be that any aggregate analysis would be incorrect.
This is one of the most difficult data quality issues to spot, and occurs when the format is correct and every value is complete, but potential mis-spellings exist or the data is simply inaccurate.
For example: If sales opportunity data entered into your CRM system by your sales people is incorrect, then your ability to conduct forecast analysis is severely compromised.
Data Invalidity is when your data can’t possibly correct based on simple rules or logic.
For example: Having values for inventory level with a negative number.
Data imprecision, or lack of precision is when data has been stored at a summarized level, as a result of an ETL process, that does not enable users to get to the level of detail they need for analysis.
There are three areas in which you can address data quality issues.
The best place to address data quality issues is at the originating data source. This means addressing systems and processes involved in data capture. The challenge with addressing issues at this layer is the high level of intervention needed at a business process layer, or if the data is provided via a third party where you have no control. It is often difficult to get the buying across the business to implement a data quality program at the source, even though it is the best place to address the issue.
If you cannot fix data quality at its source, then you can attempt to fix the issues via your ETL processes. This is often the pragmatic approach taken by many businesses. Using defined rules and smart algorithms, it is possible to fix many issues in this way – ensuring you have a clean data set to report from.
Lastly, if you do not have control of the ETL processes and need to analyze a data set ‘as is’, then you can use rules and logic within a metadata layer to fix (or mask) your data quality issues. In this way, you can apply some of the rules and logic you would have applied in an ETL process, but in this case, the underlying data is not updated. Instead, the rules are applied at run time to the query, and fixes are applied on the fly.
A range of best practice measures exist to help you to manage the quality of your reporting data.
Create best practice meta-data layers, specifically focusing on the business definitions of your data, to ensure common data definitions are used across all subject domains within your organization. These common definitions, formats and rules are the backbone of any data quality initiative.
Profile your data
Data profiling allows you to gain a deep understanding of your data. Data profiling is a method used to determine the frequency and distribution of data values in your data set. It should be used to directly measure data integrity and can be used as input to set up the measurement of other data quality dimensions. Using data profiling, you can easily determine how data conforms to specific business rules and defined data standards.
Create data quality reports & dashboards
Off the back of your data profiling set, measure data quality KPIs based on the data quality dimensions relevant to your data set and needs. Build a data quality dashboard to continuously monitor the trend of quality of the data entering your reporting and analytics environment.
Set quality alerts
Once your reports have been created, set up threshold-based alerts to ensure you are alerted to new incoming quality issues as soon as they arise. Thresholds can be tailored based on the urgency and priority of the issues identified.
Create an issues log
Create, maintain and report the outcomes of a data quality issues log. This enables you to document issues, create preventative measures and confirm the effectiveness of data cleansing processes.
Enact data governance & stewardship processes
A data governance framework must lay out the data policies, data standards and data quality KPIs that are needed. This includes what business rules that must be adhered to and underpinned by data quality measures. Furthermore, the data governance framework must include organizational structures such as a data governance committee or roles such as data owners, data stewards or data custodians to ensure ongoing management of your data quality.