Best Practice Guide

The ultimate step-by-step guide for analytic content creators

Data Enrichment

  • Data enrichment is one of the key processes by which you can add more value to your data.  Data enrichment refines, improves and enhances your data set with the addition of new attributes.  For example using an address ZIP/Post Code field you can take simple address data and enrich it by adding socio economic demographic data such as average income, household size and population attributes. By enriching data in this way you can get a better understanding of your customer base and potential target customers.

  • Enriching Techniques

    There are 6 common tasks involved in data enrichment:

    Appending Data
    Segmentation
    Derived Attributes
    Imputation
    Entity Extraction
    Categorization

    Appending Data

    By appending data to your data set you bring multiple data sources together to create a more holistic, accurate and consistent data set than that produced by any one data source.  For example extracting customer data from your CRM, Financial System and Marketing systems and bringing those together will give you a better overall picture of your customer than any one single system.

    Appending data as an enrichment technique also includes sourcing 3rd party data, such as demographic or geometry data by ZIP/Postcode and merging this to your data set as well.  Enriching location data is one of the most common techniques as this data is readily available for most countries. 

    See Yellowfin geography packs.

    Other useful examples could include:
    Exchange Rates
    Weather Data
    Date / Time Hierarchies
    Traffic Data

    Data Segmentation

    Data segmentation is a process by which you divide a data object (such as a customer, product, location) into groups based on a common set of pre-defined variables (such as age, gender, income etc for customers). This segmentation is then used as a way to better categorise and describe the entity.  

    As an example common segmentation for customers includes:

    Demographic Segmentation – based on gender, age, occupation, marital status, income, etc.
    Geographic Segmentation – based on country, state, or city of residence. Local businesses may even segment by specific towns or counties.
    Technographic Segmentation – based on preferred technologies, software, and mobile devices.
    Psychographic Segmentation – based on personal attitudes, values, interests, or personality traits.
    Behavioral Segmentation – based on actions or inactions, spending/consumption habits, feature use, session frequency, browsing history, average order value, etc.
    These can lead to groups of customers such as Trend Setters, Tree Changers etc.

    By creating either calculated fields in either a ETL process or within a meta-data layer you can create your own segmentation based on the data attributes you have.

    Derived Attributes

    Derived attributes are fields that are not stored in the original data set but can be derived from one or more fields.  For example ‘Age’ is very rarely stored but you can derive it based on a ‘date of birth’ field.  Derived attributes are very useful because often they contain logic that is repeatedly used for analysis.  By creating them within an ETL process or at the meta-data layer you are able to reduce the time it takes to create new analysis as well as ensure consistency and accuracy in the measures being used.  

    Common examples of derived attributes include:

    Counter Field – based on a unique id within the data set.  This allows for easy aggregations.
    Date Time Conversions – using a date field to extract the day of week, month of year, quarter etc, 
    Time Between – by using to date time fields you can calculate period elapsed such as response times for tickets etc.
    Dimensional Counts – by counting values within a field you can create new counter fields for a specific area.  Such as Count of Narcotic offences, Weapons offences, Petty crime.  This allows for easier comparative analysis at the report level.
    Higher order Classifications – Product Category from product, Age band from Age.
    Advanced derived attributes can be the results of data science models being run against your data.  For example determining customer churn risk, or propensity to spend can be modelled and run.

    Data Manipulation

    Data imputation is the process of replacing values for missing or inconsistent data within fields.  Rather than treating the missing value as a zero, which would skew aggregations, the estimated value helps to facilitate a more accurate analysis of your data.  For example if the value for an order was missing you could estimate the value based on previous orders by that customer or for that bundle of goods.

    Entity extraction

    Entity extraction is the process of taking unstructured data or semi-structured data and extracting meaningful structured data from that element.  When applied you are able to identify entities—people, places, organizations and concepts, numerical expressions (dates, times, currency amounts, phone numbers, etc.) as well as temporal expressions (dates, time, duration, frequency, etc.)

    Taking a simple example you could, by data parsing, extract a persons name from an email address or the organization web domain to which they belong or split names, addresses and other data elements into discrete data elements from an envelope type address into building name, unit, house number, street, postal code, city, state/province and country.

    Data Categorization

    Data Categorisation is the process of labelling unstructured data so that it becomes structured and able to be analysed. This falls into two distinct categories:

    Sentiment analysis – the extraction of feelings and emotions from text.  For example was the customer feedback frustrated or delighted, positive or neutral.

    Topication – determining the ‘topic’ of the text.  Was the text about politics, sport or house prices.

    Both of these techniques enable you to analyse unstructured text to get a better understanding of that data.

  • Data Enrichment Best Practices

    Data enrichment is rarely a once off process.  In an analytics environment where new data is being fed into your system on an ongoing basis your enrichment steps will need to be repeated on an ongoing basis.  As a result a number of best practices are required to ensure your desired outcomes are met and that your data remains of a high quality.  These include:

    Reproducibility and Consistency
    Each data enrichment task must be reproducible and generate the same expected results on a consistent basis.  Any process you create needs to be rules driven so that you can run it repeatedly with the confidence that you will always have the same outcome.

    Clear Evaluation Criterion
    Each data enrichment task must have a clear evaluation criterion.  You must be able to assess that the process has run and has been successful.  For example after running a process you are able to compare the recent outcomes with prior jobs and see that the results are as expected.

    Scalability
    Each data enrichment task should be scalable in terms of resources, timeliness and costs.  Assuming that your data will grow over time any process that you create should be able to be maintained as your data grows or you add other processes to your transformation workloads.  For example if your process is entirely manual you will very quickly be constrained by the mount you can process within a required time and the process will be cost intensive – so automate as much as possible using infrastructure that can easily grow with your needs.

    Completeness
    Each data enrichment task must be complete with respect to the data that is input into the system producing results with the same characteristics.  This means that for any intended output you have anticipated all possible results this can include cases where the results are ‘unknown’.    By being complete when new input data is added to the system you can be assured that you will always have a valid outcome from the enrichment process.

    Generality
    The data enrichment task should be applicable to different data sets.  Ideally the processes that you create will be able to be transferable to different datasets so that you can reuse logic for multiple jobs.  For example day of week extraction should be applied in exactly the same way to any date field.  This ensures consistency of outcome and helps to maintain the business rules associated with your data across all your subject domains.

    Reproducibility and Consistency
    Each data enrichment task must be reproducible and generate the same expected results on a consistent basis.  Any process you create needs to be rules driven so that you can run it repeatedly with the confidence that you will always have the same outcome.

    Clear Evaluation Criterion
    Each data enrichment task must have a clear evaluation criterion.  You must be able to assess that the process has run and has been successful.  For example after running a process you are able to compare the recent outcomes with prior jobs and see that the results are as expected.

    Scalability
    Each data enrichment task should be scalable in terms of resources, timeliness and costs.  Assuming that your data will grow over time any process that you create should be able to be maintained as your data grows or you add other processes to your transformation workloads.  For example if your process is entirely manual you will very quickly be constrained by the mount you can process within a required time and the process will be cost intensive – so automate as much as possible using infrastructure that can easily grow with your needs.

    Completeness
    Each data enrichment task must be complete with respect to the data that is input into the system producing results with the same characteristics.  This means that for any intended output you have anticipated all possible results this can include cases where the results are ‘unknown’.    By being complete when new input data is added to the system you can be assured that you will always have a valid outcome from the enrichment process.

    Generality
    The data enrichment task should be applicable to different data sets.  Ideally the processes that you create will be able to be transferable to different datasets so that you can reuse logic for multiple jobs.  For example day of week extraction should be applied in exactly the same way to any date field.  This ensures consistency of outcome and helps to maintain the business rules associated with your data across all your subject domains.