The Data Scientist's Guide to Speed Dating Success
The first couple of weeks of the Australian new year is an odd time. In theory the holiday break is over but summer is still in full swing. The roads are uncluttered and when you look around the workplace, there are plenty of vacant seats. Business may have officially resumed but nobody is taking it too seriously. This is the ideal time for channellling creative ideas and plannning projects for the year ahead.
We mention all of this in an attempt to explain why the two of us have been busy planing a Yellowfin webinar about speed dating, love and data science.
Our objective is to show how a data science workflow based on the CRISP-DM methodology can be integrated into Yellowfin in order to develop a model that can be deployed and productionised for operational use. Along the way we will demonstrate how data scientists can streamline the data wrangling, feature engineering and visualisation steps.
We intend to achieve all this with a real world project. We will use data science to improve speed dating outcomes and help Tim find love.
We've already started working our way through the key elements of the CRISP-DM methodology. We've established the business understanding, identified that the problem is one of love and we've created our preliminary plan.
In addition, we've obtained a suitable speed dating public data set which is now being put to use. Tim has diligently worked his way through this, answering more than 30 questions designed to identify preferences relating to prospective partner attributes, social activities and the speed dating environment.
Subsequent elements of the CRISP-DM methodology are being enabled through the Yellowfin Data Science Workflow.
- Data Understanding – Yellowfin is helping speed up data ingestion, profiling, exploration and visualisation to discover initial insights into the data.
- Data Preparation – Here Yellowfin’s Rich Data Transformation functionality and its Cleansing, Enrichment and Feature selection are being used to produce datasets for the modelling process.
- Modelling – Yellowfin has the ability to consume models developed in dedicated data science and machine learning toolsets through its support for PMML and PFA along with plugins for R and H2O.AI. In this step we will be using both H2O.AI and R to develop candidate models using the prepared datasets.
- Deployment - Usually this means deploying a code representation of the model into an operating system to score or classify new data. Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development. Yellowfin’s Data Transformation functionality and extensible plugin framework provides a rich processing pipeline that supports the end-to-end data science work flow.
It should be noted that in a larger, more complex project, an evaluation phase would precede Deployment. For the sake of our relatively simple business problem and to keep the webinar to a reasonable time, we’ll be jumping straight to deployment.
So, that’s the plan.
Can the world’s leading data science methodology, a thorough and well-considered model and Yellowfin’s supporting workflows help Tim stand out from the speed dating crowd? When we plug his answers into the data model, will he find his perfect match?
All will be revealed during the webinar. To find out, join us at 14.00 AEST on 15th February 2018.
Rob Aldridge & Tim McIntosh