Why is it so difficult to bridge the worlds of Data Science and Business Intelligence & Analytics?
Talking to data scientists out there, a common story that comes up again and again sounds very much like the one below:
“I decided to build a fancy ML model to predict email CTR at the individual level. I marched on and aggregated a bunch of user level feature in Pig and built a random forest model to predict email click. The idea is that if a user has a consistent long history of low CTR, we can safely holdback that email from that user.
There was only one problem – all of my work was done in my local machine in R. People appreciate my efforts but they don’t know how to consume my model because it was not “productionized” and the infrastructure cannot talk to my local model. Hard lesson learned!”
– Robert Chang, Data Scientist at Airbnb
If we zero in on this scenario – what this means is that the most common issue faced by data science and business intelligence teams in BI deployments is the last mile before value — “productionizing” created data science models.
More often that not, data scientists are siloed away from the business, using their own tools and datasets, creating and training models on sandboxes (or laptops!) that often never see the light of day in the enterprise. For those models that do reach the enterprise, they are thrown over the wall to data engineers who have to make it “work”. Sometimes the code has to be rewritten completely as the initial goal of data scientists in certain cases is different to the overall goal in production (e.g. accuracy vs. scalability).
This is said to be the most common problem for bridging data science to production in BI.
Customers are looking for a better way
With this problem to solve, we set out to ensure that these data science outputs could be consumed and shared easily between teams, all in a single platform. It was also important to be modelling tool agnostic. This would allow data scientists to drop models from their own tools into production and allow the enterprise BI infrastructure to ‘integrate’ with those models. Whether you want to produce a churn probability score off a data frame or run the entire model over a data pipeline, the BI platform needs to perform both easily.
This is why we decided to support popular standard outputs like Predictive Model Markup Language (PMML) and Portable Format for Analytics (PFA), as well as creating a framework to allow the integration of proprietary or open source API-based data science capabilities like H2O.ai.
Each of these options are full of goodies so we’ll do a show-and-tell over a series of blogs. For today’s piece, we’ll focus on PMML.
Predictive Markup Model Language (PMML)
It’s an XML based language which is the standard for sharing statistical, predictive, and data mining models between PMML compliant applications.
That means you can create/train a model in your desired statistical tool, export to PMML, and point Yellowfin to the location of that file to immediately score production sales and marketing data.
This shortens the deployment workflow and productionizes data science models that range from Naive Bayes, Clustering, and Regression, to Decision Trees, K-Nearest Neighbours, and Neural Networks – so you’ve got quite a few options here!
What tools generate PMML?
Nearly every popular tool in the data science world.
These range from R, Python, KNIME, RapidMiner, SAS, SPSS, Apache Spark to frameworks and libraries like TensorFlow, LightGBM, and XGBoost.
How does it work in Yellowfin 7.4?
After an initial query to retrieve data, Yellowfin allows post-processing calculations to be applied to query results and transform them; we call them Advanced Functions. It’s an open framework that allows our customers to bring specific post-query column operations into the platform. In Yellowfin 7.4, the JPMML library enables you to evaluate PMML models and run them over post-query data.
However, it’s much easier to show it in action. Check out the video below for a behind-the-scenes look at how PMML runs within Yellowfin 7.4:
See all this and more at our launch webinar on October 26th.