Feature Engineering — Hypothesis-driven vs. ML-driven

Piyanka Jain
DataSeries
Published in
1 min readJan 31, 2020

--

I was talking to a CTO of a Fortune 100 bank this morning and we got talking about feature engineering in AI/ML models. With the advent of ML and AI, many believe that the statistical methods of feature engineering are redundant. For example, in the case of a supervised classification 0/1 problem, many use LASSO now to identify features of importance instead of using the correlation matrix (stat approach).

While many of these are viable short cuts to build the first dirty model, YOU MIGHT BE SURPRISED that hypothesis-driven variable creation/combination still delivers far superior results than ML-driven feature optimization for the refined model.

In a recent client case for predicting customer churn, we created a hypothesis-driven variable based on a brainstorm with the product and marketing team — a customer who clicks three-level down on a help page and then calls customer support and doesn’t get FCR. That variable turned out to be the top variable predicting customer churn. No ML/stat driven methods could have gotten us this var combination and thus the lift it delivered. So we use hypotheses + math for feature optimization.

What has your experience been?

#pragmatic #datascience #bestpractices

--

--

Piyanka Jain
DataSeries

Data Literacy and Data Science thought leader, internationally acclaimed best-selling author, keynote speaker, President and CEO of SWAT data science consulting