Rexer Analytics is happy to announce the release of our first “Data Science Insights” article. This is the beginning of a periodic series of short thought pieces on a variety of data science topics we will be writing and releasing in the coming months.
Data Science Insights #1: The Pitfalls of Self-Service Analytics, by Heather Allen, PhD
Companies which produce tools for Data Science have recently placed significant emphasis on the idea of “data science for everyone” and the “Citizen Data Scientist.” An increasing number of software vendors are offering tools which will “bring the analysis to you” and suggest that busy executives simply need to purchase the right analytic engine and all of their business questions will be answered. Even better, these packages will deliver “insights,” answering questions about your business that you never thought to ask.
While analytic tools are becoming more powerful every day and ever simpler to use with powerful graphics engines and simple GUI interfaces, no tool can replace an experienced data analyst. Here are a few pitfalls to watch out for when presented with analytics conducted using these self-service packages:
- “Predictors” that Occur after the Target Event – Often the question we are asking when we begin a predictive analysis is “what variables are having an impact on my target?” be that target attrition, satisfaction, or customer cross-sell. However, our data sources often contain a series of variables or events which are not anchored in time. Any event data which occurs after the target can not possibly be affecting that target. For example, when a student fails to pay her fee bill after she has already dropped out of college, the unpaid bill is not a predictor of her dropout. While a self-serve data engine may notice that those two facts are correlated (unpaid bills and student attrition), the unpaid bill is due to the attrition, not the other way around.
- The Danger of Small Samples – In order for a finding to carry weight, it needs to be built upon a solid foundation of replicable data. If we have a dataset of 10,000 people, but only 10 of them have the variable “eye color” the insight that blue eyes are related to the purchase of car loans will not be a service to us. Without an analyst to keep a careful watch on the quality and size of the dataset, research engines can churn out numerous “insights” which actually would not advance your knowledge of your population at all.
- False Positives – It is a truth of statistics that the more analyses that are run, the more “significant” findings will emerge. Skilled data scientists have several ways of watching for and avoiding this problem of spurious “significant” findings. These include the use of hold-out samples, raising the bar for statistical significance as the number of analyses considered increases, and several other techniques. Self-service data engines will typically allow you to look at as many relationships as you like but often with no methods for avoiding spurious “significant” findings.
- Lack of Sufficient (or Appropriate) Data Preparation – It is an open secret in our field that up to 90% of the time involved in advanced data analytics is actually expended in the data cleaning and preparation stage. Self-serve data engines often do not provide data cleaning services. They take the data that you feed them, raw and dirty as it may be, and perform analyses on that data as-is. Without careful preparation of the dataset, cleaning up out of range variables, accounting for missing data, ensuring that data is coded in the most appropriate manner for the intended analysis, creating thoughtful derived data features, etc., the insights which are produced will be as dirty as the data that produced them. As the old saying goes, garbage in – garbage out.
- Appropriate Definition of the Target Variable – Self-serve data engines are generally unable to direct you in the complex assessment of identification of your target variable. For example, if you’re interested in customer attrition, how do you define “attrition”? If you are a bank, are you interested in customers who close their checking account? Or perhaps customers who close all of their accounts? How does lack of activity contribute to your definition? If a customer does not make a transaction for a year, but leaves their account open, have they attrited? And what time period would you like to use for attrition? A month? Three months? A year? How does the definition of attrition change if you consider customers with zero balances? All of these questions are important to finding the right definition of attrition for you. None of them will be asked by your friendly neighborhood self-serve data engine.
- Is There a Better Way?: While a non-analyst may be able to learn enough to generate somewhat appropriate models, or have a software package do that for them, without a full understanding of the range of potential analytic options and their implications, there may well be better modeling options available. Even within specific modeling techniques, there are often options that can be selected and variable combinations that can be explored which become apparent when an analyst merges analytic knowledge with insight of the business problem at hand, but would be unknowable in an automated process.
These are just a few of the pitfalls one encounters when attempting to mine insights without the assistance of a well-trained and experienced data scientist. As is true elsewhere in life, the easy answer is often not the best one.