I have seen two different cultures in data science: “mathematicians” and “engineers“.
“Mathematicians” are often academics, or have a PhD. Most of them have a background in statistics or mathematics. They believe in the power of theory, proper assumptions and interpretable models. They like models that are based on proven principles, or at least principles accepted by the academic community.
Good examples of the “mathematician’s”way of thinking can be found in most statistical journals. They make clear assumptions about a model before applying it and they try to base their work on sound statistical principles, which have been followed by other statisticians in the past. An example is using a common distribution (e.g. Poisson or normal) as a model of the data and fit the model using maximum-likelihood estimation.
“Engineers” care about what works. They don’t bother having a black-box model as long as it works. An interpretable model is useful if either it works better than a black-box model, or if interpretability is a business requirement. “Engineers” usually have a background in computer science.
A famous example of this mentality is Geoff Hinton. Geoff Hinton, one of the most prominent researchers in deep learning and artificial intelligence, is famous for taking a more experimental view of the subject, trying different things out, testing what works and then building a theory. Geoff Hinton’s background is in experimental psychology. In psychology, experimental procedures are far more important than any theory. The theory usually comes on top of experimental evidence, in order to create a coherent body of knowledge. Also, in fields like clinical psychology, what matters is whether a technique or a method can help a patient, as not so much as the theory backing a particular practice.
The question is, where does data science really belong? Should it be a theory-driven science, such as statistics have traditionally been, or should it take a more experimental approach?
In my opinion, both approaches have their place in data science. In the real world, what matters is results. So, theory driven approaches might seem irrelevant if they don’t deliver results. In that sense, the “engineer” mindset might be more appropriate in many contexts. Given the existence of many powerful algorithms that can work as black boxes, and the large datasets that many companies are handling, it can be easier in many contexts to just use something like a Random Forest, or a Gradient Boosting Machine, instead of creating a carefully designed model.
On the other hand, not all datasets are huge. There are many contexts, such as sports analytics, where the datasets might be small. Also, creating a more carefully crafted model can help uncover the causal relationships between variables. Sometimes, understanding these relationships can be more important than predictive power. It is surprising how many times linear regression or the generalized linear model can be surprisingly effective at giving a clear answer as to the relationships between variables.
So, both approaches have their uses in daily practice. I firmly believe that a complete data scientist should be proficient in more than one “sub-disciplines” that led to what we call now data science. Statistics have Bayesian and frequentist approaches. Machine learning grew as a subfield of artificial intelligence, but so did other related fields such as data mining and computational intelligence.
To focus only on a single approach not only restricts your potential, but it would also do not do justice to the history of data science. If you are interested to know more, make sure to check out my book, where I am also talking about the different tribes of data scientists, and how to hire the best people.
The post Should data science be driven by theory or by experimental evidence? appeared first on TDS.