10/30/2014

What is a data scientist?

I often see references and discussions around "data science" and "data scientists," and every time I do, I start to ponder the topic and think about writing a post around it. I was reminded again this week because my daughter was telling a woman that I know that I was a "data scientist" and she was trying to explain to that person what I do. She's eight years old, so it was really precious listening to her describe this to someone (She also has a small microscope that she loves to use and tells me that she wants to be a scientist when she grows up). There have been a number of blog posts that I've read that specifically draw me to this topic, after all, my website is called data + science. One post in particular here is written by Stephen Few. Steve is an expert in data visualization, specifically focusing on best practices and visual perception (which I love and is at the core of what I teach at the University of Cincinnati). I've had the honor of presenting with Steve as well as joining him on a panel discussion on the topic of data visualization. I followed his presentation by opening with a pie chart, but I'll save that story for another post. My wife and I also spent a lovely evening with him, enjoying a wonderful dinner out at one our favorite restaurants while he was visiting Cincinnati (come back to Cincinnati Steve, and we will do it again!). I follow Steve's blog, and I find myself agreeing with most everything he writes on a regular basis. However, as it relates to the post above, I disagree with some of his points.

To summarize the post, Kelly Martin, a Tableau Zen Master and a wonderfully talented visualization designer, wrote to Steve asking him to comment on the "data scientist hype."

She writes:

Kelly is not alone in her thinking. There are many others who talk about the "hype" around data science. Here are a few funny definitions that I've run across for a "data scientist."

Here is a definition returned from Google when searching the term "data scientist definition":

Berkeley has started a new Master of Information and Data Science (MIDS) degree and they have a page here describing "What is Data Science." They appear to have lots of discussion about the importance, but don't really describe what it is other than this one general sentence:

With Google and Berkeley providing definitions like this no wonder people are confused and misusing the term.

As I read Kelly's letter to Steve, my head did a half tilt, and I was intrigued at what his response was going to be. He responds, agreeing with Kelly, as he examines the term in depth. He provides some good history about the origins of the term, referencing John Tukey and William Cleveland.

I agree.

I agree with the fact that many people use this term on their resume as an inflated title, which I will discuss more in depth. However, here's where I disagree, at least in part. Data Science is NOT "what data analysts by any name have been doing for years". In fact, many of the things data scientists do today didn't exist years ago (well, these things existed at places like Bell Labs where they were being developed and used, but they didn't exist in general use like they do today). This is a fairly young field with new advancements and techniques constantly being added. These are NOT old technologies and the skills OFTEN rise to the level of science.

Steve goes on to discuss the

Bachelors or Master’s degree with specialization in Math/Statistics or Quantitative discipline is a must.

PhD or MS degree specializing in a relevant field such as Statistics, Machine Learning, Data Mining.

MS/PhD in CS, Applied Mathematics, or Statistics with a strong math background.

MS or preferably PhD in computer science or similar field with 2-5 years of related experience

BSc, MSc or PhD in Computer Science, Statistic, Math, Physics or comparable field.

Ph.D./M.S. in Mathematics, Statistics, Computer Science or related field.

BS/MS in a quantitative discipline: Statistics, Applied Mathematics, Operations Research, Computer Science, Engineering or Economics.

Advanced degree in Computer Science, Electrical Engineering, Statistics, Mathematics or related field

Academic or industry experience in natural language processing, machine learning, information retrieval

Graduate degree in statistics, econometrics, applied mathematics, physics, computer science, engineering, or quantitative field required.

Expert in SQL

Fluency in SQL/SAS Ability to query large data sets using database management software (SQL) and database mining software (SAS).

Knowledge and experience of relational databases and SQL.

Experience analyzing very large datasets with SQL (Teradata, Oracle, or MySQL) and R, or other statistical package.

Awesome SQL skills,

Fluency in Java or another object oriented language.

Experience with at least one general purpose programming language such as Java, Python, C.

Expert Knowledge of analytical techniques. Specifically working knowledge of concepts such as, clustering, segmentation and time-series (i.e. regression, forecasting).

Strong Python development skills.

Strong programming skill in one of the following languages, Java, Scala, C/C++.

Expert level knowledge of one of the scripting languages such as Python or Perl.

Knowledge of Perl language and/or other programming languages.

Strong object oriented programming skills (Java strongly preferred)

Recent programming experience in a production environment

Comfort with programming in Python or Java. (Python is a plus)

Strong skills in programming languages such as R / SAS, Java / Python, Pig, Mahout or SQL required.

Knowledge of writing APIs for data extraction.

Experience applying machine-learning algorithms to real-world data.

Experience using data mining techniques, and solving analytical problems.

Deep knowledge of machine learning, information retrieval, data mining, statistics, NLP or related field.

Proven experience working with statistical languages such as R.

Knowledge of stochastic modeling.

Experience with large scale optimization techniques (stochastic, convex, dual).

Familiarity with Monte Carlo methods and boostrapping/bagging techniques and time-series models (ARIMA).

Deep understanding of statistical modeling/machine learning/ data mining concepts, and a track record solving problems with these methods.

Familiar with one or more machine learning or statistics tools such as R, Matlab, and etc. or various libraries for other programming languages.

Network Theory, Machine Learning, and Natural Language Processing.

Solid foundation and keen interest in statistical modeling and machine learning.

Direct experience applying Machine Learning and Data Mining algorithms to solve complex problems of massive scale.

Extensive experience performing statistical analysis in R, Julia, Python, or another fast analysis language.

Natural Language Processing, Machine Learning, Statistical Analysis and Predictive Modeling.

Machine Learning and Data Mining.

Excellent problem solving abilities.

Uses a data-driven approach to solve a problems.

Extreme attention to detail.

Innovative in problem solving.

A talent for solving problems.

Highly skilled in problem solving.

Now let's cross reference these with the skills for a financial analyst vs. a data analyst vs. a BI professional vs. a statistician vs. my data science team at my company (Unifund and Recovery Decision Science).

A financial analyst at our company would likely be a finance major or have an MBA with a finance concentration. He/she would be familiar with our accounting software, be very familiar with Excel and would be able to handle things like department budgets, analysing actual results vs. projections, performance results for the company and departments, forecasting, cash management reporting and portfolio performance and amortization. He/she would work regularly with the accounting, finance and operations groups. The financial analyst, however, would not meet most of the criteria listed above.

A data analyst at our company would likely be a business or finance major or an MBA, possibly with an IT or IS background and likely taken courses in probability and statistics. He/she would be intimate with Excel, a serious power user able to do advanced formulas and calculations and program in VBA to create custom functions and macros. He/she would be expert in using our BI tools, Tableau and Cognos and would have a working knowledge of relational database, have good SQL skills, and be able to gather the data himself/herself and complete analysis of that data on his/her own. This is where the skills stop. He/she would not work regularly in any statistical package or programming language (Ex. R and Python). Having had some probability and statistics training, he/she would be able to run a regression or multiple regression and be able to interpret the results, but he/she would not be familiar with most of the analytical skills outlined above. Overall, the data analyst would have the database skills needed and the basic statistics needed to perform the job but would not meet the other requirements of a data scientist. They might, however, possess some of the skills of a statistician (below).

We don't employ any statistician's at our company, but I do personally know a few. If I were hiring one at our company this person would have an advanced degree in statistics or applied math (masters or PhD). He or she would be familiar with advanced probability and statistics and there would be a laundry list of things this person would be familiar with, for example: populations and samples, variation, probability distributions, sampling, hypothesis testing, ANOVA, goodness-of-fit, regression, forecasting time-series data, nonparametric sampling, statistical process control, design of experiments, decision trees and Marchov chains. This person is probably so familiar with Bayesian statistics that he/she could rattle off Bayes theorem on a whiteboard at a moment's notice and would love every minute of it. He/she would also be able to explain to you the difference between the Spearman rank correlation and the Pearson correlation coefficient or tell you Kolmogorov–Smirnov is not an off-brand of vodka but rather a less-than-ideal statistical test that might be used by your bank and probably shouldn't be.

Before I move on to my data science team, it's interesting to note that Dr. Jerome Friedman, Professor of Statistics at Stanford University, was writing about a very similar topic back in 1997, "Data Mining and Statistics: What's the connections?"

His observations sound eerily similar to the discussion today around data science.

Wait, are we talking about data science? Apparently Dr. Friedman thought there was industry hype around the term "data mining" in the same manner as "data science" today, but also points out that it was having a major impact at the time. Fast forward twenty years and data mining has become a very established field.

And even Dr. Friedman had a funny description for a computer scientist in the "data mining" field:

Dr. Friedman's comments are interesting to me because, having somewhat of an IT background, ten years ago I would have defined data mining more around database management and data modeling. It has been only in the last ten years that I began studying and learning the other areas of data mining and data science. I had some programming experience, but I learned SQL so that I could query my own data. We implemented Cognos as a business intelligence tool and SPSS as our analytics platform, so I began learning those. Then I started learning some Python and R and explored machine learning and neural networks. We later converted to R for our analytics platform and Tableau for data visualization and BI, and I became more and more familiar with them. One of the benefits of being an adjunct professor at the University of Cincinnati is that I can take courses for free, so in 2013 I took the graduate courses in Data Mining. My professor for Data Mining was Dr. Yu, who interned at Bell Labs when William Cleveland was there. On the first night of class, Dr. Yu cautioned that before taking this class, students should have already had college level Calculus I, Calculus II, Multivariate Calculus, Linear Algebra or Matrix Methods (these are the prerequisites to our program at UC) and then gone on to take the Statistical Methods class before starting Data Mining I.

The core curriculum for our MSBA program includes: Statistical Methods and Modeling, Optimization Methods and Modeling, Simulation Modeling, Probability Modeling, Statistical Computing and Data Management. Then there are electives in Data Visualization, Data Mining I and II, Simulation Analysis, Financial Engineering, Forecasting and Time Series Methods, Multivariate Statistical Methods, Case Studies, Game Theory and various Business Intelligence courses.

This brings me to the data scientist. At our company, there are four guys that I would consider data scientists. They all have a Master of Science in Quantitative Analysis or Business Analytics from the University of Cincinnati, and one is currently working on his PhD. All four of them were in the top of their class with perfect or near perfect grades. One was awarded the student of the year. They have taken most every class that the college has to offer including all of the courses listed above. They are also very advanced Tableau users and have studied data visualization as well. They all use R and SQL on a daily basis, they can code in Python and HTML. As a point of fact, this entire blog post and all of my posts are written in HTML and loaded up to my hosting server instead of using a traditional blog platform, like Blogger, Tumblr or Wordpress. This skill is also important, because this team often has to scrape data from the internet or connect through an API to bring additional data into their analysis.

We have a SQL test at our company that we administer to potential candidates in the I.T. department, which is eight questions. I would expect our data science team to be able to get through many of those, at least to question four or five. We also have a rigorous interview process for the data science team. We ask specific technical questions, but we also have a "problem solving" portion of the interview. Notice one of the skills listed over and over again by companies is "problem solving". We agree. Using the whiteboard we rattle off as many problems as we can in an hour and see how the candidate does on the spot. This includes probability problems, estimation problems, logic problems, etc. This process has worked well for us in finding the right people. As an example, one candidate simply responded to one of the first problems with, "that's not me."

I would expect everyone on the data science team to have a good knowledge of everything listed under the statistician, but also be able to understand GLM, ROC, CV, CART, GAM, Nonparametric Smoothing, Supervised and Unsupervised Learning, Clustering, Association Rules, ANN, LDA and NLP. I recognize that I am term-dropping and I am using the acronyms on purpose because I would expect that my data science team would instantly recognize those terms, know what they are used for and how to use them. They should understand the ROC Curve and AUC, what AIC stands for and know if a higher or lower AIC is better or worse. I might ask a question about whether variables should or should not be standardized in an ANN or explain how to determine the optimal number of clusters in a K-means clustering. I might ask, what's the difference between AIC and BIC? As with my examples above in the statistician section, this is just a small sampling of terms and concepts to give you an idea of what we are looking for in a data scientist. So to answer Steve's question in the title of his blog post, are you a data scientist? I would start by asking you, how many of these things in this paragraph did you understand or could you answer in an interview (without using Google to help you)? If you read this list and it was easy to understand and basic for you, then you might be a good candidate, assuming you have the other skills and experience I've discussed. I acknowledge that there are surely statisticians who understand these concepts of data mining as well as traditional business statistics, but this is where I see the primary difference in skill sets. For example, the statisticians whom I know don't typically work on projects like sentiment analysis using a neural network with a support vector machine that they built themselves in Python.

While there may not be a single definition of a Data Scientist, or even Data Mining after all these years, I completely disagree with Kelly and Stephen Few on this point. This is not "hype" or a crazy, in-fashion term like "big data." Yes, there are certainly people that inflate these titles or use them broadly or incorrectly, but that doesn't mean that a true data scientist is the same as a business analyst or a "data sensemaker." This resume inflation happens in every field. It's no different than another candidate that we had for an I.T. position who put on his resume that he was great at SQL. Unfortunately for him, we have that SQL test and he bombed it, not making it past the first few questions that we consider very basic. The problem in this case is not with the ambiguity of the term, it's the fact that people will often inflate their skills or have a different yardstick to measure themselves up against. If I ask someone in an interview to rate himself/herself on a scale of 1 to 10, is it reasonable to expect that a candidate will honestly respond with a 2? I think not.

As another example:

Mr. X is someone I know pretty well. He lists on his Linkedin profile that he is a "Data Scientist." However, he does not possess all of the skills I've outlined above. He does have some database/SQL skills and he does have a solid foundation in mathematics and statistics. However, he would fail miserably with the skills outlined above in the data scientist paragraph. He doesn't use R (or other statistical package or programming language), he is not familiar with machine learning, he doesn't build these advanced models and yet he uses the title "Data Scientist" prominently on his profile. I will also point out that his "skills" that have been endorsed do not include many that would give a good indication that he's a "data scientist". Only one skill, Analytics, gives any positive indication. His other skills listed are things like Business Strategy, Management and Product Management.

I searched Linkedin for "Data Scientist" in my network, and I located Pinar Donmez. She is the Chief Data Scientist at Kabbage. I do not know her personally (she is a complete stranger to me) but she is connected to someone I know. Ignoring that connection, I can make an assessment simply based on her summary, job descriptions and skills listed and skills endorsed by other people. In fact, her skills line up perfectly to the outline above including database skills, programming skills in Python and R, data mining techniques and machine learning. I am fairly confident that she is a very skilled data scientist in all of the areas I've outlined above without even meeting her or talking with her. The fact that she is connected to one of my connections, Meta Brown, who does great work in the field of Data Mining, adds further confirmation to my assessment.

Labeled "The Sexiest Job of the 21st Century" by Tom Davenport and D.J. Patil in a Harvard Business Review article and with a laundry list of skills and education that I've outlined above, it's no wonder there is a huge shortage of data scientists in the field. Having difficulty filling these roles, some companies have turned to a teamed approach to data science, much in the manner in which Steve describes.

That's a great vision. If such a collaborative approach were achieved with a team it would likely produce excellent results. We could call it the "data sensemaking team."

If nothing else, after reading this ridiculously long post, maybe you have a new appreciation for the complexities of data science and what the term has developed into. I don't see an end in sight for the data scientist (or the collective role). There will be no substitute by software to replace this function. The software tools may make this function easier, but a person needs to have a deep understanding of these mathematical models and not just click and point in software packages to derive answers. Also, there has been limited discussion in this post about the person's knowledge of the business, which is very important. Running models, analysis or predictions in a vacuum is not going to produce good results. The data scientist needs to have a context in which to understand, analyze and interpret the data. This is not a point to skip over; moreover, this is an important point for every role discussed above including the financial analyst, data analyst, statistician or data scientist.

I hope you find this discussion and counterpoint interesting. If you have any questions feel free to email me at Jeff@DataPlusScience.com.

Jeffrey A. Shaffer

Data Scientist :)

Follow on Twitter @HighVizAbility