Last year I posted an article called, What Do Predictive Analytics Consultants Do? Part 1, describing the general types of activities that we engage in. In the present article, I want to talk about the skills and tools that one should have to perform Predictive Analytics. Although this is not strictly a “What we do” article, knowing the skills we possess and the tools we use will provide some insight into what we do, without talking about some algorithm that you may have never heard of.
What's Not In This Series?
I am always at a loss in describing the skills of analytics, for there are many. I just completed a new book about analytics (available for FREE—see notes) that has a different approach than Predictive Analytics using R (also available for FREE), though I am using material from three chapters. The new book is an operations research approach to analytics, Operations Research using Open-Source Tools, covering a different set of methods, skill and tools. Combined, the two books are over 1000 pages, so perhaps you can see my dilemma. Hence, this article is going to touch the very basics.
The more technical aspects of Predictive Analytics are not in this series for two reasons. First, this is intended for a general audience, including Analytics program managers and managers who have no idea what Predictive Analytics is about. Second, the LinkedIn publishing platform is not designed for writing technical articles about Analytics. It does not support any special formatting for showing code, equations, tables, etc. WordPress is much better for this, so you will find these kinds of articles at bicorner.com. The most recent article is Random Forest using Python.
What is Predictive Analytics?
In case you missed my previous article, this is a high-level description. Predictive Analytics—sometimes used synonymously with predictive modeling--is not synonymous with statistics, often requiring modification of functional forms and use of ad hoc procedures, making it a part of data science to some degree. It does, however, encompasses a variety of statistical techniques for modeling, incorporates machine learning, and utilizes data mining to analyze current and historical facts, making predictions about future. Beyond the statistical aspect lies a mathematical modeling and programming dimension, which includes linear optimization and simulation, for example. Yet analytics goes even further by defining the business case and requirements, which are not covered here. I discussed those in How to Build a Model.
Statistical Modeling and Tools
This assumes that you already know the basics of parametric and some nonparametric statistics. If you are not familiar with these terms, then you are missing a prerequisite. However, this is a gap you can fill with online courses from Coursera. Though I have never taken one, I have many colleagues who swear by them.
By statistical modeling, I am referring to subject matter that would be covered beyond the material in a statistics for engineering or business course(s). Here we are concerned with linear regression, logistic regression, analysis of variance (ANOVA), multivariate regression, and clustering analysis, as well as goodness of fit testing, hypotheses testing, experimental design and my friends Kolmogorov and Smirnoff. Mathematical Statistics could be a plus, as it will take you into the underlying theory.
The tools one would/could use are a myriad and are often the tools our company or customer has already deployed. SAS modeling products are well-established tools of the trade. These include SAS Statistics, SAS Enterprise Guide, SAS Enterprise Modeler, and others. IBM made its mark on the market with the purchase of Clementine and its repackaging as IBM SPSS Modeler. There are other commercial products like Tableau. I have to mention Excel here, for it is all many will have to work with. But you have to go beyond the basics and into its data tools, statistical analysis tools and perhaps its linear programming Solver, plus be able to construct pivot tables, and so on.
If you want to learn SAS Enterprise Guide, SAS has made that very easy to do. Anyone can use SAS EG for learning, regardless of your status (non-student, student, professor, professional, etc.) at http://www.sas.com/en_us/software/university-edition.html. But, this is restricted to personal use only.
Today, there a multitude of open source domain tools that have become popular, including R and its GUI, R-Studio; the S programming package; and the Pythonprogramming language (the most used language in 2014). R, for example, is every bit as good as its nemesis SAS, but I have yet to get it to leverage the enormous amount of data that I have with SAS. Part of this is due to server capacity and allocation, so I really don't know how much data R can handle.
For the foregoing methods, data is necessary and it will probably not be handed to you on a silver platter ready for consumption. It may be “dirty”, in the wrong format, incomplete, or just not right. Since this is where you may spend an abundant amount of time, you need the skill at tools to process data. Even if this is a secondary task--it has not been for me--you will probably need to know Structured Query Language (SQL) and something bout the structure of databases.
If you do not have clean, complete, and reliable data to model with, you are doomed. You may have to remove inconsistencies, impute missing values, and so on. Then you have to analyze the data, perform data reduction, and integrate the data so that it is ready for use. Modeling with “bad” data results in a “bad” model!
Databases are plentiful and come in the form of Oracle Exadata, Teradata,Microsoft SQL Server Parallel Data Warehouse, IBM Netezza, and Vertica. The Greenplum Database builds on the foundations of open source database PostgreSQL.Or you may need to use a data platform like Hadoop. Also, Excel has the capacity to store "small amounts" of data across multiple worksheets and built-in data processing tools.
Again, there are prerequisites like differential and integral calculus and linear algebra. Multivariate calculus is a plus, particularly if you'll be doing models involving differential equations and nonlinear optimization. The skills you need to acquire beyond the basics include mathematical programming--linear, integer, mixed, and nonlinear. Goal programming, game theory, Markov chains, and queuing theory, to name a few, may be required. Mathematical studies in real and complex analysis, and linear vector spaces, as well as abstract algebraic concepts like group, fields and rings, can reveal the foundational theory.
Simulation modeling, including Monte Carlo, discrete and continuous time, plus discrete event simulation can be applied in analytics--I have not seen this as common practice in business analytics, but it certainly has its place. These models may rely heavily upon queuing theory, Markov chains, inventory theory and network theory.
The corporate mainstay is the powerhouse combination of MATLAB and Simulink. MATrix LABoratory or MATLAB (that is why it is spelled with all caps!). Other noteworthy commercial products include Mathematica and Analytica. Octave is an open-source mathematical modeling tool that reads MATLAB code and there are add-on GUI environments (like R-studio for R) floating around in hyperspace. I recently discovered the power of Scilab and the world of modules (packages) that are available for this open-source gem.
For simulation, Simulink works "on top of" MATLAB functions/code for a variety of simulation models. I wrote the book "Missile Flight Simulation", using MATLAB and Simulink. ExtendSim is an excellent tool for discrete event simulation and the subject of my book "Discrete Event Simulation using ExtendSim". In Scilab, I have used Xcos for discrete event simulation and Quapro for linear programming. Both are featured in my next book.
There is a general analytics tool that I do not know much about yet. BOARD, in its newest release, boasts a predictive analytics capability. I will be speaking on predictive analytics at the BOARD User Conference during April 13th-14th in San Diego. Again, I would be remiss not to mention Excel, and particularly the Solver add-in for mathematical programming. Another 3rd-party add-in to consider to @Risk.
If you aspire to become an analytics consultant or scientist, you have a lot of open-source tools, free training and online tutorials at your fingertips. If you are already working in analytics, you can easily specialize in predictive analytics. If you are already working in predictive analytics, you have what you need to become an expert. All of the tools will either work with your PC's native processing power or through a virtual machine, for example, when using Hadoop, or remote server.
A few other articles on Predictive Analytics
- A 12 Step Approach to Analytical Modeling
- Predictive Modeling in Analytics
- What is Humalytica?
- A Dangerous Game We Play
- What is Operations Research?
- How can I be a Data Scientist?
- Why you might not want to be a Data Scientist
- Data Scientists are Dead, Long Live Data Science!
- Why you should care about Statistics
- Call Center Analytics: What's Missing?