Data Visualization for Presentations
Introduction
Data visualization provides tool for presenting complex information in an easily digestible format. Being able to efficiently convey information is a vital skill for a data analyst, as they allow the analyst to get to the point quickly. When presenting information to others, data visualizations allow the analyst to quickly cover information of interest without getting lost in the large amounts of fields that you would find in a tabular report.
The information below is an excerpt covering the topic of overfitting from a larger machine learning project. Data visualization was used to explore the various relationships between model complexity, model training data fit, and model predictive error. This presentation shows how simple data visualizations can be used to convey a complex idea.
At the end of this essay are links to the data and Python Jupyter Notebooks used to generate the visualizations.
Overfitting Presentation Overview
The basic purpose of developing a machine learning or regression model is to accurately predict the outcome variable of a new record based on a never-seen-before set of inputs (Shmueli, Bruce, & Patel, 2016). To generate this model, it must be trained on an existing training dataset. During this training process, the possibility of overfitting the training data exists, which would lower the overall accuracy of the final model with given real world data. Using made up sets of hypothetical data the concept of overfitting will be explored, along with predictive accuracy metrics that can be apply to regression models and other supervised machine learning tasks.
To complete this project, Python and Jupyter Notebooks were used, and the notebook is built to run in Google Colaboratory. The Python libraries Pandas and Numpy were used to manipulate the hypothetical data that was used in this project. For visualizations, the Python libraries of Matplotlib and Seaborn are used. Finally, additional functions from the Python library Scikit-learn were used for metrics calculations.
Overfitting Example: Education vs Salary
Figure 1 - Sample Education and Salary data.
Overfitting is what occurs when a model fits the random noise and variation that exists within a sample training dataset (Shmueli, Bruce, & Patel, 2016). Instead of producing accurate predictions of new data, the trained model instead produces output based on the randomness and outliers existing in the training data. The dataset for this discussion is a hypothetical, hand-generated, set of records containing an individual’s education (measured in years) alongside their yearly salary (measured in dollars USD). A scatterplot of the data can be seen in Figure 1.
Examining the scatterplot in Figure 1 immediately presents potential insight into the underlying relationship between education and salary. It is obvious that there is a general positive relationship between education and salary (in that an increase in education seems to correspond with an increase in salary). However, the relationship does not appear completely linear, which may lead an analyst to make some assumptions about the underlying relationships in the dataset.
Education vs Salary: Potential Models
Figure 2 - Potential Models for the sample dataset.
Figure 2 shows two potential models that an analyst could come up with given the sample dataset. In terms of complexity, the 4th degree polynomial is more complex than the 2nd degree polynomial, but it also fits the training data better. While the proposed complex models do a great job at describing the sample dataset that has been analyzed, it is important to take a step back and consider some information. When the sample dataset was captured, there is a good chance that some random “noise” was captured as well (Shmueli, Bruce, & Patel, 2016). Any model trained off this sample dataset must be robust enough to predict new input instead of showing the existing relationships within the training dataset. Figure 3 shows the “shape” of the population data, highlighting the difference between the random local “shape” created within the sample data. When both the sample training data and the overall population data are considered as a whole, it becomes apparent that the initially proposed models may not be the best at accurately predicting the yearly salary of new records.
One of the powers of hypotheticals is that the entire “picture” is already known. During the beginning of an actual data mining project the analyst will not have the benefit of knowing how the overall dataset will look. Because of this, the analyst must take steps to ensure that the developed model strikes a balance between fitting general trends within the sample dataset while not fitting the sample’s random quirks. In a real-world situation, the insight from previous knowledge, subject matter expert input, data exploration, and other sources should all be thoroughly examined together to guide the development of a model.
Figure 3 - (Left) Scatterplot of both the sample and population datasets. (Right) The overall data with the potential models from Figure 2.
Education vs Salary: Overfitting
Figure 4 - (Left) A simpler, linear model. (Right) All models examined.
Figure 5 - Model complexity vs fit to the training dataset.
Armed with perfect knowledge of the entire dataset that may be encountered during this hypothetical project, it seems prudent to reconsider the proposed models of education versus salary. While the previous models possessed increasing amounts of complexity, it makes sense to examine a simpler model. Figure 4 shows a linear model derived from linear least-squares regression of the sample dataset. These three models provide an excellent example of the relationship between model complexity and fitting the training data.
As shown in Figure 5, by calculating the R squared of the various models using the training data and then plotting it against the model complexity a relationship appears. As the complexity of a model increases, so does the fit between the model and the underlying data it was trained on. However, the “goodness-of-fit” of the training data is no indication that model will be useful at predicting new records. This is a direct example of overfitting. While the more complex models will do a good job in describing the relationships within the training dataset, they will perform poorly when attempting to predict new records.
Education vs Salary: Results of Overfitting
Figure 6 - Model complexity vs model error using validation data. The Left is the Mean Absolute Error while Right is the Root Mean Squared Error.
As noted above, as the model complexity increases so does the model's fit to the training data, creating a model that is “overfitting” the sample dataset it was trained on. Within this example another relationship exists, which is the relationship between the model predictions and the actual values of new records, such as a validation dataset like the population dataset shown in Figure 3. Using the three models and two common predictive error metrics (the Mean Absolute Error and the Mean Root Squared Error), the model error for the three models described can be calculated.
The relationship that is discovered is that while training data fit increases as the model complexity increases, predictive accuracy with validation data decreases (and the predictive error increases). This trade-off is a direct result of overfitting the sample training dataset by using overly complex models. Examining Figure 6 reveals that the simple Linear Model has the lowest measured error of the three models, even though it fit the training data the worst. This is because the linear regression model fit the general trend of the sample training dataset without capturing the underlying randomness of the sample data. The other two models have increasing error as the models start to describe the training data more than they predict the outcomes of new records. This is a good direct example of the results of overfitting and should serve as a reminder to be aware of this pitfall when choosing a model for a data mining project.
Conclusion
As can be seen, data visualization can add a significant amount of information in a relatively small package, which can aid greatly in the discussion or understanding of complex topics. In addition to being able to show the information, it is also vital that the information can be summarized in a complete and understandable fashion. An effective analyst must be able to provide both useful visualizations and understandable explanations. Having a full understanding of the story that data tells is vital for organization decision makers, which heightens the need for strong data visualization and presentation skills from business intelligence and data analysts.
Below are links to the Python Jupyter Notebook and datasets used for this project:
Python Code - Project code in Jupyter Notebook
Sample Dataset - Education vs Salary Sample Dataset CSV File
Population Dataset - Education vs Salary Population Dataset CSV File
Reference
Shmueli, G., Bruce, P., & Patel, N. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications in XLMiner (3rd ed.). Hoboken, NJ: Wiley.