When you’re getting started with a machine learning (ML) project, one critical principle to keep in mind is that data is everything. It is often said that if ML is the rocket engine, then the fuel is the (high-quality) data fed to ML algorithms. However, deriving truth and insight from a pile of data can be a complicated and error-prone job. To have a solid start for your ML project, it always helps to analyze the data up front, a practice that describes the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis. During that process, it’s important that you get a deep understanding of:
The properties of the data, such as schema and statistical properties;
The quality of the data, like missing values and inconsistent data types;
The predictive power of the data, such as correlation of features against target.
This process lays the groundwork for the subsequent feature selection and engineering steps, and it provides a solid foundation for building good ML models.
There are many different approaches to conducting exploratory data analysis (EDA) out there, so it can be hard to know what analysis to perform and how to do it properly. To consolidate the recommendations on conducting proper EDA, data cleaning, and feature selection in ML projects, we’ll summarize and provide concise guidance from both intuitive (visualization) and rigorous (statistical) perspectives. Based on the results of the analysis, you can then determine corresponding feature selection and engineering recommendations. You can also get more comprehensive instructions in this white paper.
You can also check out the Auto Data Exploration and Feature Recommendation Tool we developed to help you automate the recommended analysis, regardless of the scale of the data, then generate a well-organized report to present the findings.
EDA, feature selection, and feature engineering are often tied together and are important steps in the ML journey. With the complexity of data and business problems that exist today (such as credit scoring in finance and demand forecasting in retail), how the results of proper EDA can influence your subsequent decisions is a big question. In this post, we will walk you through some of the decisions you’ll need to make about your data for a particular project, and choosing which type of analysis to use, along with visualizations, tools, and feature processing.
Let’s start exploring the types of analysis you can choose from.
With this type of analysis, data exploration can be conducted from three different angles: descriptive, correlation, and contextual. Each type introduces complementary information on the properties and predictive power of the data, helping you make an informed decision based on the outcome of the analysis.
Descriptive analysis, or univariate analysis, provides an understanding of the characteristics of each attribute of the dataset. It also offers important evidence for feature preprocessing and selection in a later stage. The following table lists the suggested analysis for attributes that are common, numerical, categorical and textual.