Editor’s note: Today’s post comes to us from Bertrand Cariou at Trifacta, and presents some steps you might take in Cloud Dataprep to clean your data for later use for your analytics or in training a machine learning model.
Data quality is a critical component of any analytics and machine learning initiative, and unless you’re working with pristine, highly-controlled data, you’ll likely face data quality issues. To illustrate the process of turning unknown, inconsistent data into trustworthy assets, we will leverage the example of a forecast analyst in the retail (consumer packaged goods) industry. Forecast analysts must be extremely accurate in planning the right quantities to order. Supplying too much product results in wasted resources, whereas supplying too little means that they risk losing profit. On top of that, an empty shelf also risks consumers choosing a competitor’s product, which can have a harmful, long-term impact on the brand.
To strike the right balance between appropriate product stocking levels and razor-thin margins, forecast analysts must continually refine their analysis and predictions, leveraging their own internal data as well as third-party data, over which they have no control.
Every business partner, including suppliers, distributors, warehouses and other retail stores, may provide data (e.g. inventory, forecast, promotions, or past transactions) in various shapes and level of quality. One company may use palettes instead of boxes as a unit of storage, pounds versus kilograms, may have different categories nomenclature and namings, may use a different date format, or will most likely have product SKUs that are a combination of internal and other supplier IDs. Furthermore, some data may be missing or may have been incorrectly entered.
Each of these data issues represents an important risk to reliable forecasting. Forecast analysts must absolutely clean, standardize, and gain trust in the data before they can report and model on it accurately. This post reviews key techniques for cleaning data with Cloud Dataprep and covers new features that may help improve your data quality with minimal effort.
Cleaning data with Cloud Dataprep corresponds to a three-step iterative process:
Assessing your data quality
Resolving or remediating any issues uncovered
Validating cleaned data, at scale
Cloud Dataprep constantly profiles the data you’re working on, from the moment you open the grid interface and start preparing data. With Dataprep’s real-time Active Profiling, you can see the impact of each data cleaning step on your data.
The profile result is summarized at the column header with basic data points to point out key characteristics in your data, in the form of an interactive visual profile. By clicking one of these profile column header bars, Cloud Dataprep suggests some transformations to remediate mismatched or missing values. You can always try a transformation, preview its impact, select it or tweak it. At any point, you can always revert to a specific previous step if you don’t like the result.
With these basic concepts in mind, let’s cover Cloud Dataprep data quality capabilities.
1. Assessing your data quality
As soon as you open a dataset in the grid interface, you can access to data quality signals that help you assess data issues and guide your work in cleaning the data.