Modern organizations process greater volumes of text than ever before. Although certain tasks like legal annotation must be performed by experienced professionals with years of domain expertise, other processes require simpler types of sorting, processing, and analysis, with which machine learning can often lend a helping hand.
Categorizing text content is a common machine learning task–typically called “content classification”–and it has all kinds of applications, from analyzing sentiment in a review of a consumer product on a retail site, to routing customer service inquiries to the right support agent. AutoML Natural Language helps developers and data scientists build custom content classification models without coding. Google Cloud’s Natural Language API helps you classify input text into a set of predefined categories. If those categories work for you, the API is a great place to start, but if you need custom categories, then building a model with AutoML Natural Language is very likely your best option.
In this blog post, we’ll guide you through the entire process of using AutoML Natural Language. We’ll use the 20 Newsgroups dataset, which consists of about 20,000 posts, roughly evenly divided across 20 different newsgroups, and is frequently used for content classification and clustering tasks.
As you’ll see, this can be a fun and tricky exercise, since the posts typically use casual language and don’t always stay on topic. Also, some of the newsgroups that we’ll use from the dataset overlap quite a bit; for example, two disparate groups cover PC and Mac hardware.
Preparing your data
Let’s first start by downloading the data. I’ve included a link to a Jupyter notebook that will download the raw dataset, and then transform it into the CSV format expected by AutoML Natural Language. AutoML Natural Language looks for the text itself or a URL in the first column, and the label in the second column. In our example, we’re assigning one label to each sample, but AutoML Natural Language also supports multiple labels.
To download the data, you can simply run the notebook in the hosted Google Colab environment, or you can find the source code on GitHub.
Importing your data
We are now ready to access the AutoML Natural Language UI. Let’s start by creating a new dataset by clicking the New Dataset button. Create a name like
twenty_newsgroups and upload the CSV you downloaded in the earlier step.