Customizing Findings & Recommendations in VMware Skyline
March 4, 2020Add Users & Manage Groups in Skyline
March 4, 2020The goal of machine learning (ML) is to extract patterns from existing data to make predictions on new data. Embeddings are an important tool for creating useful representations for input features in ML, and are fundamental to search and retrieval, recommendation systems, and other use cases.
In this blog, we’ll demonstrate a composable, extensible, and reusable implementation of Kubeflow Pipelines to prepare and learn item embeddings for structured data (item2hub pipeline), as well as custom word embeddings from a specialized text corpus (text2hub pipeline). These pipelines export the embeddings as TensorFlow Hub (TF-Hub) models, to be used as representations in various machine learning downstream tasks.
The end-to-end KFP pipelines we’ll show here, and their individual components are available in Google Cloud AI Hub. You can go through this tutorial that executes the text2hub pipeline using Manual of the Operations of Surgery by Joseph Bell text corpus to learn specialized word embedding in the medical domain.
Before we go into detail on these pipelines, let’s step back and get some background on the goals of ML, the types of data we can use, what exactly embeddings are, and how they are utilized in various ML tasks.
Machine learning fundamentals
As mentioned above, we use ML to discover patterns in existing data and use them to make predictions on new data. The patterns an ML algorithm discovers represent the relationships between the features of the input data and the output target that will be predicted. Typically, you expect that instances with similar feature values will lead to similar predicted output.
Therefore, the representation of these input features and the objective against which the model is trained directly affect the nature and quality of the learned patterns. Input features are typically represented as real (numeric) values and models are typically trained against a label–or set of existing output data.
For some datasets, it may be straightforward to determine how to represent the input features and train the model. For example, if you’re estimating the price of a house, property size in square meters, age of the building in years, and number of rooms might be useful features, while historical housing prices could make good labels to train the model from.
Other cases are more complicated. How do you represent text data as vectors–or lists–of numbers? And what if you don’t have labeled data? For example, can you learn anything useful about how similar two songs are if you only have data about playlists that users create? There are two ideas that can help us use more complex types of data for ML tasks:
-
Embeddings, which map discrete values (such as words or product IDs) to vectors of numbers
-
Self-supervised training, where we define a made-up objective instead of using a label. For example, we may not have any data that says that song_1 and song_2 are similar, but we can say that two songs are similar if they appear together in many users’ playlists.
What is an embedding?
As mentioned above, an embedding is a way to represent discrete items (such as words, song titles, etc.) as vectors of floating point numbers. Embeddings usually capture the semantics of an item by placing similar items close together in the embedding space. Take the following two pieces of text, for example: “The squad is ready to win the football match,” and, “The team is prepared to achieve victory in the soccer game.” They share almost none of the same words, but they should be close to one another in the embedding space because their meaning is very similar.
Embeddings can be generated for items such as words, sentences, images, or entities like song_ids, product_ids, customer_ids, and URLs, among others. Generally, we understand that two items are similar if they share the same context, i.e., if they occur with similar items. For example, words that occur in the same textual context seem to be similar, movies watched by the same users are assumed to be similar, and products appearing in shopping baskets tend to be similar. Therefore, a sensible way to learn item embeddings is based on how frequently two items co-occur in a dataset.
Because item similarity from co-occurrence is independent of a given learning task (such as classifying the songs into categories, or tagging words with POS), embeddings can be learned in a self-supervised fashion: directly from a text corpus or song playlists without needing any special labelling. Then, the learned embedding can be re-used in downstream tasks (classification, regression, recommendation, generation, forecasting, etc.) through transfer learning.
A typical use of an item embedding is to search and retrieve the items that are the most similar to a given query item. For example, This can be used to recommend similar and relevant products, services, games, songs, movies, and so on.
Pre-trained vs. custom embeddings
TensorFlow Hub is a library for reusable ML and a repository of reusable, pre-trained models. These reusable models can be text embeddings trained from the web or image feature extractors trained from image classification tasks.
More precisely, a pre-trained model shared on TensorFlow Hub is a self-contained piece of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks. By reusing a pre-trained model, you can train a downstream model using a smaller amount of data, improve generalization, or simply speed up training. Each model from TF-Hub provides an interface to the underlying TensorFlow graph so it can be used with little or no knowledge of its internals. Models sharing the same interface can be switched very easily, speeding up experimentation.
For example, you can use the Universal Sentence Encoder model to produce the embedding for a given input text as follows: