How we use AutoML, Multi-task learning and Multi-tower models for Pinterest Ads
Ernest Wang | Software Engineer, Ads Ranking
People come to Pinterest in an exploration mindset, often engaging with ads the same way they do with organic Pins. Within ads our mission is to help Pinners go from inspiration to action by introducing them to the compelling products and services that advertisers have to offer. A core component of the ads marketplace is predicting engagement of Pinners based on the ads we show them. In addition to click prediction, we look at how likely a user is to save or hide an ad. We make these predictions for different types of ad formats (image, video, carousel) and in context of the user (e.g., browsing the home feed, performing a search, or looking at a specific Pin.)
In this blog post, we explain how key technologies, such as AutoML, DNN, Multi-Task Learning, Multi-Tower models, and Model Calibration, allow for highly performant and scalable solutions as we build out the ads marketplace at Pinterest. We also discuss the basics of AutoML and how it’s used for Pinterest Ads.
AutoML
Pinterest’s AutoML is a self-contained deep learning framework that powers feature injection, feature transformation, model training and serving. AutoML features a simple descriptive template to fuse varieties of pre-implemented feature transforms such that the deep neural networks are able to learn from raw signals. This significantly eases the human labor in feature engineering. AutoML also provides rich model representations where state-of-the-art machine learning techniques are employed. We developed ads CTR prediction models with AutoML, which has resulted in substantial outcomes.
Feature processing
While many data scientists and machine learning engineers believe that feature engineering is more of an art than science, AutoML finds many common patterns in this work and automates the process as much as possible. Deep learning theory has demonstrated that deep neural networks (DNN) can approximate arbitrary functions if provided enough resources. AutoML leverages this advantage and enables us to directly learn from raw features by applying a series of predefined feature transform rules.
AutoML firstly characterizes the features into generic signal formats:
- Continuous: single floating point value feature that can be consumed directly
- OneHot: single-valued categorical data that usually go through an embedding lookup layer, e.g., user country and language
- Indexed: multi-hot categorical features that usually go through embedding and then projection/MLP summarize layers
- Hash_OneHot: one-hot data with unbounded vocabulary size
- Hash_Indexed: indexed data with unbounded vocabulary size
- Dense: a dense floating point valued vector, e.g., GraphSage [6] embeddings
Then the feature transforms are performed according to the signal format and the statistical distribution of the raw signal:
- Continuous and dense features usually go through squashing or normalization
- One-hot and multi-hot encoded signals will be looked up embeddings and be projected
- Categorical signals with unbounded vocabulary are hashed and converted to one-hot and multi-hot signals
This way the usually tedious feature engineering work can be saved, as the machine learning engineers can focus more on the signal quality and modeling techniques.
Model structure
AutoML leverages state-of-the-art deep learning technologies to empower ranking systems. The model consists of multiple layers that have distinct, yet powerful, learning capabilities.
The representation layer: The input features are formulated in the representation layer. The feature transforms described in the previous section are applied on this layer.
The summarization layer: Features of the same type (e.g., Pin’s category vector and Pinner’s category vector) are grouped together. A common representation (embedding) is learned to summarize the signal group.
The latent cross layer: The latent cross layers concatenate features from multiple signal groups and conduct feature crossing with multiplicative layers. Latent crossing enables high degree interactions among features.
The fully connected layer: The fully connected (FC) layers implement the classic deep feed-forward neural network.
Key learnings
As sophisticated as AutoML is, the framework can be sensitive to errors or noises introduced to the system. It’s critical to ensure the model’s stability to maximize its learning power. We find that several factors affect the quality of AutoML models significantly during our development:
- Feature importance: AutoML gives us a chance to revisit signals used in our models. Some signals that stood out in the old GBDT models (see the Calibration section) are not necessarily significant in DNNs, and vice versa. Bad features are not only useless to the model, the noises introduced may potentially deteriorate it. A feature importance report is thus developed with the random permutation [7] technique, which facilitates model development very well.
- The distribution of feature values: AutoML relies on the “normal” distribution of feature values since it skips human engineering. The feature transforms defined in the representation layer may sometimes, however, fail to capture the extreme values. They will disrupt the stability of the subsequent neural networks, especially the latent cross layers where extreme values are augmented and are passed to the next layer. Outliers in both training and serving data must be properly managed.
- Normalization: (Batch) normalization is one of the most commonly used deep learning techniques. Apart from it, we find that minmax normalization with value clipping is particularly useful for the input layer. It’s a simple yet effective treatment to the outliers in feature values as mentioned above.
Multi-task and multi-tower
Besides clickthrough rate (CTR) prediction, we also estimate other user engagement rates as a proxy to the comprehensive user satisfaction. Those user engagements include but not are not limited to good clicks (click throughs where the user doesn’t bounce back immediately), and scroll ups (user scrolls up on the ad to reach more content on the landing page). DNNs allow us to learn multi-task models. Multi-task learning (MTL) [1] has several advantages:
- Simplify system: Learning a model for each of the engagement types and housekeeping them can be a difficult and tedious task. The system will be much simplified and the engineering velocity will be improved if we can train more than one of them at the same time.
- Save infra cost: With a common underneath model that is shared by multiple heads, repeated computation can be minimized at both serving and training time.
- Transfer knowledge across objectives: Learning different yet correlated objectives simultaneously enables the models to share knowledge from each other.
Each of the engagement types is defined as an output of the model. The loss function of an MTL model looks like:
Where n denotes the number of examples, k the number of heads, y^ and y the prediction and true label, respectively.
Apart from MTL, we face another challenge with the Pinterest Shopping and Standard ads products, which are distinct in many ways:
- Creatives: The images of Standard Ads are partners’ creatives; Shopping Ads are crawled from partners’ catalogs.
- Inventory size: Shopping Ads’ inventory is multitudes bigger than Standard Ads
- Features: Shopping Ads have unique product features like texture, color, price, etc.
- User behavior patterns: Pinners with stronger purchase intention tend to engage with Shopping Ads.
We had been training and serving Shopping and Standard models separately before adopting DNN. With the help of AutoML, we started consolidating the two models. We then encountered a paradox: although the individual DNN models trained with Shopping or Standard data respectively outperformed the old models, a consolidated model that learned from a combination of Shopping and Standard data did not outperform either of the old individual models. The hypothesis is that single tower structure fails to learn the distinct characteristics of the two data sources simultaneously.
The shared-bottom, multi-tower model architecture [1] was hence employed to tackle the problem. We use the existing AutoML layers as the shared bottom of the two data sources. The multi-tower structure is implemented as separate multilayer perceptrons (MLP) on top of that. Examples from each source only go through a single tower. Those from other sources are masked. For each tower, every objective (engagement type) is trained with an MLP. Figure 3 illustrates the model architecture.
The multi-tower structure is effective in isolating the interference between training examples from different data sources, while the shared bottom captures the common knowledge of all the data sources.
We evaluated the offline AUC and log-loss of the proposed multi-tower model and a single-tower baseline model. The results are summarized in Table 1. We found that the performance of the proposed model is better on both shopping and standard Ads. Especially on the shopping ads slice, we observed significant improvement. We further validated the results through online A/B tests, which demonstrate positive gains consistently as seen from the offline evaluation.
Calibration
Calibration represents the confidence in the probability predictions, which is essential to Ads ranking. For CTR prediction models, calibration is defined as:
The calibration model of the Pinterest Ads ranking system has evolved through three stages:
- GBDT + LR hybrid [5]: Gradient boosting descent trees (GBDT) are trained against the CTR objective. The GBDT model is featurized and embedded into a logistic regression (LR) model that optimizes against the same objective. LRs by nature generate calibrated predictions.
- Wide & deep: We rely on the wide component (also an LR model) of the wide & deep model [2] for calibration.
- AutoML + calibration layer: A lightweight Platt Scaling model [3] is trained for each of the heads of the AutoML model.
The AutoML + calibration layer approach is the latest milestone for the calibration models.
As described above, we have been relying on the LR models to calibrate the prediction of engagement rates. The solution has several drawbacks:
- The wide (LR) model usually contains millions of features, most of which are ID features and cross features. The sparsity of the model is high. While the LR does a good job memorizing many critical signals, it requires a large number of training examples to converge. Our models often have difficulty capturing some transient trend in the change of Ads inventory or user behavior (e.g., a trending event or topic).
- The LR model fails to learn the non-linearity and high order interactions of features, which are DNNs’ strength.
We push all the sparse features to the AutoML model. The AutoML’s DNN models tend to be not calibrated well [4]. We then create a lightweight Platt Scaling model (essentially an LR model) with a relatively small number of signals for calibration. The signals in the calibration layer include contextual signals (country, device, time of day, etc.), creative signals (video vs image) and user profile signals (language, etc.). The model is both lean and dense, which enables it to converge fast. We are able to update the calibration layer hourly, when the DNNs are updated daily.
The new calibration solution reduced the day-to-day calibration error by as much as 80%.
More specifically, we found two technical nuances about the calibration model: negative downsampling and selection bias.
Negative downsampling
Negative examples in the training data are downsampled to keep labels balanced [5]. The prediction p generated by the model is rescaled with downsampling rate w to ensure the final prediction q is calibrated:
This formula doesn’t hold with multi-task learning, because the ratio between different user engagements are non-deterministic. Our solution is to set a base downsampling rate on one of the tasks (say the CTR head); the rescaling multiplier of other tasks are estimated dynamically during each training batch according to the base rate and the ratio in the number of engagements between other tasks and the base.
Selection bias
Our ads models are trained over user action logs. The selection bias is inevitable when the training examples are generated by other models. The intuition lies in the fact that the new model never has the exposure to the examples that are not selected by the old model. As a result, we often observe that newly trained models are always mis-calibrated when they are put on for experimentation. The calibration is usually fixed after ramping up, with the hypothesis that they are less affected by selection bias with a larger portion of examples generated by themselves.
While we don’t aim at fundamentally fixing the selection bias, a small trick helps us mitigate the issue: we train the calibration layer only with the examples generated by its own. The underlying DNN models are still trained with all the examples available to ensure convergence. The lightweightness of the calibration model, however, doesn’t need a lot of training data to converge. That way the calibration layer can learn from the mistakes made by itself and the results are surprisingly good: the newly trained models are as well calibrated as the production model during A/B testing, even with lower traffic.
Conclusion
AutoML has equipped our multi-task ads CTR models with automatic feature engineering and state-of-the-art machine learning techniques. The multi-tower structure enables us to learn from data sources with distinct characteristics by isolating the interference from one another. This innovation has driven significant improvement in values for Pinners, advertisers and Pinterest. We also learned that a lightweight Platt Scaling model can effectively calibrate the DNN predictions and mitigate selection bias. As future work, we will make the AutoML framework more extensible so we can try more deep learning techniques such as sequence models.
Acknowledgements
This article summarizes a three-quarter work that involved multiple teams of Pinterest. The author wants to thank Minzhe Zhou, Wangfan Fu, Xi Liu, Yi-Ping Hsu and Aayush Mudgal for their tireless contributions. Thanks to Xiaofang Chen, Ning Zhang, Crystal Lee and Se Won Jang for many meaningful discussions. Thanks to Jiajing Xu, Xin Liu, Mark Otuteye, Roelof van Zwol, Ding Zhou and Randall Keller for the leadership.
References
[1] R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
[2] J. Chen, B. Sun, H. Li, H. Lu, and X.-S. Hua. Deep ctr prediction in display advertising. In Proceedings of the 24th ACM international conference on Multimedia, pages 811–820, 2016.
[3] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
[4] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks
[5] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–9, 2014
[6] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs, 2017.
[7] A. Fisher, C. Rudin, and F. Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously, 2018.