Posted by Manisha Arora, Nithya Mahadevan, and Aritra Biswas, gPS Data Science team
Overview of Discovery Ads and need for Ad Performance Analysis
Discovery ads, launched in May 2019, allow advertisers to easily extend their reach of social ads users across YouTube, Google Feed and Gmail worldwide. They provide brands a new opportunity to reach 3 billion people as they explore their interests and search for inspiration across their favorite Google feeds (YouTube, Gmail, and Discover) -- all with a single campaign. Learn more about Discovery ads here.
Interaction Rate = interaction / impressions
“Customers need a data driven method to identify textual & imagery elements in Discovery Ad copies that drive Interaction Rate of their campaigns.”
- Manisha Arora, Data Scientist
Our analysis approach:
The Data Science team at Google is investing in a machine learning approach to uncover insights from complex unstructured data and provide machine learning based recommendations to our customers. Machine Learning helps us study what works in ads at scale and these insights can greatly benefit the advertisers.
We follow a six-step based approach for Discovery Ad Performance Analysis:
- Understand Business Goals
- Build Creative Hypothesis
- Data Extraction
- Feature Engineering
- Machine Learning Modeling
- Analysis & Insight Generation
“Machine Learning helps us study what works in ads at scale and these insights can greatly benefit the advertisers.”
- Manisha Arora, Data Scientist
Once we have a hypothesis we are working towards, the next step is to deep-dive into the technical analysis.
Data Extraction & Pre-processing
Text Feature Extraction
Image Feature Extraction
Following are the holistic set of features that are extracted from the ad content:
Feature Design
1. Generic text feature
a. These are features returned by Google Cloud’s Language API including sentiment, word / character count, tone (imperative vs indicative), symbols, most frequent words and so on.
2. Industry-specific value propositions
a. These are features that only apply to a specific industry (e.g. finance) that are manually curated by the data science developer in collaboration with specialists and other industry experts.
Image Feature Design
- For example, for the finance industry, one value proposition can be “Price Offer”. A list of keywords / phrases that are related to price offers (e.g. “discount”, “low rate”, “X% off”) will be curated based on domain knowledge to identify this value proposition in the ad copies. NLP techniques (e.g. wordnet synset) and manual examination will be used to make sure this list is inclusive and accurate.
1. Generic image features
a. These features apply to all images and include the color profile, whether any logos were detected, how many human faces are included, etc.
b. The face-related features also include some advanced aspects: we look for prominent smiling faces looking directly at the camera, we differentiate between individuals vs. small groups vs. crowds, etc.
2. Object-based features
a. These features are based on the list of objects and labels detected in all the images in the dataset, which can often be a massive list including generic objects like “Person” and specific ones like particular dog breeds.b. The biggest challenge here is dimensionality: we have to cluster together related objects into logical themes like natural vs. urban imagery.c. We currently have a hybrid approach to this problem: we use unsupervised clustering approaches to create an initial clustering, but we manually revise it as we inspect sample images. The process is:
- Extract object and label names (e.g. Person, Chair, Beach, Table) from the Vision API output and filter out the most uncommon objects
- Convert these names to 50-dimensional semantic vectors using a Word2Vec model trained on the Google News corpus
- Using PCA, extract the top 5 principal components from the semantic vectors. This step takes advantage of the fact that each Word2Vec neuron encodes a set of commonly adjacent words, and different sets represent different axes of similarity and should be weighted differently
- Use an unsupervised clustering algorithm, namely either k-means or DBSCAN, to find semantically similar clusters of words
- We are also exploring augmenting this approach with a combined distance metric:
d(w1, w2) = a * (semantic distance) + b * (co-appearance distance)
where the latter is a Jaccard distance metric
Each of these components represents a choice the advertiser made when creating the messaging for an ad. Now that we have a variety of ads broken down into components, we can ask: which components are associated with ads that perform well or not so well?
We use a fixed effects1 model to control for unobserved differences in the context in which different ads were served. This is because the features we are measuring are observed multiple times in different contexts i.e. ad copy, audience groups, time of year & device in which ad is served.
The trained model will seek to estimate the impact of individual keywords, phrases & image components in the discovery ad copies. The model form estimates Interaction Rate (denoted as ‘IR’ in the following formulas) as a function of individual ad copy features + controls:
We use ElasticNet to spread the effect of features in presence of multicollinearity & improve the explanatory power of the model:
“Machine Learning model estimates the impact of individual keywords, phrases, and image components in discovery ad copies.”
- Manisha Arora, Data Scientist
Outputs & Insights
In other words, if the mean CTR without feature is X% and the feature ‘xx’ has a coeff of Y, then the mean CTR with feature ‘xx’ included will be (X + Y)%. This can help us determine the expected CTR if the most important features are included as part of the ad copies.
Key-takeaways (sample insights):
We analyze keywords & imagery tied to the unique value propositions of the product being advertised. There are 6 key value propositions we study in the model. Following are the sample insights we have received from the analyses:Shortcomings:
1. The current model does not consider groups of keywords that might be driving ad performance instead of individual keywords (Example - “Buy Now” phrase instead of “Buy” and “Now” individual keywords).
2. Inference and predictions are based on historical data and aren’t necessarily an indication of future success.
3. Insights are based on industry insights and may need to be tailored for a given advertiser.
DisCat breaks down exactly which features are working well for the ad and which ones have scope for improvement. These insights can help us identify high-impact keywords in the ads which can then be used to improve ad quality, thus improving business outcomes. As next steps, we recommend testing out the new ad copies with experiments to provide a more robust analysis. Google Ads A/B testing feature also allows you to create and run experiments to test these insights in your own campaigns.
Summary
Acknowledgement
Notes
Greene, W.H., 2011. Econometric Analysis, 7th ed., Prentice Hall;
Cameron, A. Colin; Trivedi, Pravin K. (2005). Microeconometrics: Methods and Applications. ↩