Vivian Ye
7 min readMar 22, 2021

--

Analyzing and Predicting Starbucks’ Promotions

Introduction

Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

There are three datasets provided by Starbucks:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

profile.json

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

transcript.json

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Problem Statement

The task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Here are the main steps that I used which I learned from Udacity Data Scientist course:

  1. Exploring the data set
  2. Pre-processing the data
  3. Combining the data set into a final clean data file containing the features that I want to analyze
  4. Splitting data into training and testing data sets
  5. Training the classifier
  6. Analyze the results

Exploring the Data Set

This is the first step in order to understand the data and check the missing values.

Let’s read in the data set that are provided by Starbucks:

Read in all 3 data files
Check initial portfolio data file

The “channels” column has multiple values, let’s separate them and change “duration” from days to hours:

Separate values in “channels” column into multiple columns
Check initial profile data file

It seems there are lots of NaN values in “profile” data file, let’s check the number of NaN values:

Number of NaN values in profile data file
Age distribution

Starbucks customers are from teenagers to senior citizens, most of the customers are between 30 years old and 78 years old.

Income range

The customers’ income ranges from less than $30,000 to about $120,000, the majority is between $50,000 to about $75,000.

Check initial transcript data file

Data Processing

After the initial exploration of the three data sets, I find there are lots of missing values, the data files have multiple values in one column, I can now move on to process the data.

First, I create a cleaned “profile” data file with the following changes:

  1. Rename “id” to “customer_id” for easy understanding and merger with other file(s) later on
  2. Extract “year”, “month”, “day” from “became_member_on” and add them as separate columns
  3. Extract “m”, “f” from “gender” and add them as separate columns
  4. Divide “age” into different groups, and adding each group as a separate columns

Here after is what the cleaned portfolio file looks like:

Second, I create a cleaned portfolio file by extracting “bogo”, “discount”, “informational” from “offer_type” and adding them as separate columns:

Third, I create a cleaned transcript data from the original file:

  1. Rename “person” to “customer_id”
  2. Extract the multiple keys from “value” column and add each key as a separate column
  3. Check the “event”, save “transaction” event into a new data frame and save the other types of events into another data frame
Event values

I can now merge the data files into one data file:

Metrics

As this is a classification problem, the metrics used in this project are:

  • Accuracy
  • Precision
  • Recall
  • F1-score

Accuracy is the most frequent classification evaluation metric. Precision informs about the quality of the machine learning model in classification tasks. Recall informs us about the quantity that the machine learning model is capable of identifying. F1Score combines the precision and recall metrics into a single value so that I can compare the results.

Data Modeling

With a merged data file, let’s build a model that could help to predict a Starbucks customer’s response to an offer.

Algorithms and Refinement

I choose the following supervised algorithms to train the data:

  • Logistic Regression
  • Random Forest

To run the test with each model, I use RandomizedSearchCV with 10 iterations to optimize the training time.

To tune the model with Logistic Regression:

To tune the model with Random Forest:

Model Evaluation and Validation

Logistic Regression result:

Random Forest result:

Evaluation the model performance between Logistic Regression and Random Forest:

Conclusion

In this project, I analyzed the data sets provided by Starbucks, and tried to build a model to predict a customer’s response to an offer.

I started with exploring the data set to understand the provided data, check the missing value. Then pre-processing the data set to get the clean data for each data file. The pre-processing took a lot of effort because the data has missing values, mismatch data types, and multiple values in one column. Once the data are cleaned, I was able to merge them into one data file to do analysis then build the model. I chose two supervised algorithms: Logistic Regression and Random Forest. To run the test with each model, I used RandomSearchCV with 10 iterations to optimize the training time. Comparing both accuracy and f1score, Random Forest is better than Logistic Regression.

Justification

Random Forest(RF) performs better than Logistic Regression(LR) in this project. The accuracy difference between RF and LR is about 0.04(4%). RF has low bias and is flexible enough to learn highly nonlinear decision boundaries.

LR normally performs better when the number of noise variables is less than or equal to the number of explanatory variables.

Reflection

Doing this project was a challenge, especially the pre-processing part to separate multiple value into different columns, and replacing values with numbers, which took a lot of my time, it is also interesting because I learned a lot by doing it.

It has been a journey and I enjoyed every bit of it!

Improvement

If more data can be collected or less missing data, we might be able to build a model that can predict which kind of offers to be sent to what group of customers(e.g. age group, income range … etc.).

--

--