Solved – How to use predicted features in prediction

algorithmsmachine learningpredictive-modelstime series

I have ratings data.

My ratings data contains some technical features like the Originator (Channel), the exact day & hour of the broadcast, duration of the program etc. and, obviously, the label, which is the rating.
So the data looks like that:

+---------+------------+-----------------+----------------+----------------------------------+---------------+
| Program | Originator |      date       | Duration (min) | some other technical features…   | Actual rating |
+---------+------------+-----------------+----------------+----------------------------------+---------------+
| Empire  | FOX        | 24/5/2016 21:00 |             58 | …                                | 4.6%          |
| Gotham  | FOX        | 24/5/2016 21:58 |             32 | …                                | 3.1%          |
+---------+------------+-----------------+----------------+----------------------------------+---------------+

Based on the historical ratings data, I need to predict the future ratings, when all the parameters of the train are given, except the label of course.

My problem is:

A very strong feature for rating prediction is the carry over, or what was the rating of the previous program.

I want to train my model with the feature of the carry over, but I'm not sure how I should add it?
Should I train the model with the real carry-over? (the actual rating of the previous program)? In the test set the carry-over would be just an approximation to the real carry-over (it will be the prediction, not the actual rating, as oppose to the training data – because I can't know in advance what would be the rating of the previous program) , so the correlation of the carry over with the real ratings in the test set would be less significant then in the train set…
How should I tackle this problem?

Best Answer

There seem to be (at least) three types of feature engineering to do here. The first one involves the carry-over (which is the focus of your question). The second one involves transformations of other data to forms more usable by ML algorithms. The third one involves competing programs.

I'll begin with the second one, as it's necessary for the first and third ones.

Leaving carry-over for later at the moment, it looks like some of the features can be manipulated for better use (you might have done this already, but it's not indicated in the question).

The date column - television viewing probably has strong daily seasonal components, probably weekly seasonal components, and possibly yearly seasonal components (see, for example, Prediction Of TV Ratings With Dynamic Models). It's a-priori unlikely that a show airing at 3:30AM before a workday, will have the same rating as a show airing on the early evening of a weekend. It also might be the case that people watch differently in the winter and summer, during vacations, and so on.

Because of this, you might want to transform the date column into a number of features: the daily hour, the day of the week, possibly an indicator variable of whether it's weekend/vacation, the month, possibly an indicator of season, and possibly an indicator of vacation period.
The Program column - Say your test data will be $n$ days into the future, and the particular program is already on the air. The ratings now known for this program are probably some indicator for the future rating. Consequently, you might want to add two columns (at least): the past rating for this program $n$ days or earlier (using an average, for example), and a column for the number of measurements used for the past ratings. (If a show was not on $n$ days earlier, you could encode it as -1 and 0, respectively.) You could go further with this and analyze trends for the program, or ratings for similar shows, but perhaps you should start with this.
The Originator column - you might want something like one-hot encoding here.

(If you haven't done this already,) these transformations might increase the overall prediction accuracy, and decrease the relative importance of the carry-over. Some of these features can also be used as proxies for the carry-on.

I want to train my model with the feature of the carry over, but I'm not sure how I should add it? Should I train the model with the real carry-over? (the actual rating of the previous program)? ... I can't know in advance what would be the rating of the previous program)

In general, it's best to avoid training using one thing, and testing using something that isn't exactly the same. So, as your question implies, it's problematic to train using the actual carry on, and then predict using the predicted carry on.

Instead of predicting the immediately-previous carry on, let's think based on how we would do so.

The popularity of the previous program is possibly affected by its time of day, day of week, and so forth, but that's already "encoded" in the features for the current show, so why repeat it.
Similarly, the popularity of any of the immediately-preceding shows might be determined by their carriers, but that adds nothing (we know that Fox will be airing something just before the show we're predicting, for example). This is already implicitly encoded in the other features.
The one thing that doesn't seem to be already encoded in the other columns, is the past ratings of the shows immediately preceding this show. For example, when Game Of Thrones airs, its probable that the show immediately following it will enjoy large carry-over, but we know this because the past ratings of GoT were high. I think it's best to encode this data as a feature, and let the predictor learn how to use it.

One straightforward feature to add, therefore, would be the rating of the most popular show preceding it based on its past ratings (using the same two-column encoding as before).

Finally, you might want to add as a feature the rating of the most popular show competing with this one, again based on its past ratings.

Related Solutions

Hidden Markov Model – Using HMM for Event Prediction in Time Series

One problem with the approach you've described is you will need to define what kind of increase in $P(O)$ is meaningful, which may be difficult as $P(O)$ will always be very small in general. It may be better to train two HMMs, say HMM1 for observation sequences where the event of interest occurs and HMM2 for observation sequences where the event doesn't occur. Then given an observation sequence $O$ you have $$ \begin{align*} P(HHM1|O) &= \frac{P(O|HMM1)P(HMM1)}{P(O)} \\ &\varpropto P(O|HMM1)P(HMM1) \end{align*} $$ and likewise for HMM2. Then you can predict the event will occur if $$ \begin{align*} P(HMM1|O) &> P(HMM2|O) \\ \implies \frac{P(HMM1)P(O|HMM1)}{P(O)} &> \frac{P(HMM2)P(O|HMM2)}{P(O)} \\ \implies P(HMM1)P(O|HMM1) &> P(HMM2)P(O|HMM2). \end{align*} $$

Disclaimer: What follows is based on my own personal experience, so take it for what it is. One of the nice things about HMMs is they allow you to deal with variable length sequences and variable order effects (thanks to the hidden states). Sometimes this is necessary (like in lots of NLP applications). However, it seems like you have a priori assumed that only the last 5 observations are relevant for predicting the event of interest. If this assumption is realistic then you may have significantly more luck using traditional techniques (logistic regression, naive bayes, SVM, etc) and simply using the last 5 observations as features/independent variables. Typically these types of models will be easier to train and (in my experience) produce better results.

Solved – How to choose data for training a predictive model for attrition prediction

I would aggregate the data to weekly aggregate numbers, assuming that great / bad agents have some what consistent call center performance over the six months. Sometimes aggregating erases the effects of outliers before they can be classified as such. This would account for shifts in performance across the total 6 month period as well.

When it comes to sampling using 80% of data points to develop model and 20% to validate would be a good start. Can adjust those numbers depending on how big a data set you are dealing with.

I utilize Iowa State papers some times. Here is a good one on the basics (pdf).

Hope You have fun!!

Update: Just so we are clear you are aggregating by week per customer service rep right?

Both models don't fit good. You can tell variable fits using the coefficients section of the results. Significant variables have the stars next to there P value (more stars equals more significant typically and lower P value). Based on that none of your variables are actually

Coefficients

It's good that you are comparing the model vs actual results. ROC curves capture the model differences pretty well. Try running this and post what you get.

library(pROC)
g <- roc(admit ~ prob, data = mydata) 
plot(g)

Update: Its weekly aggregates, population wise(i.e the attrite population and the active population),didn do it agentwise because we will have cases when a agent leaves when he was at his peak performance but those are exceptional cases so i thought it would be better to compare the two populations, please advise if that's not the correct way of thinking

SO AW1 is first weeks performance metric aggregates for Attrites, similarly NAW1 is first weeks performance metric aggregates for Non-attrites/Active agents.

ROC Curve for the predictions of bayesglm model on full-data

Ran the "step" fuction(Selects a formula-based model by AIC) on the bayesglm model and the results are as below; Step fuction

results after the step function

Aggregating all the agent results together will mean you essentially are over fitting to match the total population metrics and not the agent's performance. Recommend that you tie in the agent level results. You mentioned there being a chance that a great agent leaves unexpectedly but for a well run unit that should be a rarity. Also, recommend you change model family parameter to

family = binomial(link = "probit")

This should give you probability of default for each agent. This would

Best Answer

Related Solutions

Hidden Markov Model – Using HMM for Event Prediction in Time Series

Solved – How to choose data for training a predictive model for attrition prediction

Related Question