Solved – Modeling time: Probability distribution over time

distributionsmachine learningprobabilitytime series

I'm trying to model users' posting behavior during a day. Say we have a bunch of users, with the time they post tweets. Now, for each user, I would like to estimate the likelihood of he post a new tweet at 9:00am according to his historical posting behaviors.

I'm curious what distribution I could pose here. In the literature I saw people using Gaussian, but I'm not sure if that's suitable since it's single peak. (Mixture model would be too complex for this task)

Thus I'm wondering is there any distribution over time that I could use?

The data I'm having is 2 month worth of tweets. Each tweet contains a timestamp, and the author's id. What I'm trying to model is the user's daily activity in terms of posting a tweet. E.G. the data for a single user would looks like [9:00, 9:12, 17:00, 17:01, 22:22, 22:37, 22:45, 22:47, 22:48…]. So this user post more tweets during night (around 10:30 pm) but very rare during work hours. I wish to model the probability P(user post a tweet|time).

Would really appreciate the answers!

Best Answer

The time-stamp does not measure the magnitude of some variable, it marks points in time per se. And from what you write, you are interested in a binary variable: to tweet or not to tweet, call it $Y$.

One possible modelling approach could be the following (for each individual separately): First, you have to decide on how you will partition the day in time zones (half-hours? Hours? Morning-Noon, etc? Depends on the particulars of your case). Given this partition, your data will be grouped in each time zone, for each day, as a count "XX tweets during time zone 2" etc.

For each day separately, this will give you an empirical frequency distribution for the random variable $X=$"number of tweets per time zone". If you divide these frequencies of each time zone by the total tweet count of the day, you will obtain an empirical relative frequency distribution, that can be considered an estimation of the random variable $Y$ (to tweet or not to tweet), for this particular day.

Denote $d_{it}$ the time zone $i$ of day $t$, $i=1,...,k$ and $p_{it}$ the corresponding empirically estimated probability that the person tweets during this time zone. Now go across days and consider the $k$ probabilities series $p_{it}=$ "Tweet during time zone $i$ of day $t$", $t=1,...,60$, since you say you have data for 60 days.

From here on you can do various things: check each time series for stability: do the pattern remains approximately the same? Here, how you have partitioned the day becomes crucial (the smaller the time interval represented in each time zone, the more instability is to be expected).

If you expect to be getting more data as days pass, you can adopt a Bayesian, updating approach, estimate a prior distribution with these first 60 days of data, and then gradually update the estimation of the distribution as new data come in: the new estimate will give you the probabilities of when the persons will tweet the next day, for each time zone.

But also you can view all the time zones together as a Vector Autoregression (VAR), of $k-1$ equations (since probabilities add up to unity), and do what one can do with VAR's, i.e. model the tomorrow probability of each time zone, as depending in a usually linear way on the corresponding probabilities of previous days (lag length to be determined during by the data and the model specification process).

Related Question