= '2022-11-01'
PREDICTION_START = '2023-12-31' PREDICTION_END
Modeling DAU, WAU, MAU with Markov chain (WIP)
Doubtlessly, DAU, WAU, and MAU are critical business metrics. An article “How Duolingo reignited user growth” by Jorge Mazal, former CPO of Duolingo is #1 the Growth section of Lenny’s Newsletter blog. In this article Jorge payed special attention to the methodology Duolingo used to model DAU metrics (see another article “Meaningful metrics: how data sharpened the focus of product teams” by Erin Gustafson). This methodology has multiple strenghts but I focus here on how one can use it for DAU forecasting.
New year is coming soon, so many companies are planning their budgets for the next year these days. Cost estimations are often require DAU forecast. In this article I’ll show you how you can get this prediction using the Duolingo’s growth model and share a DAU & MAU “calculator” designed as a Google Spreadsheet calculator.
Methodology
A quick recap on how the Duolingo’s growth model works. At day \(d\) (\(d=1,2,\ldots,\)) of a user’s lifetime the user can be in one of the following 7 (mutually-exclusive) states:
state | d = 1 | is active today | was active in [d-6, d-1] | was active in [d-29, d-7] | was active before d-30 |
---|---|---|---|---|---|
new | ✅ | ❓ | ❌ | ❌ | ❌ |
current | ❌ | ✅ | ✅ | ❓ | ❓ |
reactivated | ❌ | ✅ | ❌ | ✅ | ❓ |
resurrected | ❌ | ✅ | ❌ | ❌ | ✅ |
at_risk_wau | ❌ | ❌ | ✅ | ❓ | ❓ |
at_risk_mau | ❌ | ❌ | ❌ | ✅ | ❓ |
dormant | ❌ | ❌ | ❌ | ❌ | ✅ |
Having these states defined (as set \(S\)), we can consider a user’s lifetime trajectory as a Markov chain. Let \(M\) be a transition matrix associated with this Markov chain: \(m_{i, j} = P(s_j | s_i)\) are the probabilities that a user moves to state \(s_j\) right after being at state \(s_i\), \(s_i, s_j \in S\). The matrix values are easily fetched from the historical data.
The beauty and simplicity of this approach is that matrix \(M\) fully describes states of the all users in the future. Suppose that vector \(u_0\) of length 7 contains the counts of users being in certain states at some calendar day denoted as 0. Thus, according to the Markov model, in the next day \(u_1\) we expect to have the following amount of users:
\[ \underbrace{ \begin{pmatrix} \#New_1 \\ \#Current_1 \\ \#Reactivated_1 \\ \#Resurrected_1 \\ \#AtRiskWau_1 \\ \#AtRiskMau_1 \\ \#Dormant_1 \end{pmatrix} }_{u_1} = M^T \cdot \underbrace{ \begin{pmatrix} \#New_0 \\ \#Current_0 \\ \#Reactivated_0 \\ \#Resurrected_0 \\ \#AtRiskWau_0 \\ \#AtRiskMau_0 \\ \#Dormant_0 \end{pmatrix} }_{u_0} \]
Applying this formula recursevely, we derive the amount of the users at any arbitrary day \(t > 0\) in the future. The only thing we need to provide despite of the initial distribution \(u_0\) is to the amount of new users that would appear in the product each day in the future. We’ll get it by using historical data on new users appeared in the past and appyling the prophet library.
Now, having \(u_t\) calculated, we can calculate DAU values at day t: \[DAU_t = \#New_t + \#Current_t + \#Reactivated_t +\#Resurrected_t.\]
Additionally, we can easily calculate WAU and MAU metrics: \[WAU_t = DAU_t +\#AtRiskWau_t,\] \[MAU_t = DAU_t +\#AtRiskWau_t + \#AtRiskMau_t.\]
Finally, the algorithm looks like this:
- Calculate initial counts \(u_0\) corresponding to the day right before prediction.
- For each prediction day \(t=1, ..., T\) calculate the expected amount of new users \(\#New_1, \ldots, \#New_T\).
- Calculate recursively \(u_{t+1} = M^T u_t\).
- Calculate DAU, WAU, MAU for each prediction day \(t=1, ..., T\).
Implementation
Suppose that today is 2022-10-31 and we want to predict the DAU metric for the next year. We define PREDICTION_START
and PREDICTION_END
constants to set this prediction range.
We use a simulated dataset which is based on historical data of some SAAS app. The data is stored in the dau_data.csv
file and contains three columns user_id
, date
, registration_date
. Each record indicates a day when a user was active.
The data includes the activity indicators for all users spotted in the app from 2020-11-01 to 2022-10-31. An additional month 2020-10 is included in order to calculate user states correctly (AtRiskMau
and dormant
states requires data 1 month behind).
import pandas as pd
= pd.read_csv('dau_data.csv.gz', compression='gzip')
df print(f'Shape: {df.shape}')
print(f'Total users: {df['user_id'].nunique()}')
print(f'Data range: [{df['date'].min()}, {df['date'].max()}]')
df.head()
Shape: (447048, 3)
Total users: 38093
Data range: [2020-10-01, 2022-10-31]
user_id | date | registration_date | |
---|---|---|---|
0 | 7a010840-b4d1-543d-bd4c-fbb4ae2198c5 | 2020-10-01 | 2020-08-26 |
1 | d565a211-1996-538b-b067-11a38616c8cf | 2020-10-01 | 2020-10-01 |
2 | 41fb29c7-8122-59db-a690-e9d9502f3c38 | 2020-10-01 | 2020-09-02 |
3 | 310c15a2-fe92-5703-be1f-80270c1b83bd | 2020-10-01 | 2020-08-27 |
4 | 234c3035-145f-5694-bd7f-bec58a5f0c5c | 2020-10-01 | 2020-05-31 |
In practice, the most calculations are reasonable to run as SQL queries to a database where the data is stored. To simulate this, I’ll use the duckdb
library.
import duckdb
Predicting new users amount
Let’s start from the new users prediction. One of the easiest ways to do this is to use the prophet library. It simply requires a historical time-series and extends it in the future. The new_users
Series contains such a historical data. A condition that indicates that a new user appeared is df['date'] == df['registration_date']
.
= df[df['date'] == df['registration_date']]\
new_users =pd.to_datetime(df['date']))\
.assign(date'date').size()
.groupby(
new_users.head()
date
2020-10-01 4
2020-10-02 3
2020-10-03 3
2020-10-04 4
2020-10-05 8
dtype: int64
prohet
requires a time-series as a DataFrame containing two columns, so we reformat new_users
Series to new_users_prophet
DataFrame. Another thing we need to prepare is to create the future
variable containing certain days for prediction: from PREDICTION_START
to PREDICTION_END
. The plot illustrates predictions for both past and future dates.
from prophet import Prophet
= Prophet()
m = pd.DataFrame({'ds': new_users.index, 'y': new_users.values})
new_users_prophet
m.fit(new_users_prophet)
= len(pd.date_range(PREDICTION_START, PREDICTION_END))
periods = m.make_future_dataframe(periods=periods)
future ; m.plot(m.predict(future))
21:53:31 - cmdstanpy - INFO - Chain [1] start processing
21:53:31 - cmdstanpy - INFO - Chain [1] done processing
Now, let’s create a Series new_users_pred
where we’ll keep predictions for the new users amount.
= m.predict(future)\
new_users_pred =lambda _df: _df['yhat'].astype(int))\
.assign(yhat={'ds': 'date', 'yhat': 'count'})\
.rename(columnslambda _df: _df['date'] >= PREDICTION_START]\
.loc['date')\
.set_index('count']
[
new_users_pred.head()
date
2022-11-01 49
2022-11-02 47
2022-11-03 43
2022-11-04 39
2022-11-05 41
Name: count, dtype: int64
Getting the states
So we want assign one of 7 states for each day of a user’s lifetime (meaning lifetime within the app). According to the definition, for each day we need to consider at least 30 days in the past. This is when SQL window functions come in. However, since df
data contains only active day indicators, we need to explicitly extend it with the days when a user was not active.
For readability purposes we split the next SQL query into multiple subqueries.
dau
. Simply transforms dates from string to date data type.full_range
. Creates a full sequence of dates for each user.
= '2021-11-01'
DATASET_START = '2022-10-31'
DATASET_END = '2021-10-01'
OBSERVATION_START
= f"""
query WITH
dau AS (
SELECT
user_id,
date::date AS date,
registration_date::date AS registration_date
FROM df
),
full_range AS (
SELECT
user_id, UNNEST(generate_series(greatest(registration_date, '{OBSERVATION_START}'), date '{DATASET_END}', INTERVAL 1 DAY))::date AS date
FROM (
SELECT DISTINCT user_id, registration_date FROM dau
)
),
dau_full AS (
SELECT
fr.user_id,
fr.date,
dau.date IS NOT NULL AS is_active,
registration_date
FROM full_range AS fr
LEFT JOIN dau USING(user_id, date)
),
states AS (
SELECT
user_id,
date,
is_active,
first_value(registration_date IGNORE NULLS) OVER (PARTITION BY user_id ORDER BY date) AS registration_date,
SUM(is_active::int) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN 6 PRECEDING and 1 PRECEDING) AS active_days_back_6d,
SUM(is_active::int) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN 29 PRECEDING and 1 PRECEDING) AS active_days_back_29d,
CASE
WHEN date = registration_date THEN 'new'
WHEN is_active = TRUE AND active_days_back_6d BETWEEN 1 and 6 THEN 'current'
WHEN is_active = TRUE AND active_days_back_6d = 0 AND IFNULL(active_days_back_29d, 0) > 0 THEN 'reactivated'
WHEN is_active = TRUE AND active_days_back_6d = 0 AND IFNULL(active_days_back_29d, 0) = 0 THEN 'resurrected'
WHEN is_active = FALSE AND active_days_back_6d > 0 THEN 'at_risk_wau'
WHEN is_active = FALSE AND active_days_back_6d = 0 AND ifnull(active_days_back_29d, 0) > 0 THEN 'at_risk_mau'
ELSE 'dormant'
END AS state
FROM dau_full
)
SELECT user_id, date, state FROM states
WHERE date BETWEEN '{DATASET_START}' AND '{DATASET_END}'
ORDER BY user_id, date
"""
= duckdb.sql(query).df() states
= f"""
query SELECT
date,
state_from,
state_to,
COUNT(*) AS cnt,
FROM (
SELECT
date,
state AS state_to,
lag(state) OVER (PARTITION BY user_id ORDER BY date) AS state_from
FROM states
)
WHERE state_from IS NOT NULL
GROUP BY date, state_from, state_to
ORDER BY date, state_from, state_to;
"""
= duckdb.sql(query).df() transitions
= '2022-11-01'
PREDICTION_START = '2023-12-31'
PREDICTION_END
= ['new', 'current', 'reactivated', 'resurrected', 'at_risk_wau', 'at_risk_mau', 'dormant']
states_order
def get_transition_matrix(transitions, date1, date2):
= transitions\
probs lambda _df: _df['date'].between(date1, date2)]\
.loc['state_from', 'state_to'], as_index=False)\
.groupby(['cnt'].sum()\
[
.assign(=lambda _df: _df.groupby('state_from')['cnt'].transform('sum'),
supp=lambda _df: _df['cnt'] / _df['supp']
prob
)
= probs.pivot(index='state_from', columns='state_to', values='prob')\
M =0)\
.reindex(states_order, axis=1)\
.reindex(states_order, axis0)\
.fillna(float)
.astype(
return M
Predicting DAU
def predict_dau(M, state0, date1, date2, new_users):
"""
Predicts DAU over a given date range.
Parameters
----------
M : pandas.DataFrame
Transition matrix representing user state changes.
state0 : pandas.Series
counts of initial state of users.
date1 : str
Start date of the prediction period in 'YYYY-MM-DD' format.
date2 : str
End date of the prediction period in 'YYYY-MM-DD' format.
new_users : int or pandas.Series
The expected amount of new users for each day between date1 and date2.
If a Series, it should have dates as the index.
If an int, the same number is used for each day.
Returns
-------
pandas.DataFrame
DataFrame containing the predicted DAU, WAU, and MAU for each day in the date range,
with columns for different user states and tot.
"""
= pd.date_range(date1, date2)
dates = 'date'
dates.name = []
dau_pred = state0.copy()
new_dau for date in dates:
= (M.transpose() @ new_dau).astype(int)
new_dau if isinstance(new_users, int):
= new_users
new_users_today else:
= new_users.astype(int).loc[date]
new_users_today 'new'] = new_users_today
new_dau.loc[
dau_pred.append(new_dau.tolist())
= pd.DataFrame(dau_pred, index=dates, columns=states_order)
dau_pred 'dau'] = dau_pred['new'] + dau_pred['current'] + dau_pred['reactivated'] + dau_pred['resurrected']
dau_pred['wau'] = dau_pred['dau'] + dau_pred['at_risk_wau']
dau_pred['mau'] = dau_pred['dau'] + dau_pred['at_risk_wau'] + dau_pred['at_risk_mau']
dau_pred[
return dau_pred
def get_state0(date):
= f"""
query SELECT state, count(*) AS cnt
FROM states
WHERE date = '{date}'
GROUP BY state
"""
= duckdb.sql(query).df()
state0 = state0.set_index('state').reindex(states_order)['cnt']
state0
return state0
= get_transition_matrix(transitions, '2022-10-25', '2022-10-31')
M = get_state0(DATASET_END)
state0 = predict_dau(M, state0, PREDICTION_START, PREDICTION_END, new_users_pred)
dau_pred dau_pred
new | current | reactivated | resurrected | at_risk_wau | at_risk_mau | dormant | dau | wau | mau | |
---|---|---|---|---|---|---|---|---|---|---|
2022-11-01 | 49 | 444 | 13 | 13 | 451 | 1269 | 16795 | 519 | 970 | 2239 |
2022-11-02 | 47 | 446 | 13 | 13 | 462 | 1267 | 16830 | 519 | 981 | 2248 |
2022-11-03 | 43 | 448 | 13 | 13 | 470 | 1267 | 16865 | 517 | 987 | 2254 |
2022-11-04 | 39 | 448 | 13 | 13 | 474 | 1268 | 16900 | 513 | 987 | 2255 |
2022-11-05 | 41 | 447 | 13 | 13 | 475 | 1269 | 16935 | 514 | 989 | 2258 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2023-12-27 | 135 | 1172 | 34 | 36 | 1266 | 3247 | 45910 | 1377 | 2643 | 5890 |
2023-12-28 | 134 | 1187 | 34 | 36 | 1282 | 3258 | 45999 | 1391 | 2673 | 5931 |
2023-12-29 | 132 | 1201 | 34 | 36 | 1297 | 3271 | 46088 | 1403 | 2700 | 5971 |
2023-12-30 | 137 | 1213 | 34 | 36 | 1309 | 3285 | 46178 | 1420 | 2729 | 6014 |
2023-12-31 | 151 | 1227 | 34 | 36 | 1323 | 3300 | 46268 | 1448 | 2771 | 6071 |
426 rows × 10 columns
= """
query SELECT date, COUNT(*) AS dau
FROM states
WHERE state IN ('new', 'current', 'reactivated', 'resurrected')
GROUP BY date
ORDER BY date
"""
= duckdb.sql(query).df()\
dau_true 'date')
.set_index('dau']]).plot() pd.concat([dau_true, dau_pred[
Prediction testing
= get_state0('2022-06-30')
state0 = get_transition_matrix(transitions, '2022-06-23', '2022-06-30')
M = predict_dau(M, state0, '2022-07-01', '2022-10-31', new_users)
dau_pred
\
dau_true'dau']], how='inner', lsuffix='_true', rsuffix='_pred')\
.join(dau_pred[[ .plot()