2 RFM
Recency, frequency, monetary values are quick, easy, and useful segmentation method for labelling customers. There are many variations in the literature, such as breadth, duration, engagement, churn rate, etc. To focus on basics, the scope will be narrowed down to recency and frequency.
2.1 Recency
Essentially, recency would be ‘how many days ago from today the customer shopped with the vendor’ typically as an integer value. It’s a simple metric to show when they last shopped, where lower is better.
Recency can tell us quickly which customers are fading out to become lost customers if they haven’t shopped in a while. One example weakness of this metric would be that it does not account for product purchase rhythm, where you might buy an orange everyday, but a phone once a year or so. So even if the orange buyer has better recency, the phone buyer may bring in more revenue/profit overall.
2.2 Frequency
Frequency is ‘how many times the customer has shopped with the vendor’ as an integer value. It’s another nice and simple metric to determine which customers brings activity presence to the business, where higher the better.
One example weakness of this metric would be on new customers. Even if their frequency is the worst, they may have just discovered your business within the past week. Such customer may become one of the champion customers in the future, but frequency is no such indicator.
2.3 Basic RFM coding
Let’s code recency and frequency as described above, starting with the raw values.
# Start with the latest transaction date in the data
= df_transactions |>
date_latest_transaction pull(date) |>
max()
# Sanity check
if (date_latest_transaction > Sys.Date()) {
cat('The data has future data points. Removing invalid data ... \n')
= df_transactions |>
df_transactions filter(date <= Sys.Date())
}
# First, filter on data for the past year
= df_transactions |>
df_rfm filter(date >= date_latest_transaction %m-% months(12)) |>
summarise(.by = customer_id,
.sort = F, # never set to true on big data
recency = as.integer(date_latest_transaction - max(date)),
frequency = n(date),
monetary = sum(sales))
|>
df_rfm head() |>
print.data.frame()
customer_id recency frequency monetary
1 16010 10 140 21769200
2 17700 26 283 234886050
3 13675 43 63 16426200
4 17451 20 325 252618750
5 17404 6 103 288592950
6 16881 366 10 5635350
You can see from the above output that we now have our RFM values. What should we do with them?
2.4 Approaches to RFM segmentation
To start with the basics, let’s focus on recency
and frequency
to segment our customers. There are two ways to approach this:
Manual (hard-coded) segmentation
Algorithmic clustering segmentation
First, let’s go over the manual segmentation.
2.4.1 Manual segmentation
In order to manually segment them, it helps to put each of the RFM dimensions into different brackets. Generally, the most common approach is to put them into quantiles. Let’s put them into three quantiles.
# Cut the values into quantiles
= df_rfm |>
df_rfm_quantiles mutate(R_Q = as.integer(cut(recency,
breaks = quantile(recency,
probs = c(0, 1/3, 2/3, 1)),
labels = c(3, 2, 1),
include.lowest = T)),
F_Q = as.integer(cut(frequency,
breaks = quantile(frequency,
probs = c(0, 1/3, 2/3, 1)),
labels = c(1, 2, 3),
include.lowest = T))) |>
select(customer_id, R_Q, F_Q)
|>
df_rfm_quantiles head() |>
print.data.frame()
customer_id R_Q F_Q
1 16010 1 3
2 17700 2 3
3 13675 2 2
4 17451 1 3
5 17404 1 3
6 16881 3 1
Now we segment the customers, determined by the following rules:
customer_type | recency | frequency |
---|---|---|
Frequent | high | high |
Repeat | high | med |
New | high | low |
Lapsed | med | |
Lost | low |
# Apply labels to specific quantile combos
= df_rfm_quantiles |>
df_customer_rfm_type mutate(customer_type = case_when(R_Q == 1 & F_Q == 1 ~ 'Frequent',
== 1 & F_Q == 3 ~ 'New',
R_Q == 1 & F_Q >= 2 ~ 'Repeat',
R_Q == 2 ~ 'Lapsed',
R_Q == 3 ~ 'Lost',
R_Q ~ NA)) |>
T select(customer_id, customer_type)
|>
df_customer_rfm_type count(customer_type) |>
print.data.frame()
customer_type n
1 Frequent 232
2 Lapsed 1433
3 Lost 1420
4 New 802
5 Repeat 402
Quiz
Would there a difference between
&&
and&
/>=
and==
in thecase_when()
statements? Why or why not? (hint: try changing the order of the labels insidecase_when()
)Looking at the RFM image, which other labels seem useful, and how would you code them?
Before assigning values to quantiles, would you do any kind of preprocessing?
What are the weaknesses of using quantiles as thresholds to the ranges? How can we determine the optimal number of quantiles?
2.4.2 Algorithmic clustering segmentation
In the last quiz, how did you answer the last question? One of the answers could be that the quantiles can draw boundaries or thresholds at arbitrary sections of the value ranges, because quantiles prioritise equivalent division of the range – not the best priority in this case.
So instead, we can delegate a different algorithm to find the optimal thresholds on our behalf that prioritises a more fitting metric. If not quantiles, then what kind of algorithm would it be? For this exercise, let’s focus on K-means algorithm.
Preparing for clustering AKA pre-processing
As a useful reference, here is a pre-processing cheatsheet for some algorithms: https://www.tmwr.org/pre-proc-table
According to the table linked above, recommended pre-processing for nearest_neighbor
is dummy, zero variance removal, imputation, normalisation + transformation, with a situational use for decorrelation. KNN is similar to K-means, but for demonstration, let’s focus on normalisation (step_range()
). Then, let’s brute force to find out what would be the optimal number of clusters with tidyclust
library.
library(tidyclust)
# Set the algorithm specification to be tuned K-means
= k_means(num_clusters = tune())
spec_k_means
# Set input data
= df_rfm |>
df_rfm_range recipe( ~ recency + frequency + monetary) |>
step_range(all_predictors()) |>
prep() |>
bake(NULL)
# Set up cross validation
= df_rfm_range |>
cv_rfm vfold_cv(v = 3)
# Formulation
= df_rfm_range |>
rec_specification recipe( ~ recency + frequency + monetary)
# Workflow
= rec_specification |>
wf_k_means workflow(spec_k_means)
# Vector of numbers of clusters to brute force (tune)
= grid_regular(num_clusters(range = c(5, 9)),
grid_clust_num levels = 4)
# Tune
= wf_k_means |>
resample_tuned tune_cluster(resamples = cv_rfm,
grid = grid_clust_num,
control = control_grid(save_pred = TRUE, extract = identity),
metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio))
= resample_tuned |> collect_metrics()
df_resample_metrics
# Calculate how much of an improvement it is to add another cluster
= df_resample_metrics |>
df_delta_percents filter(.metric == "sse_ratio") |>
select(num_clusters, mean) |>
mutate(delta_percent = (mean - lag(mean)) / lag(mean)) |>
filter(!is.na(delta_percent), delta_percent < 0)
# Ensure that only improvements from adding a cluster is included into analysis
= df_delta_percents |>
df_best_iterations slice_min(order_by = delta_percent,
n = 3,
with_ties = T) |>
mutate(valid = num_clusters > lag(num_clusters)) |>
filter(valid == T | is.na(valid)) |>
select(-valid)
|> print.data.frame()
df_best_iterations
= df_best_iterations |>
number_of_clusters_best slice_head(n = 1) |>
pull(num_clusters)
cat('Number of clusters chosen: ', number_of_clusters_best, '\n')
num_clusters mean delta_percent
1 9 0.02658353 -0.3706453
2 7 0.04223934 -0.1440936
Number of clusters chosen: 9
Number of clusters chosen can be different
What we see above is very similar to an elbow method. However, marginal gain per number of clusters increase is emphasised.
Now that we have the optimal number of clusters, let’s try it with stats:kmeans()
this time.
= df_rfm_range |>
object_clusters kmeans(number_of_clusters_best, iter.max = 30)
= df_rfm_range |>
df_rfm_clustered_customers cbind((df_rfm |> select(customer_id))) |>
mutate(cluster = as.factor(object_clusters$cluster))
# logging
cat('Number of customers for each cluster:\n')
|>
df_rfm_clustered_customers count(cluster) |>
arrange(-n) |>
head() |>
print.data.frame()
Number of customers for each cluster:
cluster n
1 6 1082
2 5 845
3 2 617
4 4 481
5 3 343
6 8 341
Your results may vary
Looks like there are 1663 customers in the biggest cluster, while there are 275 in the smallest.
Let’s see what that looks like in a 3D graph with plotly
.
library(plotly)
plot_ly(color = df_rfm_clustered_customers$cluster,
size = 3,
alpha = 0.75,
x = df_rfm_clustered_customers$recency,
y = df_rfm_clustered_customers$frequency,
z = df_rfm_clustered_customers$monetary)
Looks like the biggest cluster, cluster 4, are the infrequent, old, and small revenue customers. When looking at all transactions data in the world, this is generally a common pattern.
Quiz
What preprocessing techniques would you apply before applying the clustering algorithm? Why would this improve the process?
What alternative clustering algorithms would you consider applying? How would this further optimise the clusters?
Does using K-means to segment the customers address the weaknesses you have mentioned in 2.4.1 quiz? How can K-means be applied differently? What other methods would be better?
What is another functional addendum you would make to RFM?