2 RFM

Recency, frequency, monetary values are quick, easy, and useful segmentation method for labelling customers. There are many variations in the literature, such as breadth, duration, engagement, churn rate, etc. To focus on basics, the scope will be narrowed down to recency and frequency.

An example RFM segmentation or labels of customers

2.1 Recency

Essentially, recency would be ‘how many days ago from today the customer shopped with the vendor’ typically as an integer value. It’s a simple metric to show when they last shopped, where lower is better.

Recency can tell us quickly which customers are fading out to become lost customers if they haven’t shopped in a while. One example weakness of this metric would be that it does not account for product purchase rhythm, where you might buy an orange everyday, but a phone once a year or so. So even if the orange buyer has better recency, the phone buyer may bring in more revenue/profit overall.

2.2 Frequency

Frequency is ‘how many times the customer has shopped with the vendor’ as an integer value. It’s another nice and simple metric to determine which customers brings activity presence to the business, where higher the better.

One example weakness of this metric would be on new customers. Even if their frequency is the worst, they may have just discovered your business within the past week. Such customer may become one of the champion customers in the future, but frequency is no such indicator.

2.3 Basic RFM coding

Let’s code recency and frequency as described above, starting with the raw values.

Code
Output

# Start with the latest transaction date in the data
date_latest_transaction = df_transactions |> 
  pull(date) |> 
  max()
  
# Sanity check
if (date_latest_transaction > Sys.Date()) {
  
  cat('The data has future data points. Removing invalid data ... \n')
  
  df_transactions = df_transactions |> 
    filter(date <= Sys.Date())
  
}

# First, filter on data for the past year
df_rfm = df_transactions |>
  filter(date >= date_latest_transaction %m-% months(12)) |> 
  summarise(.by = customer_id,
            .sort = F, # never set to true on big data
            recency = as.integer(date_latest_transaction - max(date)),
            frequency = n(date),
            monetary = sum(sales))

df_rfm |> 
  head() |> 
  print.data.frame()

  customer_id recency frequency  monetary
1       16010      10       140  21769200
2       17700      26       283 234886050
3       13675      43        63  16426200
4       17451      20       325 252618750
5       17404       6       103 288592950
6       16881     366        10   5635350

You can see from the above output that we now have our RFM values. What should we do with them?

2.4 Approaches to RFM segmentation

To start with the basics, let’s focus on recency and frequency to segment our customers. There are two ways to approach this:

Manual (hard-coded) segmentation
Algorithmic clustering segmentation

First, let’s go over the manual segmentation.

2.4.1 Manual segmentation

In order to manually segment them, it helps to put each of the RFM dimensions into different brackets. Generally, the most common approach is to put them into quantiles. Let’s put them into three quantiles.

Code
Output

# Cut the values into quantiles
df_rfm_quantiles = df_rfm |> 
  mutate(R_Q = as.integer(cut(recency,
                              breaks = quantile(recency,
                                                probs = c(0, 1/3, 2/3, 1)),
                              labels = c(3, 2, 1),
                              include.lowest = T)),

         F_Q = as.integer(cut(frequency,
                              breaks = quantile(frequency,
                                                probs = c(0, 1/3, 2/3, 1)),
                              labels = c(1, 2, 3),
                              include.lowest = T))) |> 
  select(customer_id, R_Q, F_Q)

df_rfm_quantiles |> 
  head() |> 
  print.data.frame()

  customer_id R_Q F_Q
1       16010   1   3
2       17700   2   3
3       13675   2   2
4       17451   1   3
5       17404   1   3
6       16881   3   1

Now we segment the customers, determined by the following rules:

customer_type	recency	frequency
Frequent	high	high
Repeat	high	med
New	high	low
Lapsed	med
Lost	low

Code
Output

# Apply labels to specific quantile combos
df_customer_rfm_type = df_rfm_quantiles |> 
    mutate(customer_type = case_when(R_Q == 1 & F_Q == 1 ~ 'Frequent',
                                     R_Q == 1 & F_Q == 3 ~ 'New',
                                     R_Q == 1 & F_Q >= 2 ~ 'Repeat',
                                     R_Q == 2 ~ 'Lapsed',
                                     R_Q == 3 ~ 'Lost',
                                     T ~ NA)) |> 
    select(customer_id, customer_type)

df_customer_rfm_type |>
    count(customer_type) |> 
    print.data.frame()

  customer_type    n
1      Frequent  232
2        Lapsed 1433
3          Lost 1420
4           New  802
5        Repeat  402

Quiz

Would there a difference between && and & / >= and == in the case_when() statements? Why or why not? (hint: try changing the order of the labels inside case_when())
Looking at the RFM image, which other labels seem useful, and how would you code them?
Before assigning values to quantiles, would you do any kind of preprocessing?
What are the weaknesses of using quantiles as thresholds to the ranges? How can we determine the optimal number of quantiles?

2.4.2 Algorithmic clustering segmentation

In the last quiz, how did you answer the last question? One of the answers could be that the quantiles can draw boundaries or thresholds at arbitrary sections of the value ranges, because quantiles prioritise equivalent division of the range – not the best priority in this case.

So instead, we can delegate a different algorithm to find the optimal thresholds on our behalf that prioritises a more fitting metric. If not quantiles, then what kind of algorithm would it be? For this exercise, let’s focus on K-means algorithm.

Preparing for clustering AKA pre-processing

As a useful reference, here is a pre-processing cheatsheet for some algorithms: https://www.tmwr.org/pre-proc-table

According to the table linked above, recommended pre-processing for nearest_neighbor is dummy, zero variance removal, imputation, normalisation + transformation, with a situational use for decorrelation. KNN is similar to K-means, but for demonstration, let’s focus on normalisation (step_range()). Then, let’s brute force to find out what would be the optimal number of clusters with tidyclust library.

Code
Output

library(tidyclust)

# Set the algorithm specification to be tuned K-means
spec_k_means = k_means(num_clusters = tune())

# Set input data
df_rfm_range = df_rfm |> 
  recipe( ~ recency + frequency + monetary) |>
  step_range(all_predictors()) |> 
  prep() |> 
  bake(NULL)

# Set up cross validation
cv_rfm = df_rfm_range |> 
    vfold_cv(v = 3)

# Formulation
rec_specification = df_rfm_range |> 
  recipe( ~ recency + frequency + monetary)

# Workflow
wf_k_means = rec_specification |> 
  workflow(spec_k_means)

# Vector of numbers of clusters to brute force (tune)
grid_clust_num = grid_regular(num_clusters(range = c(5, 9)),
                              levels = 4)

# Tune
resample_tuned = wf_k_means |> 
    tune_cluster(resamples = cv_rfm,
                 grid = grid_clust_num,
                 control = control_grid(save_pred = TRUE, extract = identity),
                 metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio))

df_resample_metrics = resample_tuned |> collect_metrics()

# Calculate how much of an improvement it is to add another cluster
df_delta_percents = df_resample_metrics |> 
  filter(.metric == "sse_ratio") |> 
  select(num_clusters, mean) |> 
  mutate(delta_percent = (mean - lag(mean)) / lag(mean)) |> 
  filter(!is.na(delta_percent), delta_percent < 0)

# Ensure that only improvements from adding a cluster is included into analysis
df_best_iterations = df_delta_percents |> 
  slice_min(order_by = delta_percent,
            n = 3,
            with_ties = T) |> 
  mutate(valid = num_clusters > lag(num_clusters)) |> 
  filter(valid == T | is.na(valid)) |> 
  select(-valid)

df_best_iterations |> print.data.frame()

number_of_clusters_best = df_best_iterations |> 
  slice_head(n = 1) |> 
  pull(num_clusters)

cat('Number of clusters chosen: ', number_of_clusters_best, '\n')

  num_clusters       mean delta_percent
1            9 0.02658353    -0.3706453
2            7 0.04223934    -0.1440936

Number of clusters chosen:  9

Number of clusters chosen can be different

What we see above is very similar to an elbow method. However, marginal gain per number of clusters increase is emphasised.

Now that we have the optimal number of clusters, let’s try it with stats:kmeans() this time.

Code
Output

object_clusters = df_rfm_range |> 
  kmeans(number_of_clusters_best, iter.max = 30)

df_rfm_clustered_customers = df_rfm_range |>
  cbind((df_rfm |> select(customer_id))) |>
  mutate(cluster = as.factor(object_clusters$cluster))
  
# logging
cat('Number of customers for each cluster:\n')
df_rfm_clustered_customers |> 
  count(cluster) |> 
  arrange(-n) |>
  head() |> 
  print.data.frame()

Number of customers for each cluster:
  cluster    n
1       6 1082
2       5  845
3       2  617
4       4  481
5       3  343
6       8  341

Your results may vary

Looks like there are 1663 customers in the biggest cluster, while there are 275 in the smallest.

Let’s see what that looks like in a 3D graph with plotly.

Code
Output

library(plotly)

plot_ly(color = df_rfm_clustered_customers$cluster,
        size = 3,
        alpha = 0.75,
        x = df_rfm_clustered_customers$recency,
        y = df_rfm_clustered_customers$frequency,
        z = df_rfm_clustered_customers$monetary)

Looks like the biggest cluster, cluster 4, are the infrequent, old, and small revenue customers. When looking at all transactions data in the world, this is generally a common pattern.

Quiz

What preprocessing techniques would you apply before applying the clustering algorithm? Why would this improve the process?
What alternative clustering algorithms would you consider applying? How would this further optimise the clusters?
Does using K-means to segment the customers address the weaknesses you have mentioned in 2.4.1 quiz? How can K-means be applied differently? What other methods would be better?
What is another functional addendum you would make to RFM?