1  Data

The data used to demonstrate in this book will be an open source data, available on GitHub: https://github.com/RodzanIskandar/PowerBI_dashboard_e-commerce_transaction/ETD_clean_data.csv

After a bit of EDA, let’s load the relevant columns, then let’s see what the data looks like:

df_transactions = fread('ETD_clean_data.csv') |>
    select(customer_id,
           province,
           date,
           stock_code,
           unit_price,
           quantity,
           sales) |> 
    mutate(customer_id = str_replace(customer_id, '.0$', ''),
           date = date(date)) |> 
    filter(customer_id != 'no customer id')

df_transactions |>
  select(-description) |> 
  head() |> 
  print.data.frame()
  customer_id    province       date stock_code unit_price quantity  sales
1       16010 DKI Jakarta 2015-11-30    22811AP      44250        6 265500
2       16010 DKI Jakarta 2015-11-30    21713AP      31500        8 252000
3       16010 DKI Jakarta 2015-11-30    22927AP      89250        2 178500
4       16010 DKI Jakarta 2015-11-30    20802AP      24750        6 148500
5       16010 DKI Jakarta 2015-11-30    22052AP       6300       25 157500
6       16010 DKI Jakarta 2015-11-30    22705AP       6300       25 157500

1.0.1 Quick EDA

Let’s get an idea of what the data looks like.

df_transactions |> 
  mutate(date_year = year(date),
         date_month = month(date)) |> 
  summarise(.by = c(date_year, date_month),
            sales_monthly = sum(sales)) |> 
  mutate(x_month = row_number()) |> 
  ggplot() +
  geom_line(aes(y = sales_monthly,
                x = x_month)) +
  theme_classic()

Quiz

  • What columns do not seem relevant?

  • What other EDA would you perform?

  • How would you assign any kind of segmentation of the customers?