Association Rule Mining in R

Association rule mining is the process of determining conditional probabilities within events that contain items or characteristics. Events can range from tweets, to grocery store receipts, to credit card applications.

Items within these events should also not be unique to each event. For example, words are repeated across tweets, multiple customers will buy the same items at the grocery store, and credit card applicants will share specific characterisitcs.

For all of these applications our goal is to estimate the probability that an event will possess item B given that it has item A. This probability is also called the confidence.

In the example above we might say that we are 23% confident that a customer will purchase rice (item B) given they are purchasing chicken (item A). We can use historical transactions (events) to estimate confidence.

Now for a practical implementation using the tidyverse in R! I am using a groceries dataset from Georgia Tech. This dataset contains rows with items separated by commas.

receipt

citrus fruit, semi-finished bread

ready soups, margarine

One transaction per row with items comma separated.

Because each event contains different items I read it using readLines() and reshape into a longer format. The groceries column contains the item name while transaction contains the transaction ID.

link <- "https://cse6040.gatech.edu/datasets/groceries.csv"
groceries <- readLines(link)

# Create long form version of data
groceries_long <- 
  tibble::tibble(groceries) |>
  dplyr::mutate(
    transaction = dplyr::row_number()
  ) |>
  tidyr::separate_rows(
    groceries,
    sep = ","
  )

`groceries`	`transaction`
`citrus fruit`	`1`
`semi-finished bread`	`1`
`tropical fruit`	`2`

Long form data with one item per row with a transaction ID.

With our data in the proper format we can develop two functions. The first function takes a vector of items and returns a vector of comma separated combinations as (A,B) and (B,A).

comb_vec <- function(items) {
  # Gets vector of all 2-level combinations
  
  p <- t(combn(items, 2))
  reg <- glue::glue("{p[, 1]},{p[, 2]}")
  rev <- glue::glue("{p[, 2]},{p[, 1]}")
  c(reg, rev)
}

For example, giving this function c("A", "B", "C") would return c("A,B" "A,C" "B,C" "B,A" "C,A" "C,B"). This is because we want to determine the probabilities of A given B and B given A.

Our final function performs the data mining. The first argument called data takes in the data frame of events and items. The last two arguments item_col and event_id tell the function which columns refer to the items and the event identifier respectively.

pair_assoc <- function(data, item_col, event_id, item_min = 1L) {
  # Derives association pairs for all elements in data
  
  # Count all items
  item_count <-
    data |>
    dplyr::count(
      A = {{ item_col }},
      name = "A Count"
    )
  
  # Get pairs as probabilities
  data |>
    dplyr::group_by({{ event_id }}) |>
    dplyr::filter(length({{ item_col }}) > 1) |>
    dplyr::reframe(comb = comb_vec({{ item_col }})) |>
    dplyr::ungroup() |>
    dplyr::count(
      comb,
      name = "A B Count"
    ) |>
    tidyr::separate(
      col = comb,
      into = c("A", "B"),
      sep = ","
    ) |>
    dplyr::left_join(
      y = item_count,
      by = "A"
    ) |>
    dplyr::mutate(
      Confidence = `A B Count` / `A Count`
    ) |>
    dplyr::arrange(desc(Confidence))
}

This function works in two stages. First, it determines the count of all individual items in the data set. In the example with groceries, this might be the counts of transactions with rice, beans, etc.

`groceries`	`A Count`
`baking powder`	`174`
`berries`	`327`

Counts of individual items serve as the denominator in the confidence computation.

The second stage uses the comb_vec() function to determine all valid item combinations within each group. This stage only returns valid combinations where the confidence is > 0%.

Finally, the function left joins the item counts to the combination counts and computes the confidence values. I called the function and return the result. I am also filtering to only combinations with a confidence of 50% or more with items purchased more than 10 times.

groceries_long |>
  pair_assoc(
    item_col = groceries, 
    event_id = transaction
  ) |>
  dplyr::filter(
    `A Count` >= 10,
    Confidence >= 0.5
  )

Here we can see the head of the results table ordered by confidence from highest to lowest. We observe that the confidence of honey and whole milk is 73%! In other words, 73% of the transactions that contain honey also contain whole milk.

`A`	`B`	`A B Count`	`A Count`	`Confidence`
`honey`	`whole milk`	`11`	`14`	`0.733`
`frozen fruits`	`other vegetables`	`8`	`12`	`0.667`
`cereals`	`whole milk`	`36`	`56`	`0.643`
`rice`	`whole milk`	`46`	`75`	`0.613`

Head of results table.

Association rule mining is a fairly simple and easy to interpret technique to help draw relationships between items and events in a data set.