Association Rule Mining in R

Association rule mining is the process of determining conditional probabilities within events that contain items or characteristics. Events can range from tweets, to grocery store receipts, to credit card applications.

Items within these events should also not be unique to each event. For example, words are repeated across tweets, multiple customers will buy the same items at the grocery store, and credit card applicants will share specific characterisitcs.

For all of these applications our goal is to estimate the probability that an event will possess item B given that it has item A. This probability is also called the confidence.

In the example above we might say that we are 23% confident that a customer will purchase rice (item B) given they are purchasing chicken (item A). We can use historical transactions (events) to estimate confidence.

Now for a practical implementation using the tidyverse in R! I am using a groceries dataset from Georgia Tech. This dataset contains rows with items separated by commas.

receipt
citrus fruit, semi-finished bread
ready soups, margarine
One transaction per row with items comma separated.

Because each event contains different items I read it using readLines() and reshape into a longer format. The groceries column contains the item name while transaction contains the transaction ID.

link <- "https://cse6040.gatech.edu/datasets/groceries.csv"

groceries <- readLines(link)

# Create long form version of data

groceries_long <- 
  data.frame(groceries) %>%
  dplyr::mutate(
    transaction = dplyr::row_number()
  ) %>%
  tidyr::separate_rows(
    groceries, sep = ","
  )
groceriestransaction
citrus fruit1
semi-finished bread1
tropical fruit2
Long form data with one item per row with a transaction ID.

With our data in the proper format we can develop two functions. The first function takes a vector of items and returns a vector of comma separated combinations as (A,B) and (B,A).

comb_vec <- function(items) {
  
  p <- t(combn(items, 2))
  
  c(paste0(p[, 1], ",", p[, 2]), paste0(p[, 2], ",", p[, 1]))

}

For example, giving this function c("A", "B", "C") would return c("A,B" "A,C" "B,C" "B,A" "C,A" "C,B"). This is because we want to determine the probabilities of A given B and B given A.

Our final function performs the data mining. The first argument called data takes in the data frame of events and items. The last two arguments item_col and event_id tell the function which columns refer to the items and the event identifier respectively.

pair_assoc <- function(data, item_col, event_id, item_min = 1L) {
  
  # Count all items
  
  item_count <- dplyr::count(data, !!sym(item_col), name = "A Count")
  
  data %>%
    dplyr::group_by( # Group by event identifier
      !!sym(event_id)
    ) %>%
    dplyr::filter( # Ensure event contains at least one item
      length(!!sym(item_col)) > 1
    ) %>%
    dplyr::summarise( # Create combinations for each event
      comb = comb_vec(!!sym(item_col))
    ) %>%
    dplyr::ungroup( # Ungroup before counting combinations
    ) %>%
    dplyr::count( # Count combinations across all events
      comb, name = "A B Count"
    ) %>%
    tidyr::separate( # Separate combinations into two columns
      col = comb,
      into = c("A","B"), sep = ","
    ) %>%
    dplyr::left_join( # Join counts of item A from item_count
      y = item_count,
      by = c("A" = item_col)
    ) %>%
    dplyr::mutate( # Compute confidence P(B given A)
      Confidence = `A B Count` / `A Count`
    ) %>%
    dplyr::arrange( # Descend by confidence
      desc(Confidence)
    )
  
}

This function works in two stages. First, it determines the count of all individual items in the data set. In the example with groceries, this might be the counts of transactions with rice, beans, etc.

groceriesA Count
baking powder174
berries327
Counts of individual items serve as the denominator in the confidence computation.

The second stage uses the comb_vec() function to determine all valid item combinations within each group. This stage only returns valid combinations where the confidence is > 0%.

Finally, the function left joins the item counts to the combination counts and computes the confidence values. I called the function and return the result. I am also filtering to only combinations with a confidence of 50% or more with items purchased more than 10 times.

groceries_long %>%
  pair_assoc(
    item_col = "groceries", 
    event_col = "transaction"
  ) %>%
  dplyr::filter(
    `A Count` >= 10,
    Confidence >= 0.5
  )

Here we can see the head of the results table ordered by confidence from highest to lowest. We observe that the confidence of honey and whole milk is 73%! In other words, 73% of the transactions that contain honey also contain whole milk.

ABA B CountA CountConfidence
honeywhole milk11140.733
frozen fruitsother vegetables8120.667
cerealswhole milk36560.643
ricewhole milk46750.613
Head of results table.

Association rule mining is a fairly simple and easy to interpret technique to help draw relationships between items and events in a data set.