Association rule mining is the process of determining conditional probabilities within events that contain items or characteristics. Events can range from tweets, to grocery store receipts, to credit card applications.
Items within these events should also not be unique to each event. For example, words are repeated across tweets, multiple customers will buy the same items at the grocery store, and credit card applicants will share specific characterisitcs.
For all of these applications our goal is to estimate the probability that an event will possess item B given that it has item A. This probability is also called the confidence.
In the example above we might say that we are 23% confident that a customer will purchase rice (item B) given they are purchasing chicken (item A). We can use historical transactions (events) to estimate confidence.
Now for a practical implementation using the tidyverse
in R! I am using a groceries dataset from Georgia Tech. This dataset contains rows with items separated by commas.
receipt |
citrus fruit, semi-finished bread |
ready soups, margarine |
Because each event contains different items I read it using readLines()
and reshape into a longer format. The groceries
column contains the item name while transaction
contains the transaction ID.
link <- "https://cse6040.gatech.edu/datasets/groceries.csv"
groceries <- readLines(link)
# Create long form version of data
groceries_long <-
data.frame(groceries) %>%
dplyr::mutate(
transaction = dplyr::row_number()
) %>%
tidyr::separate_rows(
groceries, sep = ","
)
groceries | transaction |
citrus fruit | 1 |
semi-finished bread | 1 |
tropical fruit | 2 |
With our data in the proper format we can develop two functions. The first function takes a vector of items and returns a vector of comma separated combinations as (A,B) and (B,A).
comb_vec <- function(items) {
p <- t(combn(items, 2))
c(paste0(p[, 1], ",", p[, 2]), paste0(p[, 2], ",", p[, 1]))
}
For example, giving this function c("A", "B", "C")
would return c("A,B" "A,C" "B,C" "B,A" "C,A" "C,B")
. This is because we want to determine the probabilities of A given B and B given A.
Our final function performs the data mining. The first argument called data
takes in the data frame of events and items. The last two arguments item_col
and event_id
tell the function which columns refer to the items and the event identifier respectively.
pair_assoc <- function(data, item_col, event_id, item_min = 1L) {
# Count all items
item_count <- dplyr::count(data, !!sym(item_col), name = "A Count")
data %>%
dplyr::group_by( # Group by event identifier
!!sym(event_id)
) %>%
dplyr::filter( # Ensure event contains at least one item
length(!!sym(item_col)) > 1
) %>%
dplyr::summarise( # Create combinations for each event
comb = comb_vec(!!sym(item_col))
) %>%
dplyr::ungroup( # Ungroup before counting combinations
) %>%
dplyr::count( # Count combinations across all events
comb, name = "A B Count"
) %>%
tidyr::separate( # Separate combinations into two columns
col = comb,
into = c("A","B"), sep = ","
) %>%
dplyr::left_join( # Join counts of item A from item_count
y = item_count,
by = c("A" = item_col)
) %>%
dplyr::mutate( # Compute confidence P(B given A)
Confidence = `A B Count` / `A Count`
) %>%
dplyr::arrange( # Descend by confidence
desc(Confidence)
)
}
This function works in two stages. First, it determines the count of all individual items in the data set. In the example with groceries, this might be the counts of transactions with rice, beans, etc.
groceries | A Count |
baking powder | 174 |
berries | 327 |
The second stage uses the comb_vec()
function to determine all valid item combinations within each group. This stage only returns valid combinations where the confidence is > 0%.
Finally, the function left joins the item counts to the combination counts and computes the confidence values. I called the function and return the result. I am also filtering to only combinations with a confidence of 50% or more with items purchased more than 10 times.
groceries_long %>%
pair_assoc(
item_col = "groceries",
event_col = "transaction"
) %>%
dplyr::filter(
`A Count` >= 10,
Confidence >= 0.5
)
Here we can see the head of the results table ordered by confidence from highest to lowest. We observe that the confidence of honey and whole milk is 73%! In other words, 73% of the transactions that contain honey also contain whole milk.
A | B | A B Count | A Count | Confidence |
honey | whole milk | 11 | 14 | 0.733 |
frozen fruits | other vegetables | 8 | 12 | 0.667 |
cereals | whole milk | 36 | 56 | 0.643 |
rice | whole milk | 46 | 75 | 0.613 |
Association rule mining is a fairly simple and easy to interpret technique to help draw relationships between items and events in a data set.