 # Association Rule Mining in R

Association rule mining is the process of determining conditional probabilities within events that contain items or characteristics. Events can range from tweets, to grocery store receipts, to credit card applications.

Items within these events should also not be unique to each event. For example, words are repeated across tweets, multiple customers will buy the same items at the grocery store, and credit card applicants will share specific characterisitcs.

For all of these applications our goal is to estimate the probability that an event will possess item B given that it has item A. This probability is also called the confidence.

In the example above we might say that we are 23% confident that a customer will purchase rice (item B) given they are purchasing chicken (item A). We can use historical transactions (events) to estimate confidence.

Now for a practical implementation using the `tidyverse` in R! I am using a groceries dataset from Georgia Tech. This dataset contains rows with items separated by commas.

Because each event contains different items I read it using `readLines()` and reshape into a longer format. The `groceries `column contains the item name while `transaction` contains the transaction ID.

``````link <- "https://cse6040.gatech.edu/datasets/groceries.csv"

# Create long form version of data

groceries_long <-
data.frame(groceries) %>%
dplyr::mutate(
transaction = dplyr::row_number()
) %>%
tidyr::separate_rows(
groceries, sep = ","
)``````

With our data in the proper format we can develop two functions. The first function takes a vector of items and returns a vector of comma separated combinations as (A,B) and (B,A).

``````comb_vec <- function(items) {

p <- t(combn(items, 2))

c(paste0(p[, 1], ",", p[, 2]), paste0(p[, 2], ",", p[, 1]))

}``````

For example, giving this function `c("A", "B", "C")` would return `c("A,B" "A,C" "B,C" "B,A" "C,A" "C,B")`. This is because we want to determine the probabilities of A given B and B given A.

Our final function performs the data mining. The first argument called `data` takes in the data frame of events and items. The last two arguments `item_col` and `event_id` tell the function which columns refer to the items and the event identifier respectively.

``````pair_assoc <- function(data, item_col, event_id, item_min = 1L) {

# Count all items

item_count <- dplyr::count(data, !!sym(item_col), name = "A Count")

data %>%
dplyr::group_by( # Group by event identifier
!!sym(event_id)
) %>%
dplyr::filter( # Ensure event contains at least one item
length(!!sym(item_col)) > 1
) %>%
dplyr::summarise( # Create combinations for each event
comb = comb_vec(!!sym(item_col))
) %>%
dplyr::ungroup( # Ungroup before counting combinations
) %>%
dplyr::count( # Count combinations across all events
comb, name = "A B Count"
) %>%
tidyr::separate( # Separate combinations into two columns
col = comb,
into = c("A","B"), sep = ","
) %>%
dplyr::left_join( # Join counts of item A from item_count
y = item_count,
by = c("A" = item_col)
) %>%
dplyr::mutate( # Compute confidence P(B given A)
Confidence = `A B Count` / `A Count`
) %>%
dplyr::arrange( # Descend by confidence
desc(Confidence)
)

}``````

This function works in two stages. First, it determines the count of all individual items in the data set. In the example with groceries, this might be the counts of transactions with rice, beans, etc.

The second stage uses the `comb_vec()` function to determine all valid item combinations within each group. This stage only returns valid combinations where the confidence is > 0%.

Finally, the function left joins the item counts to the combination counts and computes the confidence values. I called the function and return the result. I am also filtering to only combinations with a confidence of 50% or more with items purchased more than 10 times.

``````groceries_long %>%
pair_assoc(
item_col = "groceries",
event_col = "transaction"
) %>%
dplyr::filter(
`A Count` >= 10,
Confidence >= 0.5
)``````

Here we can see the head of the results table ordered by confidence from highest to lowest. We observe that the confidence of honey and whole milk is 73%! In other words, 73% of the transactions that contain honey also contain whole milk.

Association rule mining is a fairly simple and easy to interpret technique to help draw relationships between items and events in a data set.