Random Number Generation

Random number generators are at the core of many applied computer science applications. For example, stochastic simulation models, statistical sampling, and learning techniques all rely on random numbers generated by computers. However, computers are famously not random.

Computers allow us to reproduce complex calculations and workflows repeatedly and achieve the same result. So how can a machine which is designed to follow careful instruction generate randomness?

In fact, the random numbers generated by computers are not random at all. They leverage some starting point (a seed) and generate a series of numbers which follow that appear to have no relation to each other. However, if the same seed is used again the computer can generate the same series of numbers!

This is powerful because it means we can both simulate randomness, but also simulate that same randomness again in the future. But how do we generate random numbers and what form do those numbers take?


Sometimes we need to generate random whole numbers. Other times we need to generate percentages as a value between 0 and 1. And sometimes we need to generate random numbers which follow some distribution (say an exponential).

It turns out that as long as we can generate random numbers between 0 and 1, we can transform them to any reasonable distribution via the inverse transform theorem (this is something I might discuss in a future post). However, for now lets focus on random numbers between 0 and 1.

The most common random number generator is called the linear congruential generator or LCG. This class of generators has the following form (where U is one random value between 0 and 1):

X_{i}=aX_{i-1}+c \mod m

U_i=\frac{X_i}{m}

The values of a, c, and, m are all carefully chosen. There are two critical tests a random number generator must pass to be useful; independence and uniformity. We test that using runs and chi-squared tests, respectively.

An independence test verifies that the next value cannot be predicted by the current value. We say that a random number generator produces independent numbers if the following statement is true. Here A is the number of runs within a random number set.

\text{Independence:} \frac{A-\frac{2n-1}{3}}{\sqrt{\frac{16n-29}{90}}} < z_{0.975}

A run is simply continuous increases or decreases in values. For example: 1, 2, 3, 4, 3 has two runs because it counts up to four (the first run) and back down to three (the second run).

Next we test for uniformity. This sorts all values into buckets and counts them. If we had 1000 random numbers between 0 and 1 and we split them into quartiles we would expect (shown as E here) 250 values in each bucket. The O here is the number of observed values in those buckets.

\text{Uniformity:}\sum\frac{(O-E)^2}{E}<\chi_{0.95,k-1}^2

We can write a little bit of R to help us verify the performance of one very famous random number generator called the desert island generator. This generator chooses the values of 16807, 0, and 2147483647 for a, c, and m, respectively.


random <- function(n, seed = 123) {
  
  values <- numeric(n)
  m <- 2147483647
  
  for (i in seq(n)) {
    seed <- (16807 * seed) %% m
    values[i] <- seed / m
  }
  
  values
}

is_independent <- function(x, alpha = 0.95) {
  
  a <- head(x, -1)
  b <- tail(x, -1)
  
  change <- a > b
  run_change <- change != dplyr::lag(change, 1, 0)
  runs <- max(cumsum(x) + 1)
  
  n <- length(x)
  e <- (2 * n - 1) / 3
  v <- (16 * n - 29) / 90
  
  stat <- round(abs((runs - e) / (v ^ 0.5)), 2)
  crit <- round(qnorm(1 - (alpha / 2)), 2)
  
  msg <- dplyr::if_else(
    stat < crit,
    "Not Independent (Reject",
    "Independent (Accept"
  )
  
  glue::glue("{msg} H0): Stat = {stat}, Critical = {crit}")
}

is_uniform <- function(x, n = 100, alpha = 0.95) {
  
  e <- ceiling(length(x) / n)
  
  stat <- round(sum(((table(ceiling(x * n)) - e) ^ 2) / e), 2)
  crit <- round(qchisq(1 - alpha, n - 1), 2)
  
  msg <- dplyr::if_else(
    stat < crit,
    "Not Independent (Reject",
    "Independent (Accept"
  )
  
  glue::glue("{msg} H0): Stat = {stat}, Critical = {crit}")
}

x <- random(1e6)

is_independent(x)
# > Independent (Accept H0): Stat = 394.49, Critical = 0.06

is_uniform(x)
# > Independent (Accept H0): Stat = 128.36, Critical = 77.05

This implementation tests and verifies that our generator is random for 1M random values!

Advent of Code 2023: Day Three

Advent of Code is an annual series of holiday themed programming challenges which can be completed in any language. There is one problem for each day in December before Christmas!

Day three asks us to consider an engine schematic which needs to be parsed. If a symbol has a number adjacent to it (could be horizontal, vertical, or diagonal) we need to keep it. We sum up all numbers at the end to identify the missing part number. You can see the original problem here.

In the text below, we can see symbols (any value that is not a period or a number). The first star on the second row means that the number 467 is considered but the number 114 is not considered since no part of the number is directly adjacent.

467..114..
...*......
..35..633.
......#...
617*......
.....+.58.
..592.....
......755.
...$.*....
.664.598..

I decided to use R and the tidyverse to solve this problem. The real schematic is much larger than the example shown above. First, I load the text and parse the lines into a useful tibble format.

#### Setup ####

library(tidyr)
library(dplyr)
library(readr)
library(tibble)

lines <- readr::read_lines("~/Documents/data.txt")

digits <- seq(0, 9)
non_symbols <- c(".", digits)

#### Parse Schematic ####

parsed_lines <-
  lines |>
  stringr::str_split("") |>
  purrr::imap(function(value, row) {
    
    tibble::tibble(
      row,
      col = seq_along(value),
      value
    )
    
  }) |>
  dplyr::bind_rows()
  
# A tibble: 19,600 × 3
#   row   col value
#   <int> <int> <chr>
# 1     1     1  .    
# 2     1     2  .    
# 3     1     3  .    
# 4     1     4  .    
# 5     1     5  5    
# 6     1     6  7    
# 7     1     7  3    
# 8     1     8  .    
# 9     1     9  6    
# 10    1    10  1    
# ℹ 19,590 more rows

Next, I parse this data frame into two pieces, symbols and numbers. I want to determine the row-column positions of each number as an instance instead of individual characters. I use the lag function to determine the extent of a number and store these for later use.

number_df <-
  parsed_lines |>
  dplyr::mutate(
    is_num = value %in% digits,
    num_instance = cumsum(is_num != lag(is_num, 1, FALSE))
  ) |>
  dplyr::filter(
    value %in% digits
  ) |>
  dplyr::group_by(num_instance) |>
  dplyr::mutate(
    number = glue::glue_collapse(value)
  ) |>
  dplyr::ungroup() |>
  dplyr::select(
    row,
    col,
    number,
    num_instance
  )

I also parse the symbols. I first identify any character which is not a number or a period. Next, I expand to a series of nudges or adjustments to the row-column pair in all directions where a potential number could be matched.

nudges <-
  tidyr::expand_grid(
    col_nudge = seq(-1, 1),
    row_nudge = seq(-1, 1)
  ) |>
  dplyr::filter(
    !(col_nudge == 0 & row_nudge == 0)
  )

symbol_df <-
  parsed_lines |>
  dplyr::filter(
    !value %in% non_symbols
  ) |>
  dplyr::mutate(
    sym_instance = row_number()
  ) |>
  tidyr::expand_grid(nudges) |>
  dplyr::mutate(
    row = row + row_nudge,
    col = col + col_nudge
  )

Finally, I join these together an ensure that only one instance of a number is present for each symbol (so numbers which are both horizontally and vertically above a symbol are not double counted, for example). This allows me to take the simple sun and find the answer which is 560670! On my machine this implementation takes around 150ms which is fairly quick!

symbol_df |>
  dplyr::inner_join(
    y = number_df,
    by = c("row", "col")
  ) |>
  dplyr::distinct(
    number,
    sym_instance,
    num_instance
  ) |>
  dplyr::pull(number) |>
  as.numeric() |>
  sum()