Create Validated Data in R with dataclass

dataclass is an R package I created to easily define templates for lists and data frames that validate each element. This package is useful for validating data within R processes which pull from dynamic data sources such as databases and web APIs to provide an extra layer of validation around input and output data.

To use dataclass you specify the expected type, length, range, allowable values, and more for each element in your data. Decide whether violations of these expectations should throw an error or a warning.

For example, suppose you wanted to create a data frame in R which contains three columns: date, low_flag, and metric. These columns represent the output of some analytic process in R. Traditionally, you would simply write these columns as a data frame. How can we be sure that the data is correct? Simply describe your data in a declarative fashion:

library(dataclass)

my_dataclass <-
  dataclass::dataclass(
    # Date, logical, and numeric column
    date = dataclass::dte_vec(),
    low_flag = dataclass::lgl_vec(),
    metric = dataclass::num_vec()
  ) |>
  dataclass::data_validator()

Now we have a template for our data called my_dataclass. Because we want to validate a data frame (as opposed to a list) we called data_validator() to let dataclass know we are validating a data frame. How do we use it? Simply pass your data to validate as a function. If we pass in valid inputs, dataclass returns the input data. However, invalid inputs throw an error.

tibble::tibble(
  date = Sys.Date(),
  low_flag = TRUE,
  metric = 1
) |>
  my_dataclass()
  
#> # A tibble: 1 × 3
#>   date       low_flag metric
#>   <date>     <lgl>     <dbl>
#> 1 2023-03-21 TRUE          1

tibble::tibble(
  date = Sys.Date(),
  low_flag = TRUE,
  metric = "A string!"
) |>
  my_dataclass()
  
#> Error:
#>   ! The following elements have error-level violations:
#>   ✖ metric: is not numeric
#> Run `rlang::last_error()` to see where the error occurred.

We can also use dataclass to validate lists. Suppose we want to validate that a list contains date, my_data, and note where these elements correspond to the run date, a data frame, and a string respectively:

new_dataclass <-
  dataclass::dataclass(
    date = dataclass::dte_vec(1),
    my_data = dataclass::df_like(),
    note = dataclass::chr_vec(1)
  )

Now we can validate a list!

new_dataclass(
  date = Sys.Date(),
  my_data = head(mtcars, 2),
  note = "A note!"
)

#> $date
#> [1] "2023-03-21"
#> 
#> $my_data
#> mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
#> 
#> $note
#> [1] "A note!"

new_dataclass(
  date = Sys.Date(),
  my_data = mtcars,
  # note is not a single string!
  note = c(1, 2, 3)
)

#> Error:
#>   ! The following elements have error-level violations:
#>   ✖ note: is not a character
#> Run `rlang::last_error()` to see where the error occurred.

And that’s it! It’s pretty easy and minimal to get started. The learning curve is very minimal while the benefits of data validation cannot be overstated in a data science workflow!

You can install dataclass from CRAN by running the command below in your R console. Finally, if you want to contribute or submit bugs you can visit the GitHub repository here.

install.packages("dataclass")