dataclass is an R package I created to easily define templates for lists and data frames that validate each element. This package is useful for validating data within R processes which pull from dynamic data sources such as databases and web APIs to provide an extra layer of validation around input and output data.
dataclass you specify the expected type, length, range, allowable values, and more for each element in your data. Decide whether violations of these expectations should throw an error or a warning.
For example, suppose you wanted to create a data frame in R which contains three columns:
metric. These columns represent the output of some analytic process in R. Traditionally, you would simply write these columns as a data frame. How can we be sure that the data is correct? Simply describe your data in a declarative fashion:
library(dataclass) my_dataclass <- dataclass::dataclass( # Date, logical, and numeric column date = dataclass::dte_vec(), low_flag = dataclass::lgl_vec(), metric = dataclass::num_vec() ) |> dataclass::data_validator()
Now we have a template for our data called
my_dataclass. Because we want to validate a data frame (as opposed to a list) we called
data_validator() to let
dataclass know we are validating a data frame. How do we use it? Simply pass your data to validate as a function. If we pass in valid inputs,
dataclass returns the input data. However, invalid inputs throw an error.
tibble::tibble( date = Sys.Date(), low_flag = TRUE, metric = 1 ) |> my_dataclass() #> # A tibble: 1 × 3 #> date low_flag metric #> <date> <lgl> <dbl> #> 1 2023-03-21 TRUE 1 tibble::tibble( date = Sys.Date(), low_flag = TRUE, metric = "A string!" ) |> my_dataclass() #> Error: #> ! The following elements have error-level violations: #> ✖ metric: is not numeric #> Run `rlang::last_error()` to see where the error occurred.
We can also use
dataclass to validate lists. Suppose we want to validate that a list contains
note where these elements correspond to the run date, a data frame, and a string respectively:
new_dataclass <- dataclass::dataclass( date = dataclass::dte_vec(1), my_data = dataclass::df_like(), note = dataclass::chr_vec(1) )
Now we can validate a list!
new_dataclass( date = Sys.Date(), my_data = head(mtcars, 2), note = "A note!" ) #> $date #>  "2023-03-21" #> #> $my_data #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 #> #> $note #>  "A note!" new_dataclass( date = Sys.Date(), my_data = mtcars, # note is not a single string! note = c(1, 2, 3) ) #> Error: #> ! The following elements have error-level violations: #> ✖ note: is not a character #> Run `rlang::last_error()` to see where the error occurred.
And that’s it! It’s pretty easy and minimal to get started. The learning curve is very minimal while the benefits of data validation cannot be overstated in a data science workflow!
You can install dataclass from CRAN by running the command below in your R console. Finally, if you want to contribute or submit bugs you can visit the GitHub repository here.