Adding "views" to autodb, part two: initial tests

Last time, I showed how I think that adding “views” – i.e. previous decomposition steps, or “pre-decompositions” – to a database schema might ensure that it stays connected. Now we need to work out how to implement it.

library(autodb)
## 
## Attaching package: 'autodb'
## The following object is masked from 'package:stats':
## 
##     decompose
show <- function(x) DiagrammeR::grViz(gv(x), width = "100%")

Test property

The problem we want to solve is that the database schema given by autodb sometimes isn’t connected, so the first thing we need is a function that checks whether a schema is connected.

To do this, we construct a directed adjacency matrix for the foreign key references, and use that matrix to determine any relation can reach all the others by following references. Since references are described as a child referring to a parent, we can think of this as looking for a common descendant for all of the relations.

If we consider each relation as being adjacent to itself, we can achieve this by taking higher powers of the adjacency matrix until there’s no further change. The result shows the ancestors for each relation, including itself, and we check whether any of them have all of the relations as ancestors.

adjacency <- function(x) {
  stopifnot(inherits(x, c("database_schema", "database")))

  nms <- names(x)
  len <- length(x)
  refs <- references(x)
  adj <- matrix(
    FALSE,
    nrow = len,
    ncol = len,
    dimnames = list(nms, nms)
  )
  from <- vapply(refs, \(r) match(r[[1]], nms), integer(1))
  to <- vapply(refs, \(r) match(r[[3]], nms), integer(1))
  for (n in seq_along(refs))
    adj[[from[[n]], to[[n]]]] <- TRUE
  diag(adj) <- TRUE
  adj
}

fixed_point <- function(init, step, endif) {
  old <- init
  current <- step(init)
  while (!endif(old, current)) {
    old <- current
    current <- step(current)
  }
  current
}

is_connected <- function(x) {
  connections <- adjacency(x)
  filled <- fixed_point(connections, \(x) x %*% connections > 0, identical)
  any(apply(filled, 1, all))
}

This is not a particularly efficient implementation – we’re re-calculating everything instead of just accounting for new connections on each iteration – but it works. adjacency calculates the initial adjacency matrix. fixed_point just encapsulates the loop for repeated application, so we don’t have its temporary variables hanging around afterwards.

Let’s check this against two simple examples:

connected_example <- autodb(ChickWeight)
show(connected_example)

Remember, arrows with black heads represent foreign key references, and columns with black cells indicate relation keys.

nonconnected_example <- normalise(functional_dependency(
  list(
    list("a", "b"),
    list("a", "c"),
    list("b", "d"),
    list("c", "e"),
    list(c("d", "e"), "f")
  ),
  letters[1:6]
))
show(nonconnected_example)
is_connected(connected_example)
## [1] TRUE
is_connected(nonconnected_example)
## [1] FALSE

Good. I could add some more thorough testing, but this is a simple enough function that I don’t want to spend much time on it. Instead, let’s look at the problem we want to solve.

Test case generation

I want this to be test-driven. Specifically, I want to use property tests, so that we can think about general mathematical properties of the algorithm instead of risking getting lost in a forest of individual test cases.

This means writing a test case generator first. I have a more involved generator in autodb’s test suite, but we’ll write a simple version here that just uses integers. I use the hedgehog package for property-based testing, which lets you build up test case generators by composing existing ones. testthat adds higher-level generators on top of it, but I want the lower-level flexibility.

library(hedgehog)
## Loading required package: testthat

First, we generate the number of rows and columns, given upper limits.

  • gen.int(n) generates a number between 1 and n.
  • gen.with(g, f) takes a generator g of some value x, and returns the generator of f(x). Here, we pass it a list of two generators, which gets treated as a single generator for length-two lists.
gen.dims <- function(rows, cols) {
  list(gen.int(rows), gen.int(cols)) |>
    gen.with(unlist)
}

We then write a generator for creating the values for a single column. We’ll keep things simple, and suppose we only use integer values. We don’t want to allow arbitrary vectors of integers, because this allows a high degree of redundancy. When columns get passed into discover, all that matters is when values are equal: for example, these tuples are all equivalent under changing of unique values: \[(1, 6, 4, 1, 1, 6), \, (1, 2, 3, 1, 1, 2), \, (3, 1, 2, 3, 3, 1).\]

We avoid this redundancy by requiring the unique values to start from 1, and increment in order of appearance. We implement this requirement by generating the values from a Chinese restaurant process (with concentration 1). We first generate a vector of indices, where an index value is between 1 and its own position. If it’s equal to its own position, the result number at that index is a new unique value. If it’s less than its own position, then the result number is equal to the number at the given index.

to_crp <- function(x) {
  n <- length(x)
  res <- rep(NA_integer_, n)
  count <- 0L
  for (index in seq_len(n)) {
    res[[index]] <- res[[x[[index]]]]
    if (is.na(res[[index]])) {
      count <- count + 1L
      res[[index]] <- count
    }
  }
  res
}

gen.crp <- function(n) {
  lapply(seq_len(n), gen.int) |>
    gen.with(unlist) |>
    gen.with(to_crp)
}

We then take each column’s generated values, and combine them together into a data frame.

gen.data.frame_from_dims <- function(dims) {
  gen.list(gen.crp(dims[[1]]), of = dims[[2]]) |>
    gen.with(as.data.frame) |>
    gen.with(\(x) setNames(x, letters[seq_len(dims[[2]])]))
}

Finally, we compose the dimension and data frame generation together, remove duplicate rows from the result, and sort the rows, permuting values afterwards to keep unique value appearances in order. The duplicate rows get dropped by discover anyway, but if we get a failing test case we don’t want to see them, since they’re irrelevant.

sort_and_revalue <- function(x) {
  x <- x[do.call(order, x), , drop = FALSE]
  x[] <- lapply(x, \(y) match(y, unique(y)))
  row.names(x) <- NULL
  x
}

gen.data.frame <- function(rows = 10, cols = 10) {
  gen.dims(rows, cols) |>
    gen.and_then(gen.data.frame_from_dims) |>
    gen.with(unique) |>
    gen.with(sort_and_revalue)
}

Let’s see an example of a data frame generation. I’m setting the seed first: generator results are random, and I want the text to match the displayed results.

set.seed(2025-11-08)
gen.data.frame(5, 3)
## Hedgehog generator:
## Example:
##   a
## 1 1
## 2 2
## Initial shrinks:
##   a
## 1 1
##   a
## 1 1
## 2 2
##   a
## 1 1
##   a
## 1 1
## 2 2

The “shrinks” are simplifications of the original generated example. The idea is that, if a test case fails, the test first attempts to find a simple test case that still fails, so the user sees an attempt at a minimal failing example.

Shrinking is partly where I want to avoid redundancy in the column value generator: if we don’t then we can get situations where a column value is “shrunk” to something that’s equivalent with respect to decomposition. We still see some equivalent shrinks here, due to using unique, but less than we would have done otherwise.

Test property and iteration

With our main test case generator ready, we can think about properties to test. For now, we just kick things off by testing that data frames get normalised into a connected schema, as a property of getting normalised. We know that this should fail: I showed some failing cases last time.

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      fds <- discover(x)
      list(data = x, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema)),
  curry = TRUE
))
## Error : Falsifiable after 5 tests, and 6 shrinks
## <expectation_failure/expectation/error/condition>
## is_connected(schema) is not TRUE
## 
## `actual`:   FALSE
## `expected`: TRUE 
## Backtrace:
##      ▆
##   1. ├─base::try(...)
##   2. │ └─base::tryCatch(...)
##   3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##   4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##   5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##   6. └─hedgehog::forall(...)
##   7.   └─hedgehog:::run.prop(property, counterexample$smallest, curry)
##   8.     ├─base::tryCatch(...)
##   9.     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##  10.     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##  11.     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##  12.     ├─base::withCallingHandlers(...)
##  13.     ├─base::do.call(property, arguments)
##  14.     └─global `<fn>`(data = `<df[,2]>`, fds = `<fnctnl_d>`, schema = `<dtbs_sch>`)
##  15.       └─testthat::expect_true(is_connected(schema))
## Counterexample:
## $data
##   a b
## 1 1 1
## 2 2 1
## 
## $fds
## 1 functional dependency
## 2 attributes: a, b
##  -> b
## 
## $schema
## database schema with 2 relation schemas
## 2 attributes: a, b
## schema constants: b
##   key 1: 
## schema a: a
##   key 1: a
## no references

testthat returning the stack trace makes the error message far more verbose than it needs to be, but we can see that the test does fail. In fact, it fails on a very simple case, that we might not have thought of. If any of the variables are constant – i.e. always take the same value – then they are put into a relation with an empty key, and isolated from the rest of the schema as a special case. This trivially makes the schema non-connected.

If we rejoin the schema to get a flat table, then we do need to join the constants at some point, but I don’t think that showing this explicitly would be very useful. Since constants are already treated as a special case, I think that it’s acceptable to ignore them with respect to connectivity, so we rewrite is_connected to allow ignoring relations with an empty key.

all_nonempty <- function(x) {
  all(lengths(x) > 0)
}

is_connected <- function(x, const = TRUE) {
  stopifnot(inherits(x, c("database_schema", "database")))

  used <- if (const)
    rep(TRUE, length(x))
  else
    vapply(keys(x), all_nonempty, logical(1))

  connections <- adjacency(x[used])
  filled <- fixed_point(connections, \(x) x %*% connections > 0, identical)
  any(apply(filled, 1, all))
}

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      fds <- discover(x)
      list(data = x, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema, const = FALSE)),
  curry = TRUE
))
## Error : Falsifiable after 1 tests, and 1 shrinks
## <expectation_failure/expectation/error/condition>
## is_connected(schema, const = FALSE) is not TRUE
## 
## `actual`:   FALSE
## `expected`: TRUE 
## Backtrace:
##      ▆
##   1. ├─base::try(...)
##   2. │ └─base::tryCatch(...)
##   3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##   4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##   5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##   6. └─hedgehog::forall(...)
##   7.   └─hedgehog:::run.prop(property, counterexample$smallest, curry)
##   8.     ├─base::tryCatch(...)
##   9.     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##  10.     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##  11.     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##  12.     ├─base::withCallingHandlers(...)
##  13.     ├─base::do.call(property, arguments)
##  14.     └─global `<fn>`(data = `<df[,1]>`, fds = `<fnctnl_d>`, schema = `<dtbs_sch>`)
##  15.       └─testthat::expect_true(is_connected(schema, const = FALSE))
## Counterexample:
## $data
##   a
## 1 1
## 
## $fds
## 1 functional dependency
## 1 attribute: a
##  -> a
## 
## $schema
## database schema with 1 relation schema
## 1 attribute: a
## schema constants: a
##   key 1: 
## no references

This new failing case is also simple: everything is constant, so there are no relations to check connectivity between. This should count as being connected, so I clearly didn’t write the connection function correctly. Let’s try again.

is_connected <- function(x, const = TRUE) {
  stopifnot(inherits(x, c("database_schema", "database")))
  
  used <- if (const)
    rep(TRUE, length(x))
  else
    vapply(keys(x), all_nonempty, logical(1))
  if (sum(used) == 0)
    return(TRUE)

  connections <- adjacency(x[used])
  filled <- fixed_point(connections, \(x) x %*% connections > 0, identical)
  any(apply(filled, 1, all))
}

is_connected(autodb(data.frame(a = 1L)))
## [1] TRUE

That’s better. We can adjust the property test accordingly.

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      fds <- discover(x)
      list(data = x, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema, const = FALSE)),
  curry = TRUE
))
## Error : Falsifiable after 3 tests, and 9 shrinks
## <expectation_failure/expectation/error/condition>
## is_connected(schema, const = FALSE) is not TRUE
## 
## `actual`:   FALSE
## `expected`: TRUE 
## Backtrace:
##      ▆
##   1. ├─base::try(...)
##   2. │ └─base::tryCatch(...)
##   3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##   4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##   5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##   6. └─hedgehog::forall(...)
##   7.   └─hedgehog:::run.prop(property, counterexample$smallest, curry)
##   8.     ├─base::tryCatch(...)
##   9.     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##  10.     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##  11.     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##  12.     ├─base::withCallingHandlers(...)
##  13.     ├─base::do.call(property, arguments)
##  14.     └─global `<fn>`(data = `<df[,7]>`, fds = `<fnctnl_d>`, schema = `<dtbs_sch>`)
##  15.       └─testthat::expect_true(is_connected(schema, const = FALSE))
## Counterexample:
## $data
##   a b c d e f g
## 1 1 1 1 1 1 1 1
## 2 1 1 1 1 2 1 2
## 3 1 1 1 1 2 2 2
## 4 1 1 1 2 1 3 2
## 
## $fds
## 9 functional dependencies
## 7 attributes: a, b, c, d, e, f, g
##      -> a
##      -> b
##      -> c
##    f -> d
## e, g -> d
## f, g -> e
## d, g -> e
## e, f -> g
## d, e -> g
## 
## $schema
## database schema with 4 relation schemas
## 7 attributes: a, b, c, d, e, f, g
## schema constants: a, b, c
##   key 1: 
## schema f: f, d
##   key 1: f
## schema d_e: d, e, g
##   key 1: d, e
##   key 2: d, g
##   key 3: e, g
## schema e_f: e, f
##   key 1: e, f
## references:
## e_f.{f} -> f.{f}

The new failing case is larger, but most of it is unnecessary. The constant variables don’t matter, but the default shrinking for our current generator doesn’t get rid of them. I don’t want to be mucking around with writing custom shrinks, so, to keep things simple, we can remove constants as a final step in generation.

crp_nonconst <- function(x) {
  x <- x[vapply(x, \(y) any(y != 1), logical(1))]
  names(x) <- letters[seq_len(ncol(x))]
  x 
}

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      x_nonconst <- crp_nonconst(x)
      fds <- discover(x_nonconst)
      list(data = x_nonconst, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema, const = FALSE)),
  curry = TRUE
))
## Error : Falsifiable after 48 tests, and 14 shrinks
## <expectation_failure/expectation/error/condition>
## is_connected(schema, const = FALSE) is not TRUE
## 
## `actual`:   FALSE
## `expected`: TRUE 
## Backtrace:
##      ▆
##   1. ├─base::try(...)
##   2. │ └─base::tryCatch(...)
##   3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##   4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##   5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##   6. └─hedgehog::forall(...)
##   7.   └─hedgehog:::run.prop(property, counterexample$smallest, curry)
##   8.     ├─base::tryCatch(...)
##   9.     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##  10.     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##  11.     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##  12.     ├─base::withCallingHandlers(...)
##  13.     ├─base::do.call(property, arguments)
##  14.     └─global `<fn>`(data = `<df[,4]>`, fds = `<fnctnl_d>`, schema = `<dtbs_sch>`)
##  15.       └─testthat::expect_true(is_connected(schema, const = FALSE))
## Counterexample:
## $data
##   a b c d
## 1 1 1 1 1
## 2 1 1 2 2
## 3 1 2 2 2
## 4 2 3 1 2
## 
## $fds
## 6 functional dependencies
## 4 attributes: a, b, c, d
## c, d -> a
##    b -> a
## b, d -> c
## a, d -> c
## b, c -> d
## a, c -> d
## 
## $schema
## database schema with 3 relation schemas
## 4 attributes: a, b, c, d
## schema b: b, a
##   key 1: b
## schema a_c: a, c, d
##   key 1: a, c
##   key 2: a, d
##   key 3: c, d
## schema b_c: b, c
##   key 1: b, c
## references:
## b_c.{b} -> b.{b}

This is our first significant failing test case.

The only way to join the schema components together with a pre-decomposition is via a decomposition on the shared attributes \(\{a, c\}\), which are a key for a_c. We could visualise this as below, as I’d originally intended to be doing:

The dashed arrow with white head indicates that the pre-decomposition includes the target, plus all of its parents joined in, as one of the relations it’s decomposed into.

However, I don’t think that this visualisation precisely reflects the intended meaning, which is that the pre-decomposition of relations b_c and b has a foreign key reference to relation a_c. This is not conveyed above, since we refer to a_c and b_c in the same manner. Let’s display them a little differently:

We’ve made two changes to the pre-decomposition, to make it more informative:

  • it is now a pre-decomposition only of b_c (and b as a parent), not a_c, so it doesn’t include attribute d;
  • it has a foreign key reference to a_c instead of including it, conveying that b_c and b together refer to a_c before they’re split apart. The arrows are still dashed, to indicate a view is involved, but the black heads indicate a foreign key reference, as usual.

I’m happier with this approach. We now have an idea of what information we need to store for the view:

  • names of the relations it’s a pre-decomposition of
  • attributes within the pre-decomposition (implied by the above, but stating them explicitly allows for visualisation of the below)
  • foreign key references to those relations (those with no reference are used directly)

Next time, we’ll start writing a simple implementation.

Code

Here’s all the relevant code so far, ignoring the random seed:

library(autodb)
library(hedgehog)

fixed_point <- function(init, step, endif) {
  old <- init
  current <- step(init)
  while (!endif(old, current)) {
    old <- current
    current <- step(current)
  }
  current
}

adjacency <- function(x) {
  stopifnot(inherits(x, c("database_schema", "database")))

  nms <- names(x)
  len <- length(x)
  refs <- references(x)
  adj <- matrix(
    FALSE,
    nrow = len,
    ncol = len,
    dimnames = list(nms, nms)
  )
  from <- vapply(refs, \(r) match(r[[1]], nms), integer(1))
  to <- vapply(refs, \(r) match(r[[3]], nms), integer(1))
  for (n in seq_along(refs))
    adj[[from[[n]], to[[n]]]] <- TRUE
  diag(adj) <- TRUE
  adj
}

all_nonempty <- function(x) {
  all(lengths(x) > 0)
}

is_connected <- function(x, const = TRUE) {
  stopifnot(inherits(x, c("database_schema", "database")))
  
  used <- if (const)
    rep(TRUE, length(x))
  else
    vapply(keys(x), all_nonempty, logical(1))
  if (sum(used) == 0)
    return(TRUE)

  connections <- adjacency(x[used])
  filled <- fixed_point(connections, \(x) x %*% connections > 0, identical)
  any(apply(filled, 1, all))
}

gen.dims <- function(rows, cols) {
  list(gen.int(rows), gen.int(cols)) |>
    gen.with(unlist)
}

to_crp <- function(x) {
  n <- length(x)
  res <- rep(NA_integer_, n)
  count <- 0L
  for (index in seq_len(n)) {
    res[[index]] <- res[[x[[index]]]]
    if (is.na(res[[index]])) {
      count <- count + 1L
      res[[index]] <- count
    }
  }
  res
}

gen.crp <- function(n) {
  lapply(seq_len(n), gen.int) |>
    gen.with(unlist) |>
    gen.with(to_crp)
}

gen.data.frame_from_dims <- function(dims) {
  gen.list(gen.crp(dims[[1]]), of = dims[[2]]) |>
    gen.with(\(x) {
      as.data.frame(x) |>
        setNames(letters[seq_len(dims[[2]])])
    })
}

sort_and_revalue <- function(x) {
  x <- x[do.call(order, x), , drop = FALSE]
  x[] <- lapply(x, \(y) match(y, unique(y)))
  row.names(x) <- NULL
  x
}

gen.data.frame <- function(rows = 10, cols = 10) {
  gen.dims(rows, cols) |>
    gen.and_then(gen.data.frame_from_dims) |>
    gen.with(unique) |>
    gen.with(sort_and_revalue)
}

crp_nonconst <- function(x) {
  x <- x[vapply(x, \(y) any(y != 1), logical(1))]
  names(x) <- letters[seq_len(ncol(x))]
  x  
}

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      fds <- discover(x)
      list(data = x, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema)),
  curry = TRUE
))

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      x_nonconst <- crp_nonconst(x)
      fds <- discover(x_nonconst)
      list(data = x_nonconst, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema, const = FALSE)),
  curry = TRUE
))

Session information

sessionInfo()
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hedgehog_0.1   testthat_3.2.3 autodb_3.2.3  
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5        crayon_1.5.3       cli_3.6.4          knitr_1.50        
##  [5] rlang_1.1.5        xfun_0.51          diffobj_0.3.5      jsonlite_1.9.1    
##  [9] glue_1.8.0         htmltools_0.5.8.1  sass_0.4.9         brio_1.1.5        
## [13] rmarkdown_2.29     evaluate_1.0.3     jquerylib_0.1.4    visNetwork_2.1.2  
## [17] fastmap_1.2.0      yaml_2.3.10        lifecycle_1.0.4    bookdown_0.42     
## [21] DiagrammeR_1.0.11  compiler_4.4.3     RColorBrewer_1.1-3 waldo_0.6.1       
## [25] htmlwidgets_1.6.4  rstudioapi_0.17.1  blogdown_1.21      digest_0.6.37     
## [29] R6_2.6.1           pillar_1.10.1      magrittr_2.0.3     bslib_0.9.0       
## [33] tools_4.4.3        cachem_1.1.0
Avatar
Mark Webster
Data Scientist

Probability and Statistics, with some programming in R.

Related