Adding "views" to autodb, part three: summarising the schema

Last updated on 3 December 2025

Last time, we wrote a function to check whether a given database schema is connected – ignoring relations for constants, as a special case – and wrote a property test to show that the schemas generated by normalise and autodb aren’t always connected. Our goal is to add virtual “pre-decomposition” relations to make the schema connected; I’m referring to them as “views”, which is a misnomer, but is much shorter to write.

We start with the code listing provided at the end of the last post. We had the following failing test for schemas being connected:

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      x_nonconst <- crp_nonconst(x)
      fds <- discover(x_nonconst)
      list(data = x_nonconst, fds = fds, schema = normalise(fds))
    }),
  function(data, fds, schema) expect_true(is_connected(schema, const = FALSE)),
  curry = TRUE
))

We saw this simple failing test case:

test_db <- database_schema(
  relation_schema(
    list(
      b = list(c("a", "b"), list("b")),
      a_c = list(c("a", "c", "d"), list(c("a", "c"), c("a", "d"), c("c", "d"))),
      b_c = list(c("b", "c"), list(c("b", "c")))
    ),
    letters[1:4]
  ),
  list(list("b_c", "b", "b", "b"))
)
show <- function(x) DiagrammeR::grViz(gv(x), width = "100%")
show(test_db)

To add the view that makes this schema connected, we want to implement a function called ensure_connected, to keep similar naming to the ensure lossless argument in autodb, which can also add a relation. This should take a database schema or database, and return one with the additional view information. We’ll then worry about plotting it.

I’m upping the number of cases run per property tests, just to make sure we catch everything:

options(hedgehog.tests = 400, hedgehog.shrinks = 200)

Add attributes for new information

To avoid getting too much into autodb class internals, I think that the simplest approach is to add the new information as extra attributes. Here are the attributes currently used by the two relevant classes:

empty_ds <- database_schema(
  relation_schema(setNames(list(), character()), character()),
  list()
)
names(attributes(empty_ds)) # database_schema

## [1] "names"       "attrs_order" "class"       "references"

names(attributes(create(empty_ds))) # database

## [1] "names"       "attrs_order" "class"       "references"

We can add views to a views attribute, references to the relations they decompose into as decomp_references, and foreign key references from views as virtual_references. This is rather messy, and I would want to change references to contain all the different types of reference instead, but that’s a change that’s too involved for the blog.

To allow for these new attributes, we need to modify our property test to use the function we’ll be writing:

ensure_connected <- identity # placeholder
set.seed(1)
try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      x_nonconst <- crp_nonconst(x)
      fds <- discover(x_nonconst)
      list(
        data = x_nonconst,
        fds = fds,
        schema = normalise(fds)
      )
    }),
  function(data, fds, schema) expect_true(is_connected(
    ensure_connected(schema),
    const = FALSE
  )),
  curry = TRUE
))

## Error : Falsifiable after 12 tests, and 10 shrinks
## <expectation_failure/expectation/error/condition>
## Expected `is_connected(ensure_connected(schema), const = FALSE)` to be TRUE.
## Differences:
## `actual`:   FALSE
## `expected`: TRUE 
## 
## Backtrace:
##      ▆
##   1. ├─base::try(...)
##   2. │ └─base::tryCatch(...)
##   3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##   4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##   5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##   6. └─hedgehog::forall(...)
##   7.   └─hedgehog:::run.prop(property, counterexample$smallest, curry)
##   8.     ├─base::tryCatch(...)
##   9.     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##  10.     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##  11.     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##  12.     ├─base::withCallingHandlers(...)
##  13.     ├─base::do.call(property, arguments)
##  14.     └─global `<fn>`(data = `<df[,4]>`, fds = `<fnctnl_d>`, schema = `<dtbs_sch>`)
##  15.       └─testthat::expect_true(...)
## Counterexample:
## $data
##   a b c d
## 1 1 1 1 1
## 2 1 1 1 2
## 3 2 1 2 3
## 4 2 2 1 2
## 
## $fds
## 6 functional dependencies
## 4 attributes: a, b, c, d
## b, d -> a
## b, c -> a
## a, d -> b
## a, c -> b
##    d -> c
## a, b -> c
## 
## $schema
## database schema with 3 relation schemas
## 4 attributes: a, b, c, d
## schema d: d, c
##   key 1: d
## schema a_b: a, b, c
##   key 1: a, b
##   key 2: a, c
##   key 3: b, c
## schema a_d: a, d
##   key 1: a, d
## references:
## a_d.{d} -> d.{d}

As we’d expect, it still fails, because we haven’t done anything substantive.

Calculate accumulated attributes

Adding the required views is a large task that I don’t want to do all at once, so for the rest of the post I want to focus on collecting all the summary information we need to decide which views to add.

Previously, we defined a component in the schema as being a non-child relation – i.e. one with no other relations referring to it with a foreign key – and all of its ancestors with respect to foreign key references.

I think that we’re interested in three pieces of information:

which attributes a relation has after joining it to its ancestors (i.e. its pre-decomposition attributes);
which components the relation belongs to (plural because the foreign key references create a directed graph);
whether the relation is a non-parent for a component (i.e. has no children referring to it).

Collecting all of this is a less trivial function than is_connected, so we’ll start with writing a property test for it. Here are some expected properties that I’d expect:

post-join, every relation contains exactly the attributes of its ancestors and itself;
conversely, every relation’s attributes appear in all of its descendants post-join;
every relation is in at least one component;
every non-parent is in exactly one component;
every component contains exactly one non-parent.

To write these properties, I assume that we return the relation-component pairings as a matrix.

component_info <- function(x) NULL # placeholder
prop_component_info <- function(x) {
  ci <- component_info(x)

  # expected schema-attribute pairs post-join
  ao <- attrs_order(x)
  lao <- length(ao)
  as <- attrs(x)
  attr_matrix <- t(outer(ao, as, Vectorize(is.element)))
  adj <- adjacency(x)
  family <- fixed_point(adj, \(x) x %*% adj > 0, identical)
  self_and_ancestor_attrs <- family %*% attr_matrix > 0

  # actual schema-attribute pairs post-join
  postjoin_attrs <- t(outer(ao, ci$attrs, Vectorize(is.element)))

  component_nonparents <- which(ci$components, arr.ind = TRUE)
  nonparent_components <- ci$components[colnames(ci$components), , drop = FALSE]
  
  constant_relations <- vapply(
    keys(x),
    \(ks) any(lengths(ks) == 0),
    logical(1)
  )
  nonconstant_components <- ci$components[!constant_relations, , drop = FALSE]
  nonconstant_components <- nonconstant_components[
    ,
    colSums(nonconstant_components) > 0,
    drop = FALSE
  ]

  errors <- c(
    "relations do not have exactly the attributes of themselves and their ancestors",
    "relations do not transfer all their attributes to component nonparents",
    "there are relations with no component",
    "there are nonparents with no or multiple components",
    "there are components with no or multiple nonparents"
  )

  checks <- c(
    all(postjoin_attrs == self_and_ancestor_attrs),
    all(apply(
      component_nonparents,
      1,
      \(x) all(as[[x[[1]]]] %in% ci$attrs[ci$nonparents][[x[[2]]]])
    )),
    all(rowSums(ci$components) > 0),
    all(rowSums(nonparent_components) == 1),
    all(colSums(nonparent_components) == 1)
  )

  if (all(checks))
    succeed()
  else
    fail(paste("-", errors[!checks], collapse = "\n"))
}

I keep all the properties in a single function: the individual properties are tested against the same generator, so we might as well check all of them against every generated test case.

Since the property is written as a testthat expectation function, this does mean writing an expectation function that correctly collates all of its sub-expectation results. I don’t think the way that I’ve done it above is the approach that’s intended by the testthat package authors – they seem to like using quasiquotation, and I find code that uses it very hard to read – but explicitly pasting error messages together keeps the code reasonably simple, so it’s good enough for me.

We should be able to calculate the summary information from the extended adjacency matrix, as we calculated in is_connected with adjacency and fixed_point:

component_info <- function(x) {
  adj <- adjacency(x)
  filled <- fixed_point(adj, \(x) x %*% adj > 0, identical)
  nonparents <- which(colSums(filled) == 1) # only referenced by itself
  list(
    attrs = apply(
      filled,
      1,
      function(ancs) {
        Reduce(c, attrs(x)[ancs], init = character()) |>
          unique()
      },
      simplify = FALSE
    ),
    nonparents = nonparents,
    components = t(filled[nonparents, , drop = FALSE])
  )
}

This satisfies all of our expected properties:

forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb),
  prop_component_info
)

Here’s the summary information for our original failing test case:

component_info(test_db)

## $attrs
## $attrs$b
## [1] "b" "a"
## 
## $attrs$a_c
## [1] "a" "c" "d"
## 
## $attrs$b_c
## [1] "b" "a" "c"
## 
## 
## $nonparents
## a_c b_c 
##   2   3 
## 
## $components
##       a_c   b_c
## b   FALSE  TRUE
## a_c  TRUE FALSE
## b_c FALSE  TRUE

Generate more general schemas

I’m not sure that this approach is correct, though. If this should work for any given schema, then generating a schema from a data frame might leave off some important test cases.

With that in mind, I want a more direct schema generator. The below implementation is a quick and dirty, simplified version of what I use in the package.

First, we decide how many attributes we have, and assume their names are single letters:

gen.attrs <- function(max_attrs) {
  stopifnot(max_attrs <= length(letters))
  gen.int(max_attrs) |>
    gen.with(\(n) letters[seq_len(n)])
}

Once we have those attributes, we can generate each relation as containing a subset of those attributes. Its keys are then subsets of that subset, which don’t contain each other.

remove_supersets <- function(x) {
  x <- unique(x)
  subsets <- outer(
    x,
    x,
    Vectorize(\(sub, sup) !identical(sub, sup) && all(is.element(sub, sup)))
  )
  x[apply(subsets, 1, Negate(any))]
}

gen.relation_pair_from_attrs <- function(attrs, max_keys) {
  gen.subsequence(attrs) |>
    gen.and_then(\(as) {
      list(
        gen.pure(as),
        gen.subsequence(as) |>
          gen.list(from = 1, to = max_keys) |>
          gen.with(remove_supersets)
      )
    })
}

We generate a random number of these relation pairs, use them to make a relation schema, and use autoref to calculate plausible foreign key references between them.

add_relation_names <- function(x) {
  setNames(
    x,
    vapply(
      x,
      \(rs) if (length(rs[[2]][[1]]) == 0)
        "constants"
      else
        paste(rs[[2]][[1]], collapse = "_"),
      character(1)
    ) |>
      make.unique()
  )
}

gen.database_schema <- function(max_attrs, max_size, max_keys) {
  gen.attrs(max_attrs) |>
    gen.and_then(\(attrs) {
      gen.relation_pair_from_attrs(attrs, max_keys) |>
        gen.list(from = 0, to = max_size) |>
        gen.with(add_relation_names) |>
        gen.with(\(x) relation_schema(x, attrs))
    }) |>
    gen.with(autoref)
}

This doesn’t remove as much redundancy as the data frame generator does, but it works.

We can now see if these more general schemas still satisfy the component_info properties.

try(forall(
  gen.database_schema(5, 5, 3),
  prop_component_info
))

## Error : Falsifiable after 2 tests, and 200 shrinks
## <simpleError in family %*% attr_matrix: requires numeric/complex matrix/vector arguments>
## Counterexample:
## database schema with 0 relation schemas
## 1 attribute: a
## no references

We see that calling prop_component_info on the empty schema throws an error, specifically when we look for each relation’s post-join attributes:

  self_and_ancestor_attrs <- family %*% attr_matrix > 0

This turns out to be because, if there are no relations, attr_matrix is a matrix of list elements, rather than primitives, due to how outer handles empty inputs, so it can’t be used in matrix multiplication:

`dim<-`(t(outer("a", list(), Vectorize(is.element))), 0)

## list()

We didn’t find this earlier because I was lazy when writing the data frame generator: it generates the number of rows and columns using gen.int, where gen.int(n) generates an integer between 1 and n. This ignores the empty case, and unfortunately base R function have various issues concerning empty cases.

If we allow for an empty dimension in the generator, we should find the same problem when generating a data frame.

gen.dims <- function(rows, cols) {
  list(gen.element(seq_len(rows + 1) - 1), gen.element(seq_len(cols + 1) - 1)) |>
    gen.with(unlist)
}

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb),
  prop_component_info
))

## Error : Falsifiable after 7 tests, and 1 shrinks
## <simpleError in family %*% attr_matrix: requires numeric/complex matrix/vector arguments>
## Counterexample:
## database with 1 relation
## 0 attributes
## relation constants: ; 0 records
##   key 1: 
## no references

Voilà.

We can rewrite prop_component_info to handle these empty cases: we add empty-dimension cases for both uses of outer, and use less strict subsetting on the later call to apply.

prop_component_info <- function(x) {
  ci <- component_info(x)

  # expected schema-attribute pairs post-join
  ao <- attrs_order(x)
  lao <- length(ao)
  as <- attrs(x)
  attr_matrix <- if (length(x) == 0 || length(ao) == 0) {
    matrix(logical(), nrow = length(x), ncol = length(ao))
  }else{
    t(outer(ao, as, Vectorize(is.element)))
  }
  adj <- adjacency(x)
  family <- fixed_point(adj, \(x) x %*% adj > 0, identical)
  self_and_ancestor_attrs <- family %*% attr_matrix > 0

  # actual schema-attribute pairs post-join
  postjoin_attrs <- if (length(ao) == 0 || length(ci$attrs) == 0) {
    matrix(logical(), nrow = length(ci$attrs), ncol = length(ao))
  }else{
    t(outer(ao, ci$attrs, Vectorize(is.element)))
  }

  component_nonparents <- which(ci$components, arr.ind = TRUE)
  nonparent_components <- ci$components[colnames(ci$components), , drop = FALSE]

  constant_relations <- vapply(
    keys(x),
    \(ks) any(lengths(ks) == 0),
    logical(1)
  )
  nonconstant_components <- ci$components[!constant_relations, , drop = FALSE]
  nonconstant_components <- nonconstant_components[
    ,
    colSums(nonconstant_components) > 0,
    drop = FALSE
  ]

  errors <- c(
    "relations do not have exactly the attributes of themselves and their ancestors",
    "relations do not transfer all their attributes to component nonparents",
    "there are relations with no component",
    "there are nonparents with no or multiple components",
    "there are components with no or multiple nonparents"
  )

  checks <- c(
    all(postjoin_attrs == self_and_ancestor_attrs),
    all(apply(
      component_nonparents,
      1,
      \(x) all(unlist(as[x[[1]]]) %in% unlist(ci$attrs[ci$nonparents][x[[2]]]))
    )),
    all(rowSums(ci$components) > 0),
    all(rowSums(nonparent_components) == 1),
    all(colSums(nonparent_components) == 1)
  )

  if (all(checks))
    succeed()
  else
    fail(paste("-", errors[!checks], collapse = "\n"))
}

The property should now work again for schemas generated from data frames.

forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb),
  prop_component_info
)

That’s better. Does the property hold for more general database schemas now?

try(forall(
  gen.database_schema(5, 5, 3) |>
    gen.with(\(ds) list(ds, component_info(ds))),
  \(ds, ci) prop_component_info(ds),
  curry = TRUE
))

## Error : Falsifiable after 4 tests, and 200 shrinks
## <expectation_failure/expectation/error/condition>
## - there are relations with no component
## Backtrace:
##      ▆
##   1. ├─base::try(...)
##   2. │ └─base::tryCatch(...)
##   3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##   4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##   5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##   6. └─hedgehog::forall(...)
##   7.   └─hedgehog:::run.prop(property, counterexample$smallest, curry)
##   8.     ├─base::tryCatch(...)
##   9.     │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
##  10.     │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
##  11.     │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
##  12.     ├─base::withCallingHandlers(...)
##  13.     ├─base::do.call(property, arguments)
##  14.     └─global `<fn>`(`<dtbs_sch>`, `<named list>`)
##  15.       └─global prop_component_info(ds)
## Counterexample:
## [[1]]
## database schema with 4 relation schemas
## 4 attributes: a, b, c, d
## schema constants: 
##   key 1: 
## schema b_d: b, d
##   key 1: b, d
## schema b_d.1: b, d
##   key 1: b, d
## schema d: d, c
##   key 1: d
## references:
## b_d.1.{b, d} -> b_d.{b, d}
## b_d.{b, d} -> b_d.1.{b, d}
## 
## [[2]]
## [[2]]$attrs
## [[2]]$attrs$constants
## character(0)
## 
## [[2]]$attrs$b_d
## [1] "b" "d"
## 
## [[2]]$attrs$b_d.1
## [1] "b" "d"
## 
## [[2]]$attrs$d
## [1] "d" "c"
## 
## 
## [[2]]$nonparents
## constants         d 
##         1         4 
## 
## [[2]]$components
##           constants     d
## constants      TRUE FALSE
## b_d           FALSE FALSE
## b_d.1         FALSE FALSE
## d             FALSE  TRUE

No. We assumed that every connected component of a schema has a non-parent, but here we have a schema where the two relations both refer to each other, making a cycle with no non-parents:

The other two relations are superfluous, here: I clearly didn’t write the more general schema generator in a way that allows for efficient shrinking. Still, it’s at least giving simple failing test cases.

Handle reference cycles

How do we identify components if cycles can mean there’s no non-parent? Well, if there’s a cycle, we can collapse all the relations in that cycle into a single “node” in the reference graph. Once there are no cycles left, we can identify non-parents, but those non-parents might represent several relations in the original graph. This means that, where we originally gave a single non-parent for each component, we might give several, and we can no longer use the non-parent’s name as the component’s label.

component_info <- function(x) {
  adj <- adjacency(x)
  filled <- fixed_point(adj, \(x) x %*% adj > 0, identical)

  # find pairs of relations that reference each other
  bothref <- filled
  bothref <- bothref & t(bothref)
  cycle_groups <- apply(bothref, 1, \(x) which(x)[1])
  # relations in the same cycle should have the same extended adjacency
  stopifnot(tapply(
    apply(bothref, 1, identity, simplify = FALSE),
    cycle_groups,
    \(x) all(duplicated(x)[-1])
  ))
  cycle_filled <- filled[unique(cycle_groups), unique(cycle_groups), drop = FALSE] |>
    unname()

  nonparents <- unique(cycle_groups)[
    which(colSums(cycle_filled) == 1)
  ] |> # cycle is only referenced by itself
    lapply(\(x) which(cycle_groups == x))
  components <- t(filled[vapply(nonparents, `[[`, integer(1), 1), , drop = FALSE])
  # remove components names, since components no longer associated with single
  # nonparent relation
  colnames(components) <- NULL
  list(
    attrs = apply(
      filled,
      1,
      function(ancs) {
        Reduce(c, attrs(x)[ancs], init = character()) |>
          unique()
      },
      simplify = FALSE
    ),
    nonparents = nonparents,
    components = components
  )
}

We need to rewrite the property to account for this change. Specifically, rather than one nonparent per component, we now require at least one nonparent, and relations must transfer their attributes to all nonparents of the component post-join.

prop_component_info <- function(x) {
  ci <- component_info(x)

  # expected schema-attribute pairs post-join
  ao <- attrs_order(x)
  lao <- length(ao)
  as <- attrs(x)
  attr_matrix <- if (length(x) == 0 || length(ao) == 0) {
    matrix(logical(), nrow = length(x), ncol = length(ao))
  }else{
    t(outer(ao, as, Vectorize(is.element)))
  }
  adj <- adjacency(x)
  family <- fixed_point(adj, \(x) x %*% adj > 0, identical)
  self_and_ancestor_attrs <- family %*% attr_matrix > 0

  # actual schema-attribute pairs post-join
  postjoin_attrs <- if (length(ao) == 0 || length(ci$attrs) == 0) {
    matrix(logical(), nrow = length(ci$attrs), ncol = length(ao))
  }else{
    t(outer(ao, ci$attrs, Vectorize(is.element)))
  }

  component_nonparents <- which(ci$components, arr.ind = TRUE)
  nonparent_components <- ci$components[unlist(ci$nonparents), , drop = FALSE]

  constant_relations <- vapply(
    keys(x),
    \(ks) any(lengths(ks) == 0),
    logical(1)
  )
  nonconstant_components <- ci$components[!constant_relations, , drop = FALSE]
  nonconstant_components <- nonconstant_components[
    ,
    colSums(nonconstant_components) > 0,
    drop = FALSE
  ]

  errors <- c(
    "relations do not have exactly the attributes of themselves and their ancestors",
    "relations do not transfer all their attributes to component nonparents",
    "there are relations with no component",
    "there are nonparents with no or multiple components",
    "there are components with no nonparents"
  )

  checks <- c(
    all(postjoin_attrs == self_and_ancestor_attrs),
    all(apply(
      component_nonparents,
      1,
      \(x) {
        all(vapply(
          ci$nonparents[x[[2]]],
          \(y) all(unlist(as[x[[1]]]) %in% unlist(ci$attrs[y])),
          logical(1)
        ))
      }
    )),
    all(rowSums(ci$components) > 0),
    all(rowSums(nonparent_components) == 1),
    all(colSums(nonparent_components) >= 1)
  )

  if (all(checks))
    succeed()
  else
    fail(paste("-", errors[!checks], collapse = "\n"))
}

forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb) |>
    gen.with(\(db) list(db, component_info(db))),
  \(db, ci) prop_component_info(db),
  curry = TRUE
)

forall(
  gen.database_schema(5, 5, 3) |>
    gen.with(\(ds) list(ds, component_info(ds))),
  \(ds, ci) prop_component_info(ds),
  curry = TRUE
)

The more general schemas now pass the properties too. As an example, here’s a somewhat-nasty generated case I saw when rewriting the property:

## database schema with 4 relation schemas
## 2 attributes: a, b
## schema a: a, b
##   key 1: a
## schema a.1: a
##   key 1: a
## schema constants: 
##   key 1: 
## schema a_b: a, b
##   key 1: a, b
## references:
## a.1.{a} -> a.{a}
## a_b.{a} -> a.{a}
## a.{a} -> a.1.{a}
## a.{a, b} -> a_b.{a, b}

And here’s the resulting component information:

component_info(x)

## $attrs
## $attrs$a
## [1] "a" "b"
## 
## $attrs$a.1
## [1] "a" "b"
## 
## $attrs$constants
## character(0)
## 
## $attrs$a_b
## [1] "a" "b"
## 
## 
## $nonparents
## $nonparents[[1]]
##   a a.1 a_b 
##   1   2   4 
## 
## $nonparents[[2]]
## constants 
##         3 
## 
## 
## $components
##            [,1]  [,2]
## a          TRUE FALSE
## a.1        TRUE FALSE
## constants FALSE  TRUE
## a_b        TRUE FALSE

The non-constant relations are connected by two cycles, so they’re in the same component and have the same attributes post-join. The component information is all correct, but the schema itself brings up some interesting points:

Looking at the cycle relations’ attributes and keys, it’s clear that relation a makes the other two redundant. This is not something we’d expect to see in a well-designed schema, but we can handle it if we do see it.
When joining components with views, the existence of cycles raises a question: if a component has several non-parents, a view should only point to one as part of its decomposition. How do we decide which one? I don’t want the order of the relations to matter, since a database is a tuple of relations with no implicit ordering. If the relations in the cycle have the same attributes and keys, then this is arbitrary, but the above example shows a case where they’re not the same. We could remove relations a.1 and a_b as redundant, but schema simplification is a separate operation that I don’t want to couple to ensuring connectedness. I’m not sure what I want to do yet.

Test more properties

After writing the above, I thought of additional property that I’d expect component_info should have: the number of found components is related to whether the schema is already connected. This has two versions, depending on whether we account for “constant” relations, i.e. those with an empty key.

Here are the new properties:

there are zero or one non-constant components (i.e. components with any non-constant relations) if and only if the schema is connected;
there are zero or one components if and only if the schema is connected, including constant relations (“constant-connected”).

We can add these to the property function.

prop_component_info <- function(x) {
  ci <- component_info(x)

  # expected schema-attribute pairs post-join
  ao <- attrs_order(x)
  lao <- length(ao)
  as <- attrs(x)
  attr_matrix <- if (length(x) == 0 || length(ao) == 0) {
    matrix(logical(), nrow = length(x), ncol = length(ao))
  }else{
    t(outer(ao, as, Vectorize(is.element)))
  }
  adj <- adjacency(x)
  family <- fixed_point(adj, \(x) x %*% adj > 0, identical)
  self_and_ancestor_attrs <- family %*% attr_matrix > 0

  # actual schema-attribute pairs post-join
  postjoin_attrs <- if (length(ao) == 0 || length(ci$attrs) == 0) {
    matrix(logical(), nrow = length(ci$attrs), ncol = length(ao))
  }else{
    t(outer(ao, ci$attrs, Vectorize(is.element)))
  }

  component_nonparents <- which(ci$components, arr.ind = TRUE)
  nonparent_components <- ci$components[unlist(ci$nonparents), , drop = FALSE]

  constant_relations <- vapply(
    keys(x),
    \(ks) any(lengths(ks) == 0),
    logical(1)
  )
  nonconstant_ci <- component_info(x[!constant_relations])
  
  errors <- c(
    "relations do not have exactly the attributes of themselves and their ancestors",
    "relations do not transfer all their attributes to component nonparents",
    "there are relations with no component",
    "there are nonparents with no or multiple components",
    "there are components with no nonparents",
    "there are zero or one components, but schema is not constant-connected",
    "schema is constant-connected, but has multiple components",
    "there are zero or one non-constant components, but schema is not connected",
    "schema is connected, but has multiple non-constant components"
  )

  checks <- c(
    all(postjoin_attrs == self_and_ancestor_attrs),
    all(apply(
      component_nonparents,
      1,
      \(x) {
        all(vapply(
          ci$nonparents[x[[2]]],
          \(y) all(unlist(as[x[[1]]]) %in% unlist(ci$attrs[y])),
          logical(1)
        ))
      }
    )),
    all(rowSums(ci$components) > 0),
    all(rowSums(nonparent_components) == 1),
    all(colSums(nonparent_components) >= 1),
    ncol(ci$components) > 1 || is_connected(x, const = TRUE),
    !is_connected(x, const = TRUE) || ncol(ci$components) <= 1,
    ncol(nonconstant_ci$components) > 1 || is_connected(x, const = FALSE),
    !is_connected(x, const = FALSE) || ncol(nonconstant_ci$components) <= 1
  )

  if (all(checks))
    succeed()
  else
    fail(paste("-", errors[!checks], collapse = "\n"))
}

forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb) |>
    gen.with(\(db) list(db, component_info(db))),
  \(db, ci) prop_component_info(db),
  curry = TRUE
)

forall(
  gen.database_schema(5, 5, 3) |>
    gen.with(\(ds) list(ds, component_info(ds))),
  \(ds, ci) prop_component_info(ds),
  curry = TRUE
)

These new properties are passed, so we can be reasonably sure that the component information agrees with connectedness.

Refactor

We’ve finished testing and writing a non-trivial piece of the implementation, so let’s stop and simplify the code a bit.

Whenever we use adjacency to find an adjacency matrix, we immediately calculate the extended adjacency matrix with fixed_point, and don’t use the adjacency matrix for anything else. So, let’s write a function that just gives the extended adjacency matrix.

extended_adjacency <- function(x) {
  adj <- adjacency(x)
  fixed_point(adj, \(x) x %*% adj > 0, identical)
}

We can now use this, instead of adjacency and fixed_point, in is_connected, component_info, and prop_component_info. It more precisely conveys intent, and it means that we have one less intermediate variable lying around for us to wonder whether it’s used later.

Another repeated pattern we can get rid of is the two cases in prop_component_info where we had to add an edge case to outer, since it doesn’t handle empty inputs well. Let’s write a function to encapsulate that idea.

outer_element <- function(elements, sets) {
  if (length(sets) == 0 || length(elements) == 0) {
    matrix(logical(), nrow = length(sets), ncol = length(elements))
  }else{
    t(outer(elements, sets, Vectorize(is.element)))
  }
}

Additionally, prop_componment_info stores length(attrs_order(x)) as a variable for re-use, but doesn’t use it any more, so we can remove it.

The combined changes to the mentioned functions are shown below.

is_connected <- function(x, const = TRUE) {
  stopifnot(inherits(x, c("database_schema", "database")))
  
  used <- if (const)
    rep(TRUE, length(x))
  else
    vapply(keys(x), all_nonempty, logical(1))
  if (sum(used) == 0)
    return(TRUE)

  any(apply(extended_adjacency(x[used]), 1, all))
}

component_info <- function(x) {
  filled <- extended_adjacency(x)

  # find pairs of relations that reference each other
  bothref <- filled
  bothref <- bothref & t(bothref)
  cycle_groups <- apply(bothref, 1, \(x) which(x)[1])
  # relations in the same cycle should have the same extended adjacency
  stopifnot(tapply(
    apply(bothref, 1, identity, simplify = FALSE),
    cycle_groups,
    \(x) all(duplicated(x)[-1])
  ))
  cycle_filled <- filled[unique(cycle_groups), unique(cycle_groups), drop = FALSE] |>
    unname()

  nonparents <- unique(cycle_groups)[
    which(colSums(cycle_filled) == 1)
  ] |> # cycle is only referenced by itself
    lapply(\(x) which(cycle_groups == x))
  components <- t(filled[vapply(nonparents, `[[`, integer(1), 1), , drop = FALSE])
  # remove components names, since components no longer associated with single
  # nonparent relation
  colnames(components) <- NULL
  list(
    attrs = apply(
      filled,
      1,
      function(ancs) {
        Reduce(c, attrs(x)[ancs], init = character()) |>
          unique()
      },
      simplify = FALSE
    ),
    nonparents = nonparents,
    components = components
  )
}

prop_component_info <- function(x) {
  ci <- component_info(x)

  # expected schema-attribute pairs post-join
  ao <- attrs_order(x)
  as <- attrs(x)
  attr_matrix <- outer_element(ao, as)
  family <- extended_adjacency(x)
  self_and_ancestor_attrs <- family %*% attr_matrix > 0

  # actual schema-attribute pairs post-join
  postjoin_attrs <- outer_element(ao, ci$attrs)

  component_nonparents <- which(ci$components, arr.ind = TRUE)
  nonparent_components <- ci$components[unlist(ci$nonparents), , drop = FALSE]

  constant_relations <- vapply(
    keys(x),
    \(ks) any(lengths(ks) == 0),
    logical(1)
  )
  nonconstant_ci <- component_info(x[!constant_relations])
  
  errors <- c(
    "relations do not have exactly the attributes of themselves and their ancestors",
    "relations do not transfer all their attributes to component nonparents",
    "there are relations with no component",
    "there are nonparents with no or multiple components",
    "there are components with no nonparents",
    "there are zero or one components, but schema is not constant-connected",
    "schema is constant-connected, but has multiple components",
    "there are zero or one non-constant components, but schema is not connected",
    "schema is connected, but has multiple non-constant components"
  )

  checks <- c(
    all(postjoin_attrs == self_and_ancestor_attrs),
    all(apply(
      component_nonparents,
      1,
      \(x) {
        all(vapply(
          ci$nonparents[x[[2]]],
          \(y) all(unlist(as[x[[1]]]) %in% unlist(ci$attrs[y])),
          logical(1)
        ))
      }
    )),
    all(rowSums(ci$components) > 0),
    all(rowSums(nonparent_components) == 1),
    all(colSums(nonparent_components) >= 1),
    ncol(ci$components) > 1 || is_connected(x, const = TRUE),
    !is_connected(x, const = TRUE) || ncol(ci$components) <= 1,
    ncol(nonconstant_ci$components) > 1 || is_connected(x, const = FALSE),
    !is_connected(x, const = FALSE) || ncol(nonconstant_ci$components) <= 1
  )

  if (all(checks))
    succeed()
  else
    fail(paste("-", errors[!checks], collapse = "\n"))
}

Finally, we wrote a property function for component_info, so let’s do the same for ensure_connected:

prop_ensure_connected <- function(x) {
  expect_true(is_connected(ensure_connected(x), const = FALSE))
}

Let’s double-check the tests still work after refactoring:

forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb),
  prop_component_info
)

forall(
  gen.database_schema(5, 5, 3),
  prop_component_info
)

Code

Here’s all the relevant code so far, ignoring the random seed, and with implementation and test code separated:

library(autodb)

fixed_point <- function(init, step, endif) {
  old <- init
  current <- step(init)
  while (!endif(old, current)) {
    old <- current
    current <- step(current)
  }
  current
}

adjacency <- function(x) {
  stopifnot(inherits(x, c("database_schema", "database")))

  nms <- names(x)
  len <- length(x)
  refs <- references(x)
  adj <- matrix(
    FALSE,
    nrow = len,
    ncol = len,
    dimnames = list(nms, nms)
  )
  from <- vapply(refs, \(r) match(r[[1]], nms), integer(1))
  to <- vapply(refs, \(r) match(r[[3]], nms), integer(1))
  for (n in seq_along(refs))
    adj[[from[[n]], to[[n]]]] <- TRUE
  diag(adj) <- TRUE
  adj
}

extended_adjacency <- function(x) {
  adj <- adjacency(x)
  fixed_point(adj, \(x) x %*% adj > 0, identical)
}

all_nonempty <- function(x) {
  all(lengths(x) > 0)
}

is_connected <- function(x, const = TRUE) {
  stopifnot(inherits(x, c("database_schema", "database")))
  
  used <- if (const)
    rep(TRUE, length(x))
  else
    vapply(keys(x), all_nonempty, logical(1))
  if (sum(used) == 0)
    return(TRUE)

  any(apply(extended_adjacency(x[used]), 1, all))
}

component_info <- function(x) {
  filled <- extended_adjacency(x)

  # find pairs of relations that reference each other
  bothref <- filled
  bothref <- bothref & t(bothref)
  cycle_groups <- apply(bothref, 1, \(x) which(x)[1])
  # relations in the same cycle should have the same extended adjacency
  stopifnot(tapply(
    apply(bothref, 1, identity, simplify = FALSE),
    cycle_groups,
    \(x) all(duplicated(x)[-1])
  ))
  cycle_filled <- filled[unique(cycle_groups), unique(cycle_groups), drop = FALSE] |>
    unname()

  nonparents <- unique(cycle_groups)[
    which(colSums(cycle_filled) == 1)
  ] |> # cycle is only referenced by itself
    lapply(\(x) which(cycle_groups == x))
  components <- t(filled[vapply(nonparents, `[[`, integer(1), 1), , drop = FALSE])
  # remove components names, since components no longer associated with single
  # nonparent relation
  colnames(components) <- NULL
  list(
    attrs = apply(
      filled,
      1,
      function(ancs) {
        Reduce(c, attrs(x)[ancs], init = character()) |>
          unique()
      },
      simplify = FALSE
    ),
    nonparents = nonparents,
    components = components
  )
}

ensure_connected <- identity # placeholder

library(hedgehog)
options(hedgehog.tests = 400, hedgehog.shrinks = 200)

gen.dims <- function(rows, cols) {
  list(gen.element(seq_len(rows + 1) - 1), gen.element(seq_len(cols + 1) - 1)) |>
    gen.with(unlist)
}

to_crp <- function(x) {
  n <- length(x)
  res <- rep(NA_integer_, n)
  count <- 0L
  for (index in seq_len(n)) {
    res[[index]] <- res[[x[[index]]]]
    if (is.na(res[[index]])) {
      count <- count + 1L
      res[[index]] <- count
    }
  }
  res
}

gen.crp <- function(n) {
  lapply(seq_len(n), gen.int) |>
    gen.with(unlist) |>
    gen.with(to_crp)
}

gen.data.frame_from_dims <- function(dims) {
  gen.list(gen.crp(dims[[1]]), of = dims[[2]]) |>
    gen.with(\(x) {
      as.data.frame(x) |>
        setNames(letters[seq_len(dims[[2]])])
    })
}

sort_and_revalue <- function(x) {
  x <- x[do.call(order, x), , drop = FALSE]
  x[] <- lapply(x, \(y) match(y, unique(y)))
  row.names(x) <- NULL
  x
}

gen.data.frame <- function(rows = 10, cols = 10) {
  gen.dims(rows, cols) |>
    gen.and_then(gen.data.frame_from_dims) |>
    gen.with(unique) |>
    gen.with(sort_and_revalue)
}

crp_nonconst <- function(x) {
  x <- x[vapply(x, \(y) any(y != 1), logical(1))]
  names(x) <- letters[seq_len(ncol(x))]
  x  
}

gen.attrs <- function(max_attrs) {
  gen.int(max_attrs) |>
    gen.with(\(n) letters[seq_len(n)])
}

remove_supersets <- function(x) {
  x <- unique(x)
  subsets <- outer(
    x,
    x,
    Vectorize(\(sub, sup) !identical(sub, sup) && all(is.element(sub, sup)))
  )
  x[apply(subsets, 1, Negate(any))]
}

gen.relation_pair_from_attrs <- function(attrs, max_keys) {
  gen.subsequence(attrs) |>
    gen.and_then(\(as) {
      list(
        gen.pure(as),
        gen.subsequence(as) |>
          gen.list(from = 1, to = max_keys) |>
          gen.with(remove_supersets)
      )
    })
}

add_relation_names <- function(x) {
  setNames(
    x,
    vapply(
      x,
      \(rs) if (length(rs[[2]][[1]]) == 0)
        "constants"
      else
        paste(rs[[2]][[1]], collapse = "_"),
      character(1)
    ) |>
      make.unique()
  )
}

gen.database_schema <- function(max_attrs, max_size, max_keys) {
  gen.attrs(max_attrs) |>
    gen.and_then(\(attrs) {
      gen.relation_pair_from_attrs(attrs, max_keys) |>
        gen.list(from = 0, to = max_size) |>
        gen.with(add_relation_names) |>
        gen.with(\(x) relation_schema(x, attrs))
    }) |>
    gen.with(autoref)
}

outer_element <- function(elements, sets) {
  if (length(sets) == 0 || length(elements) == 0) {
    matrix(logical(), nrow = length(sets), ncol = length(elements))
  }else{
    t(outer(elements, sets, Vectorize(is.element)))
  }
}

prop_component_info <- function(x) {
  ci <- component_info(x)

  # expected schema-attribute pairs post-join
  ao <- attrs_order(x)
  as <- attrs(x)
  attr_matrix <- outer_element(ao, as)
  family <- extended_adjacency(x)
  self_and_ancestor_attrs <- family %*% attr_matrix > 0

  # actual schema-attribute pairs post-join
  postjoin_attrs <- outer_element(ao, ci$attrs)

  component_nonparents <- which(ci$components, arr.ind = TRUE)
  nonparent_components <- ci$components[unlist(ci$nonparents), , drop = FALSE]

  constant_relations <- vapply(
    keys(x),
    \(ks) any(lengths(ks) == 0),
    logical(1)
  )
  nonconstant_ci <- component_info(x[!constant_relations])
  
  errors <- c(
    "relations do not have exactly the attributes of themselves and their ancestors",
    "relations do not transfer all their attributes to component nonparents",
    "there are relations with no component",
    "there are nonparents with no or multiple components",
    "there are components with no nonparents",
    "there are zero or one components, but schema is not constant-connected",
    "schema is constant-connected, but has multiple components",
    "there are zero or one non-constant components, but schema is not connected",
    "schema is connected, but has multiple non-constant components"
  )

  checks <- c(
    all(postjoin_attrs == self_and_ancestor_attrs),
    all(apply(
      component_nonparents,
      1,
      \(x) {
        all(vapply(
          ci$nonparents[x[[2]]],
          \(y) all(unlist(as[x[[1]]]) %in% unlist(ci$attrs[y])),
          logical(1)
        ))
      }
    )),
    all(rowSums(ci$components) > 0),
    all(rowSums(nonparent_components) == 1),
    all(colSums(nonparent_components) >= 1),
    ncol(ci$components) > 1 || is_connected(x, const = TRUE),
    !is_connected(x, const = TRUE) || ncol(ci$components) <= 1,
    ncol(nonconstant_ci$components) > 1 || is_connected(x, const = FALSE),
    !is_connected(x, const = FALSE) || ncol(nonconstant_ci$components) <= 1
  )

  if (all(checks))
    succeed()
  else
    fail(paste("-", errors[!checks], collapse = "\n"))
}

prop_ensure_connected <- function(x) {
  expect_true(is_connected(ensure_connected(x), const = FALSE))
}

forall(
  gen.data.frame(5, 8) |>
    gen.with(autodb),
  prop_component_info
)

forall(
  gen.database_schema(5, 5, 3),
  prop_component_info
)

try(forall(
  gen.data.frame(5, 8) |>
    gen.with(\(x) {
      x_nonconst <- crp_nonconst(x)
      fds <- discover(x_nonconst)
      list(
        data = x_nonconst,
        fds = fds,
        schema = normalise(fds)
      )
    }),
  function(data, fds, schema) prop_ensure_connected(schema),
  curry = TRUE
))

Session information

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hedgehog_0.2   testthat_3.3.0 autodb_3.2.4  
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5        crayon_1.5.3       cli_3.6.5          knitr_1.50        
##  [5] rlang_1.1.6        xfun_0.54          diffobj_0.3.6      jsonlite_2.0.0    
##  [9] glue_1.8.0         htmltools_0.5.8.1  sass_0.4.10        brio_1.1.5        
## [13] rmarkdown_2.30     visNetwork_2.1.4   evaluate_1.0.5     jquerylib_0.1.4   
## [17] fastmap_1.2.0      yaml_2.3.10        lifecycle_1.0.4    bookdown_0.45     
## [21] DiagrammeR_1.0.11  compiler_4.5.2     RColorBrewer_1.1-3 waldo_0.6.2       
## [25] htmlwidgets_1.6.4  rstudioapi_0.17.1  blogdown_1.22      digest_0.6.37     
## [29] R6_2.6.1           pillar_1.11.1      magrittr_2.0.4     bslib_0.9.0       
## [33] tools_4.5.2        cachem_1.1.0