Using autodb

library(autodb)
#> 
#> Attaching package: 'autodb'
#> The following object is masked from 'package:stats':
#> 
#>     decompose

Terminology

To avoid confusion, this vignette is consistent with terminology: a database consists of relations, which have records with the same attributes. This is in line with terms used in the relational model; in Statistics, we would talk about tables, which have rows with values for the same variables, with a column for each variable.

For the original data set, which is a flat table, we refer to a data frame instead, which also consists of records with attributes.

When we talk about the type of values an attribute can take, we talk about a class, as in the relational model and R, rather than a value type, as in many other programming languages.

Motivation

autodb is described in the DESCRIPTION file in the following way:

Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.)”

Let us begin there.

For data cleaning

Database normalisation works on the principle that our data should only express a piece of information in a single place. This is done to reduce the chance that a modification will accidentally change or remove other pieces of information.

Data cleaning works on the principle that we remove duplicate observations and structural errors, and have a clear understanding of any associations between different observations, or parts of observations. This is done to ensure a high quality for the data used in analysis, but I have found that it is equally important for making sure that I understand the domain I am hoping to model.

Normalisation is helpful for both of these areas. I usually use the original, flat, “tidy” data for the analysis itself. However, for checking the data’s consistency, and for using the data to learn about the domain, normalised data is often helpful, whether it’s used for the analysis or not.

I usually do data checking under the following conditions:

  • No need to store the data in normalised form. My own work normally consists of one-off analyses, where I won’t see the data again once I’ve finished the work. For more permanent data, it would be better to store the data in a database designed to hold it, if you’re able to.
  • Limited documentation. In my own work, I am lucky if I am given a key or a data dictionary for the data set at all. When I am, it usually proves to be wrong somehow. This makes it even more important to do data checking, not just for looking for errors, but also as an exercise in understanding the problem domain before starting to model it.
  • Data that fits in local memory. This process could be done in batches or using a streaming algorithm, but this is a more advanced case.

It’s in this context that I would usually have to do the normalisation by hand, and I would like to have a tool that does it semi-automatically. This package is intended to be that tool, using a minimal amount of additional libraries.

Not for database design

Good database design is a hard and involved process, which requires a strong grasp on the domain that the database will hold information on. Please don’t try to use this package to do it on autopilot. In particular, the conditions described above mean we do things differently to good practice for normalised relational databases:

  • We need to handle constant attributes, that have the same value in every record. In a real database, we would probably expect these to not stay constant when more data is added, and would know which relation to place it in. In one-off data checking, there isn’t more data coming, so we need to handle these attributes in an automatic way by placing them in a “constants” relation with an empty key, which database designers would probably frown at.
  • We don’t care about database design principles concerned with avoiding errors under deletion, modification, and so on, because the data won’t change. This means that we don’t care about introducing artificial keys, unless we think it would make data modelling more convenient. Similarly, we don’t care about having single-attribute reference relations, which are used to make sure other relations only assign legal values to that relation’s attribute.
  • We have no “derived relations”, also known as “views”, which are additional relations created from the others for lookup convenience. We expect to be using the data frame for modelling, rather than the database, so such lookups aren’t commonly necessary, and are left to the discretion of the user to do manually with base R’s merge function.

To third normal form

autodb gets a given data frame to third normal form: every attribute depends on the whole key(s), and non-key attributes depend on nothing but the key(s). This was chosen because there is an existing algorithm, Bernstein’s synthesis, for normalising to third normal form, and because it’s the highest normal form that is attainable with arbitrary original data frames. There is an additional enhancement available as an option: see the section on avoidable attributes for more details.

Simple examples

For most of these simple examples, we use the ChickWeight dataset, included with base R.

summary(ChickWeight)
#>      weight           Time           Chick     Diet   
#>  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
#>  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
#>  Median :103.0   Median :10.00   20     : 12   3:120  
#>  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
#>  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
#>  Max.   :373.0   Max.   :21.00   19     : 12          
#>                                  (Other):506

Normalisation with autodb

The simplest way to do normalisation is with the autodb function:

db <- autodb(ChickWeight, name = "Chick Weights")
db
#> database Chick Weights with 2 relations
#> relation Chick: Chick, Diet; 50 records
#>   key 1: Chick
#> relation Time_Chick: Time, Chick, weight; 578 records
#>   key 1: Time, Chick
#> relationships:
#> Time_Chick.{Chick} -> Chick.{Chick}

Plots

There is no plotting method within the autodb package itself. Instead, there are functions to write normalised objects as inputs to the Graphviz visualisation software, using the DOT language.

The generic function to do this is gv, which calls the gv.database method on our database, db:

db_text <- gv(db)
cat(db_text)
#> digraph chick_weights {
#>   rankdir = "LR"
#>   node [shape=plaintext];
#> 
#>   chick [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="3">Chick (50 records)</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#>     <TR><TD PORT="TO_diet">Diet</TD><TD></TD><TD PORT="FROM_diet">factor</TD></TR>
#>     </TABLE>>];
#>   time_chick [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="3">Time_Chick (578 records)</TD></TR>
#>     <TR><TD PORT="TO_time">Time</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_time">numeric</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#>     <TR><TD PORT="TO_weight">weight</TD><TD></TD><TD PORT="FROM_weight">numeric</TD></TR>
#>     </TABLE>>];
#> 
#>   time_chick:FROM_chick -> chick:TO_chick;
#> }

This can be saved to a file for passing to Graphviz elsewhere, or can be passed into grViz in the DiagrammeR package:

DiagrammeR::grViz(db_text)

Each relation is represented as a box, with the top row giving the relation name and the number of records, and the other rows detailing the attributes. In the middle is a grid of cells: each column of cells represents a (candidate) key for the relation, where attributes in that key have their cell filled in black.

We can see that ChickWeight was split into two relations, with a weight for each unique combination of Chick number and time, and the Chick number by itself determining the diet. We can also see the number of records in each relation, and the classes of the attributes in those relations.

gv also has methods for data.frame, and for database_schema, which is introduced later.

Finding functional dependencies

Having looked at the final result first, we now look at the individual steps. The first step is to find the (non-trivial and minimal) dependencies present in the given data frame. There are various ways to do this; by default, the package uses DFD, a depth-first search algorithm. We run this using the discover function, setting progress to TRUE to see the steps taken:

deps <- discover(ChickWeight, accuracy = 1, progress = TRUE)
#> simplifying data types
#> constructing powerset
#> dependent weight
#> determinants available, starting search
#> dependent Time
#> determinants available, starting search
#> dependent Chick
#> determinants available, starting search
#> dependent Diet
#> determinants available, starting search
#> DFD complete

The accuracy argument expects a number between zero and one, determining what proportion of a data frame’s records need to satisfy a given dependency for the algorithm to consider it valid. Setting accuracy to one, as done here, limits us to exact dependencies. We’ll look at other values later, in the section on approximate dependencies.

deps
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#>       Chick -> Diet

After simplifying the data to something quicker to iterate over – specifically, by converting all attributes to have integer values – the algorithm takes each attribute in turn as a possible dependent, and finds sets of the other attributes that determine it.

The result is a list of functional dependencies, in the format determinant set -> dependent, with an attribute named attrs_order that gives all the attribute names in their original order. Each of these three parts has its own (generic) extraction function:

detset(deps)
#> [[1]]
#> [1] "Time"  "Chick"
#> 
#> [[2]]
#> [1] "Chick"
dependent(deps)
#> [1] "weight" "Diet"
attrs_order(deps)
#> [1] "weight" "Time"   "Chick"  "Diet"

The former two, especially, are useful for filtering, as we’ll see later.

Normalisation

Now that we have a list of discovered dependencies, we can construct a database schema, where the relation schemas are normalised to third normal form. This is done using a version of Bernstein’s synthesis.

schema <- synthesise(deps)
schema
#> 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick

Like the database before, we can also plot this database schema:

schema_text <- gv(schema)
cat(schema_text)
#> digraph {
#>   rankdir = "LR"
#>   node [shape=plaintext];
#> 
#>   chick [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="2">Chick</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD PORT="FROM_chick" BGCOLOR="black"></TD></TR>
#>     <TR><TD PORT="TO_diet">Diet</TD><TD PORT="FROM_diet"></TD></TR>
#>     </TABLE>>];
#>   time_chick [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="2">Time_Chick</TD></TR>
#>     <TR><TD PORT="TO_time">Time</TD><TD PORT="FROM_time" BGCOLOR="black"></TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD PORT="FROM_chick" BGCOLOR="black"></TD></TR>
#>     <TR><TD PORT="TO_weight">weight</TD><TD PORT="FROM_weight"></TD></TR>
#>     </TABLE>>];
#> }
DiagrammeR::grViz(schema_text)

This is similar to the database plot given before, but there is some information not present, that requires the data frame itself. We do have class information about the attributes, extracted from the data frame during dependency discovery, but we have no record counts for the individual relation schemas. However, we do have automatically-generated names for the individual relations.

Additionally, at this point we have no connections between the relation schemas, since Bernstein’s synthesis doesn’t supply information about foreign key references. We could use this database schema to build a database, but we’d rather add the foreign key references first.

Tuning detection and normalisation

Let’s look at a different dataset for a moment, to look at some cases where we don’t want to use the dependencies as given. We’ll use the Titanic data set, also provided with base R. This data is in array form, so we first convert it to data frame form:

knitr::kable(as.data.frame(Titanic))
Class Sex Age Survived Freq
1st Male Child No 0
2nd Male Child No 0
3rd Male Child No 35
Crew Male Child No 0
1st Female Child No 0
2nd Female Child No 0
3rd Female Child No 17
Crew Female Child No 0
1st Male Adult No 118
2nd Male Adult No 154
3rd Male Adult No 387
Crew Male Adult No 670
1st Female Adult No 4
2nd Female Adult No 13
3rd Female Adult No 89
Crew Female Adult No 3
1st Male Child Yes 5
2nd Male Child Yes 11
3rd Male Child Yes 13
Crew Male Child Yes 0
1st Female Child Yes 1
2nd Female Child Yes 13
3rd Female Child Yes 14
Crew Female Child Yes 0
1st Male Adult Yes 57
2nd Male Adult Yes 14
3rd Male Adult Yes 75
Crew Male Adult Yes 192
1st Female Adult Yes 140
2nd Female Adult Yes 80
3rd Female Adult Yes 76
Crew Female Adult Yes 20

This is a simple set of data, with a single count observation, Freq, for each combination of the four determining attributes. In other words, the relation is already normalised, so we only expect one relation in the normalised database.

If we use autodb again, we get the following database:

DiagrammeR::grViz(gv(autodb(as.data.frame(Titanic))))

Oops! The DFD algorithm found some functional dependencies where the count could be used to determine another attribute. These are clearly spurious: frequency count can’t causally determine age, for example. However, the algorithm still finds these dependencies, because the counts are unique often enough to make these dependencies hold in the given data.

There are two approaches we can take to eliminate these spurious dependencies: not letting them be detected in the first place, and removing them before using synthesise.

To stop them being detected, we can put constraints on what is discovered by discover: we can ask for certain attributes to not be considered as determinants, or we can exclude attributes that inherit from certain classes. In this example, we could exclude Freq from being considered:

titanic_deps_freqonly <- discover(as.data.frame(Titanic), 1, exclude = "Freq")
titanic_deps_freqonly
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq

Alternatively, we could exclude all attributes that inherit from “numeric”:

identical(titanic_deps_freqonly, discover(as.data.frame(Titanic), 1, exclude_class = "numeric"))
#> [1] TRUE

These can both be used as arguments to autodb too:

DiagrammeR::grViz(gv(autodb(as.data.frame(Titanic), exclude = "Freq")))

Generally, excluding numeric attributes as determinants is often useful, because we expect non-integer numbers to be a measurement, not part of a primary key.

Alternatively, we could remove unwanted dependencies before using decompose. Here are the found dependencies, if we don’t exclude anything:

titanic_deps <- discover(as.data.frame(Titanic), 1)
titanic_deps
#> 3 functional dependencies
#> 5 attributes: Class, Sex, Age, Survived, Freq
#>       Sex, Survived, Freq -> Age
#>     Class, Survived, Freq -> Age
#> Class, Sex, Age, Survived -> Freq

We can remove the unwanted dependencies, where Age is the dependent:

titanic_deps[dependent(titanic_deps) == "Freq"]
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq

Adding cross-references

Getting back to our ChickWeight example, we now have a database schema, consisting of a list of relation schemas. However, we have no information about how these relation schemas are linked to each other. In particular, we have no information about foreign keys. We can add this information using cross_reference:

linked_schema <- cross_reference(schema)
linked_schema
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick
#> relationships:
#> Time_Chick.{Chick} -> Chick.{Chick}

We could also have used normalise, instead of synthesise and cross_reference separately:

normalise(deps)
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick
#> relationships:
#> Time_Chick.{Chick} -> Chick.{Chick}

Plotting this updated database schema shows the same relation schemas as before, linked together by foreign key references:

linked_text <- gv(linked_schema)
cat(linked_text)
#> digraph {
#>   rankdir = "LR"
#>   node [shape=plaintext];
#> 
#>   chick [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="2">Chick</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD PORT="FROM_chick" BGCOLOR="black"></TD></TR>
#>     <TR><TD PORT="TO_diet">Diet</TD><TD PORT="FROM_diet"></TD></TR>
#>     </TABLE>>];
#>   time_chick [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="2">Time_Chick</TD></TR>
#>     <TR><TD PORT="TO_time">Time</TD><TD PORT="FROM_time" BGCOLOR="black"></TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD PORT="FROM_chick" BGCOLOR="black"></TD></TR>
#>     <TR><TD PORT="TO_weight">weight</TD><TD PORT="FROM_weight"></TD></TR>
#>     </TABLE>>];
#> 
#>   time_chick:FROM_chick -> chick:TO_chick;
#> }
DiagrammeR::grViz(linked_text)

Decomposing the original relation

Finally, once we have our normalised database schema, we can apply it to our original data frame, or a new one with the same structure. This results in a normalised database, as we got from using autodb:

db2 <- decompose(ChickWeight, linked_schema)
DiagrammeR::grViz(gv(db2, name = "Chick Weights"))

We now have the record counts added.

Rejoin a database to a data frame

We can reverse the process of turning a data frame into a database with the rejoin function. This may not be identical to ChickWeight, since the rows may have been rearranged. However, we can use the convenience function df_equiv to check for equivalence under row reordering:

rejoined <- rejoin(db)
summary(rejoined)
#>      weight           Time           Chick     Diet   
#>  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
#>  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
#>  Median :103.0   Median :10.00   20     : 12   3:120  
#>  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
#>  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
#>  Max.   :373.0   Max.   :21.00   19     : 12          
#>                                  (Other):506
identical(rejoined, ChickWeight)
#> [1] FALSE
df_equiv(rejoined, ChickWeight)
#> [1] TRUE

When rejoined, the relation attributes will be in the original order. However, the record order might have changed.

A larger example

Included in the package is a 447-by-25 data frame called nudge:

DiagrammeR::grViz(gv(nudge, name = "nudge"))

This is the data set for a meta-analysis, looking at the effectiveness of “nudge” interventions. Measurements are taken within a three-layer hierarchy: publications contain studies, which contain effect size measurements. We expect this hierarchy to be apparent in the normalisation.

Getting full dependency information for this relation can take a long time, so here we search for a reduced set, not considering numeric or sample size attributes as determinants:

nudge_deps <- discover(
  nudge,
  accuracy = 1,
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
)
nudge_schema <- normalise(nudge_deps, remove_avoidable = TRUE)
DiagrammeR::grViz(gv(nudge_schema))

We can see a relation, with many attributes, determined by the effect size ID number, es_id. This contains all of the numeric measurements, as expected in the relation for the lowest level in the hierarchy. As also expected, this has a foreign key reference to a relation for study-level information, determined by the study ID number, study_id.

However, the publication-level relation this refers to is not determined by the publication ID, publication_id, as we would expect; neither is it determined by the publication reference. Instead, it is solely determined by the publication’s title: to use the ID, we would need to supplement it with the publication year. This suggests that some publication ID numbers have been erroneously assigned to several publications, which we can easily test:

nudge_database <- decompose(nudge, nudge_schema, "nudge")
nudge_title_relation <- records(nudge_database)$title
nudge_pid_duplicates <- unique(nudge_title_relation$publication_id[
  duplicated(nudge_title_relation$publication_id)
])
knitr::kable(subset(nudge_title_relation, publication_id %in% nudge_pid_duplicates))
title publication_id reference year
44 Enhanced active choice: A new method to motivate behavior change 95 Keller et al. (2011) 2011
130 Nudging product choices: The effect of position change on snack bar choice 95 Keller et al. (2015) 2015

This would also explain why the publication ID appears in several other relations, that look likely to be spurious.

Since the publication ID can’t be trusted, we could consider removing the attribute, and re-running discover. While fine in this case, this is best avoided if running discover on a given data frame is expensive.

Another option is to simply remove the offending relations in the current schemas:

nudge_schema_relfiltered <- nudge_schema[!grepl("publication_id", names(nudge_schema), fixed = TRUE)]

Subsetting a database schema also removes any foreign key relationships involving removed schemas, so the resulting schema is still valid. However, any foreign key reference chains with removed schemas in the middle are broken. Amongst others, in this case we lose the relationship between the three relations for the three hierarchical data levels:

DiagrammeR::grViz(gv(nudge_schema_relfiltered))

We also lose any functional dependencies partially implied by the removed schemas (i.e. transitive dependencies, where removing these schemas’ dependencies means that they’re no longer transitive). This is an issue if those partially-implied dependencies aren’t spurious, because we throw away non-spurious information about the data’s structure.

The better option is to remove all functional dependencies where the publication ID is in the determinant set, to get the same effect without re-running discover:

nudge_deps_filtered <- nudge_deps[!vapply(detset(nudge_deps), is.element, logical(1), el = "publication_id")]
nudge_schema_filtered <- normalise(nudge_deps_filtered, remove_avoidable = TRUE)
DiagrammeR::grViz(gv(nudge_schema_filtered))

The resulting database schema looks to be an improvement. However, we can still see some spurious relations involving reference. One of these combines it with type_experiment to determine location, a study-level attribute. The others combine it with another attribute to determine title.

As mentioned, we’d expect the reference to uniquely identify the publication: however, in the publication-level relation, the title determines the reference, but not vice versa. This means that there are publications that share a reference:

nudge_database_filtered <- decompose(nudge, nudge_schema_filtered, "nudge")
nudge_title_relation_filtered <- records(nudge_database_filtered)$title
nudge_reference_duplicates <- unique(nudge_title_relation_filtered$reference[
  duplicated(nudge_title_relation_filtered$reference)
])
knitr::kable(subset(nudge_title_relation_filtered, reference %in% nudge_reference_duplicates))
title publication_id reference
214 Nudge vs superbugs: A behavioural economics trial to reduce the overprescribing of antibiotics 18 BETA (2018)
399 Energy labels that make cents 19 BETA (2018)

BETA is the Behavioural Economics Team of the Australian Government, so it’s not surprising that they’d have multiple publications/reports per year. Duplicate references is not necessarily an error, but would be awkward if the references were to be used.

While the reference is useful, the relations mentioned above are not, so we’ll also remove functional dependencies with the reference determining something apart from the year:

nudge_deps_filtered2 <- nudge_deps_filtered[
  !vapply(detset(nudge_deps_filtered), is.element, logical(1), el = "reference") |
    dependent(nudge_deps_filtered) == "year"
]
nudge_schema_filtered2 <- normalise(nudge_deps_filtered2, remove_avoidable = TRUE)
DiagrammeR::grViz(gv(nudge_schema_filtered2))

We now, finally, have the study location in a non-spurious relation. Interestingly, removing dependencies has revealed extra information about study locations, via the new title_type_experiment relation: studies of the same experiment type in a publication always have the same location. Looking at the resulting database shows that this removes many entries of what would be redundant location information if kept in the study relation:

nudge_database_filtered2 <- decompose(nudge, nudge_schema_filtered2, "nudge")
DiagrammeR::grViz(gv(nudge_database_filtered2))

While this is not a dependency we could expect to hold if more data was collected, it’s a reasonable dependency for the given data set, which won’t be added to.

It’s also notable for the previous discussion about information lost if we subset the database schema: the relation isn’t present in the original schema because the full set of functional dependencies make its implied dependency transitive, but removing spurious dependencies made it no longer transitive, so it’s directly represented in a new relation. Specifically, the previous schema, nudge_schema_filtered, has relations showing that title -> reference and reference, type_experiment -> location, and removing the latter makes title, type_experiment -> location no longer transitive. Subsetting the schema, instead of removing the spurious dependencies, would also lose this transitive dependency, so we would not see the new relation.

(As far as I’m aware, the publication ID error mentioned above would not have affected the meta-analysis results.)

Other features

Approximate dependencies and database reduction

Larger datasets can often have entry errors, without an easy way to remove or deal with them. For this reason, we might be interested in “approximate” functional dependencies, which hold after removing a bounded amount of violating records.

Suppose that we normalise nudge again, without any manual dependency removal, but allowing approximate dependencies. We could cheat, knowing that the questionable data example found above showed there to be two questionable records: one for a duplicated publication ID, and one for a duplicated reference. Since nudge has 447 records, we can get rid of the resulting questionable dependencies by setting the accuracy to allow two discrepant records.

nudge_approx_cheat_database_schema <- discover(
  nudge,
  accuracy = 1 - 2/nrow(nudge),
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
) |>
  normalise()
DiagrammeR::grViz(gv(nudge_approx_cheat_database_schema))

Currently decompose doesn’t account for approximate dependencies, resulting in invalid databases, so here we just work with the database schema.

Compare this to the database schema we arrived at manually:

  • The publication ID, the title, and the reference are considered equivalent, as we’d have originally expected.
  • There is an additional relation, showing approximation to be approximately determined by some publication- and study-level attributes, rather than functionally determined at the effect-size level.

Lowering the accuracy results in more dependencies being found, and so in more relations. The number of relations can get very large. For example, suppose we instead set the accuracy to 0.99, returning any approximate dependencies that hold in at least 443 records.

nudge_approx_database_schema <- discover(
  nudge,
  accuracy = 0.99,
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
) |>
  normalise()
DiagrammeR::grViz(gv(nudge_approx_database_schema))

This is a little overwhelming, so we’ll use a utility function called reduce. This returns only the named “main” relation, and any relations for which it is a descendant via foreign key references. Reducing the approximate schema, with the effect size table as the main table, gives us this set of relations:

DiagrammeR::grViz(gv(reduce(nudge_approx_database_schema, "es_id")))

Questionable intermediate relations aside, we can see that there is now a publication-level relation with the publication ID as a key, since it determines the publication attributes once we remove one of the duplicate records we discovered before.

reduce can also be used on databases, where it considers each relation with the largest number of records as a main table. It needs to be used with caution: while the relations it removes are often spurious, it’s possible to find databases where relations not linked to the main relation by foreign key references are required to rejoin to the original data frame, so reduction can remove necessary relations. Its intent is mostly to make glancing at database plots more manageable.

Avoidable attributes

The next normal form after third normal form (3NF) is Boyes-Codd normal form (BCNF). Ensuring BCNF is enforced by the database is trickier, as in some cases it can’t be enforced with just relations and foreign key constrains.

However, the package includes an option to convert to enhanced third normal form, also known as LTK form, which can be so enforced. This enhancement is tangential to BCNF, and could also be used to enhance schemas in BCNF.

In brief, the standard normal forms only put constraints on the attributes present in the relations one relation at a time. The enhancement is a constraint on the attributes present in a relation, while considering their presence in other relations. If a attribute in a relation can be removed, and still be determined from that relation by joining it to others, then the attribute is “avoidable”, and can be removed. If the attribute is in any of the relation’s keys, they’ll be replaced by keys that use the attributes not being removed. This removes attributes from relations without removing any information from the database as a whole.

For example, we can take this simple example from Chapter 6 of The Theory of Relational Databases, by David Maier:

avoid_deps <- functional_dependency(
  list(
    list("A", "B"),
    list("B", "A"),
    list(c("A", "C"), "D"),
    list(c("A", "C"), "E"),
    list(c("B", "D"), "C")
  ),
  attrs_order = c("A", "B", "C", "D", "E")
)
avoid_deps
#> 5 functional dependencies
#> 5 attributes: A, B, C, D, E
#>    A -> B
#>    B -> A
#> A, C -> D
#> A, C -> E
#> B, D -> C
avoid_schema <- normalise(avoid_deps)
DiagrammeR::grViz(gv(avoid_schema))

Attributes A and B are equivalent, since relation A has them both as a key. In other words, relation A is a simple lookup relation. Because of this, we could remove B from relation A_C, and replace the key B, D with A, D, which is equivalent when accounting for relation A.

We can have this removal of avoidable attributes done automatically, using the remove_avoidable flag for normalise:

avoid_schema_removed <- normalise(
  avoid_deps,
  remove_avoidable = TRUE
)
DiagrammeR::grViz(gv(avoid_schema_removed))

This schema is now in LTK form, with no remaining avoidable attributes. We could have also removed A from relation A_C instead of B, so this process may not have a unique result. The package’s implementation prefers to remove attributes that appear later in the original relation.

Planned extensions

Handling of duplicate records

It’s expected that an R user might pass a data frame into the package that has duplicate records. At the moment, these are kept when searching for dependencies. This can affect results for approximate dependencies, due to affecting the record counts. However, they will be removed when the data frame is decomposed into a database. At the moment, I’m not certain on whether they are best handled by removing them, or by simply returning an error if there are duplicate records.

Missing values

Strictly speaking, autodb does not search for functional dependencies, because these can’t account for missing values. Instead, it searches for a weaker variant, which has been called a literal functional dependency (LFD) in the literature.

The functional dependency X -> Y holds if, for any two possible records whose values in X are equal, their values in Y are also equal. This requires attribute values to take the equality operator, == in R. Additionally, they must take it under binary logic: whether two values are equal can only be true or false, not missing. This disallows the presence of missing values (NA and NaN), since testing for equality with a missing value returns a missing result.

The literal functional dependency X -> Y holds, or the functional dependency X -> Y holds literally, if, for any two possible records whose values in X are identical, their values in Y are also identical, where values are identical if both are non-missing and equal, or both are missing. This requires attribute classes to take the identity operator, identical() with default optional parameter values in R. Again, they must take it under binary logic, but the identity operator cannot return non-binary results anyway: In R, for example, identical() can only return TRUE or FALSE, never NA. There is, therefore, no constraint on missing values in attribute classes, or on attribute classes in general.

LFDs are more generic than standard functional dependencies: since practically every class takes the identity operator, they practically make no assumptions about the attribute classes present in a data set.

There are other FD variants than handle missing values, but, unlike most of them, LFDs still satisfy Armstrong’s axioms, allowing them to be used in normalisation in the same way as normal FDs. For example, they still respect transitivity: if X -> Y and Y -> Z literally, then X -> Z literally.

However, ignoring the special status of missing values in this way ignores important structural information. In the relational model, and ideally in relational databases, we want to have no missing values at all, partly to avoid awkward questions about what trinary logic means with respect to filtering or joining tables. For example, take the following data frame:

df_nas <- data.frame(
  patient = c(1L, 2L, 3L, 4L),
  trial_entry_date = as.Date(c("2022/05/02", "2022/06/06", "2022/04/01", "2022/03/19")),
  death_date = as.Date(c(NA, NA, "2022/10/07", NA))
)
knitr::kable(df_nas)
patient trial_entry_date death_date
1 2022-05-02 NA
2 2022-06-06 NA
3 2022-04-01 2022-10-07
4 2022-03-19 NA

autodb currently treats NA as just another value, which results in the initial data frame not being split in the database schema:

DiagrammeR::grViz(gv(autodb(df_nas)))

In this case, a missing death date represents no death, and we would prefer to move death information to a separate relation, containing only patients with a death date, in this case patient 3:

ideal_db <- decompose(
  df_nas,
  database_schema(
    relation_schema(
      list(
        patient = list(c("patient", "trial_entry_date"), list("patient")),
        patient_death = list(c("patient", "death_date"), list("patient"))
      ),
      names(df_nas)
    ),
    list(list("patient_death", "patient", "patient", "patient"))
  )
)
records(ideal_db)$patient_death <- subset(records(ideal_db)$patient_death, !is.na(death_date))
DiagrammeR::grViz(gv(ideal_db))

This makes more of the data’s structure apparent, and enforceable, in the database structure, since it makes it clear that the death date is optional.

Removing missing values in this way requires additional steps in dependency detection, as well as normalisation, since dependency detection is the only part of the process where the original table is searched for structure.

Handling some special cases

Disjunctive existence unions / option types

As a more general case of the missing values section, we sometimes have sets \(Y_1, Y_2, \ldots\) that are mutually exclusive for being non-missing, where the choice is determined by some set \(X\).

For example. I commonly get data sets where certain values can be given as either single values, or endpoints of an interval. Sometimes this is done within a single column, sometimes it’s done across several:

df_options <- data.frame(
  id = 1:10,
  single_col = c("2.3", "2.3", "5.7", "2.4-7.1", "0-10", "2.4-10", "5.3-13.1", "2.4-25.8", "2.4-25.8", "5.6-25.8"),
  value = c(2.3, 2.3, 5.7, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
  lower_bound = c(NA_real_, NA_real_, NA_real_, 2.4, 0, 2.4, 5.3, 2.4, 2.4, 5.6),
  upper_bound = c(NA_real_, NA_real_, NA_real_, 7.1, 10, 10, 13.1, 25.8, 25.8, 25.8),
  type = c("value", "value", "value", "interval", "interval", "interval", "interval", "interval", "interval", "interval")
)
knitr::kable(df_options)
id single_col value lower_bound upper_bound type
1 2.3 2.3 NA NA value
2 2.3 2.3 NA NA value
3 5.7 5.7 NA NA value
4 2.4-7.1 NA 2.4 7.1 interval
5 0-10 NA 0.0 10.0 interval
6 2.4-10 NA 2.4 10.0 interval
7 5.3-13.1 NA 5.3 13.1 interval
8 2.4-25.8 NA 2.4 25.8 interval
9 2.4-25.8 NA 2.4 25.8 interval
10 5.6-25.8 NA 5.6 25.8 interval

autodb has no way to know about these relationships, so it ignores them:

db_options <- autodb(df_options)
DiagrammeR::grViz(gv(db_options))

However, looking at the three parent tables shows that something is going on, since whether the individual values are missing determines the type:

knitr::kable(records(db_options)$value)
value type
1 2.3 value
3 5.7 value
4 NA interval
knitr::kable(records(db_options)$lower_bound)
lower_bound type
1 NA value
4 2.4 interval
5 0.0 interval
7 5.3 interval
10 5.6 interval
knitr::kable(records(db_options)$upper_bound)
upper_bound type
1 NA value
4 7.1 interval
5 10.0 interval
7 13.1 interval
8 25.8 interval

As a general solution for finding these sorts of relationships, without manually checking relations myself, I like taking each attribute with missing values, and adding a companion attribute that states whether the value is present or missing:

df_options_with_presence <- data.frame(
  id = df_options$id,
  single_col = df_options$single_col,
  value = df_options$value,
  value_present = !is.na(df_options$value),
  lower_bound = df_options$lower_bound,
  lower_bound_present = !is.na(df_options$lower_bound),
  upper_bound = df_options$upper_bound,
  upper_bound_present = !is.na(df_options$upper_bound),
  type = df_options$type
)
knitr::kable(df_options_with_presence)
id single_col value value_present lower_bound lower_bound_present upper_bound upper_bound_present type
1 2.3 2.3 TRUE NA FALSE NA FALSE value
2 2.3 2.3 TRUE NA FALSE NA FALSE value
3 5.7 5.7 TRUE NA FALSE NA FALSE value
4 2.4-7.1 NA FALSE 2.4 TRUE 7.1 TRUE interval
5 0-10 NA FALSE 0.0 TRUE 10.0 TRUE interval
6 2.4-10 NA FALSE 2.4 TRUE 10.0 TRUE interval
7 5.3-13.1 NA FALSE 5.3 TRUE 13.1 TRUE interval
8 2.4-25.8 NA FALSE 2.4 TRUE 25.8 TRUE interval
9 2.4-25.8 NA FALSE 2.4 TRUE 25.8 TRUE interval
10 5.6-25.8 NA FALSE 5.6 TRUE 25.8 TRUE interval

This helps to make the structure more apparent in the database schema:

db_options_with_presence <- autodb(df_options_with_presence)
DiagrammeR::grViz(gv(db_options_with_presence))
knitr::kable(records(db_options_with_presence)$value_present)
value_present lower_bound_present upper_bound_present type
1 TRUE FALSE FALSE value
4 FALSE TRUE TRUE interval

This is still reasonable when we exclude the nullable attributes as determinants:

DiagrammeR::grViz(gv(autodb(df_options_with_presence, exclude_class = "numeric")))

Dealing with the single-column version takes more work: even with an attribute like the type attribute above, to indicate single_col contains several different types of values, it’s not clear how to proceed with the normalisation. I’ve found an easier place to start is to split the data by the type value, and normalise each group separately first, to look for differences:

db_options_split <- lapply(split(df_options[, c("id", "single_col", "type")], df_options$type), autodb)
DiagrammeR::grViz(gv(db_options_split$value))
DiagrammeR::grViz(gv(db_options_split$interval))

In this case, it’s not especially informative, but it’s more helpful when the option type affects functional dependencies for other sets of attributes.

As a rather abstract example, we could suppose that these measurements are done with two different types of machine (device): both can return an interval, but only one can return a value instead.

df_options_device <- cbind(
  df_options,
  device = c(1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L)
)
knitr::kable(df_options_device)
id single_col value lower_bound upper_bound type device
1 2.3 2.3 NA NA value 1
2 2.3 2.3 NA NA value 1
3 5.7 5.7 NA NA value 1
4 2.4-7.1 NA 2.4 7.1 interval 2
5 0-10 NA 0.0 10.0 interval 1
6 2.4-10 NA 2.4 10.0 interval 2
7 5.3-13.1 NA 5.3 13.1 interval 1
8 2.4-25.8 NA 2.4 25.8 interval 1
9 2.4-25.8 NA 2.4 25.8 interval 2
10 5.6-25.8 NA 5.6 25.8 interval 2

The usual normalisation doesn’t show anything:

DiagrammeR::grViz(gv(autodb(df_options_device)))

However, splitting does show something:

db_options_device_split <- lapply(
  split(
    df_options_device[, c("id", "single_col", "type", "device")],
    df_options_device$type
  ),
  autodb
)
DiagrammeR::grViz(gv(db_options_device_split$value))
DiagrammeR::grViz(gv(db_options_device_split$interval))

Seeing that all measurements with value type used the same device is something that then might raise our attention. Of course, this wouldn’t happen if there were more than two types of device, with some able to give values and some not.

This is much easier when the measurement is split across multiple columns, and we add presence companion attributes:

df_options_deviceall <- cbind(
  df_options_with_presence,
  device = df_options_device$device
)
DiagrammeR::grViz(gv(autodb(df_options_deviceall)))
db_options_device_splitall <- lapply(
  split(
    df_options_deviceall,
    df_options_deviceall$type
  ),
  autodb
)
DiagrammeR::grViz(gv(db_options_device_splitall$value))
DiagrammeR::grViz(gv(db_options_device_splitall$interval))