Enum Labels and Ordering

Overview

Many software packages for manipulating and analyzing tabular data have special features for working with categorical variables. These include:

Value labels or formats (Stata, SAS and SPSS)
Categoricals (Pandas)
Factors (R)
CategoricalVectors (Julia)

These features can result in more efficient storage and faster runtime performance, but more importantly, facilitate analysis by indicating that a variable should be treated as categorical and by permitting the logical order of the categories to differ from their lexical order. And in the case of value labels, they permit the analyst to work with variables in numeric form (e.g., in expressions, when fitting models) while generating output (e.g., tables, plots) that is labeled with informative strings.

While these features are of limited use in some disciplines, others rely heavily on them (e.g., social sciences, epidemiology, clinical research, etc.). Thus, before these disciplines can begin to use Frictionless in a meaningful way, both the standards and the software tools need to support these features. This pattern addresses necessary extensions to the Table Schema.

Principles

Before describing the proposed extensions, here are the principles on which they are based:

Extensions should be software agnostic (i.e., no additions to the official schema targeted toward a specific piece of software). While the extensions are intended to support the use of features not available in all software, the resulting data package should continue to work as well as possible with software that does not have those features.
Related to (1), extensions should only include metadata that describe the data themselves—not instructions for what a specific software package should do with the data. Users who want to include the latter may do so within a sub-namespace such as custom (e.g., see Issues #103 and #663).
Extensions must be backward compatible (i.e., not break existing tools, workflows, etc. for working with Frictionless packages).

It is worth emphasizing that the scope of the proposed extensions is strictly limited to the information necessary to make use of the features for working with categorical data provided by the software packages listed above. Previous discussions of this issue have occasionally included references to additional variable-level metadata (e.g., multiple sets of category labels such as both “short labels” and longer “descriptions”, or links to common data elements, controlled vocabularies or rdfTypes). While these additional metadata are undoubtedly useful, we speculate that the large majority of users who would benefit from the extensions propopsed here would not have and/or utilize such information, and therefore argue that these should be considered under a separate proposal.

Implementations

Our proposal to add a field-specific enumOrdered property has been raised here and here.

Discussions regarding supporting software providing features for working with categorical variables appear in the following GitHub issues:

and in the Frictionless Data forum:

Finally, while we are unaware of any existing implementations intended for general use, it is likely that many users are already exploiting the fact that arbitrary fields may be added to the table schema to support internal implementations.

Proposed extensions

We propose two extensions to Table Schema:

Add an optional field-specific enumOrdered property, which can be used when contructing a categorical (or factor) to indicate that the variable is ordinal.
Add an optional field-specific enumLabels property for use when data are stored using integer or other codes rather than using the category labels. This contains an object mapping the codes appearing in the data (keys) to what they mean (values), and can be used by software to construct corresponding value labels or categoricals (when supported) or to translate the values when reading the data.

These extensions are fully backward compatible, since they are optional and not providing them is valid.

Here is an example of a categorical variable using extension (1):

{
  "fields": [
    {
      "name": "physical_health",
      "type": "string",
      "constraints": {
        "enum": [
          "Poor",
          "Fair",
          "Good",
          "Very good",
          "Excellent",
        ]
      }
      "enumOrdered": true
    }
  ],
  "missingValues": ["Don't know","Refused","Not applicable"]
}

This is our preferred strategy, as it provides all of the information necessary to support the categorical functionality of the software packages listed above, while still yielding a useable result for software without such capability. As described below, value labels or categoricals can be created automatically based on the ordering of the values in the enum array, and the missingValues can be incorporated into the value labels or categoricals if desired. In those cases where it is desired to have more control over how the value labels are constructed, this information can be stored in a separate file in JSON format or as part of a custom extension to the table schema. Since such instructions do not describe the data themselves (but only how a specific software package should handle them), and since they are often software- and/or user-specific, we argue that they should not be included in the official table schema.

Alternatively, those who wish to store their data in encoded form (e.g., this is the default for data exports from REDCap, a commonly-used platform for collecting data for clinical studies) may use extension (2) to do so:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "integer",
      "constraints": {
        "enum": [1,2,3,4,5]
      }
      "enumOrdered": true,
      "enumLabels": {
        "1": "Poor",
        "2": "Fair",
        "3": "Good",
        "4": "Very good",
        "5": "Excellent"
      }
    }
  ],
  "missingValues": ["Don't know","Refused","Not applicable"]
}

Note that although the field type is integer, the keys in the enumLabels object must be wrapped in double quotes because this is required by the JSON file format.

A second variant of the example above is the following:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "integer",
      "constraints": {
        "enum": [1,2,3,4,5]
      }
      "enumOrdered": true,
      "enumLabels": {
        "1": "Poor",
        "2": "Fair",
        "3": "Good",
        "4": "Very good",
        "5": "Excellent",
        ".a": "Don't know",
        ".b": "Refused",
        ".c": "Not applicable"
      }
    }
  ],
  "missingValues": [".a",".b",".c"]
}

This represents encoded data exported from software with support for value labels. The values .a, .b, etc. are known as extended missing values (Stata and SAS only) and provide 26 unique missing values for numeric fields (both integer and float) in addition to the system missing value (“.”); in SPSS these would be replaced with specially designated integers, typically negative (e.g., -97, -98 and -99).

Specification

A field with an enum constraint or an enumLabels property MAY have an enumOrdered property that MUST be a boolean. A value of true indicates that the field should be treated as having an ordinal scale of measurement, with the ordering given by the order of the field’s enum array or by the lexical order of the enumLabels object’s keys, with the latter taking precedence. Fields without an enum constraint or an enumLabels property or for which the enumLabels keys do not include all values observed in the data (excluding any values specified in the missingValues property) MUST NOT have an enumOrdered property since in that case the correct ordering of the data is ambiguous. The absence of an enumOrdered property MUST NOT be taken to imply enumOrdered: false.
A field MAY have an enumLabels property that MUST be an object. This property SHOULD be used to indicate how the values in the data (represented by the object’s keys) are to be labeled or translated (represented by the corresponding value). As required by the JSON format, the object’s keys must be listed as strings (i.e., wrapped in double quotes). The keys MAY include values that do not appear in the data and MAY omit some values that do appear in the data. For clarity and to avoid unintentional loss of information, the object’s values SHOULD be unique.

Suggested implementations

Note: The use cases below address only reading data from a Frictionless data package; it is assumed that implementations will also provide the ability to write Frictionless data packages using the schema extensions proposed above. We suggest two types of implementations:

Additions to the official Python Frictionless Framework to generate software-specific scripts that may be executed by a specific software package to read data from a Frictionless data package and create the appropriate value labels or categoricals, as described below. These scripts can then be included along with the data in the package itself.
Software-specific extension packages that may be installed to permit users of that software to read data from a Frictionless data package directly, automatically creating the appropriate value labels or categoricals as described below.

The advantage of (1) is that it doesn’t require users to install another software package, which may in some cases be difficult or impossible. The advantage of (2) is that it provides native support for working with Frictionless data packages, and may be both easier and faster once the package is installed. We are in the process of implementing both approaches for Stata; implementations for the other software listed above are straightforward.

Software that supports value labels (Stata, SAS or SPSS)

In cases where a field has an enum constraint but no enumLabels property, automatically generate a value label mapping the integers 1, 2, 3, … to the enum values in order, use this to encode the field (thereby changing its type from string to integer), and attach the value label to the field. Provide option to skip automatically dropping values specified in the missingValues property and instead add them in order to the end of the value label, encoded using extended missing values if supported.
In cases where the data are stored in encoded form (e.g., as integers) and a corresponding enumLabels property is present, and assuming that the keys in the enumLabels object are limited to integers and extended missing values (if supported), use the enumLabels object to generate a value label and attach it to the field. As with (1), provide option to skip automatically dropping values specified in the missingValues property and instead add them in order to the end of the value label, encoded using extended missing values if supported.
Although none of Stata, SAS or SPSS currently permits designating a specific variable as ordered, Stata permits attaching arbitrary metadata to individual variables. Thus, in cases where the enumOrdered property is present, this information can be stored in Stata to inform the analyst and to prevent loss of information when generating Frictionless data packages from within Stata.

Software that supports categoricals or factors (Pandas, R, Julia)

In cases where a field has an enum constraint but no enumLabels property, automatically define a categorical or factor using the enum values in order, and convert the variable to categorical or factor type using this definition. Provide option to skip automatically dropping values specified in the missingValues property and instead add them in order to the end of the enum values when defining the categorical or factor.
In cases where the data are stored in encoded form (e.g., as integers) and a corresponding enumLabels property is present, translate the data using the enumLabels object, define a categorical or factor using the values of the enumLabels object in lexical order of the keys, and convert the variable to categorical or factor type using this definition. Provide option to skip automatically dropping values specified in the missingValues property and instead add them to the end of the enumLabels values when defining the categorical or factor.
In cases where a field has an enumOrdered property, use that when defining the categorical or factor.

All software

Although the extensions proposed here are intended primarily to support the use of value labels and categoricals in software that supports them, they also provide additional functionality when reading data into any software that can handle tabular data. Specifically, the enumLabels property may be used to support the use of enums even in cases where value labels or categoricals are not being used. For example, it is standard practice in software for analyzing genetic data to code sex as 0, 1 and 2 (corresponding to “Unknown”, “Male” and “Female”) and affection status as 0, 1 and 2 (corresponding to “Unknown”, “Unaffected” and “Affected”). In such cases, the enumLabels property may be used to confirm that the data follow the standard convention or to indicate that they deviate from it; it may also be used to translate those codes into human-readable values, if desired.

Notes

While this pattern is designed as an extension to Table Schema fields, it could also be used to document enum values of properties in profiles, such as contributor roles.

This pattern originally included a proposal to add an optional field-specific missingValues property similar to that described in the pattern “missing values per field” appearing in this document above. The objective was to provide a mechnanism to distinguish between so-called system missing values (i.e., values that indicate only that the corresponding data are missing) and other values that convey meaning but are typically excluded when fitting statistical models. The latter may be represented by extended missing values (.a, .b, .c, etc.) in Stata and SAS, or in SPSS by negative integers that are then designated as missing by using the MISSING VALUES command. For example, values such as “NA”, “Not applicable”, ”.”, etc. could be specified in the resource level missingValues property, while values such as “Don’t know” and “Refused”—often used when generating tabular summaries and occasionally used when fitting certain statistical models—could be specified in the corresponding field level missingValues property. The former would still be converted to null before type-specific string conversion (just as they are now), while the latter could be used by capable software when creating value labels or categoricals.

While this proposal was consistent with the principles outlined at the beginning (in particular, existing software would still yield a usable result when reading the data), we realized that it would conflict with what appears to be an emerging consensus regarding field-specific missingValues; i.e., that they should replace the less specific resource level missingValues for the corresponding field rather than be combined with them (see the discussion here as well as the “missing values per field” pattern above). While there is good reason for replacing rather than combining here (e.g., it is more explicit), it would unfortunately conflict with the idea of using the field-specific missingValues in conjunction with the resource level missingValues as just described; namely, if the field-specific property replaced the resource level property then the system missing values would no longer be converted to null, as desired.

For this reason, we have dropped the proposal to add a field-specific missingValues property from this pattern, and assert that implementation of this pattern by software should assume that if a field-specific missingValues property is added to the table schema it should, if present, replace the resource level missingValues property for the corresponding field. We do not believe that this change represents a substantial limitation when creating value labels or categoricals, since system missing values can typically be easily distinguished from other missing values when exported in CSV format (e.g., ”.” in Stata or SAS, “NA” in R, or "" in Pandas).