Logic, motivation, and use

This part of the documentation explains the motivation for switching to an XML format and the way in which this change simplifies the process of adding content into eRegs.

regulations-parser

The process of creating the content that powers eRegs begins with a Python program called regulations-parser. The parser takes as its input the raw XML from the Federal Register. In the past, the parser would generate the JSON layers described in the API section, and they would be stored in the database as indicated on the diagram. Now, the parser generates a set of XML files; the first of these files represents the document that originates the regulation (e.g. 2011-31712.xml for the original document of Regulation C), and the subsequent XML documents represent notices that modify the previous version of the regulation. Thus, if 2011-31712 is the original version, and 2012-31311 is the notice representing the chronologically next modification, then "applying" the notice 2011-31311 to the regulation 2011-31712 generates the new reguation 2011-31311. The next notice will then modify the regulation 2011-31311, and so on.

The benefit of this structure is that fixes that are made upstream propagate downstread. In the old pipeline, modifying the individual JSON files was too complicated, so any changes had to be made to the source Federal Register XML and the entire parser needed to be run again. This made the process of compiling regulations hopelessly opaque and slow, especially for larger regulations like Z. With the introduction of RegML, initial mistakes in compilation can be fixed in the root version, and then incremental fixes can be made in the notices, which are typically much smaller than a fully compiled regulation. Once you have fixed version N and notice N+1, version N+1 which is obtained by applying notice N+1 to version N, will maintain those fixes. This makes it possible to incrementally fix mistakes in regulation compilation.

Additionally, fixing an XML file that can be validated with a schema is much easier than fixing the typically malformed, non-semantic XML that the Federal Register provides. RegML was designed with the goal of capturing the entire semantics of a regulation, whereas the Federal Register XML is designed to be compiled to a printable PDF. In particular, RegML is designed to have a 1-1 correspondence between XML node and Django database model, as well as with the goal of separating visual representation from semantic markup. Thus, fixing errors in a RegML file requires substantially less time and effort.

The logic of RegML

RegML is designed to capture regulation semantics, separate model structure from presentation, and maintain an isomorphism between XML node and database row. You can find the schema for RegML on Github. The schema is amply documented and should be mostly self-explanatory; it captures both the structure of a regulatory document in the form of a hierarchy of nodes, and the semantics of the text that indicate linkages between different parts of the regulation, such as definitions and references.

Powering eRegs 2.0

As documented in the API section and the introduction, the new eRegs backend replaces the Byzantine collection of JSON layers with a storage logic based on nested sets. Now, instead of breaking the RegML file into layers and then reassembling them on the site backend, the entire RegML tree can just be imported directly into the database. This eliminates the necessity of uploading large files (an update to Regulation Z can be as much as ~10Gb), and makes it much easier to store the canonical representation of a regulation in a repository. This also aids automation, as it is possible to automatically pull down an updated production branch of a repository and insert it into the database without the need of passing through multiple pipeline stages.