GEP 6 — A unified architecture#

Author	Hans-Martin von Gaudecker
Status	Provisionally accepted
Type	Standards Track
Created	2025-02-17
Resolution	Provisionally Accepted

Abstract#

This GEP outlines a unified architecture for GETTSIM based on a DAG that encompasses:

Namespaces for (policy) functions.
Decorators for including or excluding functions from the DAG based on the evaluation date of the policy environment.
Pre-processing of parameters.

Implementing this structure will make GETTSIM much more flexible and more natural to use / extend.

Motivation and Scope#

While GETTSIM’s overall architecture has proven extremely useful, we are hitting limits in at least three directions:

GETTSIM follows a law-to-code approach, but function names often become artificially long to ensure uniqueness, thereby hitting self-imposed character limits. For example, arbeitsl_geld_2_eink_m, erziehungsgeld_eink_relev_kind_y and kinderzuschl_eink_relev_m_bg represent similar concepts in very different legal contexts. Excessive concatenation and abbreviations make these names difficult to understand and disconnect them from the legal framework. In most cases, such complexity is unnecessary, as the module in which a function resides already provides sufficient context — in the examples above, those would be Arbeitslosengeld 2, Erziehungsgeld, and Kinderzuschlag.
Handling functions that change over the years is not robust (examples in Issue 449).
Parameters files do not handle cases well when functions expect parameters in a different form than the law specifies them (example: Issue 444).

These issues severely limit the development of GETTSIM. We have been spending far too much time finding names that adhere to our self-imposed character limits. Functions with changing interfaces over the years are a constant pain. Adding parameters is sometimes awkward as there is no obvious way for pre-processing them.

The proposed changes will affect all areas of GETTSIM

The DAG will account for the namespace of GETTSIM’s policy functions, which allows for non-unique function names across GETTSIM’s subdirectories. Namespaces will affect how GETTSIM expects and returns input and output data.
Building the DAG will depend on the date for which the user wants to perform calculations. Instead of a long list of if-statements in the policy_functions module, picking functions by year will be done in a more natural and more robust way.
We extend the DAG to the parameters of the taxes and transfers system, thereby allowing for pre-processing of parameters. This clear way to pre-process parameters will allow specifying parameters exactly as they are written into the law.

Usage and Impact#

The use of namespaces will have the usual benefits as in (Python) code: Disambiguation while keeping things that belong together in one place. We largely have this structure already for the parameters (think about each [x]_param dictionary as a namespace).

Currently, for example, erziehungsgeld_eink_relev_kind_y is used as the relevant income for the parental leave benefit that was in place before 2009 (Erziehungsgeld). It is calculated at the level of the child and later aggregated to the level of the receiving parent. Similarly, kinderzuschl_eink_relev_m_bg is the income relevant for means-testing the additional child benefit (Kinderzuschlag), and arbeitsl_geld_2_eink_m is the income used for means-testing the monthly subsistence payment (Bürgergeld).

Having all of these concepts in the same namespace forces us to distinguish them by including the potentially shortened name of the main transfer—erziehungsgeld, kinderzuschl, and arbeitsl_geld_2 in the examples—in the function’s name. Given the self-imposed character limit of 20 for user-facing columns, this requires abbreviating words in the function name to a great extent, making it difficult for to read, search, and understand the codebase Even so, kinderzuschl_eink_relev_m_bg is 8 characters too long.

These issues go beyond naming and could impact how future developers create policy functions. Suppose you want to create a function that checks whether erziehungsgeld_eink_relev_kind_y exceeds a certain threshold (see for example erziehungsgeld_ohne_abzug_m). Given the strict 32-character limit for function names, making the function’s purpose clear would be challenging. This constraint may lead developers to incorporate such checks in downstream functions of the DAG, potentially violating our law-to-code approach.

Another thing that has been a source of confusion is that something like kinderzuschl is overloaded. It is both the name of the transfer, i.e., a function with a monetary amount once the appropriate suffixes are added, and a prefix for all kinds of related functions that are required to calculate that transfer.

With namespaces, we would have:
```
└── erziehungsgeld
│   ├── einkommen_kind_m
│   ├── betrag_kind_m
│   └── betrag_m
├── kinderzuschlag
│   ├── einkommen_m_bg
│   └── betrag_m_bg
└── arbeitslosengeld_2
    ├── einkommen_m_bg
    └── betrag_m_bg
```
This means that:
- betrag is the convention for the monetary amount of a tax/transfer.
- Names are unique within a namespace. It will be possible to have a betrag_m_bg function within the kinderzuschlag namespace and a similar-named function within arbeitslosengeld_2. Same for einkommen_m_bg. Hence, there is no ambiguity about which income is meant.
- The namespace will generally be represented as a tuple in GETTSIMs internal infrastructure. This will be the node identifier passed to the DAG, the value is the function. In pytree terminology, we will call the tuple the “path” and the function the “leaf”. The last element of the path will be called the “leaf name”.
- Within the code, it will be possible to refer to other functions residing in the same namespace without having to prefix them with the entire path. For example, within the arbeitslosengeld_2 namespace, it will be possible use einkommen_m_bg as an argument to betrag_m_bg and it is clear that the namespace-local version is meant. In case this is needed within the kinderzuschlag namespace, we need a “qualified name”, which uses the entire path where elements are separated by double underscores. In this case, it would be arbeitslosengeld_2__einkommen_m_bg.
  
  (Note that the most readable separator would be a dot, but that does not work. In order to use the identifier as a function argument, it must be a valid Python identifier)
- For functions defined in GETTSIM itself, the function loader will work by creating namespaces at the directory level . E.g., the above examples for kinderzuschlag and arbeitslosengeld_2 may be generated from the following package structure:
```
├── kinderzuschlag
│   └── kinderzuschlag.py
|       ├── betrag_m_bg
|       └── einkommen_m_bg
└── arbeitslosengeld_2
    ├── arbeitslosengeld_2.py
    |   └── betrag_m_bg
    ├── einkommen.py
        └── einkommen_m_bg
```
  where the innermost level are functions defined in the module. This balances the size of the namespace (should be reasonably large to avoid having to use lots of qualified names) with distributing code across multiple files (size of a file should often be much smaller than a namespace). E.g., at the time of this writing, there are about 1000 lines of code within the directory transfers/arbeitsl_geld_2 — most of these should live in one namespace. However, to keep an overview of the code, they need to be distributed across multiple files.
- Interacting with the policy environment in GETTSIM works with nested dictionaries. Paths are as defined above, leafs could be functions (for overriding / expanding the policy envionment) or data columns (inputs and outputs).
  
  There will be built-in renaming functionality to make interaction with any data set on the user side very easy. Some details are described below, but there will be a separate GEP on the user facing interface.

A current example for functions changing over the years would be midijob_bemessungsentgelt_m. The relevant code in policy_environment is:

from gettsim.germany.social_insurance_contributions.eink_grenzen import (
    midijob_bemessungsentgelt_m_ab_10_2022,
)
from gettsim.germany.social_insurance_contributions.eink_grenzen import (
    midijob_bemessungsentgelt_m_bis_09_2022,
)

[...]

if date >= datetime.date(year=2022, month=10, day=1):
    functions["midijob_bemessungsentgelt_m"] = midijob_bemessungsentgelt_m_ab_10_2022
else:
    functions["midijob_bemessungsentgelt_m"] = midijob_bemessungsentgelt_m_bis_09_2022

Adding a function that changes over time always means making changes in two completely unrelated places. It is very easy to get this wrong.

A future implementation may look something like:

@policy_function(
   start_date="2003-04-01",
   end_date="2022-09-30",
   leaf_name="midijob_bemessungsentgelt_m",
)
def midijob_bemessungsentgelt_m_bis_09_2022(...):
    pass

@policy_function(start_date="2022-10-01", leaf_name="midijob_bemessungsentgelt_m")
def midijob_bemessungsentgelt_m_ab_10_2022(...):
    pass

The behavior will be such that if the evaluation date is between the start and end date, the name found under leaf_name will be used as the leaf name.

The defaults of start_date and end_date are such that they will cover the entire period one may want to use GETTSIM for.

If two or more functions with the same path are added to the DAG (i.e. there are overlapping start and end dates), an error will be raised.

The yaml-files with parameters
- Will live in the same directory as the other functions (=same namespace).
- Will be parsed to custom data classes, which will become nodes in the DAG.
Parsing the parameters will happen much like how policy functions are parsed. There is a standard way mapping dictionary contents in the yaml-files to corresponding data classes. Dates are selected by the policy_environment date. If there are changes in the structure of the parameters over time, a similar mechanism like the start_date and end_date for the policy functions will be used based on the YYYY-MM-DD keys in the yaml-files.

Functions will not have [x]_params arguments containing potentially large and unstructured dicts any more. Instead, functions will only use the policy parameters they require. These could be scalars, homogenous dictionaries, the inputs for piecewise_polynomial parameters, or custom objects.

The namespace makes clear we are talking about, say, the function beitrag in the namespace arbeitslosenversicherung will have an input beitragssatz. If we need parameters which are external to the current namespace, we will need the same verbose syntax as in 1. (sozialversicherung__rente__beitrag__beitragsbemessungsgrenze_m).

Backward compatibility#

The interface will get a complete overhaul, parts of which will be described in a separate GEP. The most important direct consequence is that the structure of the input data will be generated for each application, which can then be filled by the user. The format of that will be a nested dictionary (or yaml-file) with the paths as keys and values left empty. The user may then fill in the values with the column names in her dataset. A similar renaming functionality will be added to the target data. Because of this, we will not need to work around character limits anymore — users will never need to access the internal structure of GETTSIM.

Alternatives#

Continuing with the status quo does not seem to be an option. We learned so much in the past five years that now seems to be a good time to put those lessons into code.

Discussion#

There have been various discussions and preliminary implementations of some parts of this GEP:

Pull requests:
- #787 Model classes for policy functions and policy environments
- #720 Combined decorator for policy information,
- #638 Don’t use functions in compute_taxes_and_transfers that are not active
- #804 Namespaces for policy functions
Issues:
- #781: Summary of interface discussion from 2024 GETTSIM workshop
Zulip

Copyright#

This document has been placed in the public domain.

Previous topic

Next topic