GEP 6 — A unified architecture#
Author |
|
Status |
Provisionally accepted |
Type |
Standards Track |
Created |
2025-02-17 |
Resolution |
Abstract#
This GEP outlines a unified architecture for GETTSIM based on a DAG that encompasses:
Namespaces for (policy) functions.
Decorators for including or excluding functions from the DAG based on the evaluation date of the policy environment.
Pre-processing of parameters.
Implementing this structure will make GETTSIM much more flexible and more natural to use / extend.
Motivation and Scope#
While GETTSIM’s overall architecture has proven extremely useful, we are hitting limits in at least three directions:
GETTSIM follows a law-to-code approach, but function names often become artificially long to ensure uniqueness, thereby hitting self-imposed character limits. For example,
arbeitsl_geld_2_eink_m
,erziehungsgeld_eink_relev_kind_y
andkinderzuschl_eink_relev_m_bg
represent similar concepts in very different legal contexts. Excessive concatenation and abbreviations make these names difficult to understand and disconnect them from the legal framework. In most cases, such complexity is unnecessary, as the module in which a function resides already provides sufficient context — in the examples above, those would be Arbeitslosengeld 2, Erziehungsgeld, and Kinderzuschlag.Handling functions that change over the years is not robust (examples in Issue 449).
Parameters files do not handle cases well when functions expect parameters in a different form than the law specifies them (example: Issue 444).
These issues severely limit the development of GETTSIM. We have been spending far too much time finding names that adhere to our self-imposed character limits. Functions with changing interfaces over the years are a constant pain. Adding parameters is sometimes awkward as there is no obvious way for pre-processing them.
The proposed changes will affect all areas of GETTSIM
The DAG will account for the namespace of GETTSIM’s policy functions, which allows for non-unique function names across GETTSIM’s subdirectories. Namespaces will affect how GETTSIM expects and returns input and output data.
Building the DAG will depend on the date for which the user wants to perform calculations. Instead of a long list of if-statements in the
policy_functions
module, picking functions by year will be done in a more natural and more robust way.We extend the DAG to the parameters of the taxes and transfers system, thereby allowing for pre-processing of parameters. This clear way to pre-process parameters will allow specifying parameters exactly as they are written into the law.
Usage and Impact#
The use of namespaces will have the usual benefits as in (Python) code: Disambiguation while keeping things that belong together in one place. We largely have this structure already for the parameters (think about each
[x]_param
dictionary as a namespace).Currently, for example,
erziehungsgeld_eink_relev_kind_y
is used as the relevant income for the parental leave benefit that was in place before 2009 (Erziehungsgeld). It is calculated at the level of the child and later aggregated to the level of the receiving parent. Similarly,kinderzuschl_eink_relev_m_bg
is the income relevant for means-testing the additional child benefit (Kinderzuschlag), andarbeitsl_geld_2_eink_m
is the income used for means-testing the monthly subsistence payment (Bürgergeld).Having all of these concepts in the same namespace forces us to distinguish them by including the potentially shortened name of the main transfer—
erziehungsgeld
,kinderzuschl
, andarbeitsl_geld_2
in the examples—in the function’s name. Given the self-imposed character limit of 20 for user-facing columns, this requires abbreviating words in the function name to a great extent, making it difficult for to read, search, and understand the codebase Even so,kinderzuschl_eink_relev_m_bg
is 8 characters too long.These issues go beyond naming and could impact how future developers create policy functions. Suppose you want to create a function that checks whether
erziehungsgeld_eink_relev_kind_y
exceeds a certain threshold (see for exampleerziehungsgeld_ohne_abzug_m
). Given the strict 32-character limit for function names, making the function’s purpose clear would be challenging. This constraint may lead developers to incorporate such checks in downstream functions of the DAG, potentially violating our law-to-code approach.Another thing that has been a source of confusion is that something like
kinderzuschl
is overloaded. It is both the name of the transfer, i.e., a function with a monetary amount once the appropriate suffixes are added, and a prefix for all kinds of related functions that are required to calculate that transfer.With namespaces, we would have:
└── erziehungsgeld │ ├── einkommen_kind_m │ ├── betrag_kind_m │ └── betrag_m ├── kinderzuschlag │ ├── einkommen_m_bg │ └── betrag_m_bg └── arbeitslosengeld_2 ├── einkommen_m_bg └── betrag_m_bg
This means that:
betrag
is the convention for the monetary amount of a tax/transfer.Names are unique within a namespace. It will be possible to have a
betrag_m_bg
function within thekinderzuschlag
namespace and a similar-named function withinarbeitslosengeld_2
. Same foreinkommen_m_bg
. Hence, there is no ambiguity about which income is meant.The namespace will generally be represented as a tuple in GETTSIMs internal infrastructure. This will be the node identifier passed to the DAG, the value is the function. In pytree terminology, we will call the tuple the “path” and the function the “leaf”. The last element of the path will be called the “leaf name”.
Within the code, it will be possible to refer to other functions residing in the same namespace without having to prefix them with the entire path. For example, within the
arbeitslosengeld_2
namespace, it will be possible useeinkommen_m_bg
as an argument tobetrag_m_bg
and it is clear that the namespace-local version is meant. In case this is needed within thekinderzuschlag
namespace, we need a “qualified name”, which uses the entire path where elements are separated by double underscores. In this case, it would bearbeitslosengeld_2__einkommen_m_bg
.(Note that the most readable separator would be a dot, but that does not work. In order to use the identifier as a function argument, it must be a valid Python identifier)
For functions defined in GETTSIM itself, the function loader will work by creating namespaces at the directory level . E.g., the above examples for
kinderzuschlag
andarbeitslosengeld_2
may be generated from the following package structure:├── kinderzuschlag │ └── kinderzuschlag.py | ├── betrag_m_bg | └── einkommen_m_bg └── arbeitslosengeld_2 ├── arbeitslosengeld_2.py | └── betrag_m_bg ├── einkommen.py └── einkommen_m_bg
where the innermost level are functions defined in the module. This balances the size of the namespace (should be reasonably large to avoid having to use lots of qualified names) with distributing code across multiple files (size of a file should often be much smaller than a namespace). E.g., at the time of this writing, there are about 1000 lines of code within the directory
transfers/arbeitsl_geld_2
— most of these should live in one namespace. However, to keep an overview of the code, they need to be distributed across multiple files.Interacting with the policy environment in GETTSIM works with nested dictionaries. Paths are as defined above, leafs could be functions (for overriding / expanding the policy envionment) or data columns (inputs and outputs).
There will be built-in renaming functionality to make interaction with any data set on the user side very easy. Some details are described below, but there will be a separate GEP on the user facing interface.
A current example for functions changing over the years would be
midijob_bemessungsentgelt_m
. The relevant code inpolicy_environment
is:from gettsim.germany.social_insurance_contributions.eink_grenzen import ( midijob_bemessungsentgelt_m_ab_10_2022, ) from gettsim.germany.social_insurance_contributions.eink_grenzen import ( midijob_bemessungsentgelt_m_bis_09_2022, ) [...] if date >= datetime.date(year=2022, month=10, day=1): functions["midijob_bemessungsentgelt_m"] = midijob_bemessungsentgelt_m_ab_10_2022 else: functions["midijob_bemessungsentgelt_m"] = midijob_bemessungsentgelt_m_bis_09_2022
Adding a function that changes over time always means making changes in two completely unrelated places. It is very easy to get this wrong.
A future implementation may look something like:
@policy_function( start_date="2003-04-01", end_date="2022-09-30", leaf_name="midijob_bemessungsentgelt_m", ) def midijob_bemessungsentgelt_m_bis_09_2022(...): pass @policy_function(start_date="2022-10-01", leaf_name="midijob_bemessungsentgelt_m") def midijob_bemessungsentgelt_m_ab_10_2022(...): pass
The behavior will be such that if the evaluation date is between the start and end date, the name found under
leaf_name
will be used as the leaf name.The defaults of start_date and end_date are such that they will cover the entire period one may want to use GETTSIM for.
If two or more functions with the same path are added to the DAG (i.e. there are overlapping start and end dates), an error will be raised.
The yaml-files with parameters
Will live in the same directory as the other functions (=same namespace).
Will be parsed to custom data classes, which will become nodes in the DAG.
Parsing the parameters will happen much like how policy functions are parsed. There is a standard way mapping dictionary contents in the yaml-files to corresponding data classes. Dates are selected by the
policy_environment
date. If there are changes in the structure of the parameters over time, a similar mechanism like thestart_date
andend_date
for the policy functions will be used based on theYYYY-MM-DD
keys in the yaml-files.Functions will not have
[x]_params
arguments containing potentially large and unstructured dicts any more. Instead, functions will only use the policy parameters they require. These could be scalars, homogenous dictionaries, the inputs forpiecewise_polynomial
parameters, or custom objects.The namespace makes clear we are talking about, say, the function
beitrag
in the namespacearbeitslosenversicherung
will have an inputbeitragssatz
. If we need parameters which are external to the current namespace, we will need the same verbose syntax as in 1. (sozialversicherung__rente__beitrag__beitragsbemessungsgrenze_m
).
Backward compatibility#
The interface will get a complete overhaul, parts of which will be described in a separate GEP. The most important direct consequence is that the structure of the input data will be generated for each application, which can then be filled by the user. The format of that will be a nested dictionary (or yaml-file) with the paths as keys and values left empty. The user may then fill in the values with the column names in her dataset. A similar renaming functionality will be added to the target data. Because of this, we will not need to work around character limits anymore — users will never need to access the internal structure of GETTSIM.
Alternatives#
Continuing with the status quo does not seem to be an option. We learned so much in the past five years that now seems to be a good time to put those lessons into code.
Discussion#
There have been various discussions and preliminary implementations of some parts of this GEP:
Pull requests:
Issues:
#781: Summary of interface discussion from 2024 GETTSIM workshop
Copyright#
This document has been placed in the public domain.