(gep-6)=

# GEP 6 — A unified architecture

```{list-table}
- * Author
  * [Hans-Martin von Gaudecker](https://github.com/hmgaudecker)
- * Status
  * Provisionally accepted
- * Type
  * Standards Track
- * Created
  * 2025-02-17
- * Resolution
  * [Provisionally Accepted](https://gettsim.zulipchat.com/#narrow/channel/309998-GEPs/topic/GEP.2006/near/505695738)
```

## Abstract

This GEP outlines a unified architecture for GETTSIM based on a DAG that encompasses:

1. Namespaces for (policy) functions.
1. Decorators for including or excluding functions from the DAG based on the evaluation
   date of the policy environment.
1. Pre-processing of parameters.

Implementing this structure will make GETTSIM much more flexible and more natural to use
/ extend.

## Motivation and Scope

While GETTSIM's overall architecture has proven extremely useful, we are hitting limits
in at least three directions:

1. GETTSIM follows a law-to-code approach, but function names often become artificially
   long to ensure uniqueness, thereby hitting self-imposed character limits. For
   example, `arbeitsl_geld_2_eink_m`, `erziehungsgeld_eink_relev_kind_y` and
   `kinderzuschl_eink_relev_m_bg` represent similar concepts in very different legal
   contexts. Excessive concatenation and abbreviations make these names difficult to
   understand and disconnect them from the legal framework. In most cases, such
   complexity is unnecessary, as the module in which a function resides already provides
   sufficient context — in the examples above, those would be Arbeitslosengeld 2,
   Erziehungsgeld, and Kinderzuschlag.
1. Handling functions that change over the years is not robust (examples in
   [Issue 449](https://github.com/ttsim-dev/gettsim/issues/449)).
1. Parameters files do not handle cases well when functions expect parameters in a
   different form than the law specifies them (example:
   [Issue 444](https://github.com/ttsim-dev/gettsim/issues/444)).

These issues severely limit the development of GETTSIM. We have been spending far too
much time finding names that adhere to our self-imposed character limits. Functions with
changing interfaces over the years are a constant pain. Adding parameters is sometimes
awkward as there is no obvious way for pre-processing them.

The proposed changes will affect all areas of GETTSIM

1. The DAG will account for the namespace of GETTSIM's policy functions, which allows
   for non-unique function names across GETTSIM's subdirectories. Namespaces will affect
   how GETTSIM expects and returns input and output data.
1. Building the DAG will depend on the date for which the user wants to perform
   calculations. Instead of a long list of if-statements in the `policy_functions`
   module, picking functions by year will be done in a more natural and more robust way.
1. We extend the DAG to the parameters of the taxes and transfers system, thereby
   allowing for pre-processing of parameters. This clear way to pre-process parameters
   will allow specifying parameters exactly as they are written into the law.

## Usage and Impact

1. The use of namespaces will have the usual benefits as in (Python) code:
   Disambiguation while keeping things that belong together in one place. We largely
   have this structure already for the parameters (think about each `[x]_param`
   dictionary as a namespace).

   Currently, for example, `erziehungsgeld_eink_relev_kind_y` is used as the relevant
   income for the parental leave benefit that was in place before 2009 (Erziehungsgeld).
   It is calculated at the level of the child and later aggregated to the level of the
   receiving parent. Similarly, `kinderzuschl_eink_relev_m_bg` is the income relevant
   for means-testing the additional child benefit (Kinderzuschlag), and
   `arbeitsl_geld_2_eink_m` is the income used for means-testing the monthly subsistence
   payment (Bürgergeld).

   Having all of these concepts in the same namespace forces us to distinguish them by
   including the potentially shortened name of the main transfer—`erziehungsgeld`,
   `kinderzuschl`, and `arbeitsl_geld_2` in the examples—in the function's name. Given
   the self-imposed character limit of 20 for user-facing columns, this requires
   abbreviating words in the function name to a great extent, making it difficult for to
   read, search, and understand the codebase Even so, `kinderzuschl_eink_relev_m_bg` is
   8 characters too long.

   These issues go beyond naming and could impact how future developers create policy
   functions. Suppose you want to create a function that checks whether
   `erziehungsgeld_eink_relev_kind_y` exceeds a certain threshold (see for example
   `erziehungsgeld_ohne_abzug_m`). Given the strict 32-character limit for function
   names, making the function's purpose clear would be challenging. This constraint may
   lead developers to incorporate such checks in downstream functions of the DAG,
   potentially violating our law-to-code approach.

   Another thing that has been a source of confusion is that something like
   `kinderzuschl` is overloaded. It is both the name of the transfer, i.e., a function
   with a monetary amount once the appropriate suffixes are added, and a prefix for all
   kinds of related functions that are required to calculate that transfer.

   With namespaces, we would have:

   ```
   └── erziehungsgeld
   │   ├── einkommen_kind_m
   │   ├── betrag_kind_m
   │   └── betrag_m
   ├── kinderzuschlag
   │   ├── einkommen_m_bg
   │   └── betrag_m_bg
   └── arbeitslosengeld_2
       ├── einkommen_m_bg
       └── betrag_m_bg
   ```

   This means that:

   - `betrag` is the convention for the monetary amount of a tax/transfer.

   - Names are unique within a namespace. It will be possible to have a `betrag_m_bg`
     function within the `kinderzuschlag` namespace and a similar-named function within
     `arbeitslosengeld_2`. Same for `einkommen_m_bg`. Hence, there is no ambiguity about
     which income is meant.

   - The namespace will generally be represented as a tuple in GETTSIMs internal
     infrastructure. This will be the node identifier passed to the DAG, the value is
     the function. In pytree terminology, we will call the tuple the "path" and the
     function the "leaf". The last element of the path will be called the "leaf name".

   - Within the code, it will be possible to refer to other functions residing in the
     same namespace without having to prefix them with the entire path. For example,
     within the `arbeitslosengeld_2` namespace, it will be possible use `einkommen_m_bg`
     as an argument to `betrag_m_bg` and it is clear that the namespace-local version is
     meant. In case this is needed within the `kinderzuschlag` namespace, we need a
     "qualified name", which uses the entire path where elements are separated by double
     underscores. In this case, it would be `arbeitslosengeld_2__einkommen_m_bg`.

     _(Note that the most readable separator would be a dot, but that does not work. In
     order to use the identifier as a function argument, it must be a valid Python
     identifier)_

   - For functions defined in GETTSIM itself, the function loader will work by creating
     namespaces at the directory level . E.g., the above examples for `kinderzuschlag`
     and `arbeitslosengeld_2` may be generated from the following package structure:

     ```
     ├── kinderzuschlag
     │   └── kinderzuschlag.py
     |       ├── betrag_m_bg
     |       └── einkommen_m_bg
     └── arbeitslosengeld_2
         ├── arbeitslosengeld_2.py
         |   └── betrag_m_bg
         ├── einkommen.py
             └── einkommen_m_bg
     ```

     where the innermost level are functions defined in the module. This balances the
     size of the namespace (should be reasonably large to avoid having to use lots of
     qualified names) with distributing code across multiple files (size of a file
     should often be much smaller than a namespace). E.g., at the time of this writing,
     there are about 1000 lines of code within the directory `transfers/arbeitsl_geld_2`
     — most of these should live in one namespace. However, to keep an overview of the
     code, they need to be distributed across multiple files.

   - Interacting with the policy environment in GETTSIM works with nested dictionaries.
     Paths are as defined above, leafs could be functions (for overriding / expanding
     the policy envionment) or data columns (inputs and outputs).

     There will be built-in renaming functionality to make interaction with any data set
     on the user side very easy. Some details are described below, but there will be a
     separate GEP on the user facing interface.

1. A current example for functions changing over the years would be
   `midijob_bemessungsentgelt_m`. The relevant code in `policy_environment` is:

   ```python
   from gettsim.germany.social_insurance_contributions.eink_grenzen import (
       midijob_bemessungsentgelt_m_ab_10_2022,
   )
   from gettsim.germany.social_insurance_contributions.eink_grenzen import (
       midijob_bemessungsentgelt_m_bis_09_2022,
   )

   [...]

   if date >= datetime.date(year=2022, month=10, day=1):
       functions["midijob_bemessungsentgelt_m"] = midijob_bemessungsentgelt_m_ab_10_2022
   else:
       functions["midijob_bemessungsentgelt_m"] = midijob_bemessungsentgelt_m_bis_09_2022
   ```

   Adding a function that changes over time always means making changes in two
   completely unrelated places. It is very easy to get this wrong.

   A future implementation may look something like:

   ```py
   @policy_function(
      start_date="2003-04-01",
      end_date="2022-09-30",
      leaf_name="midijob_bemessungsentgelt_m",
   )
   def midijob_bemessungsentgelt_m_bis_09_2022(...):
       pass

   @policy_function(start_date="2022-10-01", leaf_name="midijob_bemessungsentgelt_m")
   def midijob_bemessungsentgelt_m_ab_10_2022(...):
       pass
   ```

   The behavior will be such that if the evaluation date is between the start and end
   date, the name found under `leaf_name` will be used as the leaf name.

   The defaults of start_date and end_date are such that they will cover the entire
   period one may want to use GETTSIM for.

   If two or more functions with the same path are added to the DAG (i.e. there are
   overlapping start and end dates), an error will be raised.

1. The yaml-files with parameters

   - Will live in the same directory as the other functions (=same namespace).
   - Will be parsed to custom data classes, which will become nodes in the DAG.

   Parsing the parameters will happen much like how policy functions are parsed. There
   is a standard way mapping dictionary contents in the yaml-files to corresponding data
   classes. Dates are selected by the `policy_environment` date. If there are changes in
   the structure of the parameters over time, a similar mechanism like the `start_date`
   and `end_date` for the policy functions will be used based on the `YYYY-MM-DD` keys
   in the yaml-files.

   Functions will not have `[x]_params` arguments containing potentially large and
   unstructured dicts any more. Instead, functions will only use the policy parameters
   they require. These could be scalars, homogenous dictionaries, the inputs for
   `piecewise_polynomial` parameters, or custom objects.

   The namespace makes clear we are talking about, say, the function `beitrag` in the
   namespace `arbeitslosenversicherung` will have an input `beitragssatz`. If we need
   parameters which are external to the current namespace, we will need the same verbose
   syntax as in 1. (`sozialversicherung__rente__beitrag__beitragsbemessungsgrenze_m`).

## Backward compatibility

The interface will get a complete overhaul, parts of which will be described in a
separate GEP. The most important direct consequence is that the structure of the input
data will be generated for each application, which can then be filled by the user. The
format of that will be a nested dictionary (or yaml-file) with the paths as keys and
values left empty. The user may then fill in the values with the column names in her
dataset. A similar renaming functionality will be added to the target data. Because of
this, we will not need to work around character limits anymore — users will never need
to access the internal structure of GETTSIM.

## Alternatives

Continuing with the status quo does not seem to be an option. We learned so much in the
past five years that now seems to be a good time to put those lessons into code.

## Discussion

There have been various discussions and preliminary implementations of some parts of
this GEP:

- Pull requests:
  - [#787](https://github.com/ttsim-dev/gettsim/pulls/787) Model classes for policy
    functions and policy environments
  - [#720](https://github.com/ttsim-dev/gettsim/pulls/720) Combined decorator for policy
    information,
  - [#638](https://github.com/ttsim-dev/gettsim/pulls/638) Don’t use functions in
    compute_taxes_and_transfers that are not active
  - [#804](https://github.com/ttsim-dev/gettsim/pulls/804) Namespaces for policy
    functions
- Issues:
  - [#781](https://github.com/ttsim-dev/gettsim/issues/781): Summary of interface
    discussion from 2024 GETTSIM workshop
- [Zulip](https://gettsim.zulipchat.com/#narrow/channel/309998-GEPs/topic/GEP.2006)

## Copyright

This document has been placed in the public domain.