Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 23 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CLAUDE.md
can you# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Expand Down Expand Up @@ -31,6 +31,14 @@ mvn test -Dtest=SimPathsStartTest

CLI help: `java -jar singlerun.jar -h` or `java -jar multirun.jar -h`

### Key CLI flags

- `-c <CC>` country code (`EL`, `IT`, `HU`, `PL`); `-s` start year; `-e` end year; `-p` population size; `-g true|false` show GUI.
- `-t true|false` (`--training`) — use the training-data subset under `input/<CC>/InitialPopulations/training/` and `EUROMODoutput/training/` (uses `TaxDonorParserTraining`). On `multirun.jar` this **overrides** `parameter_args.trainingFlag` from the YAML config.
- `singlerun.jar -Setup` — setup phase only (build the H2 input DB, no simulation). Multi-run equivalent is `-DBSetup`.
- `multirun.jar -r <seed>` random seed, `-n <N>` max runs, `-f` output to file, `-config <file.yml>` custom config (default `config/default.yml`).
- **Training auto-detect**: if `-t` is omitted and `input/<CC>/InitialPopulations/*.csv` is empty, `Parameters.trainingFlag` is flipped to `true` automatically and a notice is printed to stdout (`SimPathsStart.java:363-368, 520-525`). To diagnose which mode is active at runtime, look for either `Training-data flag set explicitly via CLI: -t ...` or `auto-switching to training data` in the console output.

## Architecture

### Entity Hierarchy
Expand Down Expand Up @@ -66,10 +74,22 @@ CLI help: `java -jar singlerun.jar -h` or `java -jar multirun.jar -h`
### Data Inputs

- `input/input.mv.db` — H2 database with processed EU-SILC starting population
- `input/[COUNTRY]/` — Country-specific Excel parameter files, EUROMOD output CSVs
- `input/[COUNTRY]/InitialPopulations/` — actual starting-population CSVs; `…/training/` holds the shipped training subset
- `input/[COUNTRY]/EUROMODoutput/` — EUROMOD donor CSVs; `…/training/` holds the training subset
- `input/[COUNTRY]/` — country-specific Excel parameter files (e.g. `EUROMODpolicySchedule.xlsx`)
- `input/DatabaseCountryYear.xlsx` — Cross-country/year index
- `config/default.yml` — Default multi-run parameters (population size, year range, run count)
- `config/alignment_*.yml` — Staged alignment configurations
- `config/test_create_database.yml`, `config/test_run.yml` — Configs used by the integration test

### Repository layout (beyond `src/`)

- `scripts/` — shell wrappers for batch multi-runs (`run_alignment_multiruns.sh`, `run_multiruns-alignPopOFF.sh`, `run_TEST_multiruns.sh`, …)
- `input_processing/` — Stata do-files that prepare model inputs upstream of the Java pipeline (master conditions, regression-estimate cleaning, lag-structure generation)
- `tools/generate_simpaths_eu_variable_codebook.py` — variable codebook generator
- `validation/` — Stata validation against EU-SILC/EUROMOD targets
- `documentation/` — supplementary documentation
- `output/` — timestamped simulation outputs (created at runtime)

### Tax/Benefit Imputation

Expand All @@ -87,6 +107,7 @@ JUnit 5 + Mockito. Tests in `src/test/java/simpaths/`:
- `experiment/SimPathsMultiRunTest` — Multi-run configuration
- `experiment/PersonTest` — Person entity logic
- `data/MahalanobisDistanceTest` — Statistical matching
- `integrationtest/RunSimPathsIntegrationTest` — End-to-end run using `config/test_create_database.yml` + `config/test_run.yml`

## Branch Conventions

Expand Down
62 changes: 53 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,37 @@
# SimPathsEU

by Matteo Richiardi, Patryk Bronka, Justin van de Ven, Mariia Vartuzova, David Sonnewald
by CeMPA (Centre for Microsimulation and Policy Analysis).

## Documentation

The entire SimPaths documentation is available on its [website](https://simpaths.github.io/SimPaths/), which includes: a detailed description of its building blocks; instructions on how to set up and run the model; and information about contributing to the model's development. The `documentation/` directory contains supplementary materials that complements this README (model specifications, variable references, etc.).

## Introduction

SimPaths is a family of models for individual and household life course events, all sharing common components. The framework is designed to project life histories through time, building up a detailed picture of career paths, family (inter)relations, health, and financial circumstances. The framework builds upon standardised assumptions and data sources, which facilitates adaptation to alternative countries. This repository, **SimPathsEU**, covers Greece (`EL`), Hungary (`HU`), Italy (`IT`), and Poland (`PL`), and integrates with EUROMOD for tax and benefit policy simulation. Careful attention is paid to model validation, and sensitivity of projections to key assumptions. The modular nature of the SimPaths framework is designed to facilitate analysis of alternative assumptions concerning the tax and benefit system, sensitivity to parameter estimates and alternative approaches for projecting labour/leisure and consumption/savings decisions. Projections for a workhorse model parameterised to the UK context are reported in [Bronka, P., Richiardi, M., & van de Ven, J. (2023). *SimPaths: an open-source microsimulation model for life course analysis* (No. CEMPA6/23), Centre for Microsimulation and Policy Analysis at the Institute for Social and Economic Research*](https://www.microsimulation.ac.uk/publications/publication-557738/), which closely reflect observed data throughout a 10-year validation window.
SimPaths is a family of models for individual and household life course events, all sharing common components. The framework is designed to project life histories through time, building up a detailed picture of career paths, family (inter)relations, health, and financial circumstances. The framework builds upon standardised assumptions and data sources, which facilitates adaptation to alternative countries. This repository, **SimPathsEU**, covers Greece (`EL`), Hungary (`HU`), Italy (`IT`), and Poland (`PL`), and integrates with EUROMOD for tax and benefit policy simulation. Careful attention is paid to model validation, and sensitivity of projections to key assumptions. The modular nature of the SimPaths framework is designed to facilitate analysis of alternative assumptions concerning the tax and benefit system, sensitivity to parameter estimates and alternative approaches for projecting labour/leisure and consumption/savings decisions.


## License

Released under the terms in [`license.txt`](license.txt).

## Repository layout

```
SimPathsEU/
├── src/ # Java source (main + tests)
├── input/ # H2 DB + per-country starting populations and EUROMOD outputs
│ └── <CC>/InitialPopulations/{,training/}
│ └── <CC>/EUROMODoutput/{,training/}
├── input_processing/ # Stata do-files that prepare regression estimates and inputs
├── config/ # YAML configs (default.yml, alignment_*.yml, test_*.yml)
├── scripts/ # Bash wrappers for batch multi-run scenarios
├── validation/ # Stata validation against EU-SILC / EUROMOD targets
├── documentation/ # Supplementary documentation
├── output/ # Simulation outputs (created at runtime)
├── pom.xml
└── README.md
```

## Getting Started

Expand All @@ -30,11 +57,11 @@ However, please note that _training_ data is provided. It allows the simulation
1. **Java Development Kit (JDK):** the project targets **Java 19 or later** (see `pom.xml`, which pins `source`/`target` to 19). Install a compatible JDK, e.g. OpenJDK 19+ from [Adoptium](https://adoptium.net/).
2. **Maven:** required to build from the command line. See [installation instructions](https://maven.apache.org/install.html). (Not required if you only build via the IDE.)
3. **Download an IDE** (integrated development environment) of your choice - we recommend [IntelliJ IDEA](https://www.jetbrains.com/idea/download/); download the Community (free) or Ultimate (paid) edition, depending on your needs.
4. Clone your forked repository to your local machine. Import the cloned repository into IntelliJ as a Maven project
4. Clone your forked repository to your local machine. Import the cloned repository into IntelliJ as a Maven project.

### Compiling and running SimPaths with Maven in the CLI
### Compiling and running SimPaths with Maven from the CLI

SimPaths can also be compiled by Maven ([installation instructions here](https://maven.apache.org/install.html)) and run from the command line without an IDE. After cloning the repository and setting up the JDK, in the root directory you can run:
SimPaths can also be compiled with Maven ([installation instructions here](https://maven.apache.org/install.html)) and run from the command line without an IDE. After cloning the repository and setting up the JDK, in the root directory you can run:
```
$ mvn clean package
```
Expand Down Expand Up @@ -71,10 +98,10 @@ $ mvn verify -Dit.test=RunSimPathsIntegrationTest # run just the integration
- `-p` Simulated population size
- `-g` [true/false] show/hide gui
- `-r` Re-write policy schedule from detected policy files
- `-Setup` do setup phases (creating input populations database) only
- `-Setup` perform the setup phase only (build the input population database, then exit)
- `--rebuild-db` Force a rebuild of `input/input.mv.db` instead of reusing it (headless mode)
- `--reuse-existing-db` Reuse `input/input.mv.db` if present, otherwise build it (headless mode)
- `-t` [true/false] use training data subset. When `true`, reads from `input/<COUNTRY>/InitialPopulations/training/` and `input/<COUNTRY>/EUROMODoutput/training/`, and uses `TaxDonorParserTraining` (which drops `deh`/`drgn1`/`lcs` and uses `idhh` as the tax-unit identifier). When `false` (default), reads from `InitialPopulations/` and `EUROMODoutput/` directly and uses the standard `TaxDonorDataParser`. If `-t` is omitted, an auto-detect kicks in: if `InitialPopulations/<country>/*.csv` is empty, the simulator falls back to training data and prints a console message.
- `-t` [true/false] use training data subset. When `true`, reads from `input/<COUNTRY>/InitialPopulations/training/` and `input/<COUNTRY>/EUROMODoutput/training/`. When `false` (default), reads from `InitialPopulations/` and `EUROMODoutput/` directly. If `-t` is omitted, an auto-detect kicks in: if `InitialPopulations/<country>/*.csv` is empty, the simulator falls back to training data and prints a console message.

**Important:** the country (`-c`) and start year (`-s`) must be specified when creating or rebuilding the input population database — the resulting `input/input.mv.db` is country- and year-specific.

Expand Down Expand Up @@ -153,9 +180,26 @@ $ java -jar multirun.jar -r 100 -p 50000 -n 20 -s 2017 -e 2020 -g false -f

Run `java -jar singlerun.jar -h` or `java -jar multirun.jar -h` to show these help messages.

#### Output layout

Each simulation writes a timestamped subdirectory under `output/` (named `YYYYMMDDHHMMSS`), e.g.:

```
output/
├── <YYYYMMDDHHMMSS>/ # one run's artefacts
│ ├── database/ # H2 snapshot of the simulated population
│ └── input/ # copy of the inputs used for the run (for reproducibility)
└── logs/
├── run_<seed>.txt # console log when multirun is invoked with -f
└── run_<seed>.log # logger output for the same run
```

Batch scripts in `scripts/` move each scenario's outputs into `output/<scenario-name>/` after the runs finish.


### Batch scenario scripts

Helper Bash scripts in `scripts/` run `multirun.jar` across multiple alignment configs in sequence and move each scenario's CSV output into `output/<scenario-name>/`:
Helper Bash scripts in `scripts/` run `multirun.jar` across multiple alignment configs in sequence and move each scenario's output into `output/<scenario-name>/`:
- `run_alignment_multiruns.sh` — full set of alignment scenarios


Expand All @@ -166,7 +210,7 @@ $ POP_SIZE=10000 RUNS_PER_SCENARIO=2 ./scripts/run_alignment_multiruns.sh

### Contributing

1. Create a new branch for your contributions. This will likely be based on either the `main` branch of this repository (if you seek to modify the stable version of the model) or `develop` (if you seek to modify the most recent version of the model). Please see branch naming convention below.
1. Create a new branch for your contributions. This will likely be based on either the `main` branch of this repository (if you seek to modify the stable version of the model) or `develop` (if you seek to modify the most recent version of the model).
2. Make your changes, add your code, and write tests if applicable.
3. Commit your changes.
4. Push your changes to your fork.
Expand Down
Binary file not shown.
Binary file not shown.
Binary file removed documentation/SimPaths_Variable_CodebookUK.xlsx
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed input/SimPaths_Variable_Codebook.xlsx
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -167,13 +167,13 @@ global r1a_if_condition "dcpst == 2 & dag >= ${age_can_retire} & flag_deceased !
global r1b_if_condition "ssscp != 1 & dcpst == 1 & dag >= ${age_can_retire} & flag_deceased != 1"

* Wages
global W1fa_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & flag_deceased != 1"
global W1fa_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & flag_deceased != 1"

global W1ma_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & flag_deceased != 1"
global W1ma_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & flag_deceased != 1"

global W1fb_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & previouslyWorking == 1 & flag_deceased != 1"
global W1fb_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & previouslyWorking == 1 & flag_deceased != 1"

global W1mb_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & previouslyWorking == 1 & flag_deceased != 1"
global W1mb_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & previouslyWorking == 1 & flag_deceased != 1"

* Capital income
global i1a_if_condition "dag >= ${age_becomes_semi_responsible} & flag_deceased != 1"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,8 @@ if _rc {
exit 601
}

local dir_doc "`dir_w'/documentation"
local out_dta "`dir_doc'/key_function_income_thresholds_clean2018.dta"
local out_xlsx "`dir_doc'/key_function_income_thresholds_clean2018.xlsx"
local dir_out "`dir_w'/input_processing"
local out_xlsx "`dir_out'/key_function_income_thresholds_2018.xlsx"

local ref_year 2018
local weeks_per_month = 365.25 / (7 * 12)
Expand Down Expand Up @@ -248,7 +247,6 @@ sort country approach

format lo_monthly hi_monthly lo_weekly hi_weekly %12.2f
compress
save "`out_dta'", replace

export excel using "`out_xlsx'", sheet("results") firstrow(variables) sheetreplace

Expand All @@ -267,6 +265,5 @@ putexcel A11=("Output values") B11=("Local currency per week, exact and rounde
putexcel A12=("Do-file") B12=("01_key_function_income_thresholds_from_EUROMOD_2018.do")

di as txt "Saved results to:"
di as txt " `out_dta'"
di as txt " `out_xlsx'"
list country approach bu_id lo_hi_weekly_exact lo_hi_weekly_round, noobs abbreviate(32)
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@
* DATA: Longitudinal EU-SILC UDB version, 2005 - 2020
* AUTHORS: Clare Fenwick, Daria Popova, Ashley Burdett,
* Aleksandra Kolndrekaj
* LAST UPDATE: Jan 2026 AB
* LAST UPDATE: March 2026 AB
*
********************************************************************************
* NOTES:
* ENSURE HAVE ALREADY RUN 00_master_conditions.do FILE.
*
* Before running these files, the cumulative panel for each file type
* (D, H, R, P) must be constructed. These cumulative panels should be created
* following the procedure set out in *GESIS Papers 2022/10*. The do-files to
Expand Down Expand Up @@ -115,7 +117,7 @@ global dir_ind "/Users/ashleyburdett/Library/CloudStorage/Box-Box"
// Aleksandra - C:/Users/ak25793/Box

* Working directory
global dir_work "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/initial_populations/PL"
global dir_work "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/input_processing/initial_populations/PL"

* Directory containing do files
global dir_do "$dir_work/do_files"
Expand Down Expand Up @@ -146,21 +148,21 @@ global dir_data_05_20 "$dir_data/orig_panel_2005_2020"
* DEFINE PARAMETERS & PROCESS IF CONDITIONS
*******************************************************************************/

do "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/00_master_conditions.do"
do "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/input_processing/00_master_conditions_PL.do"


/*******************************************************************************
* EXECUTE FILES
*******************************************************************************/
//do "$dir_do/01_prepare_pooled_data.do"

do "$dir_do/02_create_variables_PL.do"
do "$dir_do/02_create_variables_${country}.do"

do "$dir_do/03_create_benefit_units_PL.do"
do "$dir_do/03_create_benefit_units_${country}.do"

do "$dir_do/04_reweight_PL.do"
do "$dir_do/04_reweight_${country}.do"

do "$dir_do/05_drop_hholds_and_slice_PL.do"
do "$dir_do/05_drop_hholds_slice_and_refactoring_${country}.do"

do "$dir_do/06_check_yearly_data_PL.do"
do "$dir_do/06_check_yearly_data_${country}.do"

42 changes: 37 additions & 5 deletions input_processing/data_construction/PL/01_prepare_pooled_data_PL.do
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
********************************************************************************
* PROJECT: ESPON
* PROJECT: SImPaths EU
* DO-FILE NAME: 01_prepare_pooled_data.do
* DESCRIPTION: Compiles panel dataset from EU-SILC
********************************************************************************
Expand All @@ -23,10 +23,42 @@ merge these chunks of data into one cumulative dataset (separately for the
D-,H-,R- and P-data).
*/
/*
Initial populations: cross-sectional SILC for 2011-2023 (income 2010-2022),
2023 (income 2022)
Estimation sample: longitudinal SILC with observations from 2011-2023
(income 2010-2022)
STRUCTURE OF THIS FILE

The script builds a person-level panel dataset for a single country by
sequentially merging the four EU-SILC master files produced by the panel
construction scripts (01-04 in eu_silc_do_2025/).

Files are merged in the following order, with R as the base:

R (Personal Register) — loaded first as the base. Contains all persons
in the sample including children under 16. Key identifiers: upid
(unique person ID across releases), uhid (unique household ID), year.

P (Personal Data) — merged 1:1 on year+upid+uhid. Contains income and
personal variables for adults aged 16 and above only. After this merge:
- Adults (in both R and P): have full R and P variables
- Children (in R only, not P): retained with R variables only
- Records in P but not R: dropped (should not occur in clean data)

D (Household Register) — merged 1:m on year+uhid. D is household-level
so one D row maps to multiple persons. keep if _merge==3 retains only
persons whose household appears in D. A small number of households may
not merge — this is suspected to be an edge case from the cross-release
deduplication in 01_create_masterD.do but has not been fully investigated.

H (Household Data) — merged 1:m on year+uhid, same logic as D.

KEY IDENTIFIERS
upid — unique personal ID across releases (country + rotation group +
dropout year + pid). Not the same as the raw pid in the source data
uhid — unique household ID across releases (same construction logic).
year — income reference year.

OUTPUT
${country}-SILC_pooled_all_obs_01.dta — person-level panel for the target
country, containing all household members (adults and children) with
combined R, P, D, and H variables. Flag variables (*_f, *_i) are dropped.
*/

********************************************************************************
Expand Down
Loading
Loading