simpaths · Mariia-Var · May 7, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,4 +1,4 @@
-# CLAUDE.md
+can you# CLAUDE.md
 
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
@@ -31,6 +31,14 @@ mvn test -Dtest=SimPathsStartTest
 
 CLI help: `java -jar singlerun.jar -h` or `java -jar multirun.jar -h`
 
+### Key CLI flags
+
+- `-c <CC>` country code (`EL`, `IT`, `HU`, `PL`); `-s` start year; `-e` end year; `-p` population size; `-g true|false` show GUI.
+- `-t true|false` (`--training`) — use the training-data subset under `input/<CC>/InitialPopulations/training/` and `EUROMODoutput/training/` (uses `TaxDonorParserTraining`). On `multirun.jar` this **overrides** `parameter_args.trainingFlag` from the YAML config.
+- `singlerun.jar -Setup` — setup phase only (build the H2 input DB, no simulation). Multi-run equivalent is `-DBSetup`.
+- `multirun.jar -r <seed>` random seed, `-n <N>` max runs, `-f` output to file, `-config <file.yml>` custom config (default `config/default.yml`).
+- **Training auto-detect**: if `-t` is omitted and `input/<CC>/InitialPopulations/*.csv` is empty, `Parameters.trainingFlag` is flipped to `true` automatically and a notice is printed to stdout (`SimPathsStart.java:363-368, 520-525`). To diagnose which mode is active at runtime, look for either `Training-data flag set explicitly via CLI: -t ...` or `auto-switching to training data` in the console output.
+
 ## Architecture
 
 ### Entity Hierarchy
@@ -66,10 +74,22 @@ CLI help: `java -jar singlerun.jar -h` or `java -jar multirun.jar -h`
 ### Data Inputs
 
 - `input/input.mv.db` — H2 database with processed EU-SILC starting population
-- `input/[COUNTRY]/` — Country-specific Excel parameter files, EUROMOD output CSVs
+- `input/[COUNTRY]/InitialPopulations/` — actual starting-population CSVs; `…/training/` holds the shipped training subset
+- `input/[COUNTRY]/EUROMODoutput/` — EUROMOD donor CSVs; `…/training/` holds the training subset
+- `input/[COUNTRY]/` — country-specific Excel parameter files (e.g. `EUROMODpolicySchedule.xlsx`)
 - `input/DatabaseCountryYear.xlsx` — Cross-country/year index
 - `config/default.yml` — Default multi-run parameters (population size, year range, run count)
 - `config/alignment_*.yml` — Staged alignment configurations
+- `config/test_create_database.yml`, `config/test_run.yml` — Configs used by the integration test
+
+### Repository layout (beyond `src/`)
+
+- `scripts/` — shell wrappers for batch multi-runs (`run_alignment_multiruns.sh`, `run_multiruns-alignPopOFF.sh`, `run_TEST_multiruns.sh`, …)
+- `input_processing/` — Stata do-files that prepare model inputs upstream of the Java pipeline (master conditions, regression-estimate cleaning, lag-structure generation)
+- `tools/generate_simpaths_eu_variable_codebook.py` — variable codebook generator
+- `validation/` — Stata validation against EU-SILC/EUROMOD targets
+- `documentation/` — supplementary documentation
+- `output/` — timestamped simulation outputs (created at runtime)
 
 ### Tax/Benefit Imputation
 
@@ -87,6 +107,7 @@ JUnit 5 + Mockito. Tests in `src/test/java/simpaths/`:
 - `experiment/SimPathsMultiRunTest` — Multi-run configuration
 - `experiment/PersonTest` — Person entity logic
 - `data/MahalanobisDistanceTest` — Statistical matching
+- `integrationtest/RunSimPathsIntegrationTest` — End-to-end run using `config/test_create_database.yml` + `config/test_run.yml`
 
 ## Branch Conventions
 

diff --git a/README.md b/README.md
@@ -1,10 +1,37 @@
 # SimPathsEU
 
-by Matteo Richiardi, Patryk Bronka, Justin van de Ven, Mariia Vartuzova, David Sonnewald
+by CeMPA (Centre for Microsimulation and Policy Analysis).
+
+## Documentation
+
+The entire SimPaths documentation is available on its [website](https://simpaths.github.io/SimPaths/), which includes: a detailed description of its building blocks; instructions on how to set up and run the model; and information about contributing to the model's development. The `documentation/` directory contains supplementary materials that complements this README (model specifications, variable references, etc.).
 
 ## Introduction
 
-SimPaths is a family of models for individual and household life course events, all sharing common components. The framework is designed to project life histories through time, building up a detailed picture of career paths, family (inter)relations, health, and financial circumstances. The framework builds upon standardised assumptions and data sources, which facilitates adaptation to alternative countries. This repository, **SimPathsEU**, covers Greece (`EL`), Hungary (`HU`), Italy (`IT`), and Poland (`PL`), and integrates with EUROMOD for tax and benefit policy simulation. Careful attention is paid to model validation, and sensitivity of projections to key assumptions. The modular nature of the SimPaths framework is designed to facilitate analysis of alternative assumptions concerning the tax and benefit system, sensitivity to parameter estimates and alternative approaches for projecting labour/leisure and consumption/savings decisions. Projections for a workhorse model parameterised to the UK context are reported in [Bronka, P., Richiardi, M., & van de Ven, J. (2023). *SimPaths: an open-source microsimulation model for life course analysis* (No. CEMPA6/23), Centre for Microsimulation and Policy Analysis at the Institute for Social and Economic Research*](https://www.microsimulation.ac.uk/publications/publication-557738/), which closely reflect observed data throughout a 10-year validation window.
+SimPaths is a family of models for individual and household life course events, all sharing common components. The framework is designed to project life histories through time, building up a detailed picture of career paths, family (inter)relations, health, and financial circumstances. The framework builds upon standardised assumptions and data sources, which facilitates adaptation to alternative countries. This repository, **SimPathsEU**, covers Greece (`EL`), Hungary (`HU`), Italy (`IT`), and Poland (`PL`), and integrates with EUROMOD for tax and benefit policy simulation. Careful attention is paid to model validation, and sensitivity of projections to key assumptions. The modular nature of the SimPaths framework is designed to facilitate analysis of alternative assumptions concerning the tax and benefit system, sensitivity to parameter estimates and alternative approaches for projecting labour/leisure and consumption/savings decisions. 
+
+
+## License
+
+Released under the terms in [`license.txt`](license.txt).
+
+## Repository layout
+
+```
+SimPathsEU/
+├── src/                   # Java source (main + tests)
+├── input/                 # H2 DB + per-country starting populations and EUROMOD outputs
+│   └── <CC>/InitialPopulations/{,training/}
+│   └── <CC>/EUROMODoutput/{,training/}
+├── input_processing/      # Stata do-files that prepare regression estimates and inputs
+├── config/                # YAML configs (default.yml, alignment_*.yml, test_*.yml)
+├── scripts/               # Bash wrappers for batch multi-run scenarios
+├── validation/            # Stata validation against EU-SILC / EUROMOD targets
+├── documentation/         # Supplementary documentation
+├── output/                # Simulation outputs (created at runtime)
+├── pom.xml
+└── README.md
+```
 
 ## Getting Started
 
@@ -30,11 +57,11 @@ However, please note that _training_ data is provided. It allows the simulation
 1. **Java Development Kit (JDK):** the project targets **Java 19 or later** (see `pom.xml`, which pins `source`/`target` to 19). Install a compatible JDK, e.g. OpenJDK 19+ from [Adoptium](https://adoptium.net/).
 2. **Maven:** required to build from the command line. See [installation instructions](https://maven.apache.org/install.html). (Not required if you only build via the IDE.)
 3. **Download an IDE** (integrated development environment) of your choice - we recommend [IntelliJ IDEA](https://www.jetbrains.com/idea/download/); download the Community (free) or Ultimate (paid) edition, depending on your needs.
-4. Clone your forked repository to your local machine. Import the cloned repository into IntelliJ as a Maven project
+4. Clone your forked repository to your local machine. Import the cloned repository into IntelliJ as a Maven project.
 
-### Compiling and running SimPaths with Maven in the CLI
+### Compiling and running SimPaths with Maven from the CLI
 
-SimPaths can also be compiled by Maven ([installation instructions here](https://maven.apache.org/install.html)) and run from the command line without an IDE. After cloning the repository and setting up the JDK, in the root directory you can run:
+SimPaths can also be compiled with Maven ([installation instructions here](https://maven.apache.org/install.html)) and run from the command line without an IDE. After cloning the repository and setting up the JDK, in the root directory you can run:
 ```
 $ mvn clean package
 ```
@@ -71,10 +98,10 @@ $ mvn verify -Dit.test=RunSimPathsIntegrationTest     # run just the integration
 - `-p` Simulated population size
 - `-g` [true/false] show/hide gui
 - `-r` Re-write policy schedule from detected policy files
-- `-Setup` do setup phases (creating input populations database) only
+- `-Setup` perform the setup phase only (build the input population database, then exit)
 - `--rebuild-db` Force a rebuild of `input/input.mv.db` instead of reusing it (headless mode)
 - `--reuse-existing-db` Reuse `input/input.mv.db` if present, otherwise build it (headless mode)
-- `-t` [true/false] use training data subset. When `true`, reads from `input/<COUNTRY>/InitialPopulations/training/` and `input/<COUNTRY>/EUROMODoutput/training/`, and uses `TaxDonorParserTraining` (which drops `deh`/`drgn1`/`lcs` and uses `idhh` as the tax-unit identifier). When `false` (default), reads from `InitialPopulations/` and `EUROMODoutput/` directly and uses the standard `TaxDonorDataParser`. If `-t` is omitted, an auto-detect kicks in: if `InitialPopulations/<country>/*.csv` is empty, the simulator falls back to training data and prints a console message.
+- `-t` [true/false] use training data subset. When `true`, reads from `input/<COUNTRY>/InitialPopulations/training/` and `input/<COUNTRY>/EUROMODoutput/training/`. When `false` (default), reads from `InitialPopulations/` and `EUROMODoutput/` directly. If `-t` is omitted, an auto-detect kicks in: if `InitialPopulations/<country>/*.csv` is empty, the simulator falls back to training data and prints a console message.
 
 **Important:** the country (`-c`) and start year (`-s`) must be specified when creating or rebuilding the input population database — the resulting `input/input.mv.db` is country- and year-specific.
 
@@ -153,9 +180,26 @@ $ java -jar multirun.jar -r 100 -p 50000 -n 20 -s 2017 -e 2020 -g false -f
 
 Run `java -jar singlerun.jar -h` or `java -jar multirun.jar -h` to show these help messages.
 
+#### Output layout
+
+Each simulation writes a timestamped subdirectory under `output/` (named `YYYYMMDDHHMMSS`), e.g.:
+
+```
+output/
+├── <YYYYMMDDHHMMSS>/            # one run's artefacts
+│   ├── database/                # H2 snapshot of the simulated population
+│   └── input/                   # copy of the inputs used for the run (for reproducibility)
+└── logs/
+    ├── run_<seed>.txt           # console log when multirun is invoked with -f
+    └── run_<seed>.log           # logger output for the same run
+```
+
+Batch scripts in `scripts/` move each scenario's outputs into `output/<scenario-name>/` after the runs finish.
+
+
 ### Batch scenario scripts
 
-Helper Bash scripts in `scripts/` run `multirun.jar` across multiple alignment configs in sequence and move each scenario's CSV output into `output/<scenario-name>/`:
+Helper Bash scripts in `scripts/` run `multirun.jar` across multiple alignment configs in sequence and move each scenario's output into `output/<scenario-name>/`:
 - `run_alignment_multiruns.sh` — full set of alignment scenarios
 
 
@@ -166,7 +210,7 @@ $ POP_SIZE=10000 RUNS_PER_SCENARIO=2 ./scripts/run_alignment_multiruns.sh
 
 ### Contributing
 
-1. Create a new branch for your contributions. This will likely be based on either the `main` branch of this repository (if you seek to modify the stable version of the model) or `develop` (if you seek to modify the most recent version of the model).  Please see branch naming convention below.
+1. Create a new branch for your contributions. This will likely be based on either the `main` branch of this repository (if you seek to modify the stable version of the model) or `develop` (if you seek to modify the most recent version of the model).
 2. Make your changes, add your code, and write tests if applicable.
 3. Commit your changes.
 4. Push your changes to your fork.

diff --git a/documentation/PL_InitialPopulations_column_mapping_AB.xlsx b/documentation/PL_InitialPopulations_column_mapping_AB.xlsx
diff --git a/...SimPathsEU_Variable_Codebook_updated.xlsx → ...ntation/SimPathsEU_Variable_Codebook.xlsx b/...SimPathsEU_Variable_Codebook_updated.xlsx → ...ntation/SimPathsEU_Variable_Codebook.xlsx
diff --git a/documentation/SimPathsEU_Variable_Codebook_changes.xlsx b/documentation/SimPathsEU_Variable_Codebook_changes.xlsx
diff --git a/documentation/SimPaths_Variable_CodebookUK.xlsx b/documentation/SimPaths_Variable_CodebookUK.xlsx
diff --git a/input/PL/DoFilesTargets/TargetsPlots/disability_targets_ts.gph b/input/PL/DoFilesTargets/TargetsPlots/disability_targets_ts.gph
diff --git a/input/PL/DoFilesTargets/TargetsPlots/disability_targets_ts.png b/input/PL/DoFilesTargets/TargetsPlots/disability_targets_ts.png
diff --git a/input/PL/DoFilesTargets/TargetsPlots/employment_targets_ts.gph b/input/PL/DoFilesTargets/TargetsPlots/employment_targets_ts.gph
diff --git a/input/PL/DoFilesTargets/TargetsPlots/employment_targets_ts.png b/input/PL/DoFilesTargets/TargetsPlots/employment_targets_ts.png
diff --git a/input/PL/DoFilesTargets/TargetsPlots/inSchool_targets_ts.gph b/input/PL/DoFilesTargets/TargetsPlots/inSchool_targets_ts.gph
diff --git a/input/PL/DoFilesTargets/TargetsPlots/inSchool_targets_ts.png b/input/PL/DoFilesTargets/TargetsPlots/inSchool_targets_ts.png
diff --git a/input/PL/DoFilesTargets/TargetsPlots/partnered_BUlogic_targets_ts.gph b/input/PL/DoFilesTargets/TargetsPlots/partnered_BUlogic_targets_ts.gph
diff --git a/input/PL/DoFilesTargets/TargetsPlots/partnered_BUlogic_targets_ts.png b/input/PL/DoFilesTargets/TargetsPlots/partnered_BUlogic_targets_ts.png
diff --git a/input/PL/DoFilesTargets/TargetsPlots/retirement_targets_ts.gph b/input/PL/DoFilesTargets/TargetsPlots/retirement_targets_ts.gph
diff --git a/input/PL/DoFilesTargets/TargetsPlots/retirement_targets_ts.png b/input/PL/DoFilesTargets/TargetsPlots/retirement_targets_ts.png
diff --git a/input/PL/DoFilesTargets/alignment_targets_disability.xlsx b/input/PL/DoFilesTargets/alignment_targets_disability.xlsx
diff --git a/input/PL/DoFilesTargets/alignment_targets_employment.xlsx b/input/PL/DoFilesTargets/alignment_targets_employment.xlsx
diff --git a/input/PL/DoFilesTargets/alignment_targets_inSchool.xlsx b/input/PL/DoFilesTargets/alignment_targets_inSchool.xlsx
diff --git a/input/PL/DoFilesTargets/alignment_targets_partnered_share.xlsx b/input/PL/DoFilesTargets/alignment_targets_partnered_share.xlsx
diff --git a/input/PL/DoFilesTargets/alignment_targets_retirement.xlsx b/input/PL/DoFilesTargets/alignment_targets_retirement.xlsx
diff --git a/input/SimPaths_Variable_Codebook.xlsx b/input/SimPaths_Variable_Codebook.xlsx
diff --git a/input_processing/00_master_conditions.do → input_processing/00_master_conditions_PL.do b/input_processing/00_master_conditions.do → input_processing/00_master_conditions_PL.do
@@ -167,13 +167,13 @@ global r1a_if_condition "dcpst == 2 & dag >= ${age_can_retire} & flag_deceased !
 global r1b_if_condition "ssscp != 1 & dcpst == 1 & dag >= ${age_can_retire} & flag_deceased != 1"
 
 * Wages
-global W1fa_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & flag_deceased != 1"
+global W1fa_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & flag_deceased != 1"
 
-global W1ma_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & flag_deceased != 1"
+global W1ma_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & flag_deceased != 1"
 
-global W1fb_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & previouslyWorking == 1 & flag_deceased != 1"
+global W1fb_if_condition "dgn == 0 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & previouslyWorking == 1 & flag_deceased != 1"
 
-global W1mb_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & previouslyWorking == 1 & flag_deceased != 1"
+global W1mb_if_condition "dgn == 1 & dag >= ${age_seek_employment} & dag <= ${age_force_retire} & deh_c4 != 0 & previouslyWorking == 1 & flag_deceased != 1"
 
 * Capital income 
 global i1a_if_condition "dag >= ${age_becomes_semi_responsible} & flag_deceased != 1" 

diff --git a/input_processing/01_key_function_income_thresholds_from_EUROMOD_2018.do b/input_processing/01_key_function_income_thresholds_from_EUROMOD_2018.do
@@ -51,9 +51,8 @@ if _rc {
     exit 601
 }
 
-local dir_doc "`dir_w'/documentation"
-local out_dta  "`dir_doc'/key_function_income_thresholds_clean2018.dta"
-local out_xlsx "`dir_doc'/key_function_income_thresholds_clean2018.xlsx"
+local dir_out "`dir_w'/input_processing"
+local out_xlsx "`dir_out'/key_function_income_thresholds_2018.xlsx"
 
 local ref_year 2018
 local weeks_per_month = 365.25 / (7 * 12)
@@ -248,7 +247,6 @@ sort country approach
 
 format lo_monthly hi_monthly lo_weekly hi_weekly %12.2f
 compress
-save "`out_dta'", replace
 
 export excel using "`out_xlsx'", sheet("results") firstrow(variables) sheetreplace
 
@@ -267,6 +265,5 @@ putexcel A11=("Output values")   B11=("Local currency per week, exact and rounde
 putexcel A12=("Do-file")         B12=("01_key_function_income_thresholds_from_EUROMOD_2018.do")
 
 di as txt "Saved results to:"
-di as txt "  `out_dta'"
 di as txt "  `out_xlsx'"
 list country approach bu_id lo_hi_weekly_exact lo_hi_weekly_round, noobs abbreviate(32)
diff --git a/input_processing/data_construction/PL/00_master_data_set_construction_PL.do b/input_processing/data_construction/PL/00_master_data_set_construction_PL.do
@@ -9,10 +9,12 @@
 * DATA:         	    Longitudinal EU-SILC UDB version, 2005 - 2020 
 * AUTHORS: 				Clare Fenwick, Daria Popova, Ashley Burdett, 
 * 						Aleksandra Kolndrekaj
-* LAST UPDATE:          Jan 2026 AB
+* LAST UPDATE:          March 2026 AB
 * 
 ********************************************************************************
 * NOTES:
+*	ENSURE HAVE ALREADY RUN 00_master_conditions.do FILE.
+*
 *   Before running these files, the cumulative panel for each file type 
 * 	(D, H, R, P) must be constructed. These cumulative panels should be created 
 * 	following the procedure set out in *GESIS Papers 2022/10*. The do-files to 
@@ -115,7 +117,7 @@ global dir_ind "/Users/ashleyburdett/Library/CloudStorage/Box-Box"
 // Aleksandra - C:/Users/ak25793/Box
 
 * Working directory
-global dir_work "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/initial_populations/PL"
+global dir_work "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/input_processing/initial_populations/PL"
 
 * Directory containing do files
 global dir_do "$dir_work/do_files"
@@ -146,21 +148,21 @@ global dir_data_05_20 "$dir_data/orig_panel_2005_2020"
 * DEFINE PARAMETERS & PROCESS IF CONDITIONS
 *******************************************************************************/
 
-do "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/00_master_conditions.do"
+do "$dir_ind/CeMPA shared area/_SimPaths/_SimPathsEU/input_processing/00_master_conditions_PL.do"
 
 
 /*******************************************************************************
 * EXECUTE FILES
 *******************************************************************************/
 //do "$dir_do/01_prepare_pooled_data.do"
 
-do "$dir_do/02_create_variables_PL.do"
+do "$dir_do/02_create_variables_${country}.do"
 
-do "$dir_do/03_create_benefit_units_PL.do"
+do "$dir_do/03_create_benefit_units_${country}.do"
 
-do "$dir_do/04_reweight_PL.do"
+do "$dir_do/04_reweight_${country}.do"
 
-do "$dir_do/05_drop_hholds_and_slice_PL.do"
+do "$dir_do/05_drop_hholds_slice_and_refactoring_${country}.do"
 
-do "$dir_do/06_check_yearly_data_PL.do"
+do "$dir_do/06_check_yearly_data_${country}.do"
 
diff --git a/input_processing/data_construction/PL/01_prepare_pooled_data_PL.do b/input_processing/data_construction/PL/01_prepare_pooled_data_PL.do
@@ -1,5 +1,5 @@
 ********************************************************************************
-* PROJECT:              ESPON
+* PROJECT:              SImPaths EU 
 * DO-FILE NAME:         01_prepare_pooled_data.do
 * DESCRIPTION:          Compiles panel dataset from EU-SILC  
 ********************************************************************************
@@ -23,10 +23,42 @@ merge these chunks of data into one cumulative dataset (separately for the
 D-,H-,R- and P-data).
 */
 /*
-Initial populations: cross-sectional SILC for 2011-2023 (income 2010-2022), 
-2023 (income 2022)
-Estimation sample: longitudinal SILC with observations from 2011-2023 
-(income 2010-2022)
+STRUCTURE OF THIS FILE
+
+  The script builds a person-level panel dataset for a single country by
+  sequentially merging the four EU-SILC master files produced by the panel
+  construction scripts (01-04 in eu_silc_do_2025/).
+
+  Files are merged in the following order, with R as the base:
+
+    R (Personal Register) — loaded first as the base. Contains all persons
+      in the sample including children under 16. Key identifiers: upid
+      (unique person ID across releases), uhid (unique household ID), year.
+
+    P (Personal Data) — merged 1:1 on year+upid+uhid. Contains income and
+      personal variables for adults aged 16 and above only. After this merge:
+      - Adults (in both R and P): have full R and P variables
+      - Children (in R only, not P): retained with R variables only
+      - Records in P but not R: dropped (should not occur in clean data)
+
+    D (Household Register) — merged 1:m on year+uhid. D is household-level
+      so one D row maps to multiple persons. keep if _merge==3 retains only
+      persons whose household appears in D. A small number of households may
+      not merge — this is suspected to be an edge case from the cross-release
+      deduplication in 01_create_masterD.do but has not been fully investigated.
+
+    H (Household Data) — merged 1:m on year+uhid, same logic as D.
+
+  KEY IDENTIFIERS
+    upid  — unique personal ID across releases (country + rotation group +
+             dropout year + pid). Not the same as the raw pid in the source data
+    uhid  — unique household ID across releases (same construction logic).
+    year  — income reference year.
+
+  OUTPUT
+    ${country}-SILC_pooled_all_obs_01.dta — person-level panel for the target
+    country, containing all household members (adults and children) with
+    combined R, P, D, and H variables. Flag variables (*_f, *_i) are dropped.
 */
 
 ********************************************************************************