Authors: Divyam Madaan, Sumit Chopra, Kyunghyun Cho
TL;DR: The key to temporal generalization is not to design new algorithms, but to identify the reasonable assumptions about how the data generating process evolves over time.
Machine learning (ML) models often struggle to maintain performance under distribution shifts, leading to inaccurate predictions on unseen future data. In this work, we investigate whether and under what conditions models can achieve such a generalization when relying solely on past data. We explore two primary approaches: convex combinations of past model parameters (parameter interpolation) and explicit extrapolation beyond the convex hull of past parameters (parameter extrapolation). We benchmark several methods within these categories on a diverse set of temporal tasks, including language modeling, news summarization, news tag prediction, academic paper categorization, satellite image-based land use classification over time, and historical yearbook photo gender prediction. Our empirical findings show that none of the evaluated methods consistently outperforms the simple baseline of using the latest available model parameters in all scenarios. In the absence of access to future data or robust assumptions about the underlying data-generating process, these results underscore the inherent difficulties of generalizing and extrapolating to future data and warrant caution when evaluating claims of such generalization.
(Left) Performance degrades over time. The widening performance gap between a stale model trained once in January 2012 (red) and a monthly updated model (green) illustrates the decay in performance over time. The evaluation was conducted using data from March 2012 onward. (Right) Temporal generalization framework. Across sequential learning stages (
To set up the environment and install the necessary dependencies, run the following command:
uv venv --python 3.11 && uv pip install -r requirements.txtThe NewsRoom dataset is used for news summarization and language modeling.
- Download: Obtain the dataset from the official Newsroom website.
- Processing: After downloading, process the dataset using the
create_monthly_dataset.py.
- Installation: Install the
wildtimepackage using pip:pip install wildtime==1.1.3
- Dataset Access: The datasets within Wilds-Time can be downloaded and prepared using scripts provided within the
wildtimepackage.
-
Training: From the project root, run
time_vectors/experiment_scripts/finetune_month_models.sh(monthly fine-tuning on NewsRoom splits):cd time_vectors/experiment_scripts bash finetune_month_models.sh -
Evaluation: After training, run
time_vectors/experiment_scripts/run_eval_month_summ.sh:cd time_vectors/experiment_scripts bash run_eval_month_summ.sh
Each dataset has a script in wilds_time/eval-stream/ that trains every method with multiple seeds and evaluates parameter aggregation strategies:
cd wilds_time
bash eval-stream/yearbook.sh
bash eval-stream/huffpost.sh
bash eval-stream/fmow.sh
bash eval-stream/arxiv.shWe'd love to accept your contributions to this project. Please feel free to open an issue, or submit a pull request as necessary. If you have implementations of this repository in other ML frameworks, please reach out so we may highlight them here.
The code is based on Time vectors and Wild-Time. We thank the authors for their amazing work and releasing the code base.
This codebase is released under MIT License.
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@inproceedings{
madaan2026temporal,
title={Temporal Generalization: A Reality Check},
author={Divyam Madaan and Sumit Chopra and Kyunghyun Cho},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=Wz0ILlbh9U}
}
