Hello! Welcome to the repository for WaveMind!
Electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach to analyzing brain signals. However, the inherent complexity of brain activity, encompassing both cognitive functions representing subjective consciousness and non-cognitive processes associated with homeostasis, creates distinct supervisory modalities during training. This divergence hinders the generalization capability of existing EEG-MLLM models across tasks and impedes fluent natural language interaction. To address these limitations, we introduce WaveMind, the first LLM framework specifically designed to interpret EEG data by projecting diverse neural signals into a shared semantic space. We synthesize the WaveMind-Instruct dataset, comprising 362k instructions, with GPT assistance. WaveMind achieves remarkable performance on four downstream classification tasks and supports fluent, open-ended dialogue about brain activity. Ablation studies underscore significant synergies between supervision modility and across tasks, demonstrating the importance of comprehensive modeling of brain signals for developing general-purpose EEG interpretation systems.
We open-sourced our models, data, and code here.
- Configuration Environment
pip install uv
uv sync
source .venv/bin/activateNote: Project root path is automatically detected. If needed, set export WaveMind_ROOT_PATH_=/path/to/WaveMind for backward compatibility.
Unfortunately, due to privacy and licensing reasons, we are unable to publicly disclose the preprocessed dataset. However we provide a preprocessing process and code, which can be used to preprocess the data yourself.
- Raw Data Access
For some datasets that are difficult to download, we provide convenient download scripts.
| Data | Description | Download Script | Folder Name | Link |
|---|---|---|---|---|
| TUAB | A corpus of EEGs that have been annotated as normal or abnormal. | Link | edf | Link |
| TUEV | The subset of TUEG that contains annotations of EEG segments as one of six classes. | Link | edf | Link |
| ImageNet-EEG | This dataset includes EEG data from 6 subjects when looking 40 categories image from ImageNet. | Link | Refer to File Tree | Link |
| THING-EEG | The dataset includes the EEG data of 10 subjects when viewed corresponding images. | N.A | Data | Link |
| SEED | The dataset which provides EEG data and corresponding emotional states. | N.A | Preprocessed_EEG | Link |
- Data Preprocessing
Please refer to here for details.
We provide a unified Python script to process all EEG datasets:
# Process all datasets + generate RAG ground truth
python data/preprocess_wavemind.py --all --seed 42
# Process a specific dataset
python data/preprocess_wavemind.py --dataset SEED --seed 42
python data/preprocess_wavemind.py --dataset TUAB --seed 42
python data/preprocess_wavemind.py --dataset TUEV --seed 42
python data/preprocess_wavemind.py --dataset ImageNetEEG --seed 42
python data/preprocess_wavemind.py --dataset THING-EEG --seed 42
# Generate RAG ground truth NPY files only
python data/preprocess_wavemind.py --rag-only
# Available datasets: SEED, TUAB, TUEV, ImageNetEEG, THING-EEG, allEach dataset can also be processed independently via its own process.py (e.g., data/SEED/process.py).
Stage 1: Dual-Representation Alignment
IMPORTANT: The --config-name parameter is now mandatory.
Available config presets (in EEG_Encoder/examples/):
train_atms.yaml: Quick-start for ATMSmodify training (recommended)eval_sd.yaml: Subject-Dependent evaluationeval_si.yaml: Subject-Independent evaluationadvanced_shm.yaml: Advanced shared memory configuration
Basic training with ATMSmodify:
python EEG_Encoder/run_CLIPtraining.py --config-name=train_atmsTraining with custom overrides:
python EEG_Encoder/run_CLIPtraining.py --config-name=base \
experiment.models=[ATMSmodify] \
experiment.gpu_number=[0] \
training.DEFAULT_EPOCHS=30 \
experiment.datasets=[ImageNetEEG]Evaluation (Subject Dependent):
python EEG_Encoder/run_CLIPtraining.py --config-name=eval_sd \
advanced.model_checkpoint_name=/path/to/checkpoint.pthEvaluation (Subject Independent):
python EEG_Encoder/run_CLIPtraining.py --config-name=eval_si \
advanced.model_checkpoint_name=/path/to/checkpoint.pthAvailable Models: MLP, ATMS, ShallowFBCSPNet, channelNet, NICE, ATMSmodify (primary), EEGITNet, CBraMod, NeuroLM-B, NeuroLM-L
Encoder Checkpoint: The trained ATMM(ATMSmodify) checkpoint can be found in EEG_Encoder/Resource/Checkpoint/ALL/
Full configuration options: See EEG_Encoder/examples/base.yaml for all available parameters.
Stage 2: Cold Start Training
We use LLAVA_pretain to enable EEG-MLLM to recognize CLIP space, before EEG Instruction tuning.
- Download from LLaVA-Pretrain, unfold and put it under EEGLLM/LLaVA/playground/data/LLaVA-Pretrain.
- Please change the corresponding options and start training on the script below.
bash ./EEGLLM/examples/stage2_pretrain/pretrain.shStage 3: EEG Instruction Tuning
Please change the corresponding options in the script to start training.
bash ./EEGLLM/examples/stage3_finetune/finetune_lora_eeg.shRun:
bash ./Data_Engineering/Script/Test_data/construct_WaveMind.shRun evaluation script to evaluate the WaveMind over WaveMind Bench.
CUDA_VISIBLE_DEVICES=0 python ./EEGLLM/Evaluation/Evaluation_Classification.py --model_path /path/to/modelPlease refer to script for more setting details.
/path/to/WaveMind
├── data
│ ├── ImageNetEEG
│ │ ├── eeg_signals_raw_with_mean_std.pth -> raw_file need to be download
│ │ ├── Image -> raw_file need to be download
│ │ └── process.py -> independent dataset processor
│ └── ....
│ ├── preprocess_wavemind.py -> unified preprocessing entry point
│ ├── SEED
│ │ ├── Preprocessed_EEG -> raw_file need to be download
│ │ └── process.py -> independent dataset processor
│ ├── THING-EEG
│ │ ├── Data -> raw_file need to be download
│ │ ├── data_config.json
│ │ ├── download.py
│ │ └── process.py -> independent dataset processor
│ ├── Total
│ │ ├── CLIP_groundTruth -> generated_file (RAG features)
│ │ ├── data_label.h5 -> generated_file
│ │ ├── dataset_weights.pth -> auto_generated_file when training
│ │ └── ....
│ ├── TUAB
│ │ ├── download.exp -> raw_file need to be download
│ │ ├── edf -> raw_file need to be download
│ │ ├── save -> cache dir
│ │ └── process.py -> independent dataset processor
│ ├── TUEV
│ │ ├── download.exp
│ │ ├── edf -> raw_file need to be download
│ │ ├── eegs.npz -> cache file
│ │ └── process.py -> independent dataset processor
│ ├── Utils.py -> shared utilities (filter, mapping, HDF5)
│ └── README.md -> data preprocessing details
├── EEG_Encoder
│ ├── Resource
│ │ ├── Checkpoint -> EEG Encoder checkpoint
│ │ └── ....
│ ├── run_CLIPtraining.py -> Script to Train EEG Encoder
│ └── ....
├── EEGLLM
│ ├── Evaluation -> Evaluation on WaveMind_Bench
│ └── ....
├── Data_Engineering
│ ├── data
│ │ ├── EEG_data -> WaveMind_Bench EEG data location
│ │ ├── Test_data -> WaveMind_Bench MCQ data location
│ │ └── ...
│ ├── Script
│ └── Test_data -> Scriptd to WaveMind_Bench data generation
└── ....- Thanks to L. Dong et al. for their contribution in EEG feature alignment (in cognitive activity alignment), we refer to their work: EEG_Image_decode
- Thanks to torcheeg and Pyhealth for providing the preprocessing tool.
- This README Template comes from HuatuoGPT-o1.
- Thanks to myself and co-authors effort.
This project is released under the Apache License 2.0. We welcome the use and extension of our work by the community.