Skip to content

Add CappedHyperSphereNorm variatrion and LM Head norm option#803

Open
klei22 wants to merge 1 commit into
ReaLLMASIC:masterfrom
klei22:codex/add-optional-norm-for-lm-head-vectors-fpwgak
Open

Add CappedHyperSphereNorm variatrion and LM Head norm option#803
klei22 wants to merge 1 commit into
ReaLLMASIC:masterfrom
klei22:codex/add-optional-norm-for-lm-head-vectors-fpwgak

Conversation

@klei22
Copy link
Copy Markdown
Collaborator

@klei22 klei22 commented May 3, 2026

This pull request introduces support for applying optional normalization variants to the lm_head (language modeling head) in the model, allowing experimentation with different normalization strategies. It adds configuration, argument parsing, and implementation for several normalization types, including a new CappedHyperSphereNorm. The changes also update the experiment YAML to enable systematic comparison of these variants.

Support for lm_head normalization:

  • Added new configuration options in GPTConfig for specifying normalization variants and parameters for the lm_head, including type, radius, scale, gain, and radius learning.
  • Updated argument parsing in train_args.py to accept new lm_head normalization options from the command line. [1] [2]
  • Modified model initialization and forward pass in model.py to build, apply, and use the specified lm_head normalization in all relevant code paths. [1] [2] [3] [4] [5] [6]

New normalization variant:

  • Implemented CappedHyperSphereNorm, a normalization layer that projects vectors onto a hypersphere only if they exceed a certain radius, and registered it in the normalization dictionary. [1] [2]
  • Added cappedhyperspherenorm to the list of valid normalization choices in argument parsing.

Experimentation and configuration:

  • Added a new YAML experiment config (default_inf_lm_head_norm_comparison.yaml) to systematically compare different lm_head normalization strategies, including the new variant, across multiple head dimensions and other settings.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional normalization step for the model’s lm_head weights/logit computation to support experimentation with different normalization strategies, including a new CappedHyperSphereNorm, and provides an exploration YAML to compare variants under default_inf-style settings.

Changes:

  • Added norm_variant_lm_head and associated radius/scale/gain/radius_learning config + CLI args.
  • Implemented and registered CappedHyperSphereNorm in the norm variations registry.
  • Updated model forward paths to route lm_head logits computation through a normalization-aware helper; added an exploration config for systematic comparisons.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
variations/norm_variations.py Adds CappedHyperSphereNorm and registers it in norm_dictionary.
train_args.py Adds CLI args for selecting/configuring lm_head norm variants and parameters.
model.py Builds optional lm_head_norm module and applies it during logits computation across forward paths.
gpt_conf.py Adds GPTConfig fields for lm_head normalization selection and parameters.
explorations/default_inf_lm_head_norm_comparison.yaml New experiment config to sweep lm_head norm variants and head dims.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread model.py
if self.config.norm_variant_abs is not None:
self.transformer['post_abs_norm'] = self.build_norm_from_variant(config, "norm_variant_abs", "norm_abs")
if self.config.norm_variant_lm_head is not None:
self.transformer['lm_head_norm'] = self.build_norm_from_variant(config, "norm_variant_lm_head", "norm_lm_head")
Comment thread model.py
Comment on lines +255 to +258
def compute_lm_head_logits(self, x, lm_head_module):
weight = self.apply_lm_head_norm(lm_head_module.weight)
return F.linear(x, weight, lm_head_module.bias)

Comment on lines +207 to +215
"""Project vectors onto a sqrt(n_embd) hypersphere only when outside the radius."""

def __init__(self, config):
super().__init__()
self.radius = math.sqrt(config.n_embd)

def forward(self, x):
norms = x.norm(2, dim=-1, keepdim=True)
scale = torch.where(norms > self.radius, self.radius / (norms + 1e-8), torch.ones_like(norms))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants