Skip to content

Replace dead wikitext s3 link in quicktour with HF dataset mirror#2064

Open
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:doc-quicktour-wikitext-hf-dataset-1625
Open

Replace dead wikitext s3 link in quicktour with HF dataset mirror#2064
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:doc-quicktour-wikitext-hf-dataset-1625

Conversation

@adityasingh2400
Copy link
Copy Markdown

Summary

The wget URL in docs/source-doc-builder/quicktour.mdx points at
s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip,
which has been dead since the research.metamind.io takedown. It now
301-redirects to a host that no longer responds (or returns 403 directly
on the S3 path, as reported in #1683). Anyone following the quicktour
hits the dead link on the very first command.

PR #1846 fixed the related Salesforce blog link in the surrounding prose
but not the actual download command, so both #1625 and #1683 are still
open.

This swaps the URL to the canonical wikitext-103-raw-v1.zip mirrored
on the Hub at huggingface.co/datasets/mattdangerw/wikitext-103-raw.
The archive contents and layout are identical, so the existing
unzip wikitext-103-raw-v1.zip step and all downstream paths
(data/wikitext-103-raw/wiki.{train,test,valid}.raw in
test_quicktour.py, test_pipeline.py, and documentation.rs) keep
working without further changes.

Verification

$ curl -sI https://huggingface.co/datasets/mattdangerw/wikitext-103-raw/resolve/main/wikitext-103-raw-v1.zip | head -5
HTTP/2 302
content-type: text/plain; charset=utf-8
content-length: 1403
location: https://cas-bridge.xethub.hf.co/...wikitext-103-raw-v1.zip...

The Hub redirects to its CDN and serves the full 192 MB zip with the
expected filename.

Fixes #1625
Fixes #1683

The wikitext-103-raw-v1.zip URL in docs/source-doc-builder/quicktour.mdx
has 301-redirected to a dead host since the research.metamind.io takedown;
PR huggingface#1846 fixed the related blog link but not the actual download flow.
Switch the example to pull the dataset from Salesforce/wikitext on the
Hub so the quicktour runs end-to-end again.

Fixes huggingface#1625
Fixes huggingface#1683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wikitext-103-raw-v1.zip is not available on the amazonaws anymore Tokenizer Quickstart Tutorial: Broken Links

1 participant