Fast CSV processing for data scientists who have real work to do.
✅ Working Features:
- Auto-detect encoding (UTF-8, Latin-1, Windows-1252)
- Auto-detect delimiter (comma, tab, semicolon, pipe)
- Read compressed files (.gz)
- Handle files with BOM, mixed quotes, ragged lines
- One-command data cleaning with
quick_clean() - Export to Parquet or CSV
- 2-5x faster than pandas for files over 100K rows
- Memory-safe preview with
peek() - File analysis with
info()
- No full streaming for files larger than memory (only peek)
- Limited to Polars-supported data types
- No Excel support
- No remote file support (S3, HTTP)
- Estimated 30% of planned features implemented
# From source (PyPI package coming soon)
git clone https://github.com/yourusername/rapidcsv
cd rapidcsv
pip install -e .import rapidcsv as rc
# Just works - auto-detects encoding and delimiter
df = rc.read("messy_data.csv")
# Clean common issues with one command
df_clean = df.quick_clean()
# Save for next time
df_clean.to_parquet("clean_data.parquet")# Auto-detect everything
df = rc.read("data.csv")
# Or specify explicitly
df = rc.read("data.csv",
encoding="latin-1",
separator="\t",
error_bad_lines="skip")
# Preview large files
preview = rc.peek("huge_file.csv", rows=1000)
# Get file info without loading
info = rc.info("huge_file.csv")
print(f"Size: {info['size_mb']:.1f} MB")
print(f"Encoding: {info['detected_encoding']}")
print(f"Delimiter: {info['detected_delimiter']}")quick_clean() performs these operations:
- Strip whitespace from all strings
- Standardize null values (NA, null, None, -, "" → None)
- Drop completely empty rows
- Remove duplicate rows
- Clean column names (spaces → underscores, lowercase)
# All cleaning operations
df_clean = df.quick_clean()
# Or selectively
df_clean = df.quick_clean(
strip_whitespace=True,
standardize_nulls=True,
drop_empty_rows=False,
drop_duplicate_rows=True,
fix_column_names=True,
report=True # Print what was cleaned
)Benchmarked on MacBook Pro M1 with messy CSV files:
| File Size | Rows | Pandas | RapidCSV | Speedup |
|---|---|---|---|---|
| 3 MB | 10K | 0.04s | 0.12s | 0.3x* |
| 30 MB | 100K | 0.31s | 0.11s | 2.9x |
| 148 MB | 500K | 1.68s | 0.33s | 5.1x |
*Small files have overhead from auto-detection
Times include both loading and cleaning operations.
RapidCSV provides helpful error messages:
# Instead of: "UnicodeDecodeError: 'utf-8' codec can't decode byte..."
# You get: "Failed to decode file 'data.csv' with detected encoding 'utf-8' (confidence: 73.2%).
# Try specifying encoding explicitly: rc.read('data.csv', encoding='latin-1')"See the examples/ directory:
basic_usage.py- Complete workflow exampleperformance_comparison.py- Benchmark vs pandas
This is alpha software (v0.1.0). What works:
- ✅ Core CSV reading with auto-detection
- ✅ Basic data cleaning operations
- ✅ Parquet/CSV export
- ✅ Compressed file support
- ✅ Basic error handling
What's missing:
- ❌ Full streaming for huge files
- ❌ Advanced cleaning options
- ❌ Remote file support
- ❌ Data profiling reports
- ❌ CLI tool
This project is under active development. Issues and PRs welcome!
MIT