Skip to content

Burton-David/rapidcsv

Repository files navigation

RapidCSV

Fast CSV processing for data scientists who have real work to do.

Features

Working Features:

  • Auto-detect encoding (UTF-8, Latin-1, Windows-1252)
  • Auto-detect delimiter (comma, tab, semicolon, pipe)
  • Read compressed files (.gz)
  • Handle files with BOM, mixed quotes, ragged lines
  • One-command data cleaning with quick_clean()
  • Export to Parquet or CSV
  • 2-5x faster than pandas for files over 100K rows
  • Memory-safe preview with peek()
  • File analysis with info()

⚠️ Current Limitations:

  • No full streaming for files larger than memory (only peek)
  • Limited to Polars-supported data types
  • No Excel support
  • No remote file support (S3, HTTP)
  • Estimated 30% of planned features implemented

Installation

# From source (PyPI package coming soon)
git clone https://github.com/yourusername/rapidcsv
cd rapidcsv
pip install -e .

Quick Start

import rapidcsv as rc

# Just works - auto-detects encoding and delimiter
df = rc.read("messy_data.csv")

# Clean common issues with one command
df_clean = df.quick_clean()

# Save for next time
df_clean.to_parquet("clean_data.parquet")

Core Functions

Reading Files

# Auto-detect everything
df = rc.read("data.csv")

# Or specify explicitly
df = rc.read("data.csv", 
             encoding="latin-1",
             separator="\t",
             error_bad_lines="skip")

# Preview large files
preview = rc.peek("huge_file.csv", rows=1000)

# Get file info without loading
info = rc.info("huge_file.csv")
print(f"Size: {info['size_mb']:.1f} MB")
print(f"Encoding: {info['detected_encoding']}")
print(f"Delimiter: {info['detected_delimiter']}")

Data Cleaning

quick_clean() performs these operations:

  1. Strip whitespace from all strings
  2. Standardize null values (NA, null, None, -, "" → None)
  3. Drop completely empty rows
  4. Remove duplicate rows
  5. Clean column names (spaces → underscores, lowercase)
# All cleaning operations
df_clean = df.quick_clean()

# Or selectively
df_clean = df.quick_clean(
    strip_whitespace=True,
    standardize_nulls=True,
    drop_empty_rows=False,
    drop_duplicate_rows=True,
    fix_column_names=True,
    report=True  # Print what was cleaned
)

Performance

Benchmarked on MacBook Pro M1 with messy CSV files:

File Size Rows Pandas RapidCSV Speedup
3 MB 10K 0.04s 0.12s 0.3x*
30 MB 100K 0.31s 0.11s 2.9x
148 MB 500K 1.68s 0.33s 5.1x

*Small files have overhead from auto-detection

Times include both loading and cleaning operations.

Error Handling

RapidCSV provides helpful error messages:

# Instead of: "UnicodeDecodeError: 'utf-8' codec can't decode byte..."
# You get: "Failed to decode file 'data.csv' with detected encoding 'utf-8' (confidence: 73.2%). 
#           Try specifying encoding explicitly: rc.read('data.csv', encoding='latin-1')"

Examples

See the examples/ directory:

  • basic_usage.py - Complete workflow example
  • performance_comparison.py - Benchmark vs pandas

Development Status

This is alpha software (v0.1.0). What works:

  • ✅ Core CSV reading with auto-detection
  • ✅ Basic data cleaning operations
  • ✅ Parquet/CSV export
  • ✅ Compressed file support
  • ✅ Basic error handling

What's missing:

  • ❌ Full streaming for huge files
  • ❌ Advanced cleaning options
  • ❌ Remote file support
  • ❌ Data profiling reports
  • ❌ CLI tool

Contributing

This project is under active development. Issues and PRs welcome!

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors