Skip to content

Benchmark Optimization Session 2026 03 05

Wei Lin edited this page Mar 5, 2026 · 1 revision

Benchmark Optimization Session — 2026-03-05

Summary

Iterative optimization of MiniPdf's Excel-to-PDF converter, improving benchmark scores from 95.5% avg / 64 passing (≥0.99) to 97.1% avg / 100 passing across 150 test cases. The session expanded the benchmark from 120 to 150 test cases (adding classic121–classic150 for style/border/fill scenarios) and applied 15+ code changes across ExcelReader.cs, ExcelToPdfConverter.cs, PdfTextBlock.cs, PdfPage.cs, and PdfWriter.cs.


Before / After

Metric Before After Delta
Average score 95.5% 97.1% +1.6%
Cases ≥ 0.99 64 / 120 100 / 150 +36
Cases < 0.80 5 3 −2
Test cases 120 150 +30

Score Distribution (150 cases)

Bucket Count
≥ 0.99 (Excellent) 100
0.95 – 0.98 22
0.90 – 0.94 16
0.80 – 0.89 9
< 0.80 3

Changes Made

1. Font Style Support (ExcelReader.cs)

Added FontStyleInfo record and ReadFontStyles() method (replacing ReadFontColors()) to parse <sz>, <b>, <i> elements from styles.xml. Extended ExcelCell with FontSize, Bold, Italic fields.

2. Border Support (ExcelReader.cs + ExcelToPdfConverter.cs)

  • Added BorderSide and CellBorderInfo records
  • ReadBorders() + ReadBorderSide() parse <borders> from styles.xml
  • Renderer draws AddLine() for each border side with style-dependent width:
    • thin → 0.5pt, medium → 1.0pt, thick → 1.5pt

3. Pattern Fill Support (ExcelReader.cs)

Enhanced ReadFillColors() to handle non-solid patterns (darkGray, lightGray, mediumGray, etc.) by blending foreground/background colors with pattern-specific tint ratios.

4. Vertical Alignment (ExcelReader.cs + ExcelToPdfConverter.cs)

  • Added VerticalAlignment field to ExcelCell
  • ReadCellXfStyles() now returns a 6-tuple including vertical alignment list
  • Renderer positions text at top/center/bottom of the row height using descent-based baseline calculation:
    var descent = options.FontSize * 0.31f;
    if (verticalAlignment == "top")
        cellY = currentY - cellFontSize;
    else if (verticalAlignment == "center")
        cellY = currentY - (rowHeight - textBlock) / 2f - cellFontSize + descent;
    else // "bottom" (default)
        cellY = currentY - rowHeight + descent + lineHeight * (lines.Length - 1);

5. Column Width Fix (Major Impact)

Root cause: CalculateNaturalColumnWidths() auto-sized columns to content width for multi-column sheets without explicit widths. LibreOffice uses a fixed default of 8.43 character units (47.4pt).

Fix: Use default column width when no explicit widths are present. This single change improved scores from 95.7% → 96.7% and added ~16 cases to ≥0.99.

6. Calibri Fitting Scale (ExcelToPdfConverter.cs)

Tuned FittingChars() scale factor from 0.93 → 0.86 to approximate Calibri character metrics (which LibreOffice uses) on Helvetica glyph widths. This allows more characters per column, matching LibreOffice's character count more closely.

Also applied CalibriFittingScale to MeasureHelveticaWidth() used by FitNumericText(), preventing over-aggressive number reformatting.

private const double CalibriFittingScale = 0.86;

7. Number Formatting Fixes (ExcelReader.cs)

  • Negative currency sign placement: $-180,000-$180,000 (minus before prefix for single-section formats)
  • Parenthesized negatives: (#,##0) format no longer adds an extra minus sign
  • Date format handling: FormatExcelDate() accepts format code with full ConvertExcelDateFormat() method supporting yyyy/MM/dd/HH/mm/ss with month-vs-minute context detection
  • FormatGeneral: Lowered integer threshold from 1e15 → 1e10 for scientific notation; added near-integer rounding (e.g., 9999999.99 → 10000000)
  • FitNumericText for all cells: Applied number-fitting to all cells (not just clipped ones), matching LibreOffice's behavior of reformatting numbers that exceed column width even in the last column

8. Boolean Center Alignment (ExcelToPdfConverter.cs)

Excel booleans (TRUE/FALSE, type b) default to center alignment. Added detection so boolean cells with no explicit alignment use "center".

9. Margin and Padding Tuning (ExcelToPdfConverter.cs)

  • MarginLeft: 50 → 54 (matching LibreOffice's default 54pt left margin)
  • ColumnPadding: 4 → 2 (reducing inter-column gap to match reference PDFs)
  • Minimum column padding clamp: 4 → 2

10. Fill Rectangle Sizing (ExcelToPdfConverter.cs)

  • Fill height uses rowHeight (not lineHeight × maxLinesInRow) to properly cover explicit row heights
  • Fill width uses exact column width (no extra padding), matching LibreOffice's cell fill rendering

11. Auto Row Height for Large Fonts (ExcelToPdfConverter.cs)

When a cell uses a font larger than the default, auto-calculates row height using maxCellFontSize × 1.3 (with lineHeight as minimum):

var autoRowHeight = maxCellFontSize > options.FontSize
    ? Math.Max(lineHeight, maxCellFontSize * 1.3f)
    : lineHeight;

12. Merged Cell Alignment Fix (ExcelToPdfConverter.cs)

Center/right alignment in merged cells now uses the full merged span width instead of only the first column's width.

13. PDF Clipping Infrastructure (PdfTextBlock.cs, PdfPage.cs, PdfWriter.cs)

Added optional ClipRect property to PdfTextBlock, AddText() now accepts a clip rectangle, and BuildContentStream() renders q ... re W n ... Q clipping operators. While full-text clipping was ultimately not used (it hurt PyMuPDF text extraction), the infrastructure remains for future use.


Approaches Tried but Reverted

Approach Result Reason
Full text + PDF clipping (no truncation) Score dropped PyMuPDF text extraction grouped text differently with q/Q wrappers
Tz horizontal text scaling (86%) 92/150 Narrower text diverged from reference (which also had merged spans)
Column padding = 0 91/150 Text from adjacent cells merged in extraction
Column padding = 1 91/150 Same merging issue as padding=0
MarginLeft = 55 Slightly worse Pixel comparison too sensitive to layout shifts
MarginRight = 54 (symmetric) Mixed No net improvement
Chart title font size 15 → 13 −7 cases Over-correction for chart rendering
Chart legend text removal Neutral Legend text was actually table header text
Chart Y-axis desiredTicks 6 → 5 Mixed Hurt some cases while helping others
Overflow guard (post-truncation check) Worse Too aggressive, removed valid text
Fill width += columnPadding Worse Created overlapping fills
Per-cell lineHeight for mixed fonts Regressed classic128 Base lineHeight more consistent
Baseline shift for mixed font sizes No effect PyMuPDF normalizes origin.y in line grouping

Remaining Failures (50 cases)

Chart Rendering (31 cases)

Most chart failures stem from:

  • Y-axis scale differencesNiceAxisScale() produces different tick values than LibreOffice
  • Chart title/label positioning — Calibri vs Helvetica width differences
  • Legend rendering — Text placement and visibility differences
  • Unsupported chart types — Stock OHLC, 3D, combo charts use fallback renderers

Non-Chart (19 cases)

Category Examples Root Cause
Long text pagination classic09 (60.1%) LibreOffice wraps/paginates at ~60 lines; MiniPdf clips
CJK/Unicode classic57 (87.6%), classic23 (91.5%) Helvetica font lacks CJK glyphs; width estimation differs
Visual margin drift classic18 (95.7%), classic60 (97.4%) Multi-page tables accumulate small positioning differences
Text span merging classic13 (98.8%), classic49 (97.4%) Adjacent cell text merges in PyMuPDF extraction
Fill/border rendering classic134 (98.9%), classic137 (98.4%) Pattern fill blending, checkerboard grid precision

Key Technical Insights

  1. FittingChars scale is critical — The ratio between Helvetica and Calibri character widths (0.86) determines how many characters fit per column. Too aggressive (0.82) causes visual overflow; too conservative (0.93) truncates too early.

  2. Column width default matters more than character widths — Switching from auto-sized columns to Excel's 8.43 char-unit default was the single highest-impact fix (+16 cases).

  3. PDF clipping hurts text extraction — While technically correct, adding q/Q graphics state wrappers disrupts PyMuPDF's text line grouping, causing text_similarity to drop.

  4. MeasureHelveticaWidth must use the same scale as FittingChars — Without this, FitNumericText reformats numbers that actually fit the column.

  5. Bottom alignment is Excel's default — Text baseline should be positioned at row_bottom + descent, not at the row top. This improved visual scores across many cases.

  6. Margin alignment matters — Matching LibreOffice's 54pt left margin improved visual similarity for styled cases.


Build & Test Commands

# Build
cd d:\git\MiniPdf
dotnet build src/MiniPdf/MiniPdf.csproj

# Generate PDFs (with --no-cache to pick up source changes)
cd tests/MiniPdf.Scripts
dotnet run --no-cache convert_xlsx_to_pdf.cs

# Run comparison
cd tests/MiniPdf.Benchmark
$env:PYTHONIOENCODING="utf-8"
python compare_pdfs.py --minipdf-dir "..\MiniPdf.Scripts\pdf_output" --reference-dir "reference_pdfs" --report-dir "reports"

# Quick check
python _quick_check.py

# Full pipeline
python run_benchmark.py --skip-generate

Files Modified

File Lines Changed Key Changes
src/MiniPdf/ExcelReader.cs ~200 Font styles, borders, pattern fills, vertical alignment, number formatting
src/MiniPdf/ExcelToPdfConverter.cs ~150 Column widths, Calibri scale, margins, padding, vertical alignment, boolean alignment, merged cell alignment, fill/border rendering, auto row height
src/MiniPdf/PdfTextBlock.cs +10 ClipRect property
src/MiniPdf/PdfPage.cs +2 AddText clipRect parameter
src/MiniPdf/PdfWriter.cs +15 Clip rectangle rendering in content stream

Clone this wiki locally