Benchmark Optimization Session 2026 03 05

Benchmark Optimization Session — 2026-03-05

Summary

Iterative optimization of MiniPdf's Excel-to-PDF converter, improving benchmark scores from 95.5% avg / 64 passing (≥0.99) to 97.1% avg / 100 passing across 150 test cases. The session expanded the benchmark from 120 to 150 test cases (adding classic121–classic150 for style/border/fill scenarios) and applied 15+ code changes across ExcelReader.cs, ExcelToPdfConverter.cs, PdfTextBlock.cs, PdfPage.cs, and PdfWriter.cs.

Before / After

Metric	Before	After	Delta
Average score	95.5%	97.1%	+1.6%
Cases ≥ 0.99	64 / 120	100 / 150	+36
Cases < 0.80	5	3	−2
Test cases	120	150	+30

Score Distribution (150 cases)

Bucket	Count
≥ 0.99 (Excellent)	100
0.95 – 0.98	22
0.90 – 0.94	16
0.80 – 0.89	9
< 0.80	3

Changes Made

1. Font Style Support (`ExcelReader.cs`)

Added FontStyleInfo record and ReadFontStyles() method (replacing ReadFontColors()) to parse <sz>, <b>, <i> elements from styles.xml. Extended ExcelCell with FontSize, Bold, Italic fields.

2. Border Support (`ExcelReader.cs` + `ExcelToPdfConverter.cs`)

Added BorderSide and CellBorderInfo records
ReadBorders() + ReadBorderSide() parse <borders> from styles.xml
Renderer draws AddLine() for each border side with style-dependent width:
- thin → 0.5pt, medium → 1.0pt, thick → 1.5pt

3. Pattern Fill Support (`ExcelReader.cs`)

Enhanced ReadFillColors() to handle non-solid patterns (darkGray, lightGray, mediumGray, etc.) by blending foreground/background colors with pattern-specific tint ratios.

4. Vertical Alignment (`ExcelReader.cs` + `ExcelToPdfConverter.cs`)

Added VerticalAlignment field to ExcelCell
ReadCellXfStyles() now returns a 6-tuple including vertical alignment list

Renderer positions text at top/center/bottom of the row height using descent-based baseline calculation:

var descent = options.FontSize * 0.31f;
if (verticalAlignment == "top")
    cellY = currentY - cellFontSize;
else if (verticalAlignment == "center")
    cellY = currentY - (rowHeight - textBlock) / 2f - cellFontSize + descent;
else // "bottom" (default)
    cellY = currentY - rowHeight + descent + lineHeight * (lines.Length - 1);

5. Column Width Fix (Major Impact)

Root cause: CalculateNaturalColumnWidths() auto-sized columns to content width for multi-column sheets without explicit widths. LibreOffice uses a fixed default of 8.43 character units (47.4pt).

Fix: Use default column width when no explicit widths are present. This single change improved scores from 95.7% → 96.7% and added ~16 cases to ≥0.99.

6. Calibri Fitting Scale (`ExcelToPdfConverter.cs`)

Tuned FittingChars() scale factor from 0.93 → 0.86 to approximate Calibri character metrics (which LibreOffice uses) on Helvetica glyph widths. This allows more characters per column, matching LibreOffice's character count more closely.

Also applied CalibriFittingScale to MeasureHelveticaWidth() used by FitNumericText(), preventing over-aggressive number reformatting.

private const double CalibriFittingScale = 0.86;

7. Number Formatting Fixes (`ExcelReader.cs`)

Negative currency sign placement: $-180,000 → -$180,000 (minus before prefix for single-section formats)
Parenthesized negatives: (#,##0) format no longer adds an extra minus sign
Date format handling: FormatExcelDate() accepts format code with full ConvertExcelDateFormat() method supporting yyyy/MM/dd/HH/mm/ss with month-vs-minute context detection
FormatGeneral: Lowered integer threshold from 1e15 → 1e10 for scientific notation; added near-integer rounding (e.g., 9999999.99 → 10000000)
FitNumericText for all cells: Applied number-fitting to all cells (not just clipped ones), matching LibreOffice's behavior of reformatting numbers that exceed column width even in the last column

8. Boolean Center Alignment (`ExcelToPdfConverter.cs`)

Excel booleans (TRUE/FALSE, type b) default to center alignment. Added detection so boolean cells with no explicit alignment use "center".

9. Margin and Padding Tuning (`ExcelToPdfConverter.cs`)

MarginLeft: 50 → 54 (matching LibreOffice's default 54pt left margin)
ColumnPadding: 4 → 2 (reducing inter-column gap to match reference PDFs)
Minimum column padding clamp: 4 → 2

10. Fill Rectangle Sizing (`ExcelToPdfConverter.cs`)

Fill height uses rowHeight (not lineHeight × maxLinesInRow) to properly cover explicit row heights
Fill width uses exact column width (no extra padding), matching LibreOffice's cell fill rendering

11. Auto Row Height for Large Fonts (`ExcelToPdfConverter.cs`)

When a cell uses a font larger than the default, auto-calculates row height using maxCellFontSize × 1.3 (with lineHeight as minimum):

var autoRowHeight = maxCellFontSize > options.FontSize
    ? Math.Max(lineHeight, maxCellFontSize * 1.3f)
    : lineHeight;

12. Merged Cell Alignment Fix (`ExcelToPdfConverter.cs`)

Center/right alignment in merged cells now uses the full merged span width instead of only the first column's width.

13. PDF Clipping Infrastructure (`PdfTextBlock.cs`, `PdfPage.cs`, `PdfWriter.cs`)

Added optional ClipRect property to PdfTextBlock, AddText() now accepts a clip rectangle, and BuildContentStream() renders q ... re W n ... Q clipping operators. While full-text clipping was ultimately not used (it hurt PyMuPDF text extraction), the infrastructure remains for future use.

Approaches Tried but Reverted

Approach	Result	Reason
Full text + PDF clipping (no truncation)	Score dropped	PyMuPDF text extraction grouped text differently with `q`/`Q` wrappers
`Tz` horizontal text scaling (86%)	92/150	Narrower text diverged from reference (which also had merged spans)
Column padding = 0	91/150	Text from adjacent cells merged in extraction
Column padding = 1	91/150	Same merging issue as padding=0
MarginLeft = 55	Slightly worse	Pixel comparison too sensitive to layout shifts
MarginRight = 54 (symmetric)	Mixed	No net improvement
Chart title font size 15 → 13	−7 cases	Over-correction for chart rendering
Chart legend text removal	Neutral	Legend text was actually table header text
Chart Y-axis desiredTicks 6 → 5	Mixed	Hurt some cases while helping others
Overflow guard (post-truncation check)	Worse	Too aggressive, removed valid text
Fill width += columnPadding	Worse	Created overlapping fills
Per-cell lineHeight for mixed fonts	Regressed classic128	Base lineHeight more consistent
Baseline shift for mixed font sizes	No effect	PyMuPDF normalizes origin.y in line grouping

Remaining Failures (50 cases)

Chart Rendering (31 cases)

Most chart failures stem from:

Y-axis scale differences — NiceAxisScale() produces different tick values than LibreOffice
Chart title/label positioning — Calibri vs Helvetica width differences
Legend rendering — Text placement and visibility differences
Unsupported chart types — Stock OHLC, 3D, combo charts use fallback renderers

Non-Chart (19 cases)

Category	Examples	Root Cause
Long text pagination	classic09 (60.1%)	LibreOffice wraps/paginates at ~60 lines; MiniPdf clips
CJK/Unicode	classic57 (87.6%), classic23 (91.5%)	Helvetica font lacks CJK glyphs; width estimation differs
Visual margin drift	classic18 (95.7%), classic60 (97.4%)	Multi-page tables accumulate small positioning differences
Text span merging	classic13 (98.8%), classic49 (97.4%)	Adjacent cell text merges in PyMuPDF extraction
Fill/border rendering	classic134 (98.9%), classic137 (98.4%)	Pattern fill blending, checkerboard grid precision

Key Technical Insights

FittingChars scale is critical — The ratio between Helvetica and Calibri character widths (0.86) determines how many characters fit per column. Too aggressive (0.82) causes visual overflow; too conservative (0.93) truncates too early.
Column width default matters more than character widths — Switching from auto-sized columns to Excel's 8.43 char-unit default was the single highest-impact fix (+16 cases).
PDF clipping hurts text extraction — While technically correct, adding q/Q graphics state wrappers disrupts PyMuPDF's text line grouping, causing text_similarity to drop.
MeasureHelveticaWidth must use the same scale as FittingChars — Without this, FitNumericText reformats numbers that actually fit the column.
Bottom alignment is Excel's default — Text baseline should be positioned at row_bottom + descent, not at the row top. This improved visual scores across many cases.
Margin alignment matters — Matching LibreOffice's 54pt left margin improved visual similarity for styled cases.

Build & Test Commands

# Build
cd d:\git\MiniPdf
dotnet build src/MiniPdf/MiniPdf.csproj

# Generate PDFs (with --no-cache to pick up source changes)
cd tests/MiniPdf.Scripts
dotnet run --no-cache convert_xlsx_to_pdf.cs

# Run comparison
cd tests/MiniPdf.Benchmark
$env:PYTHONIOENCODING="utf-8"
python compare_pdfs.py --minipdf-dir "..\MiniPdf.Scripts\pdf_output" --reference-dir "reference_pdfs" --report-dir "reports"

# Quick check
python _quick_check.py

# Full pipeline
python run_benchmark.py --skip-generate

Files Modified

File	Lines Changed	Key Changes
`src/MiniPdf/ExcelReader.cs`	~200	Font styles, borders, pattern fills, vertical alignment, number formatting
`src/MiniPdf/ExcelToPdfConverter.cs`	~150	Column widths, Calibri scale, margins, padding, vertical alignment, boolean alignment, merged cell alignment, fill/border rendering, auto row height
`src/MiniPdf/PdfTextBlock.cs`	+10	ClipRect property
`src/MiniPdf/PdfPage.cs`	+2	AddText clipRect parameter
`src/MiniPdf/PdfWriter.cs`	+15	Clip rectangle rendering in content stream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Optimization Session 2026 03 05

Benchmark Optimization Session — 2026-03-05

Summary

Before / After

Score Distribution (150 cases)

Changes Made

1. Font Style Support (`ExcelReader.cs`)

2. Border Support (`ExcelReader.cs` + `ExcelToPdfConverter.cs`)

3. Pattern Fill Support (`ExcelReader.cs`)

4. Vertical Alignment (`ExcelReader.cs` + `ExcelToPdfConverter.cs`)

5. Column Width Fix (Major Impact)

6. Calibri Fitting Scale (`ExcelToPdfConverter.cs`)

7. Number Formatting Fixes (`ExcelReader.cs`)

8. Boolean Center Alignment (`ExcelToPdfConverter.cs`)

9. Margin and Padding Tuning (`ExcelToPdfConverter.cs`)

10. Fill Rectangle Sizing (`ExcelToPdfConverter.cs`)

11. Auto Row Height for Large Fonts (`ExcelToPdfConverter.cs`)

12. Merged Cell Alignment Fix (`ExcelToPdfConverter.cs`)

13. PDF Clipping Infrastructure (`PdfTextBlock.cs`, `PdfPage.cs`, `PdfWriter.cs`)

Approaches Tried but Reverted

Remaining Failures (50 cases)

Chart Rendering (31 cases)

Non-Chart (19 cases)

Key Technical Insights

Build & Test Commands

Files Modified

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Benchmark Optimization Session 2026 03 05

Benchmark Optimization Session — 2026-03-05

Summary

Before / After

Score Distribution (150 cases)

Changes Made

1. Font Style Support (ExcelReader.cs)

2. Border Support (ExcelReader.cs + ExcelToPdfConverter.cs)

3. Pattern Fill Support (ExcelReader.cs)

4. Vertical Alignment (ExcelReader.cs + ExcelToPdfConverter.cs)

5. Column Width Fix (Major Impact)

6. Calibri Fitting Scale (ExcelToPdfConverter.cs)

7. Number Formatting Fixes (ExcelReader.cs)

8. Boolean Center Alignment (ExcelToPdfConverter.cs)

9. Margin and Padding Tuning (ExcelToPdfConverter.cs)

10. Fill Rectangle Sizing (ExcelToPdfConverter.cs)

11. Auto Row Height for Large Fonts (ExcelToPdfConverter.cs)

12. Merged Cell Alignment Fix (ExcelToPdfConverter.cs)

13. PDF Clipping Infrastructure (PdfTextBlock.cs, PdfPage.cs, PdfWriter.cs)

Approaches Tried but Reverted

Remaining Failures (50 cases)

Chart Rendering (31 cases)

Non-Chart (19 cases)

Key Technical Insights

Build & Test Commands

Files Modified

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

1. Font Style Support (`ExcelReader.cs`)

2. Border Support (`ExcelReader.cs` + `ExcelToPdfConverter.cs`)

3. Pattern Fill Support (`ExcelReader.cs`)

4. Vertical Alignment (`ExcelReader.cs` + `ExcelToPdfConverter.cs`)

6. Calibri Fitting Scale (`ExcelToPdfConverter.cs`)

7. Number Formatting Fixes (`ExcelReader.cs`)

8. Boolean Center Alignment (`ExcelToPdfConverter.cs`)

9. Margin and Padding Tuning (`ExcelToPdfConverter.cs`)

10. Fill Rectangle Sizing (`ExcelToPdfConverter.cs`)

11. Auto Row Height for Large Fonts (`ExcelToPdfConverter.cs`)

12. Merged Cell Alignment Fix (`ExcelToPdfConverter.cs`)

13. PDF Clipping Infrastructure (`PdfTextBlock.cs`, `PdfPage.cs`, `PdfWriter.cs`)