-
Notifications
You must be signed in to change notification settings - Fork 15
Benchmark Optimization Session 2026 03 05
Iterative optimization of MiniPdf's Excel-to-PDF converter, improving benchmark scores from 95.5% avg / 64 passing (≥0.99) to 97.1% avg / 100 passing across 150 test cases. The session expanded the benchmark from 120 to 150 test cases (adding classic121–classic150 for style/border/fill scenarios) and applied 15+ code changes across ExcelReader.cs, ExcelToPdfConverter.cs, PdfTextBlock.cs, PdfPage.cs, and PdfWriter.cs.
| Metric | Before | After | Delta |
|---|---|---|---|
| Average score | 95.5% | 97.1% | +1.6% |
| Cases ≥ 0.99 | 64 / 120 | 100 / 150 | +36 |
| Cases < 0.80 | 5 | 3 | −2 |
| Test cases | 120 | 150 | +30 |
| Bucket | Count |
|---|---|
| ≥ 0.99 (Excellent) | 100 |
| 0.95 – 0.98 | 22 |
| 0.90 – 0.94 | 16 |
| 0.80 – 0.89 | 9 |
| < 0.80 | 3 |
Added FontStyleInfo record and ReadFontStyles() method (replacing ReadFontColors()) to parse <sz>, <b>, <i> elements from styles.xml. Extended ExcelCell with FontSize, Bold, Italic fields.
- Added
BorderSideandCellBorderInforecords -
ReadBorders()+ReadBorderSide()parse<borders>fromstyles.xml - Renderer draws
AddLine()for each border side with style-dependent width:- thin → 0.5pt, medium → 1.0pt, thick → 1.5pt
Enhanced ReadFillColors() to handle non-solid patterns (darkGray, lightGray, mediumGray, etc.) by blending foreground/background colors with pattern-specific tint ratios.
- Added
VerticalAlignmentfield toExcelCell -
ReadCellXfStyles()now returns a 6-tuple including vertical alignment list - Renderer positions text at top/center/bottom of the row height using descent-based baseline calculation:
var descent = options.FontSize * 0.31f; if (verticalAlignment == "top") cellY = currentY - cellFontSize; else if (verticalAlignment == "center") cellY = currentY - (rowHeight - textBlock) / 2f - cellFontSize + descent; else // "bottom" (default) cellY = currentY - rowHeight + descent + lineHeight * (lines.Length - 1);
Root cause: CalculateNaturalColumnWidths() auto-sized columns to content width for multi-column sheets without explicit widths. LibreOffice uses a fixed default of 8.43 character units (47.4pt).
Fix: Use default column width when no explicit widths are present. This single change improved scores from 95.7% → 96.7% and added ~16 cases to ≥0.99.
Tuned FittingChars() scale factor from 0.93 → 0.86 to approximate Calibri character metrics (which LibreOffice uses) on Helvetica glyph widths. This allows more characters per column, matching LibreOffice's character count more closely.
Also applied CalibriFittingScale to MeasureHelveticaWidth() used by FitNumericText(), preventing over-aggressive number reformatting.
private const double CalibriFittingScale = 0.86;-
Negative currency sign placement:
$-180,000→-$180,000(minus before prefix for single-section formats) -
Parenthesized negatives:
(#,##0)format no longer adds an extra minus sign -
Date format handling:
FormatExcelDate()accepts format code with fullConvertExcelDateFormat()method supportingyyyy/MM/dd/HH/mm/sswith month-vs-minute context detection - FormatGeneral: Lowered integer threshold from 1e15 → 1e10 for scientific notation; added near-integer rounding (e.g., 9999999.99 → 10000000)
- FitNumericText for all cells: Applied number-fitting to all cells (not just clipped ones), matching LibreOffice's behavior of reformatting numbers that exceed column width even in the last column
Excel booleans (TRUE/FALSE, type b) default to center alignment. Added detection so boolean cells with no explicit alignment use "center".
-
MarginLeft: 50 → 54 (matching LibreOffice's default 54pt left margin) -
ColumnPadding: 4 → 2 (reducing inter-column gap to match reference PDFs) - Minimum column padding clamp: 4 → 2
- Fill height uses
rowHeight(notlineHeight × maxLinesInRow) to properly cover explicit row heights - Fill width uses exact column width (no extra padding), matching LibreOffice's cell fill rendering
When a cell uses a font larger than the default, auto-calculates row height using maxCellFontSize × 1.3 (with lineHeight as minimum):
var autoRowHeight = maxCellFontSize > options.FontSize
? Math.Max(lineHeight, maxCellFontSize * 1.3f)
: lineHeight;Center/right alignment in merged cells now uses the full merged span width instead of only the first column's width.
Added optional ClipRect property to PdfTextBlock, AddText() now accepts a clip rectangle, and BuildContentStream() renders q ... re W n ... Q clipping operators. While full-text clipping was ultimately not used (it hurt PyMuPDF text extraction), the infrastructure remains for future use.
| Approach | Result | Reason |
|---|---|---|
| Full text + PDF clipping (no truncation) | Score dropped | PyMuPDF text extraction grouped text differently with q/Q wrappers |
Tz horizontal text scaling (86%) |
92/150 | Narrower text diverged from reference (which also had merged spans) |
| Column padding = 0 | 91/150 | Text from adjacent cells merged in extraction |
| Column padding = 1 | 91/150 | Same merging issue as padding=0 |
| MarginLeft = 55 | Slightly worse | Pixel comparison too sensitive to layout shifts |
| MarginRight = 54 (symmetric) | Mixed | No net improvement |
| Chart title font size 15 → 13 | −7 cases | Over-correction for chart rendering |
| Chart legend text removal | Neutral | Legend text was actually table header text |
| Chart Y-axis desiredTicks 6 → 5 | Mixed | Hurt some cases while helping others |
| Overflow guard (post-truncation check) | Worse | Too aggressive, removed valid text |
| Fill width += columnPadding | Worse | Created overlapping fills |
| Per-cell lineHeight for mixed fonts | Regressed classic128 | Base lineHeight more consistent |
| Baseline shift for mixed font sizes | No effect | PyMuPDF normalizes origin.y in line grouping |
Most chart failures stem from:
-
Y-axis scale differences —
NiceAxisScale()produces different tick values than LibreOffice - Chart title/label positioning — Calibri vs Helvetica width differences
- Legend rendering — Text placement and visibility differences
- Unsupported chart types — Stock OHLC, 3D, combo charts use fallback renderers
| Category | Examples | Root Cause |
|---|---|---|
| Long text pagination | classic09 (60.1%) | LibreOffice wraps/paginates at ~60 lines; MiniPdf clips |
| CJK/Unicode | classic57 (87.6%), classic23 (91.5%) | Helvetica font lacks CJK glyphs; width estimation differs |
| Visual margin drift | classic18 (95.7%), classic60 (97.4%) | Multi-page tables accumulate small positioning differences |
| Text span merging | classic13 (98.8%), classic49 (97.4%) | Adjacent cell text merges in PyMuPDF extraction |
| Fill/border rendering | classic134 (98.9%), classic137 (98.4%) | Pattern fill blending, checkerboard grid precision |
-
FittingChars scale is critical — The ratio between Helvetica and Calibri character widths (0.86) determines how many characters fit per column. Too aggressive (0.82) causes visual overflow; too conservative (0.93) truncates too early.
-
Column width default matters more than character widths — Switching from auto-sized columns to Excel's 8.43 char-unit default was the single highest-impact fix (+16 cases).
-
PDF clipping hurts text extraction — While technically correct, adding
q/Qgraphics state wrappers disrupts PyMuPDF's text line grouping, causing text_similarity to drop. -
MeasureHelveticaWidthmust use the same scale asFittingChars— Without this,FitNumericTextreformats numbers that actually fit the column. -
Bottom alignment is Excel's default — Text baseline should be positioned at
row_bottom + descent, not at the row top. This improved visual scores across many cases. -
Margin alignment matters — Matching LibreOffice's 54pt left margin improved visual similarity for styled cases.
# Build
cd d:\git\MiniPdf
dotnet build src/MiniPdf/MiniPdf.csproj
# Generate PDFs (with --no-cache to pick up source changes)
cd tests/MiniPdf.Scripts
dotnet run --no-cache convert_xlsx_to_pdf.cs
# Run comparison
cd tests/MiniPdf.Benchmark
$env:PYTHONIOENCODING="utf-8"
python compare_pdfs.py --minipdf-dir "..\MiniPdf.Scripts\pdf_output" --reference-dir "reference_pdfs" --report-dir "reports"
# Quick check
python _quick_check.py
# Full pipeline
python run_benchmark.py --skip-generate| File | Lines Changed | Key Changes |
|---|---|---|
src/MiniPdf/ExcelReader.cs |
~200 | Font styles, borders, pattern fills, vertical alignment, number formatting |
src/MiniPdf/ExcelToPdfConverter.cs |
~150 | Column widths, Calibri scale, margins, padding, vertical alignment, boolean alignment, merged cell alignment, fill/border rendering, auto row height |
src/MiniPdf/PdfTextBlock.cs |
+10 | ClipRect property |
src/MiniPdf/PdfPage.cs |
+2 | AddText clipRect parameter |
src/MiniPdf/PdfWriter.cs |
+15 | Clip rectangle rendering in content stream |