docx_comment_parser

A fast, memory-efficient C++17 shared library (DLL/SO) that extracts all comment metadata from .docx files, with full Python bindings via pybind11.

Features

Feature	Details
Comment fields	id, author, date, initials, full text, paragraph style
Anchoring	referenced document text (via `commentRangeStart/End`)
Threading	parent/reply relationships (OOXML 2016+ `commentsExtended.xml`)
Resolution	`done` flag, earliest/latest dates, per-author filtering
Batch parsing	Thread-pool with configurable parallelism
Memory	ZIP entries inflated one-at-a-time; SAX for document body; no full DOM
Dependencies	libxml2, zlib (standard on all major platforms)
Python	pybind11 extension module, GIL released during batch parsing

Building

Prerequisites

Linux / macOS

sudo apt install libxml2-dev zlib1g-dev   # Debian/Ubuntu
brew install libxml2 zlib                  # macOS
pip install pybind11 cmake

Windows Install vcpkg then:

vcpkg install libxml2 zlib pybind11

CMake (recommended)

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Optionally run tests:
cd build && ctest --output-on-failure

This produces:

build/libdocx_comment_parser.so (Linux) / .dylib (macOS) / .dll (Windows)
build/_docx_comment_parser*.so – Python extension

pip (Python only)

pip install pybind11
pip install .

Python Usage

import docx_comment_parser as dcp

# ── Single file ──────────────────────────────────────────────────────────────
parser = dcp.DocxParser()
parser.parse("report.docx")

for c in parser.comments():
    print(f"[{c.id}] {c.author} ({c.date}): {c.text[:80]}")
    if c.referenced_text:
        print(f"  ↳ anchored to: '{c.referenced_text[:60]}'")
    if c.is_reply:
        print(f"  ↳ reply to comment #{c.parent_id}")

# Filter by author
for c in parser.by_author("Alice"):
    print(c.to_dict())

# Get full thread for a root comment
for c in parser.thread(0):
    indent = "  " if c.is_reply else ""
    print(f"{indent}[{c.id}] {c.author}: {c.text}")

# Stats
s = parser.stats()
print(f"Total: {s.total_comments}, Authors: {s.unique_authors}")
print(f"Date range: {s.earliest_date} → {s.latest_date}")

# ── Batch (parallel) ─────────────────────────────────────────────────────────
import glob

bp = dcp.BatchParser(max_threads=0)   # 0 = auto
files = glob.glob("/documents/**/*.docx", recursive=True)
bp.parse_all(files)

for f in files:
    if f in bp.errors():
        print(f"ERROR {f}: {bp.errors()[f]}")
        continue
    s = bp.stats(f)
    print(f"{f}: {s.total_comments} comments by {len(s.unique_authors)} authors")

bp.release_all()   # free memory

C++ Usage

#include "docx_comment_parser.h"

// Single file
docx::DocxParser parser;
parser.parse("report.docx");

for (const auto& c : parser.comments()) {
    std::cout << c.id << " | " << c.author << " | " << c.text << "\n";
}

// Batch
docx::BatchParser bp(/*threads=*/4);
bp.parse_all({"a.docx", "b.docx", "c.docx"});
for (const auto& [path, err] : bp.errors())
    std::cerr << "Failed: " << path << ": " << err << "\n";
bp.release_all();

CommentMetadata fields

Field	Type	Source
`id`	`int`	`w:id` attribute
`author`	`str`	`w:author`
`date`	`str`	`w:date` (ISO-8601)
`initials`	`str`	`w:initials`
`text`	`str`	Full plain-text of comment body
`paragraph_style`	`str`	Style of first paragraph in comment
`referenced_text`	`str`	Document text anchored by this comment
`is_reply`	`bool`	True if this is a threaded reply
`parent_id`	`int`	id of parent comment (-1 if root)
`replies`	`list[CommentRef]`	Direct replies (populated on parent)
`para_id`	`str`	OOXML 2016+ paragraph ID
`para_id_parent`	`str`	Parent paragraph ID (before id resolution)
`done`	`bool`	Resolved/done flag (`commentsExtended.xml`)
`thread_ids`	`list[int]`	Ordered ids in this thread (root only)
`paragraph_index`	`int`	0-based paragraph in document body
`run_index`	`int`	0-based run within paragraph

Architecture

docx_comment_parser/
├── include/
│   ├── docx_comment_parser.h   # Public API (CommentMetadata, DocxParser, BatchParser)
│   ├── zip_reader.h            # ZIP reader interface (zlib only, no libzip)
│   └── xml_utils.h             # Lightweight libxml2 helpers
├── src/
│   ├── zip_reader.cpp          # Memory-mapped ZIP + inflate
│   ├── docx_parser.cpp         # Core: comments.xml (DOM) + document.xml (SAX)
│   └── batch_parser.cpp        # Thread-pool batch processing
├── python/
│   └── python_bindings.cpp     # pybind11 module
├── tests/
│   └── test_docx_parser.cpp    # Self-contained test suite
├── CMakeLists.txt
└── setup.py

Memory model

ZIP entries are memory-mapped and inflated one at a time; no entry's data is kept in memory while another is being read.
comments.xml is parsed with libxml2 DOM (typically < 100 KB).
document.xml (which can be very large) is streamed with libxml2 SAX2; only the anchor text accumulator is kept in memory.
BatchParser runs one DocxParser per thread; results can be individually release()d to reclaim memory after use.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
include		include
python		python
src		src
tests		tests
vendor/zlib		vendor/zlib
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docx_comment_parser

Features

Building

Prerequisites

CMake (recommended)

pip (Python only)

Python Usage

C++ Usage

CommentMetadata fields

Architecture

Memory model

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docx_comment_parser

Features

Building

Prerequisites

CMake (recommended)

pip (Python only)

Python Usage

C++ Usage

CommentMetadata fields

Architecture

Memory model

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages