[Project Page] [Datasets] [Checkpoints]
Full page transcription of historical documents (FP-THD) is a pipeline for the transcription of historical documents preserving these special features. In this work, we propose to extend an existing text line recognition method with a layout analysis model. We analyze historical text images using a layout analysis model to extract text lines, which are then processed by an OCR model to generate a fully digitized page.
The repository is organized into four principal parts corresponding to the pipeline:
- Layout analysis (Layout_analysis): This is corresponding to analysis layout component.
- Extraction of page line images (Extraction_of_page_line_images): This component extracts individual line text images from the input to prepare them for subsequent processing.
- OCR: The Masked Autoencoder with Vision Transformer (MAE-ViT) that is responsible to reconize the text.
- Result representation: (Result_representation) This component visualizes or displays the final output from the processing pipeline.
The other repository are input (to place the images of input), train to train and prepare the data for train from scratch the models and args to place the required arguments.
Follow the steps below to set up and run the pipeline.
Go to the input/ directory and place your images there.
Clone the Repository layout analysis.
git clone https://github.com/DCGM/pero-ocr/tree/masterCopy the layout analysis using the following command:
cp -r pero-ocr/pero_ocr/* ./Layout_analysis/you should replace all the "pero_ocr" by "Layout_analysis". You can use this command :
grep -rl --include="*.py" "pero_ocr" | xargs sed -i 's/pero_ocr/Layout_analysis/g'
Make sure the folder "Layout_analysis" is installed before proceeding and the General layout analysis (printed and handwritten) with european printed OCR specialized to czech newspapers can be downloaded here download model The model should be in ""args folder "" with the "config.ini" file.
The folder OCR contains ./model/ and ./utils/ from public repository HTR-VT.
To execute the OCR step you should place the pretrained model in checkpoints folder.
Go to this link and install the model : FP-THD model .
To run the piline execute the following command line:
python main_pipeline.py --config-path ./args/config.ini --image-folder ./input/ --cropped-lines-folder ./Extraction_of_page_line_images/ --save-dir ./args/checkpoints/ --out-dir Result_representation/ --exp-name text_xmls --image-extension tif
After execution, the output results will be available in the Result_representation folder.
This work was supported by the Aragon Regional Government (Spain) through the project PROY_S11_24. We acknowledge the help provided by public code repositories: HTR-VT and Pero-ocr .
