1. Introduction

Full page transcription of historical documents (FP-THD) is a pipeline for the transcription of historical documents preserving these special features. In this work, we propose to extend an existing text line recognition method with a layout analysis model. We analyze historical text images using a layout analysis model to extract text lines, which are then processed by an OCR model to generate a fully digitized page.

2. Overview architecture

3. Repository organization

The repository is organized into four principal parts corresponding to the pipeline:

Layout analysis (Layout_analysis): This is corresponding to analysis layout component.
Extraction of page line images (Extraction_of_page_line_images): This component extracts individual line text images from the input to prepare them for subsequent processing.
OCR: The Masked Autoencoder with Vision Transformer (MAE-ViT) that is responsible to reconize the text.
Result representation: (Result_representation) This component visualizes or displays the final output from the processing pipeline.

The other repository are input (to place the images of input), train to train and prepare the data for train from scratch the models and args to place the required arguments.

4. Pipeline Launch Guide

Follow the steps below to set up and run the pipeline.

Step 1: Prepare Your Input Data

Go to the input/ directory and place your images there.

Step 2: Layout analysis

Clone the Repository layout analysis.

git clone https://github.com/DCGM/pero-ocr/tree/master

Copy the layout analysis using the following command:

cp -r  pero-ocr/pero_ocr/* ./Layout_analysis/

you should replace all the "pero_ocr" by "Layout_analysis". You can use this command :

grep -rl --include="*.py" "pero_ocr" | xargs sed -i 's/pero_ocr/Layout_analysis/g'

Make sure the folder "Layout_analysis" is installed before proceeding and the General layout analysis (printed and handwritten) with european printed OCR specialized to czech newspapers can be downloaded here download model The model should be in ""args folder "" with the "config.ini" file.

Step 4 : OCR

The folder OCR contains ./model/ and ./utils/ from public repository HTR-VT.

To execute the OCR step you should place the pretrained model in checkpoints folder.

Go to this link and install the model : FP-THD model .

Step 5: Run the Main Program

To run the piline execute the following command line:

python main_pipeline.py --config-path ./args/config.ini  --image-folder ./input/ --cropped-lines-folder ./Extraction_of_page_line_images/  --save-dir ./args/checkpoints/ --out-dir Result_representation/ --exp-name text_xmls --image-extension tif

Step 6 : View the Results

After execution, the output results will be available in the Result_representation folder.

4. Acknowledgement

This work was supported by the Aragon Regional Government (Spain) through the project PROY_S11_24. We acknowledge the help provided by public code repositories: HTR-VT and Pero-ocr .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Introduction

2. Overview architecture

3. Repository organization

4. Pipeline Launch Guide

Step 1: Prepare Your Input Data

Step 2: Layout analysis

Step 4 : OCR

Step 5: Run the Main Program

Step 6 : View the Results

4. Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Extraction_of_page_line_images		Extraction_of_page_line_images
Layout_analysis		Layout_analysis
OCR		OCR
Result_representation		Result_representation
args		args
input		input
train		train
README.md		README.md
architecture.png		architecture.png
command.sh		command.sh
main_pipeline.py		main_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

2. Overview architecture

3. Repository organization

4. Pipeline Launch Guide

Step 1: Prepare Your Input Data

Step 2: Layout analysis

Step 4 : OCR

Step 5: Run the Main Program

Step 6 : View the Results

4. Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages