This tool extracts text based on positional and alignment criteria from standardized text outputs, such as PDFs. It supports two main extraction methods: Label Extraction and Row Extraction, allowing users to specify an anchor and direction for precise text extraction.
To set up the Text Extraction Tool, follow these steps:
-
Ensure Node.js is Installed:
- Visit the Node.js official website and download the installer for your operating system. It's recommended to install the LTS version.
- Follow the installation prompts to install Node.js and npm.
- Verify the installation by running
node -vandnpm -vin your terminal or command prompt.
-
Clone the Repository:
- Clone the project repository to your local machine using Git or download the ZIP file and extract it.
-
Install Dependencies:
- Navigate to the project directory in your terminal or command prompt.
- Run
npm installto install the project's dependencies as listed inpackage.json. This command will install TypeScript, Jest for testing, and other necessary packages.
-
Compile TypeScript:
- The project uses TypeScript, which needs to be compiled to JavaScript before execution. You can compile the TypeScript files by running
npm run start. This script compiles the TypeScript files and then executes thedist/index.jsfile.
- The project uses TypeScript, which needs to be compiled to JavaScript before execution. You can compile the TypeScript files by running
-
Start the Application:
- Use the
npm startcommand to compile TypeScript files and run the application. This command is defined inpackage.jsonunder thescriptssection and performs both compilation and execution in one step.
- Use the
- Run
npm testto perform jest tests.
Edit input.json to switch between label and row extraction methods. The structure for each method is defined as follows:
export interface Label {
id: "label";
position: "right" | "left" | "above" | "below";
textAlignment: "right" | "left";
anchor: string;
}export interface Row {
id: "row";
position: "right" | "left";
tiebreaker: number | "last";
anchor: string;
}Example input.json:
{
"method": "label",
"position": "above",
"textAlignment": "left",
"anchor": "Example Anchor"
}- Label: Targets text adjacent to an anchor.
- Row: Focuses on text at the same vertical level as the anchor.
- Filtering: Narrows down lines based on their spatial relation to the anchor.
- Sorting: Orders filtered lines by proximity to the anchor, considering alignment.
- Extraction: Returns the line that best matches the criteria.
Remember to adjust input.json for different documents and extraction needs.