Skip to content

UCREL/Indonesian-TreeTagger-Docker-Build

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Indonesian-TreeTagger-Docker-Build

Docker file to build the TreeTagger specifically for Indonesian.

License

The contents of this repository is licensed under Apache License Version 2.0 as stated within the LICENSE file. HOWEVER by building this Docker file you are implicitly agreeing to the TreeTagger license as you are downloading and using the TreeTagger code when building and running this docker container. Part of the TreeTagger license stops you from re-distributing the TreeTagger code, therefore please do not upload your built docker container to a registry like Docker Hub.

Build the docker container

The easiest way to do this is to run the following (docker container size roughly 139MB):

docker build -t indonesian-treetagger:1.0.0 https://github.com/UCREL/Indonesian-TreeTagger-Docker-Build.git#main

Run the docker container

We are assuming you have built the docker container and tagged it as indonesian-treetagger:1.0.0.

Tagging to stdout

echo "Dia adalah abdi negara" | docker run --rm -i indonesian-treetagger:1.0.0

Output should be:

        reading parameters ...
        tagging ...
         finished.
Dia     PRP     dia
adalah  VB      adalah
abdi negara     NN      abdi negara

Tagging and outputting to a TSV file

echo "Dia adalah abdi negara" | docker run --rm -i indonesian-treetagger:1.0.0 > output_file.tsv

The TSV file should contain (Note we have added column headers here to explain what each column represents, these headers should not be in your file):

token	POS	lemma
Dia	PRP	dia
adalah	VB	adalah
abdi negara	NN	abdi negara

POS = Part Of Speech

Indonesian Tagger details

The tagger has been built to handle both abbreviations and Multi Word Units (MWU), a good example of a MWU is abdi negara.

The tagger uses the UI POS tagset, which can be found here.

The Indonesian TreeTagger was trained by by Prihantoro on the idn-tagged-corpus using an additional lexicon that was created in two parts:

  1. Data from the idn-tagged-corpus was used to create a lexicon, this corpus data was first lemmatised using MorphInd before being added to the lexicon.
  2. Simplex and Compound words from Kateglo, of which the POS tagset used in Kateglo differed therefore it was mapped to the UI POS tagset.

The abbreviations and MWU lexicon data was created by Prihantoro and can be found within the TreeTagger software.

Acknowledgements

We thank:

About

Docker file to build the Indonesian TreeTagger.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors