Indonesian-TreeTagger-Docker-Build

Docker file to build the TreeTagger specifically for Indonesian.

License

The contents of this repository is licensed under Apache License Version 2.0 as stated within the LICENSE file. HOWEVER by building this Docker file you are implicitly agreeing to the TreeTagger license as you are downloading and using the TreeTagger code when building and running this docker container. Part of the TreeTagger license stops you from re-distributing the TreeTagger code, therefore please do not upload your built docker container to a registry like Docker Hub.

Build the docker container

The easiest way to do this is to run the following (docker container size roughly 139MB):

docker build -t indonesian-treetagger:1.0.0 https://github.com/UCREL/Indonesian-TreeTagger-Docker-Build.git#main

Run the docker container

We are assuming you have built the docker container and tagged it as indonesian-treetagger:1.0.0.

Tagging to stdout

echo "Dia adalah abdi negara" | docker run --rm -i indonesian-treetagger:1.0.0

Output should be:

        reading parameters ...
        tagging ...
         finished.
Dia     PRP     dia
adalah  VB      adalah
abdi negara     NN      abdi negara

Tagging and outputting to a TSV file

echo "Dia adalah abdi negara" | docker run --rm -i indonesian-treetagger:1.0.0 > output_file.tsv

The TSV file should contain (Note we have added column headers here to explain what each column represents, these headers should not be in your file):

token	POS	lemma
Dia	PRP	dia
adalah	VB	adalah
abdi negara	NN	abdi negara

POS = Part Of Speech

Indonesian Tagger details

The tagger has been built to handle both abbreviations and Multi Word Units (MWU), a good example of a MWU is abdi negara.

The tagger uses the UI POS tagset, which can be found here.

The Indonesian TreeTagger was trained by by Prihantoro on the idn-tagged-corpus using an additional lexicon that was created in two parts:

Data from the idn-tagged-corpus was used to create a lexicon, this corpus data was first lemmatised using MorphInd before being added to the lexicon.
Simplex and Compound words from Kateglo, of which the POS tagset used in Kateglo differed therefore it was mapped to the UI POS tagset.

The abbreviations and MWU lexicon data was created by Prihantoro and can be found within the TreeTagger software.

Acknowledgements

We thank:

Helmut Schmid for creating and releasing the TreeTagger software.
Prihantoro for creating the Indonesian resources for TreeTagger as well as guiding us through how to use the Indonesian TreeTagger.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.dockerignore		.dockerignore
LICENSE		LICENSE
README.md		README.md
dockerfile		dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indonesian-TreeTagger-Docker-Build

License

Build the docker container

Run the docker container

Tagging to stdout

Tagging and outputting to a TSV file

Indonesian Tagger details

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Indonesian-TreeTagger-Docker-Build

License

Build the docker container

Run the docker container

Tagging to stdout

Tagging and outputting to a TSV file

Indonesian Tagger details

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages