Docker file to build the TreeTagger specifically for Indonesian.
The contents of this repository is licensed under Apache License Version 2.0 as stated within the LICENSE file. HOWEVER by building this Docker file you are implicitly agreeing to the TreeTagger license as you are downloading and using the TreeTagger code when building and running this docker container. Part of the TreeTagger license stops you from re-distributing the TreeTagger code, therefore please do not upload your built docker container to a registry like Docker Hub.
The easiest way to do this is to run the following (docker container size roughly 139MB):
docker build -t indonesian-treetagger:1.0.0 https://github.com/UCREL/Indonesian-TreeTagger-Docker-Build.git#mainWe are assuming you have built the docker container and tagged it as indonesian-treetagger:1.0.0.
echo "Dia adalah abdi negara" | docker run --rm -i indonesian-treetagger:1.0.0Output should be:
reading parameters ...
tagging ...
finished.
Dia PRP dia
adalah VB adalah
abdi negara NN abdi negaraecho "Dia adalah abdi negara" | docker run --rm -i indonesian-treetagger:1.0.0 > output_file.tsvThe TSV file should contain (Note we have added column headers here to explain what each column represents, these headers should not be in your file):
token POS lemma
Dia PRP dia
adalah VB adalah
abdi negara NN abdi negaraPOS = Part Of Speech
The tagger has been built to handle both abbreviations and Multi Word Units (MWU), a good example of a MWU is abdi negara.
The tagger uses the UI POS tagset, which can be found here.
The Indonesian TreeTagger was trained by by Prihantoro on the idn-tagged-corpus using an additional lexicon that was created in two parts:
- Data from the idn-tagged-corpus was used to create a lexicon, this corpus data was first lemmatised using MorphInd before being added to the lexicon.
- Simplex and Compound words from Kateglo, of which the POS tagset used in Kateglo differed therefore it was mapped to the UI POS tagset.
The abbreviations and MWU lexicon data was created by Prihantoro and can be found within the TreeTagger software.
We thank:
- Helmut Schmid for creating and releasing the TreeTagger software.
- Prihantoro for creating the Indonesian resources for TreeTagger as well as guiding us through how to use the Indonesian TreeTagger.