Thai NLP — WASM

@2021 API Foundry by Virach Labo team BabyAI.ORG

Prof.Dr.Virach Sornlertlamvanich
Musashino University & SIIT Thammasat University
virach@gmail.com

Contributors

Prof.Dr.Virach Sornlertlamvanich @Virach

Dr.Sumeth Yuenyong @Sumeth

Dr.Titipakorn Prakayaphun @Titipakorn

Kitiya Suriyachay @BeeKitiya

Nannam Aksorn @NannamAksorn

TCC is the smallest stand-alone character unit by the spelling rules. By recognizing the Thai character string in the unit of character cluster, it can reduce the size of search space for possible word segmentation positions. Since there is no ambiguity in identifying the character cluster boundary, applying the TCC algorithm will not affect the accuracy in the higher level language processing.

Contributor

Prof.Dr.Virach Sornlertlamvanich @Virach

Nannam Aksorn @NannamAksorn

Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.

Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.

Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.

Dataset: BKD20

Attribution-ShareAlike
CC BY-SA

Citation:

Any publications or works based on this corpus should make a reference to the following published paper.

Suriyachay, K., Charoenporn, T., and Sornlertlamvanich, V. (2019). Thai Named Entity Tagged Corpus Annotation Scheme and Self Verification. The 9th Language and Technology Conference (LTC2019).

This corpus is designed and constructed based on the annotation scheme proposed in ORCHID corpus construction, which is the first open online Thai POS Tagged corpus. The corpus is disjointedly managed in seven types of entities: DATe, LOCation, MEAsurement, NAMe, ORGanization, PERson, TIMe, where each category is abbreviated by its first three characters and one another category (Other). The BIO annotation scheme is used for this corpus.

BIO annotation scheme:

B - The beginning of a word

I - The inside of a word

O - The word does not belong to any type of entities

Category	Format	Description	Example
Date	B-DAT	Beginning of a date	วันที่ (Date)
Date	I-DAT	Inside of a date	14 กุมภาพันธ์ (February 14)
Location	B-LOC	Beginning of a location name	เมือง (City)
Location	I-LOC	Inside of a location name	นิวยอร์ค (New York)
Measurement	B-MEA	Beginning of a measurement name	ห้า (Five)
Measurement	I-MEA	Inside of a measurement name	เล่ม (Books)
Name	B-NAM	Beginning of any proper name except location, person, and organization names	ศึก (League)
Name	I-NAM	Inside of any proper name	ลาลีกา (La Liga)
Organization	B-ORG	Beginning of an organization name	บริษัท (Corp.)
Organization	I-ORG	Inside of an organization's name	โตโยต้า มอเตอร์ (Toyota Motor)
Person	B-PER	Beginning of a person name	นาย (Mister)
Person	I-PER	Inside of a person's name	ณัฐวุฒิ สะกิดใจ (Natthawut Sakidjai)
Time	B-TIM	Beginning of a time	สิบ (Ten)
Time	I-TIM	Inside of a time	นาฬิกา (O'clock)
Other	O	Word does not belong to any type of entity

In the corpus, a format of the data is generally composed of three components separated by a tab for each line. The first component is a word. The second component is part of speech (POS) of the word. The last component is a category or tag of the word in the same line. Some lines consist of only one part which is EOS, indicating the end of a sentence.

Example labeled data in location corpus file:

สหรัฐอเมริกา/NPRP/B-LOC <space>/PUNC/O ญี่ปุ่น/NPRP/B-LOC <space>/PUNC/O สหภาพ/NPRP/B-LOC ยุโรป/NPRP/I-LOC <space>/PUNC/O อาเซียน/NCMN/B-LOC <space>/PUNC/O ลดลง/VSTA/O เฉลี่ย/VACT/O <space>/PUNC/O 28.2/DCNM/O

Contributor

Prof.Dr.Virach Sornlertlamvanich @Virach

Kitiya Suriyachay @BeeKitiya

Dr.Titipakorn Prakayaphun @Titipakorn

Suriyachay, K., Charoenporn, T., and Sornlertlamvanich, V. (2019). Thai Named Entity Tagged Corpus Annotation Scheme and Self Verification. The 9th Language and Technology Conference (LTC2019).

⏳ NER models will load on first use…

0 / 7

WS is the word tokenization of Thai words with the HMM model and Viterbi algorithm for computing possibilities of possible words with their POS tags. The input of this model is words. The output is the tokenized words with their POS tags.

Contributor

Prof.Dr.Virach Sornlertlamvanich @Virach

Dr.Sumeth Yuenyong @Sumeth

Dr.Titipakorn Prakayaphun @Titipakorn

Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.

Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.

Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.

SS is our standard tool for dividing bunches of words into sentences for post processing such as sentence classification. The input of this model is words along with their POS tags. The output is the sentence dividing by |.

Contributor

Prof.Dr.Virach Sornlertlamvanich @Virach

Dr.Sumeth Yuenyong @Sumeth

Dr.Titipakorn Prakayaphun @Titipakorn

Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.

Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.

Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.

⏳ SS model will load on first use…