Prof.Dr.Virach Sornlertlamvanich
Musashino University & SIIT Thammasat University
virach@gmail.com
TCC is the smallest stand-alone character unit by the spelling rules. By recognizing the Thai character string in the unit of character cluster, it can reduce the size of search space for possible word segmentation positions. Since there is no ambiguity in identifying the character cluster boundary, applying the TCC algorithm will not affect the accuracy in the higher level language processing.
Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.
Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.
Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.
Dataset: BKD20
Attribution-ShareAlike
CC BY-SA
Citation:
Any publications or works based on this corpus should make a reference to the following published paper.
Suriyachay, K., Charoenporn, T., and Sornlertlamvanich, V. (2019). Thai Named Entity Tagged Corpus Annotation Scheme and Self Verification. The 9th Language and Technology Conference (LTC2019).
This corpus is designed and constructed based on the annotation scheme proposed in ORCHID corpus construction, which is the first open online Thai POS Tagged corpus. The corpus is disjointedly managed in seven types of entities: DATe, LOCation, MEAsurement, NAMe, ORGanization, PERson, TIMe, where each category is abbreviated by its first three characters and one another category (Other). The BIO annotation scheme is used for this corpus.
BIO annotation scheme:
B - The beginning of a word
I - The inside of a word
O - The word does not belong to any type of entities
| Category | Format | Description | Example |
|---|---|---|---|
| Date | B-DAT | Beginning of a date | วันที่ (Date) |
| I-DAT | Inside of a date | 14 กุมภาพันธ์ (February 14) | |
| Location | B-LOC | Beginning of a location name | เมือง (City) |
| I-LOC | Inside of a location name | นิวยอร์ค (New York) | |
| Measurement | B-MEA | Beginning of a measurement name | ห้า (Five) |
| I-MEA | Inside of a measurement name | เล่ม (Books) | |
| Name | B-NAM | Beginning of any proper name except location, person, and organization names | ศึก (League) |
| I-NAM | Inside of any proper name | ลาลีกา (La Liga) | |
| Organization | B-ORG | Beginning of an organization name | บริษัท (Corp.) |
| I-ORG | Inside of an organization's name | โตโยต้า มอเตอร์ (Toyota Motor) | |
| Person | B-PER | Beginning of a person name | นาย (Mister) |
| I-PER | Inside of a person's name | ณัฐวุฒิ สะกิดใจ (Natthawut Sakidjai) | |
| Time | B-TIM | Beginning of a time | สิบ (Ten) |
| I-TIM | Inside of a time | นาฬิกา (O'clock) | |
| Other | O | Word does not belong to any type of entity |
In the corpus, a format of the data is generally composed of three components separated by a tab for each line. The first component is a word. The second component is part of speech (POS) of the word. The last component is a category or tag of the word in the same line. Some lines consist of only one part which is EOS, indicating the end of a sentence.
Example labeled data in location corpus file:
Suriyachay, K., Charoenporn, T., and Sornlertlamvanich, V. (2019). Thai Named Entity Tagged Corpus Annotation Scheme and Self Verification. The 9th Language and Technology Conference (LTC2019).
WS is the word tokenization of Thai words with the HMM model and Viterbi algorithm for computing possibilities of possible words with their POS tags. The input of this model is words. The output is the tokenized words with their POS tags.
Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.
Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.
Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.
SS is our standard tool for dividing bunches of words into sentences for post processing such as sentence classification. The input of this model is words along with their POS tags. The output is the sentence dividing by |.
Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.
Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.
Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.