About Us
@2021 API Foundry by Virach Labo team BabyAI.ORG

Prof.Dr.Virach Sornlertlamvanich
Musashino University & SIIT Thammasat University
virach@gmail.com

Contributors
Prof.Dr.Virach Sornlertlamvanich @Virach
Dr.Sumeth Yuenyong @Sumeth
Dr.Titipakorn Prakayaphun @Titipakorn
Kitiya Suriyachay @BeeKitiya
Nannam Aksorn @NannamAksorn

TCC is the smallest stand-alone character unit by the spelling rules. By recognizing the Thai character string in the unit of character cluster, it can reduce the size of search space for possible word segmentation positions. Since there is no ambiguity in identifying the character cluster boundary, applying the TCC algorithm will not affect the accuracy in the higher level language processing.

Contributor
Prof.Dr.Virach Sornlertlamvanich @Virach
Nannam Aksorn @NannamAksorn

Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.

Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.

Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.

0/5000

Dataset: BKD20

CC-BY-SA

Attribution-ShareAlike
CC BY-SA

Citation:

Any publications or works based on this corpus should make a reference to the following published paper.

Suriyachay, K., Charoenporn, T., and Sornlertlamvanich, V. (2019). Thai Named Entity Tagged Corpus Annotation Scheme and Self Verification. The 9th Language and Technology Conference (LTC2019).

This corpus is designed and constructed based on the annotation scheme proposed in ORCHID corpus construction, which is the first open online Thai POS Tagged corpus. The corpus is disjointedly managed in seven types of entities: DATe, LOCation, MEAsurement, NAMe, ORGanization, PERson, TIMe, where each category is abbreviated by its first three characters and one another category (Other). The BIO annotation scheme is used for this corpus.

BIO annotation scheme:

    B - The beginning of a word

    I - The inside of a word

    O - The word does not belong to any type of entities

Category Format Description Example
Date B-DAT Beginning of a date วันที่ (Date)
I-DAT Inside of a date 14 กุมภาพันธ์ (February 14)
Location B-LOC Beginning of a location name เมือง (City)
I-LOC Inside of a location name นิวยอร์ค (New York)
Measurement B-MEA Beginning of a measurement name ห้า (Five)
I-MEA Inside of a measurement name เล่ม (Books)
Name B-NAM Beginning of any proper name except location, person, and organization names ศึก (League)
I-NAM Inside of any proper name ลาลีกา (La Liga)
Organization B-ORG Beginning of an organization name บริษัท (Corp.)
I-ORG Inside of an organization's name โตโยต้า มอเตอร์ (Toyota Motor)
Person B-PER Beginning of a person name นาย (Mister)
I-PER Inside of a person's name ณัฐวุฒิ สะกิดใจ (Natthawut Sakidjai)
Time B-TIM Beginning of a time สิบ (Ten)
I-TIM Inside of a time นาฬิกา (O'clock)
Other O Word does not belong to any type of entity

In the corpus, a format of the data is generally composed of three components separated by a tab for each line. The first component is a word. The second component is part of speech (POS) of the word. The last component is a category or tag of the word in the same line. Some lines consist of only one part which is EOS, indicating the end of a sentence.

Example labeled data in location corpus file:

สหรัฐอเมริกา/NPRP/B-LOC <space>/PUNC/O ญี่ปุ่น/NPRP/B-LOC <space>/PUNC/O สหภาพ/NPRP/B-LOC ยุโรป/NPRP/I-LOC <space>/PUNC/O อาเซียน/NCMN/B-LOC <space>/PUNC/O ลดลง/VSTA/O เฉลี่ย/VACT/O <space>/PUNC/O 28.2/DCNM/O
Contributor
Prof.Dr.Virach Sornlertlamvanich @Virach
Kitiya Suriyachay @BeeKitiya
Dr.Titipakorn Prakayaphun @Titipakorn

Suriyachay, K., Charoenporn, T., and Sornlertlamvanich, V. (2019). Thai Named Entity Tagged Corpus Annotation Scheme and Self Verification. The 9th Language and Technology Conference (LTC2019).

⏳ NER models will load on first use…
0/5000
0 / 7

WS is the word tokenization of Thai words with the HMM model and Viterbi algorithm for computing possibilities of possible words with their POS tags. The input of this model is words. The output is the tokenized words with their POS tags.

Contributor
Prof.Dr.Virach Sornlertlamvanich @Virach
Dr.Sumeth Yuenyong @Sumeth
Dr.Titipakorn Prakayaphun @Titipakorn

Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.

Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.

Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.

0/5000

SS is our standard tool for dividing bunches of words into sentences for post processing such as sentence classification. The input of this model is words along with their POS tags. The output is the sentence dividing by |.

Contributor
Prof.Dr.Virach Sornlertlamvanich @Virach
Dr.Sumeth Yuenyong @Sumeth
Dr.Titipakorn Prakayaphun @Titipakorn

Sornlertlamvanich, V., Charoenporn, T., and Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Tech. Rep. TR-NECTEC-1997-001, National Electronics and Computer Technology Center, Thailand, pp. 5-19.

Sornlertlamvanich, V., and Tanaka H. (1996)a. The Automatic Extraction of Open Compounds from Text Corpora. The 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146.

Sornlertlamvanich, V., and Tanaka H. (1996)b. Extracting Open Compounds from Text Corpora. The Second Annual Meetings of the Association for Natural Language Processing, pp 213-216.

⏳ SS model will load on first use…
0/5000
Copied!