Supervised Entity Tagger for Indonesian Labor Strike Tweets using Oversampling Technique and Low Resource Features
Abstract: We propose an entity
tagger for Indonesian tweets sent during labor strike events using supervised
learning methods. The aim of the tagger is to extract the date, location and
the person/organization involved in the strike. We use SMOTE (Synthetic
Minority Oversampling Technique) as an oversampling technique and conducted
several experiments using Twitter data to evaluate different settings with
varying machine learning algorithms and training data sizes. In order to test
the low resource features, we also conducted experiments for the system without
employing the word list feature and the word normalization. Our results
indicated that different treatment of different types of machine learning algorithms
with low resource features can lead to a good accuracy score. Here, we tried
Naïve Bayes, C4.5, Random Forest and SMO (Sequential Minimal Optimization)
algorithms using Weka as the machine learning tools. For the Naïve Bayes, due
to the data distribution based of the class probability, the best accuracy was
achieved by removing data duplication. For C4.5 and Random Forest, SMOTE gave
higher accuracy result compared to the original data and the data with data
duplication removal. For SMO, there is no significant difference among various
sizes of training data.
Keywords: Indonesian Entity
Tagger, SMOTE, supervised learning, word level feature, word window feature,
labor strike tweets
Author: Ayu Purwarianti, Lisa
Madlberger, Muhammad Ibrahim
Journal Code: jptkomputergg160151