Streamed Sampling on Dynamic data as Support for Classification Model
Abstract: Data mining process
on dynamically changing data have several problems, such as unknown data size
and changing of class distribution. Random sampling method commonly applied for
extracting general synopsis from very large database. In this research,
Vitter’s reservoir algorithm is used to retrieve k records of data from the
database and put into the sample. Sample is used as input for classification
task in data mining. Sample type is backing sample and it saved as table
contains value of id, priority and timestamp. Priority indicates the
probability of how long data retained in the sample. Kullback-Leibler
divergence applied to measure the similarity between database and sample
distribution. Result of this research is showed that continuously taken samples
randomly is possible when transaction occurs. Kullback-Leibler divergence with
interval from 0 to 0.0001, is a very good measure to maintain similar class distribution
between database and sample. Sample results are always up to date on new
transactions with similar class distribution. Classifier built from balance
class distribution showed to have better performance than from imbalance one.
Keywords: random sample,
relative entropy, skewness, kullback liebler divergence, dynamic classification
Author: Astried Silvanie,
Taufik Djatna, Heru Sukoco
Journal Code: jptkomputergg130119