Binarization of Ancient Document Images based on Multipeak Histogram Assumption
Abstract: In document
binarization, text is segmented from the background. This is an important step,
since the binarization outcome determines the success rate of the optical
character recognition (OCR). In ancient documents, that are commonly noisy,
binarization becomes more difficult. The noise can reduce binarization
performance, and thus the OCR rate. This paper proposes a new binarization
approach based on an assumption that the histograms of noisy documents consist
of multipeaks. The proposed method comprises three steps: histogram
calculation, histogram smoothing, and the use of the histogram to track the
first valley and determine the binarization threshold. In our simulations we
used a set of Jawi ancient document images with natural noises. This set is
composed of 24 document tiles containing two noise types: show-through and
uneven background. To measure performance, we designed and implemented a point
compilation scheme. On average, the proposed method performed better than the
Otsu method, with the total point score obtained by the former being 7.5 and
that of the latter 4.5. Our results show that as long as the histogram fulfills
the multipeak assumption, the proposed method can perform satisfactorily.
Author: Fitri Arnia, Khairul
Munadi
Journal Code: jptkomputergg170173