Automatically generation and evaluation of Stop words list for Chinese Patents
Abstract: As an important
preprocessing step of information retrieval and information processing, the
accuracy of stop words’ elimination directly influences the ultimate result of
retrieval and mining. In information retrieval, stop words’ elimination can
compress the storage space of index, and in text mining, it can reduce the
dimension of vector space enormously, save the storage space of vector space
and speed up the calculation. However, Chinese patents are a kind of legal
documents containing technical information, and the general Chinese stop words
list is not applicable for them. This paper advances two methodologies for
Chinese patents. One is based on word frequency and the other on statistics.
Through experiments on real patents data, these two methodologies’ accuracy are
compared under several corpuses with different scale, and also compared with
general stop list. The experiment result indicates that both of these two
methodologies can extract the stop words suitable for Chinese patents and the
accuracy of Methodology based on statistics is a little higher than the one
based on word frequency.
Author: Deng Na, Chen Xu
Journal Code: jptkomputergg150179