Text categorization is one of the well studied problems in data mining and information retrieval. Given a large quantity of documents in a data set where each document is associated with its corresponding category. This research proposes a novel approach for English and Persian documents classification with using novel method that combined competitive neural text categorizer with new vectors that we called, string vectors. Traditional approaches to text categorization require encoding documents into numerical vectors which leads to the two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of document categorization is degraded. The idea of this research as the solution to the problems is to encode the documents into string vectors and apply it to the novel competitive neural text categorizer as a string vector. Extensive experiments based on several benchmarks are conducted. The results indicated that this method can significantly improve the performance of documents classification up to 13.8% in comparison to best traditional algorithm on standard Reuter 21578 dataset.
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |