期刊文献+

Investigating the Relevance of Arabic Text Classification Datasets Based on Supervised Learning 被引量:1

下载PDF
导出
摘要 Training and testing different models in the field of text classification mainly depend on the pre-classified text document datasets. Recently, seven datasets have emerged for Arabic text classification, including Single-Label Arabic News Articles Dataset(SANAD), Khaleej, Arabiya, Akhbarona, KALIMAT, Waten2004, and Khaleej2004. This study investigates which of these datasets can provide significant training and fair evaluation for text classification(TC). In this investigation, well-known and accurate learning models are used, including naive Bayes(NB), random forest(RF), K-nearest neighbor(KNN), support vector machines(SVM), and logistic regression(LR) models. We present relevance and time measures of training the models with these datasets to enable Arabic language researchers to select the appropriate dataset to use based on a solid basis of comparison. The performances of the five learning models across the seven datasets are measured and compared with the performances of the same models trained on a well-known English language dataset. The analysis of the relevance and time scores shows that training the SVM model on Khaleej and Arabiya obtained the most significant results in the shortest amount of time,with the accuracy of 82%.
出处 《Journal of Electronic Science and Technology》 CAS CSCD 2022年第2期187-208,共22页 电子科技学刊(英文版)
  • 相关文献

参考文献1

二级参考文献12

  • 1L. Rocha, F. Mourao, H. Mota et al., "Temporal contexts: Ef- fective text classification in evolving document collections", In- formation Systems, Vol.38, No.3, pp.388-409, 2012. 被引量:1
  • 2M.T. Fardanesh, "Classification accuracy improvement of neu- ral network classifiers by using unlabeled data", IEEE Trans- actions on Geoscienee and Remote Sensing, Vol.36, No.3, pp.1020 1025, 1998. 被引量:1
  • 3T. Joachims, "Transductive inference for text classification us- ing support vector machines", Proc. of the Sixteenth In- ternational Conference on Machine Learning, Bled, Slovenia, pp.200-209, 1999. 被引量:1
  • 4Y. Tsuruoka, J. Tsujii, "Training a naive bayes classifier via the EM algorithm with a class distribution constraint", Proc. of the Seventh Conference on Natural Language Learning, Edmonton, Canada, pp.127-134, 2003. 被引量:1
  • 5R. Kothari, V. Jain, "Learning from labeled and unlabeled data using a minimal number of queries", IEEE Transaction on Neu- ral Networks, Vol.14, No.6, pp.1496 1505, 2003. 被引量:1
  • 6M. Efron, P. Organisciak, K. Fenlon, "Improving retrieval of short texts through document expansion", Proc. of the 35th International A CM SIGIR Conference on Research and Devel- opment in Information Retrieval, Portland, OR, United states, pp.911-920, 2012. 被引量:1
  • 7V. Vapnik, "The Nature of Statistical Learning Theory, Springer- Verlag, New York, 1999. 被引量:1
  • 8S.M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASP-35, No.3, pp.400-401, 1987. 被引量:1
  • 9C.X. Zhai, "Statistical language models for information retrieval a critical review", Foundations and Trends in Information Re- trieval, Vol.2, No.3, pp.137 213, 2008. 被引量:1
  • 10V. Lavrenko, W.B. Croft, "Relevance based language models", Proc. of the 24th annual international A CM SIGIR conference on Research and Development in Information Retrieval, New York, USA, pp.120-127, 2001. 被引量:1

共引文献11

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部