ROMANIAN TOPIC MODELING – AN EVALUATION OF PROBABILISTIC VERSUS TRANSFORMER-BASED TOPIC MODELING FOR DOMAIN CATEGORIZATION
DOI:
https://doi.org/10.59277/RRST-EE.2023.3.8Keywords:
Topic Modeling, Latent Dirichlet allocation (LDA), Bidirectional encoder representations from transformers topic (BERTopic), Clustering, Classification, Romanian documentsAbstract
A primary challenge for digital library systems when digitizing millions of volumes is to automatically analyze and group the huge document collection by categories while identifying patterns and extracting the main themes. A common method to be leveraged on unlabeled texts is topic modeling. Given the wide range of datasets and evaluation criteria used by researchers, comparing the performance and outputs of existing unsupervised algorithms is a complex task. This paper introduces a domain-based topic modeling evaluation applied to Romanian documents. Several variants of Latent Dirichlet Allocation (LDA) combined with dimensionality reduction techniques were compared to Transformer-based models for topic modeling. Experiments were conducted on two datasets of varying text lengths: abstracts of novels and full-text documents. Evaluations were performed against coherence and silhouette scores, while the validation considered classification and clustering tasks. Results highlighted meaningful topics extracted from both datasets.
References
(1) R. Dobrescu, D. Merezeanu, From information to knowledge transmission of meaning, Rev. Roum. Sci. Techn., 62, 1, pp. 115–118 (2017).
(2) R.-I. Mogoş, C.-N. Bodea, Recommender systems for engineering education, Rev. Roumn. Sci. Tech., 64, 4, pp. 435–442 (2019).
(3) A. Rajaraman, J.D. Ullman, Data Mining: Mining of Massive Datasets, pp. 1–17 (2011).
(4) S.T. Dumais, Latent semantic analysis, Annual Review of Information Science and Technology, 38, 1, pp. 188–230 (2004).
(5) T. Hoffman, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning, 42, 1, pp. 177–196 (2001).
(6) D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research, 3, 4-5, pp. 993–1022 (2003).
(7) L. Hong, B.D. Davison, Empirical study of topic modeling in Twitter, ACM, pp. 80–88 (2010).
(8) N. Gillis, The why and how of nonnegative matrix factorization, arXiv:1401.5226 (2014).
(9) M. Grootendorst, BERTopic: neural topic modeling with a class-based TF-IDF procedure, arXiv:2203.05794 (2022).
(10) L. McInnes, J. Healy, Accelerated hierarchical density based clustering, ICDMW, pp. 33–42 (2017).
(11) D. Angelov, Top2Vec: distributed representations of topics, arXiv:2008.09470 (2020).
(12) S. Palani, P. Rajagopal, S. Pancholi, T-BERT - model for sentiment analysis of micro-blogs integrating topic model and BERT, arXiv:2106.01097 (2021).
(13) C.M. Bishop, Pattern recognition and machine learning, New York, Springer (2006).
(14) L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research, 9, pp. 2579–2605 (2008).
(15) L. McInnes, J. Healy, UMAP: uniform manifold approximation and projection for dimension reduction, arXiv:1802.03426 (2018).
(16) L.-M. Neagu, T.-M. Cotet, M. Dascalu, S. Trausan-Matu, E. Chisu, E. Simion, Semantic recommendations and topic modeling based on the chronology of Romanian literary life, SETE, pp. 164–174 (2019).
(17) F. Lind, J.-M. Eberl, S. Galyga, T. Heidenreich, H.G. Boomgaarden, B.H. Jimenez, R., Berganza, A bridge over the language gap: topic modelling for text analyses across languages for country comparative research, REMINDER project (2019).
(18) M.D. Hoffman, D.M. Blei, F. Bach, Online learning for latent Dirichlet allocation, NIPS, pp. 856–864 (2010).
(19) N. Reimers, I. Gurevych, Sentence-BERT: sentence embeddings using siamese BERT-networks, EMNLP-IJCNLP, pp. 3982–3992 (2019).
(20) M. Masala, S. Ruseti, M. Dascalu, RoBERT-a Romanian BERT model, COLING, pp. 6626–6637 (2020).
(21) F. Murtagh, Multilayer perceptrons for classification and regression, Neurocomputing, 2, pp. 183-197 (1991).
(22) T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, KDD, pp. 785–794 (2016).
(23) J.A. Hartigan, M.A. Wong, A k-means clustering algorithm, RSS, 28, 1, pp. 100-108 (1979).
(24) R.L. Thorndike, Who belongs in the family?, Psychometrika, 18, pp. 267–276 (1953).
(25) G. Bouma, Normalized (pointwise) mutual information in collocation extraction (2009).
(26) M. Röder, A. Both, A. Hinneburg, Exploring the space of topic coherence measures, WSDM, pp. 399–408 (2015).
(27) J. Chang, S. Gerrish, W. Chong, J.L. Boyd-Graber, D.M., Blei, How humans interpret topic models, NeurIPS, pp. 228-296 (2009).
(28) M. Bastian, S. Heymann, M. Jacomy, Gephi: An open source software for exploring and manipulating networks, AAAI ICWSM, pp. 361–362 (2009).
Downloads
Published
Issue
Section
License
Copyright (c) 2023 REVUE ROUMAINE DES SCIENCES TECHNIQUES — SÉRIE ÉLECTROTECHNIQUE ET ÉNERGÉTIQUE
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.