English

Auteurs

  • MELANIA NIȚU University Politehnica of Bucharest, Faculty of Automated Control and Computers, Splaiul Independentei 313, 060042, Bucharest, Romania Author
  • English English English Author
  • English English English Author

DOI :

https://doi.org/10.59277/RRST-EE.2023.3.8

Mots-clés :

Modélisation de sujets, Allocation latente de Dirichlet (LDA), Représentations d'encodeurs bidirectionnels du sujet des transformateurs (BERTopic), Regroupement, Classification, Documents roumains

Résumé

A primary challenge for digital library systems when digitizing millions of volumes is to automatically analyze and group the huge document collection by categories while identifying patterns and extracting the main themes. A common method to be leveraged on unlabeled texts is topic modeling. Given the wide range of datasets and evaluation criteria used by researchers, comparing the performance and outputs of existing unsupervised algorithms is a complex task. This paper introduces a domain-based topic modeling evaluation applied to Romanian documents. Several variants of Latent Dirichlet Allocation (LDA) combined with dimensionality reduction techniques were compared to Transformer-based models for topic modeling. Experiments were conducted on two datasets of varying text lengths: abstracts of novels and full-text documents. Evaluations were performed against coherence and silhouette scores, while the validation considered classification and clustering tasks. Results highlighted meaningful topics extracted from both datasets.

Références

(1) R. Dobrescu, D. Merezeanu, From information to knowledge transmission of meaning, Rev. Roum. Sci. Techn., 62, 1, pp. 115–118 (2017).

(2) R.-I. Mogoş, C.-N. Bodea, Recommender systems for engineering education, Rev. Roumn. Sci. Tech., 64, 4, pp. 435–442 (2019).

(3) A. Rajaraman, J.D. Ullman, Data Mining: Mining of Massive Datasets, pp. 1–17 (2011).

(4) S.T. Dumais, Latent semantic analysis, Annual Review of Information Science and Technology, 38, 1, pp. 188–230 (2004).

(5) T. Hoffman, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning, 42, 1, pp. 177–196 (2001).

(6) D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research, 3, 4-5, pp. 993–1022 (2003).

(7) L. Hong, B.D. Davison, Empirical study of topic modeling in Twitter, ACM, pp. 80–88 (2010).

(8) N. Gillis, The why and how of nonnegative matrix factorization, arXiv:1401.5226 (2014).

(9) M. Grootendorst, BERTopic: neural topic modeling with a class-based TF-IDF procedure, arXiv:2203.05794 (2022).

(10) L. McInnes, J. Healy, Accelerated hierarchical density based clustering, ICDMW, pp. 33–42 (2017).

(11) D. Angelov, Top2Vec: distributed representations of topics, arXiv:2008.09470 (2020).

(12) S. Palani, P. Rajagopal, S. Pancholi, T-BERT - model for sentiment analysis of micro-blogs integrating topic model and BERT, arXiv:2106.01097 (2021).

(13) C.M. Bishop, Pattern recognition and machine learning, New York, Springer (2006).

(14) L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research, 9, pp. 2579–2605 (2008).

(15) L. McInnes, J. Healy, UMAP: uniform manifold approximation and projection for dimension reduction, arXiv:1802.03426 (2018).

(16) L.-M. Neagu, T.-M. Cotet, M. Dascalu, S. Trausan-Matu, E. Chisu, E. Simion, Semantic recommendations and topic modeling based on the chronology of Romanian literary life, SETE, pp. 164–174 (2019).

(17) F. Lind, J.-M. Eberl, S. Galyga, T. Heidenreich, H.G. Boomgaarden, B.H. Jimenez, R., Berganza, A bridge over the language gap: topic modelling for text analyses across languages for country comparative research, REMINDER project (2019).

(18) M.D. Hoffman, D.M. Blei, F. Bach, Online learning for latent Dirichlet allocation, NIPS, pp. 856–864 (2010).

(19) N. Reimers, I. Gurevych, Sentence-BERT: sentence embeddings using siamese BERT-networks, EMNLP-IJCNLP, pp. 3982–3992 (2019).

(20) M. Masala, S. Ruseti, M. Dascalu, RoBERT-a Romanian BERT model, COLING, pp. 6626–6637 (2020).

(21) F. Murtagh, Multilayer perceptrons for classification and regression, Neurocomputing, 2, pp. 183-197 (1991).

(22) T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, KDD, pp. 785–794 (2016).

(23) J.A. Hartigan, M.A. Wong, A k-means clustering algorithm, RSS, 28, 1, pp. 100-108 (1979).

(24) R.L. Thorndike, Who belongs in the family?, Psychometrika, 18, pp. 267–276 (1953).

(25) G. Bouma, Normalized (pointwise) mutual information in collocation extraction (2009).

(26) M. Röder, A. Both, A. Hinneburg, Exploring the space of topic coherence measures, WSDM, pp. 399–408 (2015).

(27) J. Chang, S. Gerrish, W. Chong, J.L. Boyd-Graber, D.M., Blei, How humans interpret topic models, NeurIPS, pp. 228-296 (2009).

(28) M. Bastian, S. Heymann, M. Jacomy, Gephi: An open source software for exploring and manipulating networks, AAAI ICWSM, pp. 361–362 (2009).

Téléchargements

Publiée

2023-10-12

Numéro

Rubrique

Électronique et transmission de l’information | Electronics & Information Technology

Comment citer

English. (2023). REVUE ROUMAINE DES SCIENCES TECHNIQUES — SÉRIE ÉLECTROTECHNIQUE ET ÉNERGÉTIQUE, 68(3), 295-300. https://doi.org/10.59277/RRST-EE.2023.3.8