Regresar

Enabling the Latent Semantic Analysis of Large-Scale Information Retrieval Datasets by Means of Out-of-Core Heterogeneous Systems

Abstract:

Latent Semantic Analysis (LSA) has already been widely and successfully applied in many applications for Natural Language Processing (NLP), usually working with fairly small or average sized datasets and no actual time constraints. Even so, LSA is a high time and space consuming task, which complicates its integration in real-time NLP applications (as, for example, information retrieval or question answering) on large-scale datasets. For this reason, an implementation of LSA that can both allow and accelerate as much as possible its execution on large-scale datasets would be most useful in these data-intensive, real-time NLP scenarios. However, to the best of our knowledge, such an implementation of LSA has not been achieved so far. Towards this end, a new, out-of-core, scalable, heterogeneous LSA (hLSA) system has been built and run on the clinical decision support large-scale dataset from the Text REtrieval Conference (TREC) 2015 competition. Results show that the out-of-core hLSA system can process this large-scale dataset (that is, 631,302 documents) with a full-ranked term-document matrix of 566 GB fairly fast and, besides, with a better precision (at least for one of the topics) than the TREC 2015 competing systems.