Regresar

Topic Identification from Spanish Unstructured Health Texts

Abstract:

Topic Models allow to extract topics from documents and classify them. In this work, Latent Dirichlet Allocation model was applied to extract topics from documents with medical information. 220 digital documents written in Spanish were used, these documents have information about different health conditions. A pre-processing was carried out, which implies tokenization, stop words elimination and lemmatization, to define the medical data or terms that will represent the documents. Subsequently, a document representation was made through a document-term matrix. An important step was to use a medical glossary based on terminology extracted from Internet to assign weights to the terms. LDA was applied and two new matrices were obtained: a document-topic matrix and a topic-term matrix. 25 topics were identified, they can be visualized by heat maps, word cloud and an interactive tool called PyLDAvis. The application was developed in Phyton using some libraries such as Spacy, Scikit-learn, Tmtoolkit, PyLDAvis among others.