Topic Identification from Spanish Unstructured Health Texts


Abstract:

Topic Models allow to extract topics from documents and classify them. In this work, Latent Dirichlet Allocation model was applied to extract topics from documents with medical information. 220 digital documents written in Spanish were used, these documents have information about different health conditions. A pre-processing was carried out, which implies tokenization, stop words elimination and lemmatization, to define the medical data or terms that will represent the documents. Subsequently, a document representation was made through a document-term matrix. An important step was to use a medical glossary based on terminology extracted from Internet to assign weights to the terms. LDA was applied and two new matrices were obtained: a document-topic matrix and a topic-term matrix. 25 topics were identified, they can be visualized by heat maps, word cloud and an interactive tool called PyLDAvis. The application was developed in Phyton using some libraries such as Spacy, Scikit-learn, Tmtoolkit, PyLDAvis among others.

Año de publicación:

2021

Keywords:

  • Lda
  • topic model
  • Medical text

Fuente:

scopusscopus

Tipo de documento:

Conference Object

Estado:

Acceso restringido

Áreas de conocimiento:

    Áreas temáticas de Dewey:

    • Medicina y salud
    • Fisiología humana
    • Enfermedades
    Procesado con IAProcesado con IA

    Objetivos de Desarrollo Sostenible:

    • ODS 3: Salud y bienestar
    • ODS 8: Trabajo decente y crecimiento económico
    • ODS 9: Industria, innovación e infraestructura
    Procesado con IAProcesado con IA