Using reddit data for multi-label text classification of twitter users interests


Abstract:

The automation process for inferring users' interest groups is a challenge task in social networks research and it has applications in marketing and recommendation systems. Manually labeling of documents is a difficult and an expensive task, but it is essential for training an automatic text classifier. Actually, there are several approaches where the problem is treated as a multi-label prediction task. In this work, a methodology is proposed to automatically categorize data by considering Reddit and Twitter data. First, a dataset of 42.100 publications belongs to popular forums site Reddit is collected to train a model with labeled data. Then, a dataset of tweets, an average of 100 tweets per user, from 1573 profiles is collected to predict users' topics of interest with the trained model. Finally, we were able to automatically categorize data with an average precision of 75.62%.

Año de publicación:

2019

Keywords:

  • TD-IDF
  • classification
  • Lda
  • TEXT
  • Word2Vec
  • Twitter
  • Reddit

Fuente:

scopusscopus
googlegoogle

Tipo de documento:

Conference Object

Estado:

Acceso restringido

Áreas de conocimiento:

  • Aprendizaje automático
  • Ciencias de la computación

Áreas temáticas de Dewey:

    Procesado con IAProcesado con IA

    Objetivos de Desarrollo Sostenible:

    • ODS 9: Industria, innovación e infraestructura
    • ODS 17: Alianzas para lograr los objetivos
    • ODS 8: Trabajo decente y crecimiento económico
    Procesado con IAProcesado con IA