Using reddit data for multi-label text classification of twitter users interests
Abstract:
The automation process for inferring users' interest groups is a challenge task in social networks research and it has applications in marketing and recommendation systems. Manually labeling of documents is a difficult and an expensive task, but it is essential for training an automatic text classifier. Actually, there are several approaches where the problem is treated as a multi-label prediction task. In this work, a methodology is proposed to automatically categorize data by considering Reddit and Twitter data. First, a dataset of 42.100 publications belongs to popular forums site Reddit is collected to train a model with labeled data. Then, a dataset of tweets, an average of 100 tweets per user, from 1573 profiles is collected to predict users' topics of interest with the trained model. Finally, we were able to automatically categorize data with an average precision of 75.62%.
Año de publicación:
2019
Keywords:
- TD-IDF
- classification
- Lda
- TEXT
- Word2Vec
Fuente:
scopus
googleTipo de documento:
Conference Object
Estado:
Acceso restringido
Áreas de conocimiento:
- Aprendizaje automático
- Ciencias de la computación
Áreas temáticas de Dewey:
Objetivos de Desarrollo Sostenible:
- ODS 9: Industria, innovación e infraestructura
- ODS 17: Alianzas para lograr los objetivos
- ODS 8: Trabajo decente y crecimiento económico