A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models
Abstract:
Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.
Año de publicación:
2021
Keywords:
- Corpus construction
- Large-scale corpus
- Corpus in Spanish
- Supplies for NLP
- Methodological framework
Fuente:

Tipo de documento:
Conference Object
Estado:
Acceso restringido
Áreas de conocimiento:
- Inteligencia artificial
- Ciencias de la computación
Áreas temáticas:
- Lingüística
- Lengua
- Ciencias sociales