Regresar

A Representation Method for Cellular Lines based on SVM and Text Mining

Abstract:

One important problem in Bioinformatics is the discovery of new interactions between cellular lines and chemical compounds. In silico methods for cell-line screening are fundamental to optimize cost and time in the drug discovery processes. In order to build these methods, we need to computationally represent cell lines. Current methods for modeling cell line interactions rely on comparing genetic expression profiles. However, these profiles are usually unknown.In this work, we present a method to characterize and represent cell lines by text processing the related scientific literature. We collect abstracts of scientific papers about cellular lines from Cellosaurus and PubMed. These documents are then represented as TF-IDF vectors. We build a data set for classification with the document vectors having the cell line identifier as the target class. We then apply a multiclass SVM classification method. We use Support Vector Domain Description to describe and characterize each cell line with its corresponding hyperplane obtained with a one-vs-rest training. We evaluated several configurations of classifiers, using micro-averaged precision as metric to choose the best classifier, and were able to differentiate cellular lines from a set of 200+.