ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins
Abstract:
Background: The exponential growth of protein structural and sequence databases is enabling multifaceted approaches to understanding the long sought sequence-structure-function relationship. Advances in computation now make it possible to apply well-established data mining and pattern recognition techniques to these data to learn models that effectively relate structure and function. However, extracting meaningful numerical descriptors of protein sequence and structure is a key issue that requires an efficient and widely available solution. Results: We here introduce ProtDCal, a new computational software suite capable of generating tens of thousands of features considering both sequence-based and 3D-structural descriptors. We demonstrate, by means of principle component analysis and Shannon entropy tests, how ProtDCal's sequence-based descriptors provide new and more relevant information not encoded by currently available servers for sequence-based protein feature generation. The wide diversity of the 3D-structure-based features generated by ProtDCal is shown to provide additional complementary information and effectively completes its general protein encoding capability. As demonstration of the utility of ProtDCal's features, pbkp_rediction models of N-linked glycosylation sites are trained and evaluated. Classification performance compares favourably with that of contemporary pbkp_redictors of N-linked glycosylation sites, in spite of not using domain-specific features as input information. Conclusions: ProtDCal provides a friendly and cross-platform graphical user interface, developed in the Java programming language and is freely available at: http://bioinf.sce.carleton.ca/ProtDCal/. ProtDCal introduces local and group-based encoding which enhances the diversity of the information captured by the computed features. Furthermore, we have shown that adding structure-based descriptors contributes non-redundant additional information to the features-based characterization of polypeptide systems. This software is intended to provide a useful tool for general-purpose encoding of protein sequences and structures for applications is protein classification, similarity analyses and function pbkp_rediction.
Año de publicación:
2015
Keywords:
- Protein descriptors
- Protein function modelling
- PROTDCAL
- Protein feature generation
- Data Mining
Fuente:


Tipo de documento:
Article
Estado:
Acceso abierto
Áreas de conocimiento:
- Proteína
- Ciencias de la computación
Áreas temáticas:
- Ciencias de la computación
- Bioquímica
- Fisiología humana