Evaluación de la influencia de los acantilados de actividad (activity cliffs) en la modelación QSAR
Archivos
Fecha
2015-05-20
Autores
Velázquez Libera, José Luís
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Central “Marta Abreu” de Las Villas. Facultad de Matemática, Física y Computación. Departamento de Computación
Resumen
La modelación QSAR es un ejemplo de herramienta quimioinformática cuyo uso se ha extendido a diferentes esferas del desarrollo de la sociedad. El principal supuesto de las aproximaciones en modelos QSAR es la continuidad del espacio de las Relaciones Estructura-Actividad (SAR), la cual se puede ver afectada por la presencia de los activity cliffs. Estudios recientes han mostrado los efectos negativos de la presencia de los activity cliffs sobre la capacidad predictiva de los modelos QSAR. Sin embargo, no se reportan estudios en los que se evalúe el efecto de eliminarlos de los conjuntos de datos previamente a la modelación.
El objetivo del presente trabajo fue evaluar el efecto de la eliminación de los activity cliffs sobre la capacidad predictiva de modelos QSAR basados en algoritmos de aprendizaje automatizado. Con este propósito se diseñó e implementó un procedimiento para identificar los activity cliffs, y eliminar los más influyentes de los conjuntos de datos. Se utilizaron nueve algoritmos de aprendizaje automatizado en la modelación de los cinco conjuntos de datos seleccionados. Se evaluó el desempeño de los modelos QSAR obtenidos a partir de los conjuntos de datos “sin activity cliffs” respecto a los obtenidos para los conjuntos de datos originales.
Durante el proceso de evaluación se pudo comprobar que la eliminación de los activity cliffs no condujo a cambios estadísticamente significativos de la continuidad de las SAR. Sin embargo, si se apreciaron mejoras estadísticamente significativas en la modelabilidad de los conjuntos de entrenamiento; específicamente los procesados empleando el algoritmo que realiza agregación de las matrices de similitud por media geométrica. Por otra parte, eliminar los activity cliffs permitió mejoras estadísticamente significativas en el proceso de entrenamiento y validación de los modelos, no siendo así en la clasificación de los subconjuntos de validación externa, donde de manera general no hubo cambios estadísticamente significativos. No obstante, se mejoró la clasificación de la clase peor clasificada por los modelos obtenidos de los subconjuntos de entrenamiento originales. Este último resultado fue estadísticamente significativo para el algoritmo de eliminación de activity cliffs que no realiza fusión de matrices de similitud, lo que muestra una tendencia a balancear la clasificación.
The QSAR modeling is an example of cheminformatics tool which use has spread to different areas of development of society. The main assumption of the approximations in QSAR models is the continuity of space of Structure-Activity Relationships (SAR), the presence of activity cliffs may be affected it. Recent studies have shown the negative effects of the presence of the activity cliffs on the predictive ability of QSAR models. However, there are not reports showing if the removal of activity cliffs from a data sets is beneficial, detrimental or non-significant. The goal of this study was to evaluate the effect of removing the activity cliffs on the predictive ability of QSAR models based on machine learning algorithms. For this purpose, we designed and implemented a procedure to identify the activity cliffs, and eliminate the most influential from data sets. We used nine machine-learning algorithms in modeling the five selected data sets. In addition, we evaluated the performance of QSAR models obtained from data sets "without activity cliffs” compared to those obtained for the original data sets. During the evaluation process, we found that removal of the activity cliffs did not lead to statistically significant changes in the continuity of the SAR. However, we did noticed statistically significant improvements in modelability of training sets; specifically processed using the algorithm that performs aggregation of similarity matrices by geometric mean. Moreover, eliminating activity cliffs allowed statistically significant improvements in the training process and validation of the models, not the case in the classification of subsets of external validation, where generally there were no statistically significant changes. However, the classification of the worst class classified by the models achieved from training subsets, improved. The latter result was statistically significant for the removal of activity cliffs algorithm that does not perform fusion of similarity matrices, showing a tendency to balance the classification.
The QSAR modeling is an example of cheminformatics tool which use has spread to different areas of development of society. The main assumption of the approximations in QSAR models is the continuity of space of Structure-Activity Relationships (SAR), the presence of activity cliffs may be affected it. Recent studies have shown the negative effects of the presence of the activity cliffs on the predictive ability of QSAR models. However, there are not reports showing if the removal of activity cliffs from a data sets is beneficial, detrimental or non-significant. The goal of this study was to evaluate the effect of removing the activity cliffs on the predictive ability of QSAR models based on machine learning algorithms. For this purpose, we designed and implemented a procedure to identify the activity cliffs, and eliminate the most influential from data sets. We used nine machine-learning algorithms in modeling the five selected data sets. In addition, we evaluated the performance of QSAR models obtained from data sets "without activity cliffs” compared to those obtained for the original data sets. During the evaluation process, we found that removal of the activity cliffs did not lead to statistically significant changes in the continuity of the SAR. However, we did noticed statistically significant improvements in modelability of training sets; specifically processed using the algorithm that performs aggregation of similarity matrices by geometric mean. Moreover, eliminating activity cliffs allowed statistically significant improvements in the training process and validation of the models, not the case in the classification of subsets of external validation, where generally there were no statistically significant changes. However, the classification of the worst class classified by the models achieved from training subsets, improved. The latter result was statistically significant for the removal of activity cliffs algorithm that does not perform fusion of similarity matrices, showing a tendency to balance the classification.
Descripción
Palabras clave
QSAR, Quimioinformática, Activity Cliffs