Modelos de proximidad novedosos para el cribado virtual de conjuntos de datos quimioinformáticos
Fecha
2011-06-26
Autores
Hernández Díaz, Yoandy
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Central “Marta Abreu” de Las Villas
Resumen
La búsqueda de similitud es una prestación importante en los sistemas modernos de gestión de la información química para acceder a la rica información contenida en los enormes repositorios químicos modernos. Básicamente, dadas una representación molecular, una medida de similitud y un algoritmo de búsqueda, la salida de la técnica devuelve una lista ordenada de moléculas del conjunto de datos en orden decreciente de similitud con respecto a la molécula consulta especificada por el usuario. Como consecuencia, los investigadores han puesto su interés en la eficacia de las representaciones y medidas de similitud en estas tareas. Sin embargo, sus estudios se han enfocado predominantemente en representaciones binarias y las medidas de semejanza correspondientes, y poco se ha trabajado en otros tipos de descripción numérica. También se han aplicado técnicas del Aprendizaje Automático en la selección de rasgos, aunque no de forma consistente con el principio de vecindad. Estos precedentes junto a la necesidad de nuevos métodos apropiados para cada contexto químico, constituyen la motivación para este trabajo. El mismo comprende la implementación computacional en el ambiente Java de 21 modelos de proximidad, 9 de los cuales son novedosos en Quimioinformática, proceden del área de la Psicología y están basados en el concepto acuerdo relacional, y otros doce son medidas ya establecidas de la literatura especializada. Posteriormente, las nuevas medidas de similitud fueron comparadas y validadas en la “recuperación temprana” usando nueve conjuntos farmacológicos de la Química Medicinal de interés internacional, representados por descriptores numéricos, seleccionados por Aprendizaje Automático, y un algoritmo de búsqueda eficiente. Los resultados muestran que en tendencia promedia los nuevos modelos se comportan superiormente a los de referencia y que más de la mitad de los mismos se sitúan entre los diez modelos más potentes.
Similarity searching is an important possibility in modern chemical information management systems to accede the rich information contained in modern enormous chemical repositories. Basically, given a molecular representation, a similarity measure, and a matching algorithm, the technique’s output returns an ordered list of dataset molecules in decreasing order of similarity with respect to a query or reference molecule specified by the user. As a consequence, researchers have put their interest in molecular representations and similarity measures performance in these tasks. However, their studies have been predominantly focused in binary representations and the corresponding resemblance measures, and little work has been done taking into account other types of numerical description. Also, Machine Learning techniques have been applied for descriptor selection, though not consistently with the neighborhood principle. These precedents, together with the need of new methods suitable for each chemical context, constitute the motivation for this work. It comprises the computational implementation in the Java environment of 21 proximity models, of which 12 are novel in Chemoinformatics that come from the Psychology area, and are based on the concept of relational agreement; and other nine are measures already established in the specialized literature. Later, the new similarity measures were compared and validated at the “early retrieval” using nine pharmacological datasets from Medicinal Chemistry, represented by machine learning-selected real descriptors, and some efficient matching algorithm. Results show that in average trends the new models perform superiorly with respect to the reference ones, and more than half of them are among the top-10 models.
Similarity searching is an important possibility in modern chemical information management systems to accede the rich information contained in modern enormous chemical repositories. Basically, given a molecular representation, a similarity measure, and a matching algorithm, the technique’s output returns an ordered list of dataset molecules in decreasing order of similarity with respect to a query or reference molecule specified by the user. As a consequence, researchers have put their interest in molecular representations and similarity measures performance in these tasks. However, their studies have been predominantly focused in binary representations and the corresponding resemblance measures, and little work has been done taking into account other types of numerical description. Also, Machine Learning techniques have been applied for descriptor selection, though not consistently with the neighborhood principle. These precedents, together with the need of new methods suitable for each chemical context, constitute the motivation for this work. It comprises the computational implementation in the Java environment of 21 proximity models, of which 12 are novel in Chemoinformatics that come from the Psychology area, and are based on the concept of relational agreement; and other nine are measures already established in the specialized literature. Later, the new similarity measures were compared and validated at the “early retrieval” using nine pharmacological datasets from Medicinal Chemistry, represented by machine learning-selected real descriptors, and some efficient matching algorithm. Results show that in average trends the new models perform superiorly with respect to the reference ones, and more than half of them are among the top-10 models.
Descripción
Palabras clave
Sistemas de Gestión de la Información, Repositorios Químicos, Algoritmo de Búsqueda, Similitud Molecular, Cribado Virtual, Datos Quimioinformáticos, Aprendizaje Automático, Computación en Química