Generación Automática de Conjuntos de Entrenamientos para Weka
Fecha
2013-07-04
Autores
Rodríguez Morales, Adrián
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Central “Marta Abreu” de Las Villas
Resumen
Las técnicas de aprendizaje automático tienen gran aplicación en los algoritmos de clasificación, los cuales infieren en la frontera de decisión a partir de un conjunto de instancias de entrenamiento, siendo el núcleo de usos fascinantes. La diversidad de dominios—medicina, industria o educación—proporciona sin duda problemas dispares en los que involucra a tipo de atributos, volumen de instancias y distribución de datos. Todas estas características han llevado a la implementación de diferentes estrategias para abordar cada problema de la manera más adecuada, ya que el rendimiento del sistema de aprendizaje depende en parte del diseño de su algoritmo. Se han logrado progresos considerables refinando dichos algoritmos, tanto, que el desarrollo de técnicas ha alcanzado su nivel de madurez ofreciendo miles de métodos, todos ellos ciertamente competitivos y capaces de ajustar modelos precisos a partir de muestras del problema a resolver. No obstante, y a pesar del avance en la clasificación de datos, quedan aún cuestiones pendientes, sin ir más lejos, cómo las características intrínsecas de los datos afectan a los sistemas de aprendizaje. Esto, juntamente con el poco margen de mejora y la incertidumbre en la habilidad de las técnicas para capturar completamente el conocimiento que encierran los datos, induce a mirar otros elementos que forman parte del proceso de aprendizaje. Es entonces cuando los datos acaparan el protagonismo. Esta tesis se adentra en el estudio de la complejidad de los datos y su papel en la definición del comportamiento de las técnicas de aprendizaje supervisado, y explora la generación artificial de conjuntos de datos mediante estimadores de complejidad.
Machine learning techniques have a wide range of practical applications, and algorithms for supervised classification, which infer a decision boundary from a set of training instances, are at the core of fascinating uses. The diversity of domains—medicine, industry, or learning—provides extremely disparate data sets regarding properties such as the type of attributes, volume of instances, and data distribution. All of these characteristics have led to the implementation of different strategies to tackle each problem properly, since learner performance depends partly on the algorithm design. Tremendous progress has been made in refining such algorithms. Actually, the development of techniques has reached an advanced state of maturity offering thousands of methods, all of them very competitive, and providing accurate models from data which are generalized from a sample of the problem at hand. However, despite the progress in data classification, questions such as how the intrinsic characteristics of the data sets affect learners remain unanswered. This, coupled with the little leeway for improvement and the uncertainty of the ability of techniques to fully capture the underlying knowledge of data, in duce’s us to look toward other elements involved in the learning process. At this point, data steal the limelight from learners. This thesis takes a close view of data complexity and its role shaping the behavior of machine learning techniques in supervised learning and explores the generation of synthetic data sets through complexity estimates.
Machine learning techniques have a wide range of practical applications, and algorithms for supervised classification, which infer a decision boundary from a set of training instances, are at the core of fascinating uses. The diversity of domains—medicine, industry, or learning—provides extremely disparate data sets regarding properties such as the type of attributes, volume of instances, and data distribution. All of these characteristics have led to the implementation of different strategies to tackle each problem properly, since learner performance depends partly on the algorithm design. Tremendous progress has been made in refining such algorithms. Actually, the development of techniques has reached an advanced state of maturity offering thousands of methods, all of them very competitive, and providing accurate models from data which are generalized from a sample of the problem at hand. However, despite the progress in data classification, questions such as how the intrinsic characteristics of the data sets affect learners remain unanswered. This, coupled with the little leeway for improvement and the uncertainty of the ability of techniques to fully capture the underlying knowledge of data, in duce’s us to look toward other elements involved in the learning process. At this point, data steal the limelight from learners. This thesis takes a close view of data complexity and its role shaping the behavior of machine learning techniques in supervised learning and explores the generation of synthetic data sets through complexity estimates.
Descripción
Palabras clave
Conjuntos de Entrenamientos, Aprendizaje Supervisado, Complejidad de los Datos, Estimadores de Complejidad