Segmentación por tópicos en textos científicos-técnicos usando una ventana de párrafos inferiores para medir la cohesión léxica
Fecha
2008-07-08
Autores
Hernández Rojas, Laritza
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Central “Marta Abreu” de Las Villas
Resumen
La presente investigación se realizó en el departamento de Minería de Datos del
CENATAV, responsable del procesamiento y la extracción de información en documentos
digitales en esta institución. De ahí que su propósito fuese la elaboración de un método para
segmentar automáticamente textos por tópicos sobre colecciones de documentos
científicos-técnicos, logrando una cohesión léxica considerable de los segmentos que se
obtengan y evitando la innecesaria interrupción de los mismos, con similar o superior
eficacia a otros métodos existentes. Para ello fue necesaria la elaboración del Marco
Teórico de la investigación, estudiando y analizando de forma critica el estado actual de los
métodos de segmentación por tópicos, luego se diseño un nuevo método de segmentación
por tópicos, nombrado TextLec, que resultara más adecuado que las anteriores propuestas y
finalmente se validó el método propuesto a partir de corpus textuales representativos del
universo investigado y su comparación con algunos de los métodos encontrados. El trabajo
se justificó porque posee valor teórico, novedad científica, relevancia práctica y social, y
por su utilidad metodológica. Se sustentó en el uso de la cohesión léxica como señal de
cambio de tópico, del Modelo de Espacio Vectorial como forma de representación de las
unidades textuales, de la medida del coseno para determinar la similitud entre dos unidades
textuales, de la teoría computacional de Skorochod’ko sobre la estructura lineal del
discurso y en el uso de una ventana de párrafos inferiores (por debajo) a cada párrafo, con
vista a localizar el párrafo cohesionado más lejano a cada párrafo y evitar la interrupción de
los tópicos. Concluyéndose con la satisfacción del objetivo propuesto.
This research was carried out at CENATAV, particularly at the Data Mining department which is the one in charge of processing and extracting information from digital documents. Thus the objective was to develop a method to automatically segment texts by topics for the scientific and technical collections and trying to achieve a strong lexical cohesion of the segments that are obtained and avoiding the unnecessary interruption with a similar or higher accuracy to other existing methods. For this aim it was necessary the elaboration of the Theoretical Framework of the research, by studying and critically analyzing the related works on thematic of segmentation by topic. Later it was designed a new methods of segmentation by topic called TextLec, which aiming at outperforming the other proposals and then the method was validated using text from the universe studied and we compared it with some of the methods we found. This work was justified because of its theoretical value as well as its novelty its social and practical relevance and its methodological usefulness. It was supported by the use of lexical cohesion as a cue of topic change of the Vector Space Model as a way to represent text units, the cosine measure to determine the similarity between two textual units, the Skorochod ‘ko computational theory about the linear structure of discourse and the use, for each paragraph of a paragraphs lower window (paragraph below) to find the farthest cohesive paragraph inside the window and to avoid topic interruptions. Hence, we have complied with the proposed goals.
This research was carried out at CENATAV, particularly at the Data Mining department which is the one in charge of processing and extracting information from digital documents. Thus the objective was to develop a method to automatically segment texts by topics for the scientific and technical collections and trying to achieve a strong lexical cohesion of the segments that are obtained and avoiding the unnecessary interruption with a similar or higher accuracy to other existing methods. For this aim it was necessary the elaboration of the Theoretical Framework of the research, by studying and critically analyzing the related works on thematic of segmentation by topic. Later it was designed a new methods of segmentation by topic called TextLec, which aiming at outperforming the other proposals and then the method was validated using text from the universe studied and we compared it with some of the methods we found. This work was justified because of its theoretical value as well as its novelty its social and practical relevance and its methodological usefulness. It was supported by the use of lexical cohesion as a cue of topic change of the Vector Space Model as a way to represent text units, the cosine measure to determine the similarity between two textual units, the Skorochod ‘ko computational theory about the linear structure of discourse and the use, for each paragraph of a paragraphs lower window (paragraph below) to find the farthest cohesive paragraph inside the window and to avoid topic interruptions. Hence, we have complied with the proposed goals.
Descripción
Palabras clave
Segmentación por Tópicos, Textos Científicos-Técnicos, Ventana de Párrafos Inferiores, Medición, Cohesión Léxica, Método TextLec