Representación de textos y su reducción de dimensionalidad
Fecha
2005-06-28
Autores
Valdés Vera, Libernys
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Central “Marta Abreu” de Las Villas
Resumen
En este trabajo se realizó el diseñó del sistema CorpusMiner que permite el procesamiento de corpus textuales, desde su representación inicial hasta lograr obtener su resumen extracto. Este procesamiento requiere transitar por diferentes etapas. En esta investigación nos focalizamos en la representación del corpus y en la reducción de dimensionalidad como parte de la representación y como parte de otras etapas del procesamiento. La representación parte de un corpus textual en idioma Inglés y se transforma realizando lematización, homogeneización ortográfica y expandiendo las contracciones y abreviaturas. Luego, se genera la representación vector space model VSM de la forma término-documento o término-sentencia. Se permite eliminar o no las palabras gramaticales. La representación VSM puede ser pesada utilizando TF-IDF y normalizada utilizando la suma total de frecuencias por documentos. Calculamos la calidad de los términos utilizando las medidas de entropía, skewness, kurtossis, calidad de término I y II. Se puede reducir la dimensionalidad de la matriz VSM utilizando las medidas de calidad de términos. La reducción de dimensionalidad no sólo se aplica a la etapa de representación textual. En CorpusMiner hemos incorporado técnicas de selección de rasgos que permiten obtener las palabras claves que caracterizan a los grupos de documentos obtenidos como parte de un procesamiento intermedio que permitirá la futura extracción de las oraciones relevantes que conformarán el extracto. Las formas de identificar las palabras claves son: ID3, relevancia calculada en el agrupamiento, y las medidas de calidad de términos por grupos.
This work deals with the design of the CorpusMiner system which allows to process text corpora to obtain an extractive abstract of the contents of related texts. The process goes through several stages. In the current research Project we concentrated on corpus representation and dimensionality reduction both as part of representing the original corpus and also during further processing. The initial processing for corpus representation comprises transformation processes such as lemmatization, spelling homogenization and expansion of abbreviations and contractions. Then a vector space modeling VMS is applied in which the corpus is represented either as a term-document or term-sentence matrix. The values in the matrix may be weighted using TF-IDF or normalized (word frequency divided by word total for each document). Term quality can be calculated by using measures such as entropy, skewness, kurtosis, quality I and quality II. These measures of term quality can be used for further reducing dimensionality. CorpusMiner incorporates feature selection techniques which allow to enumerate the keywords that are characteristic of the document clusters that will be abstracted in the form of the cluster’s most representative sentences. Keyword identification may be achieved by means of ID3, calculated cluster’s relevance and cluster’s term quality.
This work deals with the design of the CorpusMiner system which allows to process text corpora to obtain an extractive abstract of the contents of related texts. The process goes through several stages. In the current research Project we concentrated on corpus representation and dimensionality reduction both as part of representing the original corpus and also during further processing. The initial processing for corpus representation comprises transformation processes such as lemmatization, spelling homogenization and expansion of abbreviations and contractions. Then a vector space modeling VMS is applied in which the corpus is represented either as a term-document or term-sentence matrix. The values in the matrix may be weighted using TF-IDF or normalized (word frequency divided by word total for each document). Term quality can be calculated by using measures such as entropy, skewness, kurtosis, quality I and quality II. These measures of term quality can be used for further reducing dimensionality. CorpusMiner incorporates feature selection techniques which allow to enumerate the keywords that are characteristic of the document clusters that will be abstracted in the form of the cluster’s most representative sentences. Keyword identification may be achieved by means of ID3, calculated cluster’s relevance and cluster’s term quality.
Descripción
Palabras clave
Representación de Textos, Reducción de Dimensionaldad, CorpusMiner, Minería de Textos