Comparing string similarity measures for reducing inconsistency in integrating data from different sources


Abstract:

The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.

Año de publicación:

2001

Keywords:

    Fuente:

    scopusscopus

    Tipo de documento:

    Conference Object

    Estado:

    Acceso restringido

    Áreas de conocimiento:

    • Análisis de datos
    • Ciencias de la computación

    Áreas temáticas:

    • Funcionamiento de bibliotecas y archivos