Gopal, Kreshna (2007-08). Efficient case-based reasoning through feature weighting, and its application in protein crystallography. Doctoral Dissertation. Thesis uri icon

abstract

  • Data preprocessing is critical for machine learning, data mining, and pattern recognition. In particular, selecting relevant and non-redundant features in highdimensional data is important to efficiently construct models that accurately describe the data. In this work, I present SLIDER, an algorithm that weights features to reflect relevance in determining similarity between instances. Accurate weighting of features improves the similarity measure, which is useful in learning algorithms like nearest neighbor and case-based reasoning. SLIDER performs a greedy search for optimum weights in an exponentially large space of weight vectors. Exhaustive search being intractable, the algorithm reduces the search space by focusing on pivotal weights at which representative instances are equidistant to truly similar and different instances in Euclidean space. SLIDER then evaluates those weights heuristically, based on effectiveness in properly ranking pre-determined matches of a set of cases, relative to mismatches. I analytically show that by choosing feature weights that minimize the mean rank of matches relative to mismatches, the separation between the distributions of Euclidean distances for matches and mismatches is increased. This leads to a better distance metric, and consequently increases the probability of retrieving true matches from a database. I also discuss how SLIDER is used to improve the efficiency and effectiveness of case retrieval in a case-based reasoning system that automatically interprets electron density maps to determine the three-dimensional structures of proteins. Electron density patterns for regions in a protein are represented by numerical features, which are used in a distance metric to efficiently retrieve matching patterns by searching a large database. These pre-selected cases are then evaluated by more expensive methods to identify truly good matches - this strategy speeds up the retrieval of matching density regions, thereby enabling fast and accurate protein model-building. This two-phase case retrieval approach is potentially useful in many case-based reasoning systems, especially those with computationally expensive case matching and large case libraries.
  • Data preprocessing is critical for machine learning, data mining, and pattern
    recognition. In particular, selecting relevant and non-redundant features in highdimensional
    data is important to efficiently construct models that accurately describe the
    data. In this work, I present SLIDER, an algorithm that weights features to reflect
    relevance in determining similarity between instances. Accurate weighting of features
    improves the similarity measure, which is useful in learning algorithms like nearest
    neighbor and case-based reasoning. SLIDER performs a greedy search for optimum
    weights in an exponentially large space of weight vectors. Exhaustive search being
    intractable, the algorithm reduces the search space by focusing on pivotal weights at
    which representative instances are equidistant to truly similar and different instances in
    Euclidean space. SLIDER then evaluates those weights heuristically, based on
    effectiveness in properly ranking pre-determined matches of a set of cases, relative to
    mismatches.
    I analytically show that by choosing feature weights that minimize the mean rank of
    matches relative to mismatches, the separation between the distributions of Euclidean
    distances for matches and mismatches is increased. This leads to a better distance metric,
    and consequently increases the probability of retrieving true matches from a database. I
    also discuss how SLIDER is used to improve the efficiency and effectiveness of case
    retrieval in a case-based reasoning system that automatically interprets electron density
    maps to determine the three-dimensional structures of proteins. Electron density patterns
    for regions in a protein are represented by numerical features, which are used in a distance metric to efficiently retrieve matching patterns by searching a large database.
    These pre-selected cases are then evaluated by more expensive methods to identify truly
    good matches - this strategy speeds up the retrieval of matching density regions, thereby
    enabling fast and accurate protein model-building. This two-phase case retrieval
    approach is potentially useful in many case-based reasoning systems, especially those
    with computationally expensive case matching and large case libraries.

publication date

  • August 2007