Machine learning algorithms have been increasingly integrated into applications that significantly affect human lives. This surged an interest in designing algorithms that train machine learning models to minimize training error and imposing a certain level of fairness. In this paper, we consider the problem of fair clustering of data sets. In particular, given a set of items each associated with a vector of nonsensitive attribute values and a categorical sensitive attribute (e.g., gender, race, etc.), our goal is to find a clustering of the items that minimizes the loss (i.e., clustering objective) function and imposes fairness measured by Rnyi correlation. We propose an efficient and scalable in-processing algorithm, driven by findings from the field of combinatorial optimization, that heuristically solves the underlying optimization problem and allows for regulating the trade-off between clustering quality and fairness. The approach does not restrict the analysis to a specific loss function, but instead considers a more general form that satisfies certain desirable properties. This broadens the scope of the algorithms applicability. We demonstrate the effectiveness of the algorithm for the specific case of k-means clustering as it is one of the most extensively studied and widely adopted clustering schemes. Our numerical experiments reveal the proposed algorithm significantly outperforms existing methods by providing a more effective mechanism to regulate the trade-off between loss and fairness.
History: Rema Padman served as the senior editor for this article.
Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://codeocean.com/capsule/9556728/tree and in the e-Companion to the this article (available at https://doi.org/10.1287/ijds.2022.0005 ).