Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model

Mohammad Rasool SARRAFI AGHDAM  Noboru SONEHARA  

IEICE TRANSACTIONS on Information and Systems   Vol.E99-D   No.8   pp.2069-2078
Publication Date: 2016/08/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2015INP0019
Type of Manuscript: Special Section PAPER (Special Section on Security, Privacy and Anonymity of Internet of Things)
anonymization,  privacy preserving data mining,  K-anonymity,  algorithm,  

Full Text: PDF(1.2MB)>>
Buy this Article

In data sharing privacy has become one of the main concerns particularly when sharing datasets involving individuals contain private sensitive information. A model that is widely used to protect the privacy of individuals in publishing micro-data is k-anonymity. It reduces the linking confidence between private sensitive information and specific individual by generalizing the identifier attributes of each individual into at least k-1 others in dataset. K-anonymity can also be defined as clustering with constrain of minimum k tuples in each group. However, the accuracy of the data in k-anonymous dataset decreases due to huge information loss through generalization and suppression. Also most of the current approaches are designed for numerical continuous attributes and for categorical attributes they do not perform efficiently and depend on attributes hierarchical taxonomies, which often do not exist. In this paper we propose a new model for k-anonymization, which is called Similarity-Based Clustering (SBC). It is based on clustering and it measures similarity and calculates distances between tuples containing numerical and categorical attributes without hierarchical taxonomies. Based on this model a bottom up greedy algorithm is proposed. Our extensive study on two real datasets shows that the proposed algorithm in comparison with existing well-known algorithms offers much higher data utility and reduces the information loss significantly. Data utility is maintained above 80% in a wide range of k values.