Historikk

Cristin-resultat-ID: 2080488

Sist endret: 28. februar 2023, 17:37

NVI-rapporteringsår: 2022

Resultat

Vitenskapelig Kapittel/Artikkel/Konferanseartikkel

2022

Neural Text Sanitization with Explicit Measures of Privacy Risk

Anthi Papadopoulou
Yunhao Yu
Pierre Lison og
Lilja Øvrelid

Bok Bok

The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing

ISBN:

978-1-955917-65-0

Utgiver

Association for Computational Linguistics

NVI-nivå 1

Finn i kanalregisteret

Om resultatet Om resultatet

Vitenskapelig Kapittel/Artikkel/Konferanseartikkel

Publiseringsår: 2022

Sider: 217 - 229

ISBN:

978-1-955917-65-0

Lenker Lenker

ORIA

Søk i ORIA med 978-1-955917-65-0

Klassifisering Klassifisering

Fagfelt (NPI)

Fagfelt: IKT

- Fagområde: Realfag og teknologi

Beskrivelse Beskrivelse

Engelsk

Tittel

Neural Text Sanitization with Explicit Measures of Privacy Risk

Sammendrag

We present a novel approach for text sanitization, which is the task of editing a document to mask all (direct and indirect) personal identifiers and thereby conceal the identity of the individuals(s) mentioned in the text. In contrast to previous work, the approach relies on explicit measures of privacy risk, making it possible to explicitly control the trade-off between privacy protection and data utility. The approach proceeds in three steps. A neural, privacy-enhanced entity recognizer is first employed to detect and classify potential personal identifiers. We then determine which entities, or combination of entities, are likely to pose a re-identification risk through a range of privacy risk assessment measures. We present three such measures of privacy risk, respectively based on (1) span probabilities derived from a BERT language model, (2) web search queries and (3) a classifier trained on labelled data. Finally, a linear optimization solver decides which entities to mask to minimize the semantic loss while simultaneously ensuring that the estimated privacy risk remains under a given threshold. We evaluate the approach both in the absence and presence of manually annotated data. Our results highlight the potential of the approach, as well as issues specific types of personal data can introduce to the process.

Vis fullstendig beskrivelse