Cristin-resultat-ID: 2080492
Sist endret: 28. februar 2023, 16:10
NVI-rapporteringsår: 2022
Resultat
Vitenskapelig artikkel
2022

The text anonymization benchmark (TAB): A dedicated corpus and evaluation framework for text anonymization

Bidragsytere:
  • Ildikó Pilán
  • Pierre Lison
  • Lilja Øvrelid
  • Anthi Papadopoulou
  • David Sánchez og
  • Montserrat Batet

Tidsskrift

Computational Linguistics
ISSN 0891-2017
e-ISSN 1530-9312
NVI-nivå 2

Om resultatet

Vitenskapelig artikkel
Publiseringsår: 2022
Publisert online: 2022
Trykket: 2022
Volum: 48
Hefte: 4
Sider: 1053 - 1101
Open Access

Importkilder

Scopus-ID: 2-s2.0-85143251867

Beskrivelse Beskrivelse

Tittel

The text anonymization benchmark (TAB): A dedicated corpus and evaluation framework for text anonymization

Sammendrag

We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.

Bidragsytere

Ildikó Pilán

  • Tilknyttet:
    Forfatter
    ved Avdeling for statistisk analyse og maskinlæring for brukermotiverte anvendelser SAMBA ved Norsk Regnesentral
Aktiv cristin-person

Pierre Lison

  • Tilknyttet:
    Forfatter
    ved Avdeling for statistisk analyse og maskinlæring for brukermotiverte anvendelser SAMBA ved Norsk Regnesentral

Lilja Øvrelid

  • Tilknyttet:
    Forfatter
    ved Forskningsgruppen for språkteknologi ved Universitetet i Oslo

Anthi Papadopoulou

  • Tilknyttet:
    Forfatter
    ved Forskningsgruppen for språkteknologi ved Universitetet i Oslo

David Sánchez

  • Tilknyttet:
    Forfatter
    ved Universitat Rovira i Virgili
1 - 5 av 6 | Neste | Siste »