Cristin-resultat-ID: 1882431
Sist endret: 19. mars 2021, 10:16
NVI-rapporteringsår: 2020
Resultat
Vitenskapelig artikkel
2020

Document similarity

Bidragsytere:
  • Claus Huitfeldt og
  • C. Michael Sperberg-McQueen

Tidsskrift

Balisage Series on Markup Technologies
ISSN 1947-2609
e-ISSN 1947-2609
NVI-nivå 1

Om resultatet

Vitenskapelig artikkel
Publiseringsår: 2020
Publisert online: 2020
Trykket: 2020
Volum: 25

Beskrivelse Beskrivelse

Tittel

Document similarity

Sammendrag

In recent years, development of tools and methods for measuring document similarity has become a thriving field in informatics, computer science, and digital humanities. Historically, questions of document similarity have been (and still are) important or even crucial in a large variety of situations. Typically, similarity is judged by criteria which depend on context. The move from traditional to digital text technology has not only provided new possibilities for discovery and measurement of document similarity, it has also posed new challenges. Some of these challenges are technical, others conceptual. This paper argues that a particular, well-established, traditional way of starting with an arbitrary document and constructing a document similar to it, namely transcription, may fruitfully be brought to bear on questions concerning similarity criteria for digital documents. Some simple similarity measures are presented and their application to marked up documents are discussed. We conclude that when documents are encoded in the same vocabulary, n-grams constructed to include markup can be used to recognize structural similarities between documents.

Bidragsytere

Claus Huitfeldt

  • Tilknyttet:
    Forfatter
    ved Institutt for filosofi og førstesemesterstudier ved Universitetet i Bergen

C. Michael Sperberg-McQueen

  • Tilknyttet:
    Forfatter
    ved USA
1 - 2 av 2