Historikk

Cristin-resultat-ID: 2282055

Sist endret: 12. juli 2024, 10:23

Resultat

Vitenskapelig foredrag

2024

Out of balance, out of sight: Issues with the design and accessibility of a corpus of fake and real news

Nele Poldvere
Zia Uddin og
Aleena Thomas

Presentasjon Presentasjon

Navn på arrangementet: ICAME 45

Sted: Vigo

Dato fra: 18. juni 2024

Dato til: 22. juni 2024

Om resultatet Om resultatet

Vitenskapelig foredrag

Publiseringsår: 2024

Beskrivelse Beskrivelse

Engelsk

Tittel

Out of balance, out of sight: Issues with the design and accessibility of a corpus of fake and real news

Sammendrag

Fake news is a topic that only recently has caught the attention of (corpus) linguists (Grieve & Woodfield, 2023; Sousa Silva, 2022; Trnavac & Põldvere, 2024). Such research has sought to identify differences in linguistic features between fake and real news based on carefully designed corpora. An example of such a corpus is the new PolitiFact-Oslo Corpus (Põldvere et al., 2023), a large dataset of fake and real news in English based on recent events (post-2019). However, in its current form the corpus has some limitations, due to the highly specific, and sensitive, nature of fake news. The present methodological study seeks solutions to these limitations with a view to facilitating future corpus building efforts around fake news, a highly promising area of study for linguists. As the name implies, the PolitiFact-Oslo Corpus relies on the fact-checking website PolitiFact.com for its data, with each news item being individually labelled for veracity by experts (from ‘True’ to ‘Pants on Fire’). In contrast to many other fake news datasets (e.g., DeClarE in Popat et al., 2018), the corpus is the result of a combination of automatic and manual procedures to have greater control over what is included. In addition to a manual approach to text selection, the corpus is accompanied by important metadata information about the texts, such as their text type (e.g., social media) and source (e.g., X). This said, the corpus currently has two major limitations. Firstly, there is a noticeable imbalance between the fake and real news samples (358,516 vs. 70,401 words, respectively), which is due to the preference of PolitiFact and other fact-checkers to debunk false information rather than to find support for true information. This limitation has serious implications for fake news analysis and detection model development based on the corpus (Põldvere et al., 2023). Secondly, due to copyright and privacy issues the corpus is currently not publicly available, a feature of the corpus which is hardly in line with current open science practices. We offer some solutions. As for the imbalance between the fake and real news samples, we have decided to extend the scope of the fact-checkers rather than to stretch out the timeline. The fact-checkers are found via Google’s Fact Check Explorer, which provides quick and easy access to more instances of (mostly or half) true news. The challenge is to ensure comparability of the ratings between the fact-checkers (what is ‘Mostly True’ according to one fact-checker may be ‘Half True’ according to another) as well as balance in terms of the metadata information (text type, source). The lack of access to the corpus is a much more complex problem to solve. Inspired by current practices in corpus linguistics, we are exploring opportunities to release the text snippets, rather than the full texts, via an online interface, which, however, is complicated by the legal challenges of distributing fake news data in our national context. We seek solutions to these challenges, too.

Vis fullstendig beskrivelse

Bidragsytere Bidragsytere

Nele Poldvere

Forfatter
ved Russland, Sentral-Europa og Balkan ved Universitetet i Oslo

Zia Uddin

Forfatter

Aleena Thomas

Forfatter
ved Sustainable Communication Technologies ved SINTEF AS

1 - 3 av 3