Cristin-resultat-ID: 2282054
Sist endret: 12. juli 2024, 10:20
Resultat
Vitenskapelig foredrag
2024

The grammar of fake news: Corpus linguistics meets machine learning

Bidragsytere:
  • Zia Uddin
  • Nele Poldvere og
  • Aleena Thomas

Presentasjon

Navn på arrangementet: ICAME 45
Sted: Vigo
Dato fra: 18. juni 2024
Dato til: 22. juni 2024

Om resultatet

Vitenskapelig foredrag
Publiseringsår: 2024

Beskrivelse Beskrivelse

Tittel

The grammar of fake news: Corpus linguistics meets machine learning

Sammendrag

In this study, we investigate the grammar of fake news by bringing together insights from corpus linguistics and machine learning. While the former offers a robust corpus-based register analysis of grammatical features, namely, multidimensional analysis (Biber, 1988), the latter contributes with methodological capabilities for the automatic identification of fake news based on the features. Fake news detection has made remarkable progress in natural language processing and machine learning (e.g., Rashkin et al., 2017; Põldvere et al., 2023), but it has not taken full advantage of the linguistic resources that are available. Based on the new PolitiFact-Oslo Corpus (Põldvere et al., 2023), we aim (i) to describe the grammatical differences between fake and real news across a variety of text types in a large corpus, and (ii) to develop a deep learning-based efficient approach for fake news detection based on these differences. A common distinction in multidimensional register analysis is between informational and involved styles of communication. While the former tends to contain more nouns and is common in registers with dense styles of communication such as news reportage, the latter is characterized by a more frequent use of pronouns, verbs and adjectives and is common in spontaneous conversation with lower levels of information density. Departing from the view that fake news is a register in its own right, Grieve and Woodfield (2023) analyzed 49 grammatical features in a small collection of fake and real news texts by one journalist. They found fake news to be more similar to involved styles of communication through the use of, e.g., present tense verbs, emphatics and predicative adjectives. This was different from real news which shared features with informational styles of communication. In contrast to Grieve and Woodfield (2023), in this study we make use of a large corpus of fake and real news in English: the PolitiFact-Oslo Corpus. The main strengths of the corpus are that the texts have been individually labelled for veracity by experts and are accompanied by important metadata about the text types (e.g., social media, news and blog) and sources (e.g., X, The Gateway Pundit). At present, the corpus contains 428,917 words of fake and real news, and it is growing. To extract the grammatical features, we used the Multidimensional Analysis Tagger (Nini, 2019), followed by a deep learning-based efficient approach (Attention-based Long Short-Term Memory; LSTM) to train the features incriminating fake and real news. The trained model was then used to automatically detect the fake news texts. The preliminary results based on a sample from the corpus indicate that there are systematic differences between fake and real news, which by and large are indicative of the distinction between involved and informational styles of communication, respectively. However, these differences are not the same across the text types, with social media showing lower levels of information density in fake news than news and blog. Our machine learning model based on the grammatical features also shows promising results (LSTM mean accuracy: 90%), particularly when compared to models without the grammatical features.

Bidragsytere

Zia Uddin

  • Tilknyttet:
    Forfatter

Nele Poldvere

  • Tilknyttet:
    Forfatter
    ved Russland, Sentral-Europa og Balkan ved Universitetet i Oslo

Aleena Thomas

  • Tilknyttet:
    Forfatter
    ved Sustainable Communication Technologies ved SINTEF AS
1 - 3 av 3