Machine Learning + Data Science
Fake or Nah? A linguistic approach to detecting misinformation using machine learning and linguistic features (2025)
Misinformation has become one of the most pressing linguistic and technological challenges of the last decade, spreading rapidly across digital platforms and reshaping public understanding of politics, public health, and current events. While computational approaches to fake-news detection often rely on lexical or topic-based features, emerging research suggests that stylistic and structural linguistic patterns offer a powerful, interpretable source of insight. This study presents Fake or Nah?, a machine-learning model designed to classify fake and true news by integrating TF-IDF lexical vectors with linguistically motivated identifiers.
Using a balanced dataset of 44,000+ articles from Kaggle, the study extracts identifiers such as question and exclamation mark frequency, VADER sentiment, noun density, pronoun ratios, and vocabulary richness. Statistical testing (Welch's t-tests) reveals significant stylistic differences: fake news employs substantially more rhetorical punctuation, exhibits more negative sentiment, contains higher profanity rates, and demonstrates lower noun density than factual reporting. Three models—TF-IDF Logistic Regression, identifier-only Logistic Regression, and a hybrid model—were trained and evaluated. The hybrid model achieved the strongest performance (≈98% accuracy, AUC ≈ 0.99), indicating that linguistic features meaningfully enhance predictive power while improving interpretability.
Completed for AI4ALL Ignite & Linguistics 103: Psycholinguistics (Glendale Community College)
Open in Google Docs