Fake News Detection
Identification of fake news using NLP text classification
Overview
The model with 99.6% accuracy of detecting fake news using different Machine Learning algorithms and feature extractors and parameters tuning using RandomizedSearchCV
​
Introduction
Fake News contains misleading information that could be checked. This maintains lies about a certain statistic in a country or exaggerated cost of certain services for a country. There are organizations, like the House of Commons and the Crosscheck project, trying to deal with issues by confirming authors are accountable. However, their scope is so limited because they depend on human manual detection, in a globe with millions of articles either removed or being published every minute, this cannot be accountable or feasible manually. A solution could be, the development of a system to provide a credible automated index scoring, or rating for the credibility of different publishers, and news contexts.
Dataset Exploration
The dataset for this NLP task was acquired on Kaggle. It was uploaded by user Clement Bisaillon. The original modelling on this dataset was performed in the textbook Intelligent, Secure and Dependable Systems in Distributed and Cloud Environments, ISDDC 2017, by Issa Traore, Isaac Woungang, Ahmed Award. The basic objective task of the dataset has always been to create an algorithm that can determine if a given article is fake news or real. The uploaded dataset is divided into two CSV files, ‘Fake.csv’ and ‘True.csv’. Combined there are a total of approximately 45,000 records. There are only four columns in the dataset: title, text, subject and date. The title of the news article is the headline itself while the text column contains the article text itself. The subject column does not add many contexts to the dataset and the values are not consistent or balanced. It most likely tries to describe the genre of the news each record belongs to. The date of the record is simply the date when the article was published. The data is balanced since there are around 21,500 records in the ‘True’ file and approximately 23,500 records in the ‘False’ file.
Data Preprocessing
The first task in data preprocessing was to remove the subject and date columns from the dataset completely. As mentioned previously, the subject column does not add any context to the data and is not relevant for the objective. Similarly, the date column only indicates the publication date, and it is not relevant to the news article being fake or real. Additionally, the title of the article and the text were concatenated to be merged into the same string. To label the data, a 0 was added after each ‘True’ record and a 1 was added after each ‘False’ record. Another important preprocessing step was to remove the word ‘Reuters’ from the headlines of the True dataset. Since a lot of the true records contained the word ‘Reuters’, it would create bias in the model as false news articles with the word ‘Reuters’ could be incorrectly classified as true news. The next step in data preprocessing is transforming the text data into feature vectors. Along with this, stop words were also removed from the data to save processing time and filter out useless data. Tokenization was also applied to the data to split the phrases and sentences into smaller tokens. The last step of the preprocessing was splitting the data into train and test sets.
Model Building & Benchmarking
For building the classification model the different feature extraction methods and different classification algorithms were used. For the feature extraction, we chose 3 models: HashingVectorizer, CountVectorizer, TfIdfVectorizer. And Multinomial Naïve Byes, Logistic Regression, Random Forest, KNN were utilized as classification algorithms.
To tune the feature extractors and classifiers’ parameters the RandomizedSearchCV was used with 20 iterations and 3 cross-validations split. Also, to get the best performing combination of feature extractor and classification algorithm all combinations of models were trained using RandomizedSearchCV. The table below demonstrates parameter grids for all models that were fed to RandomizedSearchCV. RandomizedSearchCV was preferred instead of GridSearchCV because the team could not afford exhaustive search due to lack of time and computation power.
So, in each iteration of training one combination of feature extractor and classifier was chosen and passed to the pipeline, which was fed to RandomizedSearchCV with respective parameters to tune. Then after RandomizedSearchCV fitting, the trained pipeline was returned.
​
Right after that, the pipeline was evaluated on the test data set using an accuracy score. The accuracy score was chosen for evaluation because the dataset is balanced. Finally, by the end of the iteration received results were saved to a global variable.
​
As the result, all the combinations of models were fit and evaluated, and all the accuracy scores were recorded and compared. The best combination of models with their hyper-parameters is shown in the table below.
Surprisingly the best in our case feature extractor which is TfidfVectorizer uses character bigrams to produce features and does not lowercase the text. It allowed us to get a great result, but in cost of computing time, because as many as 4381 features were received from the feature extractor. The figures below illustrate the top 20 frequent bigrams in the used corpus and the distance between the top 100 frequent bigrams translated to two-dimensional scatterplots using TrunctatedSVD.
Results
Our system takes input from an existing URL and classifies it to be true or fake. To implement this, we used NLP Text classification in Machine Learning.
We used different feature selection methods such as Hashing Vectorizer, Count Vectorize , TF- IDF Vectorizer and different algorithms such Logistic Regression , SVC, KNN, Random Forest Classifier, MultinomialNB and ran Randomized SearchCV on them .The maximum accuracy we could achieve with them was 99.553%.
Conclusion
We were able to identify the real news from the fake news using NLP with accuracy of 99.559% with the use of TF-IDF and Random Forest Classifier.