Disaster Tweets

Disaster Tweets NLP classifier

Demo

Notebooks

Overview

In a society where millennials and gen-zs are excited about capturing and posting anything exciting happening around them on the internet. We think we have found another way to capitalize on it. Our idea is to build a model that would be able to identify any post on social media about newly emerging disasters and send details like location and incident information to agencies like local emergency services and news outlets as fast as possible. Hence, to identify such disaster posts we have used one among the most popular social media platforms which is Twitter.

Project details

Problem Statement: The project is building an NLP model that is capable of classifying tweets that are announcing disasters. Our dataset: https://www.kaggle.com/c/nlp-getting-started
Purpose is to predict whether a given tweet is about a real disaster or not. If so, predict a 1.
If not, predict a 0. Dataset includes train and test data with the tweets as text, id, keywords, location, and target (in the train). We have a dataset which trains the model with 7613 and tests 3263 records.

As per the basic analysis we found that the train data has 110 duplicates with keywords and location columns included some NULL value. For train dataset its keyword-61; location- 2482 and test dataset its keyword-26; location -1105

EDA

Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations. Graphical analysis of our dataset using seaborn, word cloud python libraries stopword removal method has been used to evaluate the dataset.

The following bar chart shows the representation of our target column from the training dataset, the unbalanced distribution of the data shows a smaller number of disaster tweets than that of non-disaster tweets.

WordCloud - visual representations of words that give greater prominence to words that appear more frequently.

The dataset keyword columns contain some null value, and the horizontal bar representation shows the unique keywords in both the train and test dataset.

The graph below shows the count of words from the train dataset for both the target labels.

The graph below shows the most common keyword of the train dataset.

The pie shows the percentage of the word ‘Disaster’ used in the dataset.

The below histogram shows the difference in Text Length after the data cleaning. The distribution of the text length is even for clean data, we remove the stopwords, URLs, emojis and some punctuation marks to clean the text(tweets) available in the train dataset using the nltk libraries. (WordNetLemmatizer, stopwords)

Data Cleaning

In Figure: Original Data, it can be seen what the raw data was before the cleanup process. In Figure: Cleaned Data, it can be seen what the data looks like after the cleanup process, which was obtained by completing the following procedures.

● All the letters were converted to lowercase.

● All newlines were removed.

● Elements such as IP and username were removed.

● Words with apostrophes were converted to two individual words. (Ex: “you’re” to “you are”)

● Common words were taken out, such as “as”, “a”, “and”.

● All non-alphanumeric and digit characters were removed, except for ”!” marks as it is common to see disaster comments including “!” marks.

The accuracy obtained from running the uncleaned tweets through a basic neural network was found to be 74.1% and the accuracy obtained from running the Cleaned Tweets through a Basic Neural Network was found to be 74.5%. The accuracy improvement from training our basic model was quite minimal which is evident in the Model accuracy Figures (Figure: Training with Uncleaned Data & Figure: Training with Cleaned Data).

The accuracy did not improve drastically but what was observed while running the model was the training time for the Cleaned data was far less than it was when training the model with uncleaned data.

It is believed the reason for the better training time was because of the number of features in the dataset. The number of Features was reduced from the dataset with the cleaning process, which allowed to have far lesser input nodes in our model. In conclusion, it was found to be beneficial to use the cleaned data because it made the trained time for each model to be smaller without impending accuracy.

Data Balancing

In the dataset, there is a large data imbalance toward non-disaster tweets, which can be seen in Figure: Dataset Distribution. The data imbalance is an extremely problematic situation when training a model because it would lead to having less representation of the minority class.

In Figure: Training with Unbalanced Dataset, the results of training a model with imbalanced data can be seen. The validation loss was not decreasing after each Epoch and the validation accuracy was staying within the same accuracy after each Epoch. To fix the above situation we used the SMOTE function, where the function balances the dataset by creating more samples of the minority class. See Figure: SMOTE Balancing code.

As seen in Figure: Model Performance After Balancing Dataset, the model is working as a Neural Network as the Validation Loss decreases after each Epoch and the Validation Accuracy increases after each Epoch. In conclusion, the SMOTE function allows to balance the dataset and helps the Model achieve peak performance.

Classical Machine Learning Models

Classical Machine Learning algorithms were used to obtain a benchmark performance which allows us to understand the limitations of Machine learning algorithms and to inspect how Deep Learning Neural Networks can outperform Machine Learning Algorithms. The model used for this small section of the project was a pipeline with Count-Vectorizer, TF-IDF transformer, data balancing, and a Machine learning Classifier. There were multiple Machine Learning Classifiers used for this project, which were Random Forest Classifier, Logistic Regression, Decision Tree, and Multinomial Naive Bayes.

As seen on Table: Machine Learning Classifiers Performance, Random Forest Classifier obtained the best accuracy of & 78.0%, but the Precision, Recall and F1-Score for the disaster Tweets were far from optimal, which can be examined in Figure: Random Forest Classifier Performance

Neural Networks

The basic Neural network structure was used to try different methods of NLP processing methods where different lengths of Characters and Words of the disaster Tweets were tested to find the performance of each NLP processing method.

The Basic Neural Network was structured as seen on Figure: Basic Neural Network Structure. The Basic Neural Network has an input layer with an amount of input nodes which equal the number of features. Next, the hidden layers were as follows: 1000 nodes to 5000 nodes to 1000 nodes to 100 nodes to 10 nodes. All hidden layers had the activation function Relu. Lastly, the output layer was 1 node with an activation function of Sigmoid because there were only two possible outputs.

Count Vectorizer on Basic Neural Network Structure

In Figure: Countvectorizer Performance, the performance of the different types of Counter-Vectorizers using the Basic Neural Network can be seen. When using the Character N-gram, an accuracy increase was seen as the number of Characters were increased. When using the Word N-gram, an accuracy decrease was present as the number of words increased. The best Countvectorizer model was obtained from using a Word N-Gram of 1, where the accuracy was found to be 73.1%.

TF-IDF Vectorizer with Basic Neural Network Model

As seen on Figure: TF-IDF Vectorizer Performance, a similar trend to what was observed in Count-Vectorizer was present. Where the Basic Neural Network performed worse as the number of Word N-Gram increased. Word N-gram of 1 was found to be the best TF-IDF model with an Accuracy of 74.5%.

Combining TF-IDF and Count-Vectorizer with Basic Neural Network Model

n an attempt to increase the number of features in the dataset, the output of Count-Vectorizer and TF-IDF Vectorizer was combined into the same dataset to be used to train the Basic Neural Network.

As seen in Figure: TF-IDF & Count-Vectorizer Performance, the accuracy of the Basic Neural Network increased drastically with the combination of the two best NLP processing methods. Where the best accuracy increased to 77.5%.

Word Embedding

To improve model performance Glove word embedding was chosen. 6 billion glove models with 50, 100, 200, 300 dimensions were used.

On figure 1 the ANN with Glove embedding implemented is illustrated. The ANN architecture is an example of 50 dimensional Glove model usage.

First cleaned texts go into the input layer and passed to the Text Vectorization layer that transforms sentences into numeric format for feature extraction. After that vectors are passed into the embedding layer where 50 glove dimensions with pretrained weights are added. Then 3 hidden Dense layers are used for ANN with decreasing number of dimensions to the output layer.

Glove Models Comparison

The same way different models were trained with different glove models. The ANN Architecture was the same for all models, only embedding and dense layers shapes were changed. On the Table 1 best performing models for different glove dimensions are described. Figures 2-5 models accuracy and loss curves are illustrated.

It turned out that the 50 dimensional Glove model works the best for us. Although the model had to learn for 700 epochs to converge.

Output Probability Thresholding

The figure above describes the prediction process:

Raw text is cleaned
Cleaned text passed to the model
Model returns set of probabilities
Maximum probability is calculated
Maximum probability is thresholded
Classification label returned

Instead of using default evaluation and prediction under NN’s hood, a manual process was used. So by the default maximum probability is threshold to receive binary value. To achieve better accuracy values from 0 to 1 with 0.1 step were iterated through. The model was evaluated for both training and test set thresholding probability for each value. Usage of ANN, Glove and probability thresholding allowed us to achieve 77.5% accuracy. But the performance is still not better than previous models.

RNN

After RNN was decided to be tried for neural network architecture, Figure 7 represents the architecture. Basically 2 Dense hidden layers were just replaced with 2 Bidirectional GRUs. accuracy finally broke through and grew to 80% through RNN.

Output Probability Selection

As were mentioned above the maximum probability was used from a calculated set for finilizzing prediction. So using mean probability instead of maximum was tried to be used. The threshold also had been lowered by iterative selection. Fortunately it could help to increase the accuracy, the ANN model’s accuracy went up to 82%, although this approach didn’t work with RNN.

Best Models Comparison & Confusion Matrices

The RNN with max probability underperformed the ANN with mean probability used by 2% in accuracy. Not only accuracy was improved, but also Precision. The figures below show confusion matrices of the best 2 models so far. The RNN has approximately the same precision and recall, but the better model improved the precision sacrificing the recall. But in our problem decreasing false positives is much more important, because if a tweet alerts for an actual disaster it is crucial to identify it correctly.

Conclusion

To conclude the best result that was obtained is 82% accuracy. In the future different Neural network activation functions and architectures, other word embedding, combining and adding more features would be used

Resources

Dataset

https://www.kaggle.com/c/nlp-getting-started

Research Notebooks (see different branches)

https://github.com/daniyarka/DisasterTweets

Demo

https://github.com/daniyarka/DisasterTweetsWebApp

Demo Project

The web application using Django backend framework was created for the project demo.