Detection of Hate Speech in Social Media

Detection of Hate Speech on Social Media

Abstract

The objective our research to detect hate speech on social media. Every day huge amount of data generated by the users of different social media .For this research we created a data set collecting data form twitter. This data set consists of tweets of different kind of people of different race and religion. In this work we followed machine learning approach and as we know NB and SVM is most popular algorithm for sentiment analysis and classifying text, so we used Naïve Bayes and Support Vector Machine algorithm in this work. While using NB we find accuracy rate at 94.63% and in SVM the accuracy rate was 92.32%. As the action of particular event of social media not only bounded only in the internet it effects the real life events all well. Again anything spread faster on social media with compare of different other media. Many people post many hatred things of social media and it hurts others feeling and then difficulties arrives and people have to face further consequence. By detecting hate speech we can control this things and avoid this kind of situations. So our work have value to keep social media free from few bad things and conflict between people of different believes.

CHAPTER 1

INTRODUCTION

1.1 Introduction

[734] If we look at the statics, we can clearly visualize that the number of people using social media increasing at great speed. Every day millions of people joining social media like Facebook, Twitter, and Instagram. This means that social media has become a very important communication medium today. By using social media technology people can send the message very quickly, so a particular topic can spared very quickly. Unfortunately, that also same for hate speech, it can also spread very fast. That can be matter of conflicts among different groups of our society. According to Cambridge Dictionary “Hate speech is a public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation “. By those criteria hate speech is defined is varied from one place to another place according to the globe. As we see in America there most of the inconvenient activity occurs on based issues like anti-American African, anti-immigrant, anti-White etc. On the contrary country like India most of the hate speech delivered on the basis of religion, cast, political views.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

Right now it is very important having control over this huge scale of data on social media. There are several work on different methodology done to detect hate speech using data of social media like twitter, facebook or other sites. There two method popular among one is word bad method, where a data set is created consist of hate word. And another approach is machine learning method. Machine leaning is used in different field like Business data analysis, prediction system, recommendation system, speech or handwriting recognition, bioinformatics, sentiment analysis and more. They uses of machine learning seems very efficient and successful throughout the days on several field including which just mentioned. Using of machine leaning and data mining technique is a great option for deal with enormous collection of data. Where different machine learning algorithms is used to serve those purpose. There are different machine leaning algorithms like SVM, Naïve Bayes, Random Forest, Decision tree etc.

In our research we use machine learning techniques to detect hate speech and we applied SVM and Naïve Bayes algorithms and show the comparison of their performance. We collected data from twitter. For annotate a tweet we consider different factor like hate against other religion, race, ethnicity, political view, tradition, gender, sexual orientation. In this article we describe the process of our work in different section.

1.2 Motivation

People want a pleasant environment on social media. A small line of word can hurt a lot people. Social media activity’s circumstance not bounded only in virtual world Social media is one of the most frequently used communication platform today. People come under the same umbrella regardless their national boarder, socio economical background etc. by the contribution of social media. Social media help any message to be sent very quickly, become far-reaching and even viral when the certain topic takes public attention. As like that hate speech can also spread easily and rapidly and that can be reason of conflicts between different groups in our society. Hate speech especially concerning religion, nationality that we have seen in the recent past.

1.3 Objective of this work

Our research aimed to create a new data that consist of random users tweets, here twitter user information will never published. Those set that includes tweets which are not hate speech add those hate speech which including hatred for religion, nationality, ethnicity and gender. In this research we used machine learning techniques as Machine learning is repeatedly used approach for text classification and researcher shows great exertion using ML algorithms to analysis sentiment from text data.

1.4 Expected Outcome

Maintain a Healthy Environment In Social Media

Avoid unpleasant and embarrassing situation.

Get rid of conflicts between groups in society

Reduce misuse of Social Media

When those kind of violating and disrespectful speech can be detected in real-time on social media. It will be very easy to filter them or taking necessary measures among those person responsible for those kind of activity. And this can be a very efficient way to maintain a good and expected environment in social media like Facebook, Twitter. Thus also help us bypass many embarrassing situations when spending time on social media.

CHAPTER 2

BACKGROUND

2.1 Introduction

There are few work on detecting hate speech on social media. When ever comes the term in Bangladesh prospective the number of work is very few. Most of the work done in English and many other languages. Which is not related to Bangladesh prospect at any point. We found two work which is related to Bangladesh on of work with English text and the other uses Bangla text or script, as it very complicated to collect Bangla tweets automatically, the manually collect data from many sites and the amount of data set is very low consist only 200-300 words.

2.2 Related Work

Several works have worked on hate speech detection. In this section we will discuss about few previous work and their outcome. In “Detecting Hate Speech on the World Wide Web William Warner and Julia Hirschberg”. In this work they done a brief discussion on hate speech, give a clear concept about hate speech. They proposed a method of detecting hate speech. They collect data from the American Jewish Congress (AJC) and Yahoo. They pick those texts which marked as offensive by those sites readers and found as hate speech according their definition. After analyzing they divided the data by their stereotype, like anti-Hispanic, anti-African American, anti-immigrants, anti-white, anti-sematic language. They build classifier for each stereotype. For classification template-based strategy is used in this work. They used SVM classifier with linear carnal and perform 10 fold cross validation for each classifier. After classification performance and error report is represent in two different tables. The overall accuracy was 96%.

In “Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter” collected about 16 thousand tweets and annotates them. They analyses the features for finding the features which can output the best performance and improve hate speech detection. They defines 13 conditions, where a tweet will be consider hate speech if it fulfil one or more pre-defined conditions. They test the impact of various feature on the performance using logistic regression with 10 fold cross validation. For model selection they uses grid search over all possible features. For each tweet n-gram is collected where value of n is 1, 2, 3, 4. Difference between one n-gram to another n-gram is found in their result.

“Detecting Offensive Language in Social Media to Protect Adolescent Online Safety” this research, they analysis previous methods of text mining. They proposed an approach named Lexical Syntactical Feature (LSF) which may use for detecting offensive text as well as predict the potential of a user use of abusive content. According to their research combining context-specific features, structure features and style features with lexical features may improve existing machine learning methods and prediction of offensiveness.

“Offensive Language Detection Using Multi-level Classification” used total 1525 massages from few group massaging services. Among them they annotated 68% of data as “OK” means not offensive and 32 percent as offensive. While they consider criteria like Slurs, Racism, Homophobia, Extremism, Crude language, Provocative language, Taboos, Unrefined language. Where 10 percent of data used for test and rest for training set. After processing data they gone three level classification. They uses Weke software for classification task. They first run few classifier and consider different criteria best classifier with high performance is chosen for classification. In three level they used Complement Naïve Bayes classifier, Multinomial Updatable Naïve Bayes classifier and DTNB (Decision Table/Naive Bayes hybrid classifier respectively. In those classifier level they percentage of correctly classified is varied from 84.26% to 96.72%.

2.3 Comparative Studies

Will be added

2.4 Scope of the Problem

Firstly for the collection of data we have to gone through several process. For collecting data we have to get twitter api key and password. For this we have to apply for a developer account. Once we send them the information and the purpose of the work. They reply with they need additional information to approve the application. After providing those information the request approve by the twitter authority. Then we generate the api key and password for collecting tweets.

2.5 Challenges

We found several challenge in the way of our research work….

2.5.3 Time

The last, but not least which can be even called our biggest challenge in this work has been the factor of time as all of us are working full time and studying beside it. To agree on appointments and meet up was one of the most challenging parts of the job, as the project itself was a very interesting subject and it was fun to get the job done, but we learnt quite a lot from this process to be well structured and well planned.

2.6 Summary

In this chapter many related works on sentiment analysis of social media data has described. From the above content. It is clear that research of this topic is seems very important to many researcher across the world and the research on Bangladesh perspective very few. Although more and more research and project are being done in this area, However, People are trying to find out more effective and easy process of virtual for doing this work.

CHAPTER 3

REQUIREMENT SPECIFICATION

3.1 Anaconda

Anaconda is a open-source software package of the Python and R programming languages for scientific computing (data science, data mining and machine learning applications, predictive analytics, large-scale data processing, etc.), that motivate to simplify package management and deployment. Package management system of conda manages package versions. Over 15 million users using Anaconda distribution which includes exceeding 1500 popular data-science packages. Anaconda support different operating system like MacOS, Linux and Windows.

The requirements for Anaconda Installation is discussed below.

Hardware requirements

*Additional space recommended if the repository will be used to store packages built by the customer. With an empty repository, a base install requires 2 GB.

3.1.2 Installing Anaconda

This is a pen source software and anyone can use it for free. For using it you have to download and install. For downloading you have to go there official website where they provide anaconda for different operating system. For windows operating system we choose Windows 10s version. One and chose either Python 3.6 version or Python 2.7 version as they uses. Then they ask for mail address you may provide your email address or just skip this option, then download will be started. After completing download, click and install the fill, it will prompt with anaconda minimal interface then follow the instruction and read the license agreement and click ok. Then the installing process will start and after while the package will be installed.

3.1.2 Anaconda Navigator

It is available on major operation system like Windows, macOS and Linux. Anaconda navigator is an alternative of command line, this comes with anaconda distribution. It make easy for user to launch application. Conda packages, channels and package can be managed by anaconda navigator without command line. Search operation for package on both Local anaconda repository and Anaconda cloud can be perform with the help of Anaconda Navigator. Installing packages in an environment, run and update them is less complex in anaconda navigator with comparison to Command line.

3.2 Jupyter Notebook

Jupyter Notebook comes with anaconda full package, its previously known as Ipython Notebooks. This is a run on computer browser like a web application. The codes are written on input box called “cell” . The advantage of this in jupyter notebook particular block of code can run separately while other interpreter run the whole code at the same time. In the interface of jupyter can contain code, text, mathematical plot or other kind of media. The file of the jupyter can save on any local disk of computer. Those files genarelly contains “.ipynb” in the end of file as extension.

A screenshot of jupyter notebook will be added here

Using “Download As” option from web interface Jupyter Notebook can be save as different standard formats like HTML, PDF, python file and more format also available. It can be convert from shell using “jupyter nbconvert” or nbconvart library. Nb convert library is provided via NbViewer . Which convert notebook document to HTML or other formate by taking the URL of the document. It simplify the visualization for display to the user.

Python Notebook interface

A Browser based REPL is provided by jupyter notebook which built contains number of open-source libraries. For programming is different languages Jupyter Notebook allow to connect to a lot of carnel. There are already 49 jupyter kernels available for programming many languages like R, Python, Julia and Haskell. Python kernel comes with jupyter notebook as default kernel.

In December 2011 the notebook interface added to Ipython on it’s 0.12 release. Later on 2015 it renamed as Jupyter Notebook. Though its interface is similar to Maple, SagaMath or Mathematica interface, it gain vast popularity recent years. It overtook the popularity of Mathematic in 2018.

3.2.1 Installing Jupyter Notebook

3.2.1.1 Prerequisite

Python

As previously discussed Jupyter can run code of several languages. There is different requirement for each of programming language. Python is required for running python on Jupyter notebook. We installed it using Anaconda. Which installation process is describe in previous section.

3.2.1.2 Installing Jupyter using Anaconda and conda

For new users, it is best course installing Anaconda. Anaconda conveniently installs the Jupyter Notebook, Python and other popular packages for data science, machine learning and scientific other computing.

Installation steps:

First of all need to download Anaconda. It’s better to download Anaconda’s latest Python 3 version (current version is Python 3.5).

Install Anaconda which you downloaded, the instructions are given on the download page.

When installation complete jupyter notebook will be ready for run.

3.3 Natural Language Toolkit

The Natural Language Toolkit, or commonly known as NLTK, is a suite of libraries as well as programs for symbolic and statistical natural language processing (NLP) for script written in English in Python language. NLTK was developed by Edward Loper and Steven Bird in the Computer and Information Science department at the University of Pennsylvania. Graphical demonstrations and sample data is comprised in it. Natural Language Toolkit is associated by a book. The underlying concepts behind the language processing tasks and supported toolkit is explained in this book, also having a cookbook.

NLTK is designed to support research and teaching in NLP or compactly related areas, including cognitive science, empirical linguistics, information retrieval, artificial intelligence and machine learning. NLTK has been applied effectively as a teaching tool, as a distinct study tool, and as a entresol for prototyping and creating research systems. There are as much as 32 universities in the US and at least 25 countries using NLTK in their courses. NLTK supports classification, stemming, tokenization, parsing tagging, and semantic reasoning functionalities.

3.3.1 Library highlights

Lexical analysis: Word and text tokenizer

n-gram and collocations

Part-of-speech tagger

Word lemmatize.

Tree model and Text chunker for capturing

Named-entity recognition

CHAPTER 4

METHODOLOGY

4.1 Modules:

4.1.1 Creating the Dataset

Creating the dataset consists of few steps like selecting data source, collecting and labeling the data. Those process are discussed bellow.

4.1.1.1 Data Collection

For collecting data for this work we chose twitter as data source. From twitter data can be collected form twitter API. For this you need to follow few steps. First of all you need to create a twitter developer account, for this you need to apply and tell them what exactly you are going to do with this account and what is the purpose of collecting data form twitter. Then when your application approved, you can get access token secret and with this you can collect tweets from particular topic as your requirement. By using particular keywords.

4.1.1.2 Labeling Data

After collecting tweets they should be identified is it a hate speech or not. For this data labeled manually. Those data labeled abjectly two category those tweets which seems hateful labeled as “hate_speech” and other tweets consider as “non-hate tweets”. Which tweet is hate which given 1 as class lebel and which in non hate is given 0 as class lebel.

4.1.2 Hate Speech Detection

4.1.2.2 Preprocessing

The preprocessing method used by this work have those steps.

1) Retweet removal

2) Cleaning text

3) Lowercasing

4) Correcting Spell

5) Negation handling

6) Stop word removal.

For preprocessing tweets we used Natural Language Processing Toolkits. Now we will see an overview of some NLP tools which are commonly used in NLP NLTK is a python based platform to work with NLP. It was developed by Steve Bird and Edward Loper. NLTK was made available to uses from year 2001 under Apache 2.0 license. It provides various tools for classification, POS tagging, parsing, tokenization, stemming and semantic reasoning alone provides 50 corpora and lexical resources. NLTK support research in NLP and some other closely related research areas such as machine learning, artificial intelligence, linguistics and information retrieval.

2. Features Extraction: After gathering large tweet corpus, we have built and train classifier for tweet sentiment analysis. We examine mainly two classifiers: Naïve Bayes and Support Vector Machine. For each classifier we extract the same features from the tweets to classify on it.To build feature set, we process each tweet and extract meaningful feature and create feature matrix by unigram technique. For example, if positive tweet contains word “sorrow”, a feature for classification would be whether or not a tweet contains the word “sorrow”.

3. Training Module: The generated data is used as training dataset to train the model for sentiment analysis. On inspecting the model on test dataset, we receive the tweet sentiment labels as an output. We will use this dataset for detecting hate speech.

4. Classification and Evaluation: We used supervised learning approach in classifying tweets. We would compare the performance of NB, SVM algorithm with our dataset.

Naïve Bayes :

Naïve Bayes is a probabilistic ML algorithm, in task like text classification, recommendation system, sentiment analysis, spam detection naïve bays algorithms is frequently used. Inspire of being a less complex algorism it perform many sophisticated algorithms. This algorithms based on Bayes theorem by Thomas Bayes. Since 1960’s the popularity of this theorem grown rapidly. The theorem is about the equation given bellow.
PA/B= PB/A × PAPB

Where P(A) and P(B) is the probability of A and B’s independent observation of each other. Again P(A/B), P(B/A) are both conditional probability, here P(A/B) is the probability of A at the presence of B and P(B/A) is the probability of B at the presence of A .

Naïve Bayes classifier assume value of one feature without concerning about another feature. Each feature contribute independently to the overall probability. So that any change of value in one feature doesn’t bother another features values. It’s also a drawback of Naïve Bayes. Naive Bayes models are also known as independence Bayes and simple Bayes. There are different NB classifier model Gaussian naive Bayes, Multinomial naive Bayes, Bernoulli naive Bayes. The advantage is the algorithms is it can be coded really easy with programming language like Python or R and this classifier can be trained very efficiently. Another advantage is it can work very quickly, it can work fine with very large amount of data. So this algorithms is often use in many application.[323]

Support Vector Machine:

In machine learning, support-vector machines (SVM), also known as support-vector networks are supervised learning models. In Supervised learning we have known the class label. When the class label are unknown then supervised learning is not possible, then used unsupervised learning. There is also svm clustering algorithms, which is created by Hava Siegelmann and Vladimir Vapnik, In this work we used Supervised learning as we know the target classes. SVM is available in many tools like weka, sci-kit learn, MATLAB, kernlab and OPEN CV.

SVM helps to draw a line between 2 categories of data exist in a data set. The margin between data know as Support Vector. There are two kind of margin one is hard margin and other is soft margin. When data of two class can be divided linearly into two categories then used hard margin on the other hand when data is not linearly separable then it’s uses soft margin.

Task like Classification, Detection and Regression can be perform with support vector machine. Over the year SVM is a very popular in use in industrial work. SVM frequently uses fields like face detection, categorize of image, text classification, bioinformatics, hand writing recognition and more.[256]

CHAPTER 5

RESULT

Using those SVM, NB algorithms we get 91% and 90% accuracy. The accuracy Is given bellow. Rate of Precession, Recall, True Positive, False Positive, True Negative, False Negative Represented bellow with a table. The resulting dataset had a size of 5000, consists of both hate tweets non-hate tweets. Based on the experimental results, The Naïve Bayes shows accuracy of 94.63% and in SVM the accuracy rate was 92.32%. The results also showed that instead of using word bigram or trigram alone, it was better to union word bigram and word trigram. In adding character n-gram and negative sentiment to the feature sets was not needed. While previous study said that RFDT, BLR, and SVM had the same performance in detecting hate speech, we found out that NB performance was better than SVM.

CHAPTER 6

CONCLUTION AND FUTURE SCOPE

6.1 Conclusion

In this paper we discussed about few work which previously done by researcher and wrote about their techniques and methodology. And make completive study, besides our main focus was to represent our work. In our research we collected raw data, processed them and create a dataset consist of 5000 clear tweets which consists of both hate tweets non-hate tweets. The annotation of data done manually after that use them in our classifier. We used applied two different algorithms SVM and NB. Where SVM accuracy of SVM was 92.32 % and NB accuracy was 94.63%. In our point of view this work has social value to make the environment of social media pleasant and user friendly. By keeping the bully’s away, social media can be more continent for calm people. For future work we will use motivate do research on detection of hate speech in Bengali language with improve techniques.

6.2 Future Scope

Applying this in social media like Facebook, twitter as real-time application can prevent this kind of activity by user. Using more advance algorithms the accuracy of this process can be improved. Analysis data with similar technique many crime of virtual world can be predict or detect.

References

[1]

Wikipedia, “Naive Bayes classifier,” 25 07 2019. [Online]. Available: https://en.wikipedia.org/wiki/Naive_Bayes_classifier.

[2]

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Detection of Hate Speech in Social Media ”

Get high-quality paper

NEW! AI matching with writer