Instagram Fake Account Detection using Machine Learning

10 min readDec 20, 2020

Introduction

In the current era, presence on social media has become part and parcel of life. These platforms not just used for sharing thoughts and moments but also being widely used to promote businesses. Due to this, fields like Digital Marketing and Targeted Customers Capturing evolved a lot. Being an open platform it is also being exploited by clever people to do fake promotions by increasing their pseudo-popularity metrics. Instagram is one of the widely used emerging social platforms, having an active user count of around 1 billion. In a survey, it has been found that one in ten accounts on Instagram is fake. In this article, we will describe our Machine Learning approach to detect and classify fake accounts on Instagram.

Real Account: Real accounts are those accounts that are active, having profile pictures, nicely written biographies with a legible username, and media posts.

Fake Account: Fake accounts are those accounts that usually contain strange names, very little follower count. These accounts contain zero or very few posts. The main intention behind the creation of such accounts is to increase the popularity metrics of some users by increasing their followers. These accounts are created in bulk using automated tools.

Dataset Exploration and Preprocessing

Dataset Used

We have used a publicly available INSTAFake Dataset from GitHub (link to the dataset) and increased the number of samples by 20% by manual collection and labeling. We also added one more feature User Biography Emoji Count into our modified dataset.

Different features used in the dataset are as follows:

Input Feature: User Follower Count, User Following Count, User Biography Count, User Media Count, User has Profile Pic, User is Private, Username Digit Count, Username Length, User Biography Emoji Count

Output Label: (x) Is Fake (0 means real, 1 means fake)

In the real world, data may be organized, unorganized, cleaned or it can be messy enough to make our lives difficult. It is known that almost 80% of the time spent in a project is the time spent in data preprocessing only to make it as what we want it to be. Therefore data preprocessing plays a crucial role in any project. In our project also, we have also performed a few steps to ensure that we prepare good data to feed our machine learning algorithms.

Preprocessing

Handling Missing Values: We have added one extra feature of the User Biography Emoji count, which was not there before in the public dataset. We replaced missing values in the column with the mean of the column. Few columns were having missing values that are taken care of and fillna method is used to replace them with the mean of the column values.

Class Imbalance Problem: Gathering fake account profile is not so easy, as they are rare to find among lots of real accounts. It requires a sufficient amount of human valuable time to create a required dataset. That is why the number of samples for fake accounts is less compared to that of real. We had 1140 samples of real accounts and 253 samples of fake accounts. But in order to make good ML classifiers, we have to solve this problem. We solved it using SMOTE (Synthetic Minority Oversampling Technique). It is an oversampling technique that increases the samples of the minority class. We assume that you are well familiar with the functionality of SMOTE or you can spend time reading Link to the paper. After oversampling, we had 800 samples for each class

Label encoding of columns: We need our column values to have numeric values before performing SMOTE, thus we have performed label encoding on three columns that previously had the values as yes and no which are then encoded to 0 or 1. We have used a label encoder.

Visualization of data using seaborn pairplot: It is always good to visualize data before proceeding with implementation as it will make us understand the data metrics such as degree of separation etc. It will also help us to select suitable algorithms for implementation.

Model Selection

Since it is a binary classification problem, we have taken into consideration all state of the art classifiers which will also help us for detailed study and selection of optimal among all. In our project, we are dealing with a problem of binary classification where we have to classify a given user's Instagram account as real or a fake one. Therefore we will restrict ourselves in explaining the various possible binary classification algorithms and will understand why are considered to be fit for our project in later sections.

Decision Tree:

It is a tree-based classifier that takes decisions on the basis of feature values. It forms a tree kind of structure and at each level, some condition is evaluated depending upon features and on the basis of the result, we jump to one of the branches of the next level. Leaf nodes contain classification results. Features are selected using ‘entropy’ and ‘Gini index’ metrics.

Logistic Regression: It is a very popular linear classifier that tries to fit a line on the given dataset. The equation for the classification boundary is:

h(x) = wx + b

Where h(x) is the target function.w, b, and x are weight, bias (intercept), and sample points respectively. Weights of the equation are learned using Gradient Descent.

Support Vector Machine (SVM): SVM is a maximum margin classifier. It takes the help of the support vectors while defining a decision boundary over a dataset. Support vectors are a set of positive and negative data points whose perpendicular distance from the decision boundary is minimum. The distance between support vectors is called the minimum functional margin.

SVM classifier selects the classification boundary which maximizes the minimum geometric margin (a scale-invariant version of functional margin). Kernels are also used in SVM to convert the data into higher dimensions which help in the classification of non-linearly separable data.

Neural Network: Neural Networks are artificial simulations of the biological neural network of the human brain into computers. It uses perceptron, an artificial equivalent of biological neurons. Neurons are connected to each other in a hierarchical fashion. The first layer is called the input layer, internal layers are called hidden layers and the last layer is the output layer which stores the classification result. Input data is passed into the input layer it gets processed through hidden layers and the final output is saved at output layers.

In the later section of the blog, we will implement a few of the classifiers for our problem and undergo a thorough comparative analysis of the performances which will help us to understand why and which algorithm is working better in this scenario and the factors affecting their performances.

Implementation

In the first half of the blog, we have seen that we were having an unbalanced dataset and thus faced the problem of class imbalance. You might know about the class imbalance problem, how to deal with it and the same is explained in the data preprocessing part of the blog. After the preprocessing stage, we are having a balanced dataset. We are not bothered about accuracy for the imbalanced dataset as other metrics such as recall, precision, F1 score, etc. play a crucial role to define model performance. Thus we have divided our implementation path into two ways; one with considering all features and the other with considering only the important features after feature extraction.

We will implement classifiers for the ways of implementation and then compare how much class imbalance performance differs performance of classifiers with a balanced/corrected dataset.

STEP1: Import the necessary python libraries that we are going to use in our project.

STEP2: Load the dataset using pandas read_json, read_csv etc. Depending on the format of the dataset. We have used read_json and read_csv in our project.

STEP3: After the preprocessing of the dataset in the data preprocessing stage in the blog, we have two data frames; one is unbalanced and the other is balanced. In a later section, we will use both of these during implementation.

Let us begin the step-wise implementation of each of our classifiers.

Logistic Regression

Import the required libraries for the logistic regression. Use the train test split function to split the data frame (after sampling) into training and testing sets. Create a logistic regression classifier. Train the classifier using train data and train labels. Use the predict function on test data and test labels which will result in class values. Lastly use accuracy score, confusion matrix, and classification report to have desired results. Repeat the procedure with only relevant features.

Training accuracy with all features: 0.901665344964314

Testing accuracy with all features: 0.9026128266033254

Training accuracy with 4 important features: 0.8953626634958383

Testing accuracy with 4 important features: 0.8883610451306413

Decision Tree

Import the required libraries for the decision tree. Create a decision tree classifier object and perform the above steps as what we have done in logistic regression.

Training accuracy with all features: 1.0

Testing accuracy with all features: 0.9619952494061758

Training accuracy with 4 important features: 1.0

Testing accuracy with 4 important features: 0.98321

Support Vector Machines (SVM)

Import the required libraries for the SVM. Create an SVM classifier object and perform the above steps as what we have done in logistic regression.

USING ALL FEATURES:

Training Accuracy with the linear kernel: 0.9111816019032514

Testing Accuracy with the linear kernel: 0.8954869358669834

Training Accuracy with RBF kernel: 0.9135606661379857

Testing Accuracy with RBF kernel: 0.9121140142517815

USING ONLY 4 IMPORTANT FEATURES

Training Accuracy with the linear kernel: 0.8947681331747919

Testing Accuracy with the linear kernel: 0.8883610451306413

Training Accuracy with RBF kernel: 0.9114149821640903

Testing Accuracy with RBF kernel: 0.9121140142517815

Neural Network

Import the required libraries for the NN. Create an MLP classifier object and perform the above steps as what we have done in logistic regression. RELU is used as an activation function. Layer configuration has also been performed.

Before Sampling:

Train Accuracy: 96.55172413793103 %

Train Loss: 0.4866034207701727

Test Accuracy: 95.70200573065902 %

Test Loss: 0.4386049217973333

After Sampling:

Train Accuracy: 96.19349722442506 %

Train Loss: 0.19235956573174792

Test Accuracy: 92.87410926365796 %

Test Loss: 0.33758514479955243

Result analysis

Now we are done with the implementation of our classifiers for our project on our dataset. Now comes the turn for the analysis of the outputs we have obtained for our classifiers. We will have a comparative study of why and which classifier is performing low or high and the reasons behind it.

Below are some important analysis arguments generated considering our classifiers and their outputs:

·Decision trees are easy to visualize and interpret.

·The decision tree divided the region into sub-regions and eventually obtain the decision boundaries while logistic regression tends to search for a linear boundary which is difficult when increasing the number of features.

·The below decision boundary plots show that for the decision tree, classification and decision boundary is better than others and it can be verified with the accuracy and other performance metrics also where decision tree is found out to be best among all.

·Decision tree works well with outliers also as it doesn’t hamper its performance but it is not the case with other classifiers.

· Decision tree generally works extremely well in classification problems and needless data cleaning and achieve good performance when compared with other binary classifiers.

Conclusion

After implementing the classifiers on our dataset, now we have an understanding of why and which classifier is performing better than others and the factors affecting it. Dataset dimensions, quality of data, the relevance of features, selection of hyperparameters, etc. are few important things to keep in mind while implementing a classifier to achieve good results and to understand the comparative study in a better way.

Among the implemented models, based on the output analysis and performance metrics, the decision tree classifier is working pretty well with high accuracy and other metrics such as recall, precision, and F1 score.

Authors

Kapil Bhargava, MTECH CSE, IIITD (LinkedIn, Email)

· Data exploration and preprocessing, decision tree, logistic regression, Support vector machines, analysis.

Shivank Agrahari, MTECH CSE, IIITD (LinkedIn, Email, Github, Medium)

· Overview, Data exploration and preprocessing, decision tree, neural networks, analysis.

Guided and Supported By

We are thankful and extend our sincere gratitude to Professor Dr. Tanmoy Chakraborty for his constant support and guidance throughout this Machine learning course project.

1. Professor: LinkedIn, Facebook, Twitter

2. Professor Website: faculty.iiitd.ac.in/~tanmoy/

3. Teaching Fellow: Ms. Ishita Bajaj

4. Teaching Assistants: Pragya Srivastava, Shiv Kumar Gehlot, Chhavi Jain, Vivek Reddy, Shikha Singh, and Nirav Diwan

References

1] Sen, A. Aggarwal, S. Mian, S. Singh, P. Kumaraguru, ve A. Datta, “Worth its weight in likes: Towards detecting fake likes on Instagram.” WebSci, 2018, sf. 205–209.

2] Instagram Fake and Automated Account Detection Instagram Sahte ve Otomatik Hesap Kullanımı Tespiti· T. Information, “Instagram’s Growing Bot Problem,” www.theinformation.com/articles/instagrams-growing-bot-problem, accessed: 2019–06–10

3]W. Zhang ve H. Sun, “Instagram spam detection,” 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), Jan 2017, sf. 227–228

4]B. Er¸sahin, Ö. Akta¸s, D. Kılınç, ve C. Akyol, “Twitter fake account detection,” 2017 International Conference on Computer Science and Engineering (UBMK). IEEE, 2017, sf. 388–392