Hinglish Sentiment Analysis
In India we observe the growing popularity of Hinglish which is basically an informal language that includes a mixture of English as well as Hindi words, or Hindi words written in Roman English Script instead of the traditional Devanagari script. For example,” mereko yeh dekhna pasand hai” basically means “I like watching this”. Very little or almost negligible work has been carried out analyzing and classifying sentiments written using Hinglish. In our blog I try to explain my proposed approach that aims towards correctly classifying the sentiments received on YouTube videos and Instagram posts.
Proposed Approach
In the proposed approach the aim is to create a user-friendly website where the user can enter the link of their YouTube videos or Instagram posts along with their login credentials to get a detailed analysis of the comments on their content segregated in the form of positive, negative or neutral comments. I aim at creating a platform which will specially be beneficial for content creators who wish to analyze how their content is being perceived by their audience. Comments from the posts and videos will be extracted using an API in python for analysis. To perform our sentiment analysis, I have extracted the YouTube dataset from a GitHub repository which is a mixture of English as well as Hinglish comments. The comments will first be passed through the Google Translator API in order to convert the Hinglish comments into English for analysis. Data preprocessing and cleaning will be the next required step in order to extract only the useful words ad to get a structured data for analysis. After getting the required structured data it is important to convert it into a form that is understandable by the model for classification, for which we perform tokenization that converts the user readable sentences into vectorized form which is machine understandable. For the classification purposes to analyze the type of sentiment I have used Deep learning approaches that include using of LSTM (Long Short-Term Memory networks) which is a special kind of RNN (Recurrent Neural Networks) that is very popularly used in the field of text data mining such as sentiment analysis.
A. Training The Model:
1) Data Collection:
There are a number of datasets available for sentiment analysis but choosing the right one for giving the most accurate and satisfactory results is a tedious task. The dataset was collected from Kaggle that consists of Yelp reviews. The reviews consisted a mix of other languages other than English ad totally consisted of over 3.9 lakh samples which were biased towards positive comments and the distribution was not equal between the three categories of comments. Our model constitutes of 90 thousand comments taken from this dataset that consists English comments purely and 30 thousand comments of each category i.e. Positive (fig.3) negative (fig .4) and neutral(fig.2). The data set has been divided into a ratio of 70% and 30% , out of which 70% is being used for training and rest for testing.
2) Data preprocessing
a) Removal of unwanted symbols and emojis:
The modified data containing equal number of positive negative and neutral comments acquired from Kaggle after making the necessary changes goes through a preprocessing step to remove the text that is not useful for analyzing the sentiments from the give text. In the data used for training we have used re (Regular Expression) library to remove the unwanted symbols such as (!@#$%&*) etc. which do notgive any relevant or significant information about the sentiment in the sentence.
b) Tokenization:
Tokenization is a preprocessing step that is essential to covert the given sentence in to a form that can be fed into our model for classification purposes. In tokenization each word of the sentence is broken down into small parts i.e. converted into small tokens (assigned a number) and stored in a form of array which can be used for the required computation by the model. This preprocessing step can be performed by using the Tokenizer API provided by Keras.
3) Classification Model:
For the classification purposes we have used the technique of deep learning. Text mining or processing of text for achieving a significant information for our benefit from it is complex job. According to a number of research and implementations in this field it is acquired that Deep learning is a more accurate method to work upon text data such as performing sentimental analysis. One of the major advantages of using Deep learning over Machine learning is that the deep learning architecture scans the entire data to find correlations between data and performs feature engineering by itself. Although the process is slower that machine learning algorithms but is known to give out better results. In our model we have used several different layers to get a satisfactory result. The biggest advantage of using deep learning over machine learning is that the kind of model used for our classification is a sequential model, in which the layers form a stacked structure which may or may not be linear in nature.
a) Embedding layer:
Keras gives us a provision of using an Embedding layer which can be used on the neural networks for our textual data. This type of layer can work only upon data that has been encoded using tokenization to convert every word into a unique integer. Word Embedding is basically a feature learning method or technique that maps the phrases or words from the vocabulary (set of unique words in the entire textual data) to a vectorized form of real numbers. Embedding layer is considered to be the first hidden layer of the network that consists of three arguments for its functioning, namely the size of the vocabulary, the output size which defines the vector space size for the words to be embedded and the input sequence length which is required for the layers for any model being trained. The vocabulary size used in our model is 5000. Embedding layer is a flexible layer that is initialized with weights that are random in nature and then it consequently learns to embed for all the data given in the training dataset.
b) Dropout:
Dropout is a basically a regularization method used in deep learning models to prevent over fitting while training a data. This can be implemented on either all the hidden layers as well as the input layer or only a few hidden layers depending upon the kind of dataset and problem being solved by the model. As the name suggests dropout means dropping something which in the field of deep learning refers to dropping units of hidden layers or visible layer. By dropping a unit, it means in action that temporarily cutting off its connection from the entire network by removing connectivity between all its incoming as well as outgoing connections. This gives an illusion that the layer is different from its previous layer and hence it gets treated accordingly, changing the configuration of the model from the original one. In our model we have used a dropout rate as 0.3 which means that 30% of our hidden units will be dropped out.
c) LSTM:
Long Short-Term Memory (LSTM) networks are known to be a special type of Recurrent Neutral Network (RNN). Its speciality is that it is capable not only to processes a single data point such as images but sequence of data such as an audio file or videos. The architecture of LSTM typically constitutes of a cell, input gate, output gate and a forget gate. The cell is useful for remembering whimsical intervals of time while the three gates play an important role in regulating the in and out flow from the cell. Every cell is responsible for acquiring the information about the dependencies that is constituted between the elements in the input sequence. LSTMs were introduced to eradicate two major problems faced in using RNNs i.e. vanishing gradients and exploding gradient problems. LSTMs are popularly being used in fields
d) Dense layer:
Dense layer is a non — linear layer , as the name suggests it is a fully connected layer which connects all the neurons present in a network. Every neuron present in the present layer is connected to every other node in the previous layer. Important feature of using a dense layer is that it takes in all the possible combinations from the previous layer to get the result. The dense layer is the final layer in the model that finally provides the out and classifies the sentiments as positive, neutral or negative.
e) Activation Function used:
We have used a mathematical function called Softmax popularly used for logistic regression. that basically helps in conversion into a form of vector probability from a number. Softmax is a very commonly used activation function used for neural networks. Softmax function is generally in classification when there are more than two possibilities of outcome(multiclass classification). In our case, we have three possible outcomes, i.e. whether the sentence is positive, neutral, or negative.
f) Loss Function used:
We have used categorical_crossentropy loss in our model which is used in the case of multiclass classification when the desired result is not in form of integers, but in the form of vector probabilities. This type of loss function is best-suitable when using a softmax activation function for classification.
g) Optimizer used:
In our model, we have used an Adam optimizer which is a stochastic gradient descent-based method for the improvement of our model by reducing the losses.
B. Testing our model:
The model that we trained in the earlier section will now be tested on the Yelp dataset which was divided into training (70%) and testing (30%) and then our model will also be tested on the comments of the post whose link has been entered by the user on the created website.
1) Dataset collection:
The comments for sentiment analysis by a user will be taken with the help of our website where the user will enter the link of the video whose comments they wish to analyze in case of YouTube and User credentials as well as the post number in case of Instagram. Using Google YouTube API in python we will extract all the comments of the video through the link entered by the user. Similarly, for extracting the Instagram comments we have used the Instagram API in python. Once the user enters the link of the post/video all the comments will be extracted and sent to our model for classification whether it is positive, negative or neutral.
2) Working upon Hinglish Sentences:
After researching over the internet, we found out that no sufficient or appropriate data for Hinglish was available for training our model.
Only some datasets are available with very few data examples which fail to provide the desired data results.
To work on Hinglish sentences I have used Google Trans API in python to convert the Hinglish sentences into English which can then be sent to our model for sentiment analysis.
3) Tokenization:
The comments extracted from the posts/videos will go through the same tokenizer method as mentioned in training our model for conversion of the sentences into a form that can be accepted by the model.
4) Prediction:
The comments extracted from the posts/videos that the user wants to analyze will be predicted using our trained model by using model.predict() command offered by Keras. This will predict the sentiment of the test data extracted from comments and classify it as either positive, negative or neutral.
Conclusion
Through this study, I tried to create a platform in the form of a website for the users where they can analyze the comments received on their posts or videos on Instagram and YouTube respectively which will be extracted using the link of the post/video entered by the user on our website. For training our model we have used a dataset of about 90000 comments from Yelp. To perform classification, we have used deep learning techniques, namely LSTMs for achieving a satisfactory accuracy in the results. Through this research, I concluded that little or almost no study has been performed on Hinglish sentiment analysis. We have worked on Hinglish comments using translation API. Our aim is to provide the best results possible and display them to the user in the form of graphs for a better understanding of the user to analyze the distribution of the comments on their posts.