## Words Sentiment Score

We have explained how to get a sentiment score for words in Python. Instead of building our own lexicon, we can use a pre-trained one like the VADER which stands from Valence Aware Dictionary and sEntiment Reasoner and is specifically attuned to sentiments expressed in social media.

You can install the VADER library using pip like ` pip install vaderSentiment ` or you can get it directly from NTLK. You can have a look at VADER documentation.

## Examples of Sentiment Scores

The VADER library returns 4 values such as:

• pos: The probability of the sentiment to be positive
• neu: The probability of the sentiment to be neutral
• neg: The probability of the sentiment to be negative
• compound: The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1

Notice that the `pos`, `neu` and `neg` probabilities add up to 1. Also, the `compound` score is a very useful metric in case we want a single measure of sentiment. Typical threshold values are the following:

• positive: compound score>=0.05
• neutral: compound score between -0.05 and 0.05
• negative: compound score<=-0.05

Let’s see these features in practice. We will work with a sample fo twitters obtained from NTLK.

```import pandas as pd
import nltk

# get 5000 posivie and negative tweets

analyzer = SentimentIntensityAnalyzer()

```

Let’s get an arbitrary positive tweet and then a negative one.

```# positive
all_positive_tweets

```

Let’s have a look at the tweet.

`"@metalgear_jp @Kojima_Hideo I want you're T-shirts ! They are so cool ! :D"`

Let’s get its sentiment score:

```analyzer.polarity_scores(all_positive_tweets)

```

The output is 56.8% positive ad 43.2% neutral. The compound score is 0.8476

`{'neg': 0.0, 'neu': 0.432, 'pos': 0.568, 'compound': 0.8476}`

Let’s do the same for a negative tweet.

```all_negative_tweets
```

which is:

`'I feel lonely someone talk to me guys and girls :(\n\[email protected] @imarieuda @EiroZPegasus @AMYSQUEE @UdotV'`

Let’s get the sentiment score:

```analyzer.polarity_scores(all_negative_tweets)

```

The output is 70.7% neutral ad 29.3% negative. The compound score is -0.6597

`{'neg': 0.293, 'neu': 0.707, 'pos': 0.0, 'compound': -0.6597}`

## Important Note about Sentiment Scores

In most NLP tasks we need to apply data cleansing first. In my opinion, this should be avoided when we run sentiment analysis. Notice that VADER:

• It is case sensitive. The sentence `This is great` has a different score than the sentence `This is GREAT`.
• Punctuation matters. The exclamation marks for example have a positive score
• The emojis have also a score and actually very strong sentiments. Try the `<3`, `:)` , `:p` and `:(`
• Words after `@` and `#` have a neutral score.

## Get the Sentiment Score of Thousands of Tweets

We will show how you can run a sentiment analysis in many tweets. We will work with the 10K sample of tweets obtained from NLTK. We start our analysis by creating the `pandas` data frame with two columns, `tweets` and `my_labels` which take values 0 (negative) and 1 (positive).

```my_labels = *len(all_positive_tweets)
negative_labels = *len(all_negative_tweets)
my_labels.extend(negative_labels)

all_positive_tweets.extend(all_negative_tweets)

df = pd.DataFrame({'tweets' : all_positive_tweets,
'my_labels' : my_labels})

df

```

Now, we will add 4 new columns such as the `neg`, `neu`, `pos` and `compound` using the lambda function.

```df['neg'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
df['neu'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
df['pos'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
df['compound'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
df

```

## Analyze the Sentiment Score

Since we have tide the data and we have gathered the required information in a structured format, we can apply any kind of analysis. So for example let’s have a look at the `compound` score for the positive and negative labels.

```df.groupby('my_labels')['compound'].describe()

```

Let’s also have a look at the boxplot.

```df.boxplot(by='my_labels', column='compound', figsize=(12,8))

```

## Discussion

It is obvious that VADER is a reliable tool to perform sentiment analysis, especially in social media comments. As we can see from the box plot above, the positive labels achieved much higher score `compound` score and the majority is higher than 0.5. On contrary, the negative labels got a very low `compound` score, with the majority to lie below 0.