Who are the happiest Twitter users? Part 1 – Or: Localized Sentiment Analysis with the TwitterAPI

Alright. First post. The page isn’t really set up yet, but I should write something anyway. I will mostly use this space to showcase my own projects or interesting tools I stumble upon. In this post I’ll present the TwitterAPI and a simple Sentiment Analyzer for Twitter, which I’ve used to compare the average Twitter sentiment for the 100 most populous cities in the US1https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population.


The United States of America is a very widespread country and encompasses a lot of very different regions. Nevertheless the people of America do not seem to differ too much from region to region. For example the language does not show signs of separating into different dialects. Sure there is the typical southern speech pattern, but compared with the regional differences of European languages, everyone speaks pretty much exactly the same. On the other hand a difference in political inclination can easily be seen on maps showing vote percentages. I would like to know if there are differences between regions in other areas of life. Are people in San Francisco more positive than people in New York? To find out we can look at sentiment data from both cities and compare if one shows a significant difference. An easy way to get such data is to look at Twitter feeds from those locations. We receive random statements of people talking about anything they want. If the people in both towns are actually pretty much the same, then the statements of both should not differ too much in their sentiments. In statistics we’d call both being the same our “null hypothesis”. This shouldn’t be too difficult..

Twitter API

The Key and Secret are what you need to have your program access the API.

You can generate new keys and secrets if you accidentally post them in a blog.

First, to get the Twitter data we need credentials for the Twitter API. You can get those by creating an app on apps.twitter.com2apps.twitter.com, which is just a name for the project for which you are using the credentials. Since we don’t need to post status updates or access our regular twitter account we can use the ‘read only’ access level and we should only need the “Consumer Key” and “Consumer Secret”.

To make our life easier when we write the code we can use one of many libraries for the Twitter API. In this project we’ll use the TwitterAPI by geduldig 3https://github.com/geduldig/twitterapi for python. For other languages you can check the official Twitter list of libraries on dev.twitter.com4https://dev.twitter.com/resources/twitter-libraries which is a good resource in general.

geduldig’s TwitterAPI offers a TwitterOAuth class which reads your credentials from a txt file, so you don’t have to show your own Key and Secret in your code. By default it looks in ‘TwitterAPI/credentials.txt’ (i.e ‘/usr/local/lib/python2.7/dist-packages/TwitterAPI/credentials.txt’) . With the credentials saved and library installed we can finally start to code.

Testing the API

Before we start the actual analyzer, we should see if our setup works. Let’s read a completely random tweet:

from TwitterAPI import TwitterAPI, TwitterOAuth, TwitterRestPager

creds = TwitterOAuth.read_file()
api = TwitterAPI(creds.consumer_key,
                 auth_type = "oAuth2")

restPager = TwitterRestPager(api,
tweetIterator = restPager.get_iterator()
message = tweetIterator.next()
print message['text']
'@LordoftheNexus I'm trying really hard to justify my lolrandom joke'

What chu lookin at?Me too @Tiny__Squid. Anyway. We searched for a tweet containing the search term ‘lolrandom’ by feeding the TwitterRestPager the authorized api, the search command, as well as a JSON object containing the search term and the number of tweets we want. Next we get the iterator which will help us in our actual project by repeatedly requesting more tweets. In our little demo here we just want one though, so we can get the message with a manual .next(). The responses are (kind of) JSON objects which may contain a tweet or an error message. I wing it here for simplicity, but you should normally see if there is a tweet in there by asking “if 'text' in message” and do error handling if not.

We aren’t interested in search terms though. We want tweets from 100 specific locations. To do that we use api.request('statuses/filter', {'locations': '-74,40,-73,41'}) instead of the TwitterRestPager. This will provide us with an iterator that continuously streams tweets from the specified location. We need two sets of geo coordinates which indicate an area across our target location, which means we can choose the precision or size of the area. In this example we’d receive tweets from anyone between latitude 40,41 North and longitude 73,74 West. This is rather complicated, and there is actually a search command which would let us search for “places”. But those are only used when people manually tag a place and can be used even if the user is somewhere else. Geo location data is transmitted with every tweet when a user has that setting activated. For some reason we need the other authentication type for this kind of request, so we have to add our access_token_key and access_token_secret to the credentials.

Where is Glendale?

I’m from Germany, but I think I have an above average grasp of the United States due to a number of friends and acquaintances living there. Nevertheless I cannot name the 100 largest cities off the top of my head and a good number of them don’t mean anything to me. Good thing we have people who love making lists on Wikipedia for basically any topic, so it comes as no surprise that

we have a list of the most populous cities of the United States5https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population. And lucky for us they list their coordinates as well.


Spreadsheet Data saved in a google document

I just pasted the data into a google spreadsheet and exported it as a csv.

So where is Glendale? I don’t know, but let’s find out. We copy the Wiki data and put it in a spreadsheet. Most of the data is uninteresting to us so we remove everything but the city and location columns. We can remove the [1]s, this degree thing ° etc., and convert locations into individual floats by using some regex magic. In the end we have a nice CSV with city names and corresponding coordinates. I still don’t know where exactly Glendale is. But with our nicely formatted CSV and the power of python we can find out.

import pandas as pd

with open('LocationData.csv','r') as csvFile:
  df = pd.read_csv(csvFile)
  City         Latitude Longitude
0 New York     40.6643  -73.9385
1 Los Angeles  34.0194  -118.4108
2 Chicago      41.8376  -87.6818
3 Houston      29.7805  -95.3863
4 Philadelphia 40.0094  -75.1333

Introduction of pandas. Somewhat more successful in spreading than their namesake, pandas is a popular data analysis library for python. They use a very nice data frame that makes working with labeled data more fun. In the code above we create a data frame directly from the csv file. The .head() function shows that the data is structured in three columns of City(name), Latitude, and Longitude. The type of each column can be set on the read_csv() call. To access a single column we can use various methods. .loc[] is the standard method and should usually be used to select one or more indexes. Columns can also be referred to by their name via df.City. Comparing this single column with our target value of ‘Glendale’ returns a binary list. We can use this to search for the city.

print(df[df.City == 'Glendale'])
   City     Latitude Longitude
88 Glendale 33.5331  -112.1899

There it is. Glendale is the 88th largest city in the US and located at 33.53 N 112.19 W. Which is…

huh. That looks like it borders on Phoenix and Mesa isn’t far either. Interesting that these are still individual cities. Glendale is the proud home of the Arrowhead Towne Center mall and the Thunderbird post-grad business school. Population of 226,721. 6in 2010

Now we can easily access the coordinates of any large city in the US. But since we need an area for our TwitterAPI request, we have to extrapolate. I’ll just decide that we take a degree north, south, east, and west from the point and that way span an area of two by two degrees around the city. This way we might receive tweets that are not exactly from that city or we miss tweets that are cut off, but I think they are still representative of each general area. Also, curiously, cities further up north get a smaller area this way as the longitude lines approach the north pole and move closer together. 7https://en.wikipedia.org/wiki/Geographic_coordinate_system#Expressing_latitude_and_longitude_as_linear_units ¯\_(ツ)_/¯

Calling Twitter

Now that we have set up our data frame, we can finally start gathering tweets. For each city we will call the Streaming API and specify the bounding box around that location. Since the iterator will return tweets forever (or at least until we time out), we have to set a limit. Our first goal will be to collect 100 tweets from each city. In total that will be 10,000 tweets.

from TwitterAPI import TwitterAPI, TwitterOAuth
import pandas as pd

LIMIT = 100

creds = TwitterOAuth.read_file()
api = TwitterAPI(creds.consumer_key,
                 auth_type = "oAuth1")

with open('LocationData.csv','r') as csvFile:
  locations = pd.read_csv(csvFile,
                          dtype={'Latitude':float, 'Longitude':float})

allTweets = []
for city in locations.index:
  print("Next Stop: {}".format(city))
  tweets = []

  # String representing the bounding box around the location coordinates
  (lo, la) = locations.loc[city,:]
  geobox = "{0:.2f},{1:.2f},{2:.2f},{3:.2f}".format(la-1,lo-1,la+1,lo+1)

  # Continuously request tweets with that geo location
  for item in api.request('statuses/filter',
                          {'locations': geobox}).get_iterator():
    if len(tweets) >= LIMIT:

    if 'text' in item:
    elif 'limit' in item:
      print '{} tweets missed'.format(item['limit']['track'])
    elif 'disconnected' in item:
      print 'Disconnected: {}'.format(item['disconnected']['reason'])

  allTweets.append( [city,tweets] )

# Save all Tweets with their respective location in a CSV
df = pd.DataFrame(allTweets)
with open('TwitterData.csv','w') as outputFile:

Twitter hasn't reached Alaska yet.

Now we just have to run this script and wait for the 10,000 tweets to accumulate. To get a time estimate we can track how long it takes to receive 10 tweets with time.time()8deprecated in python3.3+. Some cities have a higher tweet frequency than others which would be interesting to study on its own9does frequency correlate well with population?. New York, Chicago, and so on can deliver about a tweet per second or more. Tweets from Anchorage on the other hand may take up to 30-60 seconds per message. It also depends on the time of day and day of the week. When I started my script over here in Germany it’s 2 pm, which means it’s 5 am in LA. It would be best to take samples over the course of multiple time periods so that our data is more varied and unbiased. Since we have to wait on our data I’ll publish this article now and will explain the sentiment analysis once we have enough data in the second part.

References   [ + ]

1, 5. https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population
2. http://app.twitter.com
3. https://github.com/geduldig/twitterapi
4. https://dev.twitter.com/resources/twitter-libraries
6. in 2010
7. https://en.wikipedia.org/wiki/Geographic_coordinate_system#Expressing_latitude_and_longitude_as_linear_units
8. deprecated in python3.3+
9. does frequency correlate well with population?