On Twitter and Anonymity

Over the past few weeks, I’ve found myself fascinated by the context that location data can provide to a narrative. The advent of mobile applications and the cheap and reliable equipment they run on has allowed the internet to slip silently into the very center of our lives. Much of these products have been augmented or improved by way of gps, and have given birth to a vast new data resource for people like me to explore. Twitter, among all the modern web applications, has risen to the top as one of the most efficient forms of real time communication available. It conveniently supplies the world with an outlet to say pretty much anything (so long as it’s less than 140 characters), and it collect vast amounts of consumer information in the process.

The ‘Twitterverse’ as it is sometimes called generates more than 500,000,000 tweets per day, each one providing a clue about the user, their interests & opinions, their habits, and increasingly where they choose to go. I read somewhere that roughly 120,000,000 per day, more than 20% of the world’s tweets worldwide, communicate not just words but location. This is an interesting opportunity because it allows us to tie people’s sentiment to their conversations and the places they go. It adds a layer of context to what would have ordinarily been the noisy anonymity of the web. This is not a new idea. As a matter of fact, Twitter themselves have been mapping this stuff for several years:

Visualization: Europe

Twitter, much like most other large social networks, has developed an API (application programming interface) for people to use when they want to connect to this vast ocean of data. The API is how the mobile apps, extensions, and 3rd parties we use connect to Twitter allowing us to interact with it. There are a handful of ways to get to it, but my chosen method was with a Raspberry pi, some Python scripting, sqlite3, and a bit of manual labor with R for text analysis. The bulk of the twitter API is well documented on their developer page, but I cheated a little bit and used the Tweepy library with python to simplify automating everything. I have this running on my rasp pi day and night, and the data used in my analysis was gathered from December 15th through 19th. I realize this may not be a representative sample of Baltimore’s true online presence, but it’s certainly a place to start. There are roughly 45,000 tweets contained in this database, but I continue to collect more. The process for gathering them is pretty straightforward:

  1. Sip from the firehose of streaming tweets, taking only those inside a bounding box around Baltimore
  2. Gather only the tweets with Lat/Lon coordinates (only 20% of tweets have geotags)
  3. Record the necessary data from this stream
  4. Dump the collected data continuously to an SQLite database for later evaluation

Once I had a database of tweets I could move it over at my discretion for analysis on my laptop. I extracted 4 major pieces of information:

  1. WHO: The user name of the person tweeting
  2. WHAT: The unedited text of the tweet, including hashtags and urls.
  3. WHEN: The datetime stamp attached to the tweet
  4. WHERE: The Lat/Lon coordinates attached to the tweet

It was from this data that I’ve assembled a snapshot of Baltimore’s “twitterverse.”

Before I dive much deeper into this post, however, I would like to preface it by saying that the internet is a very strange place.

Skimming twitter activity and poring over the things I uncover is a bizarre experience, and I find myself struggling with whether or not this counts as voyeurism. There’s a pretty broad spectrum of material to be found here ranging from the benign to the downright appalling. Some of this content is sexually explicit, profane, racially charged, violent, or otherwise sprinkled with goodness. The bulk of what I’ve gathered is conversational; people talking with one another in a more or less casual way. The thoughts and opinions they share are often deeply personal, which I find interesting given how public their chosen forum is. I liken this to walking through a crowded airport terminal shouting out the minute details of my personal life for anyone to hear. These people are putting their lives out there in an incredibly public way, and I can’t help but worry that they don’t realize it. Adding location this this mix only adds to the problem, because they lose a degree of anonymity in the process. I can say, however, that I’ve learned a tremendous amount about WHO this city really is, and have gained a great deal of respect for my own choices when communicating online.

With that out of the way, I wanted to put my finger on the city’s pulse and watch how residents spent their time. To start, I ran a quick histogram of tweet volume by hour from the data set. After adjusting for time zones (Tweets are timestamped in UTC format) you can see the lifecycle of the day. The early morning hours are dominated by night owls, whose activity tapers off towards sunrise. Around 7AM, people crawl out of bed, and the daytime chatter picks up. The peak of activity happens between 7 and 10PM. There’s also a notable dip during rush hour, presumably because people have begun to heed tweeting & driving PSAs:

TweetTime
When people tweet

I’ve also mapped these people against a basic shape of the city for some geographical context. The bounding box I’ve drawn is larger than the city footprint so as to capture things like the airport and the suburban areas in Baltimore and Anne Arundel Counties:

Each dot is one tweet
Each dot is one tweet

There are a few hot spots, usually high schools, shopping centers, or hospitals. Unsurprisingly there are also a couple of dead zones where people are few like out in the industrial areas by the port of Baltimore and Sparrow’s point. The people there are outnumbered by buildings and boats which don’t tweet quite so often.

There are a handful of people in this dataset that are rather prolific averaging more than 10 posts per hour. This allows for a more or less constant stream of location data for specific individuals. These people are effectively telling me where they live and work. Unfortunately most of these people’s activity was concentrated around only one spot, usually a house or building. This made generating heat maps problematic because they stood out as bright dots on the map. Their tweets look a lot like this:

To avoid this problem, I ran a simple standard deviation of the lat/lon coordinates for tweets by user. I knew that anyone with a low SD would be tweeting from a single point, and I passed over them for analysis. I decided instead to investigate the @Baltimore311 account through its daily routine. They are pretty active, tweeting several times per hour from all over the city. Every time they receive a call,

@Baltimore311
@Baltimore311

they update their twitter feed which in turn updates the world on the location of the 311 request. Complaints vary a bit, but are usually related to parking, dirty alleys, abandoned vehicles, or requests to mow city-owned lots. In total, there were about 570 updates in the days I collected data, and you can see them jump around as new calls are opened or closed. That’s because each update is geocoded by the phone reporting the issue. In essence, this is hundreds of different people, all reporting their location to twitter through the 311 app.

Digging deeper, I started to look at specific tweeters with lots of activity. One individual, whose identity I will not reveal, stood out from the rest as the most interesting user of all. This person, who I will call the “Professor,” displays 3 behaviors that set them apart from the crowd:

  1. They have a nearly insatiable appetite for adult content posted to twitter
  2. They tweet letter grades in response to this content, indicating their assessment of its quality.
  3. They do this dozens of times per hour from the comfort (and privacy?) of their home.

The professor does nothing else with their twitter account, and in the few days I was collecting data they Gradestweeted several hundred times. Naturally, I took the liberty of tallying the scores to see how picky this person is. There was a great deal of variety in the scoring (A, AA, AA+, F-, D++, A+, etc) so I’ve lumped them together into some sensible categories. Overall, they seem to be pretty generous with the passing grades, only 8% failing to achieve worse than a C. This makes me wonder what it would take to earn an “F-” from the Professor, and before you ask, no I didn’t look.

This also highlights the importance of knowing the size and shape of your internet footprint. The Professor is almost certainly unaware that I’m writing this, but they have inadvertently told me where they live, what they do with their free time, and given me a peek deep into their personal life. That’s a pretty eye opening thought, and one that goes far beyond the scope of a simple blog about Baltimore. When I first set out to do this, I had envisioned finding the unique “fingerprint” of Baltimore’s online world and writing about it (I still plan to do this). What I found instead was the obvious but scary reality that people are generally too passive with their privacy, leaving the door wide open to anyone good or bad.