Saturday, March 13, 2010

Twitter corpus v0.1

I am pleased to announce that we have published the first version of our Twitter corpus. All the data was collected from Twitter's streaming API over a period of about two months (November 11th 2009 until February 1st 2010). You can download the corpus from our social media website (which we just set up recently). There is an accompanying paper which gives some statistics about the corpus. One things that might interest all the 13-year old girls out there is that it seems Justin Bieber > Nick Jonas (look at table 3 in the paper). I believe that the Twitter corpus will be of interest to anyone working in social media research and/or NLP. We do plan to release subsequent versions as we get more data (and we might release old data starting from April 2009, but more on this later).