32MTweets
Approaching Social and Contextual Biases in Tweeting
Twitter data is both a promised land and a nightmare to any sociologist. More than 1.3 billion accounts have been created on the microblogging platform since 2006 and the number of monthly active Twitter users worldwide as reached 330 Million in January 2018. The total number of Tweets sent per Day by those users is now over 500 Million. This impressive figures often lay ground for a lot of research looking at twitter as a new form of public sphere, especially in media and opinion research. Nevertheless strong biases arise whenever Twitter data are used to describe such things as « para-journalistic » coverage of an issue, opinion formation or even elections outcome.
Among those biases the most important stem a) from the difference between having a Twitter account and actively tweeting : most users on Twitter have a very small activity on the social network (it is estimated that 44% never sent a Tweet and only 8% have sent more than 50 tweets) and among those with a regular activity the probability to have a lot of followers and thus to be retweeted varies a lot. It also stems b) from very well known socio-demographical biases (mostly gender, age and profession — as well as race in some countries).
Studying this issue is very hard due to the lack of systematic data that could document users attributes. Thus a lot of research has been carried on to infer users attributes from Twitter profile information, tweeting behavior, the linguistic content of tweets or social network infor- mation gathered from retweets patterns. Very stimulating results have been obtained in infer- ring gender (Rao et. al., 2010, Liu & Ruths, 2013), age (Schler et. al., 2006 ; Al Zamal et. al., 2011), occupation and social class (Sloan et. al., 2014 ; Preotiuc-Pietro et. al., 2015 ; Mac Kim et. al., 2016), location (Jones et. al., 2007), political orientation (Thomas et. al., 2006 ; Rao et. al., 2010), ethnicity (Pennacchiotti & Popescu, 2011 ; Rao et. al., 2011). Other data have also been used such as twitter accounts lists data to infer profession (Ke et. al., 2016), or websites visitors demographics (Goel et. al., 2012 ; Culotta et. al., 2015)
The 32MTweets project aims at analysing a corpus of 32.8 Million Tweets sent from France between 2014 and 2017 collected with their GPS geolocalisation using the Twitter API and Bounding Box limitations to try to geographically infer the intensity of Twitter activity from various variables that can be gathered on very specific locations. Using geographical inference is particularly interesting in the case of Twitter because it makes it possible to compare the influence of var- ious kinds of variables ont the tweeting activity of a certain area : socio-demographical vari- ables (such as age, gender, profession), morphological variables (such as the human density of the area or the public transportation system), contextual variables (such as the average in- come or the share of unemployed people) and political ones (such as the participation in local elections).
All Tweets are attributed to the IRIS zone they were sent from (IRIS zones are the smallest geographical unit of the french national office for statistical information, INSEE. They usually count +/- 2.500 in- habitants). Census (and other) data are collected to describe every IRIS. A dataset with 47.484 IRIS counting at least one tweet between 2014 and 2017 resulted from this process. Some information being only available at the town level (e.g. political participation) another dataset was created with 33.881 towns (most of them being small/very small towns that count only one IRIS).
Among those biases the most important stem a) from the difference between having a Twitter account and actively tweeting : most users on Twitter have a very small activity on the social network (it is estimated that 44% never sent a Tweet and only 8% have sent more than 50 tweets) and among those with a regular activity the probability to have a lot of followers and thus to be retweeted varies a lot. It also stems b) from very well known socio-demographical biases (mostly gender, age and profession — as well as race in some countries).
Studying this issue is very hard due to the lack of systematic data that could document users attributes. Thus a lot of research has been carried on to infer users attributes from Twitter profile information, tweeting behavior, the linguistic content of tweets or social network infor- mation gathered from retweets patterns. Very stimulating results have been obtained in infer- ring gender (Rao et. al., 2010, Liu & Ruths, 2013), age (Schler et. al., 2006 ; Al Zamal et. al., 2011), occupation and social class (Sloan et. al., 2014 ; Preotiuc-Pietro et. al., 2015 ; Mac Kim et. al., 2016), location (Jones et. al., 2007), political orientation (Thomas et. al., 2006 ; Rao et. al., 2010), ethnicity (Pennacchiotti & Popescu, 2011 ; Rao et. al., 2011). Other data have also been used such as twitter accounts lists data to infer profession (Ke et. al., 2016), or websites visitors demographics (Goel et. al., 2012 ; Culotta et. al., 2015)
The 32MTweets project aims at analysing a corpus of 32.8 Million Tweets sent from France between 2014 and 2017 collected with their GPS geolocalisation using the Twitter API and Bounding Box limitations to try to geographically infer the intensity of Twitter activity from various variables that can be gathered on very specific locations. Using geographical inference is particularly interesting in the case of Twitter because it makes it possible to compare the influence of var- ious kinds of variables ont the tweeting activity of a certain area : socio-demographical vari- ables (such as age, gender, profession), morphological variables (such as the human density of the area or the public transportation system), contextual variables (such as the average in- come or the share of unemployed people) and political ones (such as the participation in local elections).
All Tweets are attributed to the IRIS zone they were sent from (IRIS zones are the smallest geographical unit of the french national office for statistical information, INSEE. They usually count +/- 2.500 in- habitants). Census (and other) data are collected to describe every IRIS. A dataset with 47.484 IRIS counting at least one tweet between 2014 and 2017 resulted from this process. Some information being only available at the town level (e.g. political participation) another dataset was created with 33.881 towns (most of them being small/very small towns that count only one IRIS).
Published on September 18, 2018
Project participants
Gilles Bastin
Sophie Kuegler
Etienne Dublé
Sophie Kuegler
Etienne Dublé