Extension of the paper Testing Propositions Derived from Twitter Studies: Generalization and Replication in Computational Social Science from Hai Liang and King-wa Fu

Research dataset

What we used that already exist

The paper authors provided the dataset they used to test the 10 propositions. Among the 3 datasets, we used only the ego timelines dataset. This dataset contains tweet and associated user information such as creation daate of the tweet, user id associated, tweet id, hastags, urls retweeted etc. This data set was enough to test the assumption made about Twitter usage evolution over month, yet for the prediction case, we needed to have recent information on the user account to know if the given user decided whether he remains on twitter or not.

Extension of the Dataset

We needed recent data about the users contained in the ego timelines dataset. By creating tweeter developper account, we were able to constitute a new dataset using the Twitter api. We used the python librairy tweepy and stored the result in a csv file. We requested user data by doing request on userid, and in case the user was not present anymore, we supposed the account was deleted and so we create another dataset to store the ID of the user unreachable.