Brand new tweet-ids accommodate the brand new distinct tweets on the Myspace API that are older than nine months (we

Brand new tweet-ids accommodate the brand new distinct tweets on the Myspace API that are older than nine months (we

This site Footnote 2 was applied as a way to gather tweet-ids Footnote step 3 , this amazing site will bring experts with metadata off a good (third-party-collected) corpus out of Dutch tweets (Tjong Kim Done and Van den Bosch, 2013). age., the brand new historic limit when requesting tweets predicated on a venture inquire). The new Roentgen-bundle ‘rtweet’ and you will complementary ‘lookup_status’ form were utilized to collect tweets when you look at the JSON format. The new JSON file constitutes a desk towards tweets’ advice, such as the design time, the tweet text message, and also the source (i.e., types of Facebook consumer).

Analysis cleanup and you may preprocessing

The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as pages who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.

Brand new tweet texts were transformed into ASCII security. URLs, line vacation trips, tweet headers, monitor brands, sugar daddies Guelph and you will references so you’re able to screen brands were eliminated. URLs add to the reputation count whenever found within the tweet. Although not, URLs do not enhance the reputation amount when they’re located at the termination of an effective tweet. To cease good misrepresentation of your actual reputation restrict that users suffered with, tweets that have URLs (but not media URLs such as added pictures otherwise clips) have been omitted.

Token and you will bigram study

The brand new Roentgen plan Footnote 5 ‘quanteda’ was applied so you can tokenize the fresh tweet messages into the tokens (i.elizabeth., separated terms, punctuation s. In addition, token-frequency-matrices was calculated that have: the brand new frequency pre-CLC [f(token pre)], the brand new cousin regularity pre-CLC[P (token pre)], the new regularity post-CLC [f(token blog post)], the newest relative regularity post-CLC and you can T-scores. New T-try is much like a simple T-fact and you will calculates the new statistical difference between mode (we.age., this new cousin term wavelengths). Bad T-results imply a comparatively higher density regarding a good token pre-CLC, whereas positive T-scores indicate a somewhat higher density out of a token article-CLC. The new T-score picture utilized in the analysis are showed because the Eq. (1) and you will (2). Letter ‘s the final number of tokens per dataset (we.age., pre and post-CLC). So it formula will be based upon the process to have linguistic data from the Chapel ainsi que al. (1991; Tjong Kim Done, 2011).

Part-of-speech (POS) study

The latest Roentgen plan Footnote 6 ‘openNLP’ was used so you can categorize and matter POS kinds on the tweets (i.elizabeth., adjectives, adverbs, blogs, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you can various). The fresh new POS tagger works having fun with a maximum entropy (maxent) probability model in order to predict brand new POS class based on contextual enjoys (Ratnaparkhi, 1996). The fresh Dutch maxent design used for the newest POS classification are coached into CoNLL-X Alpino Dutch Treebank study (Buchholz and you may ). The fresh openNLP POS design could have been reported with a precision rating out of 87.3% whenever employed for English social media analysis (Horsmann et al., 2015). An enthusiastic ostensible maximum of your newest study ‘s the accuracy from new POS tagger. But not, equivalent analyses was basically performed both for pre-CLC and you may blog post-CLC datasets, definition the precision of the POS tagger are going to be consistent more both datasets. Ergo, we guess there are no medical confounds.

Recommended Posts