Are men and women tweeted equal? a corpus linguistic approach in R

Jan 3, 2019 10 min read R, Linguistics, Text mining

I’ve been wanting to try out the excellent rtweet package for some time now, and I remembered an interesting corpus linguistic study from the mid nineties that would be fun to replicate with twitter data and expand with further analyses.

In this post, we’ll be using simple word frequency and n-gram analyses of word colocations and common linguistic constructions to explore what Twitter users are tweeting about men and women these days. More specifically, we’ll explore the following:

Words co-occurring in tweets with the words ‘men’ and ‘women’
Hashtags co-occurring in tweets with ‘men’ and ‘women’
Bigrams ending in ‘men’ and ‘women’ - e.g. “successful men/women”
Bigrams starting with the possessive pronouns ‘his’ and ‘her’ and a
Tri- and quadrogram(?) constructions starting with “s/he is ..” and “s/he is a/an ..”

We won’t cover how to download tweets with rtweet, but instead move right to the analyses.

First, we’ll need to load a few other packages.

library(tidyverse)
library(tidytext)
library(wordcloud)
library(tm)
library(widyr)
library(udpipe)
library(magrittr)
library(gridExtra)

White men and black women

The first tweets that we’ll explore all contain the word ‘men’ or ‘women’.

df <- read_csv("mwTweets.csv")

names(df)

##  [1] "X1"                      "user_id"                
##  [3] "status_id"               "created_at"             
##  [5] "screen_name"             "text"                   
##  [7] "source"                  "display_text_width"     
##  [9] "reply_to_status_id"      "reply_to_user_id"       
## [11] "reply_to_screen_name"    "is_quote"               
## [13] "is_retweet"              "favorite_count"         
## [15] "retweet_count"           "hashtags"               
## [17] "symbols"                 "urls_url"               
## [19] "urls_t.co"               "urls_expanded_url"      
## [21] "media_url"               "media_t.co"             
## [23] "media_expanded_url"      "media_type"             
## [25] "ext_media_url"           "ext_media_t.co"         
## [27] "ext_media_expanded_url"  "ext_media_type"         
## [29] "mentions_user_id"        "mentions_screen_name"   
## [31] "lang"                    "quoted_status_id"       
## [33] "quoted_text"             "quoted_created_at"      
## [35] "quoted_source"           "quoted_favorite_count"  
## [37] "quoted_retweet_count"    "quoted_user_id"         
## [39] "quoted_screen_name"      "quoted_name"            
## [41] "quoted_followers_count"  "quoted_friends_count"   
## [43] "quoted_statuses_count"   "quoted_location"        
## [45] "quoted_description"      "quoted_verified"        
## [47] "retweet_status_id"       "retweet_text"           
## [49] "retweet_created_at"      "retweet_source"         
## [51] "retweet_favorite_count"  "retweet_retweet_count"  
## [53] "retweet_user_id"         "retweet_screen_name"    
## [55] "retweet_name"            "retweet_followers_count"
## [57] "retweet_friends_count"   "retweet_statuses_count" 
## [59] "retweet_location"        "retweet_description"    
## [61] "retweet_verified"        "place_url"              
## [63] "place_name"              "place_full_name"        
## [65] "place_type"              "country"                
## [67] "country_code"            "geo_coords"             
## [69] "coords_coords"           "bbox_coords"            
## [71] "status_url"              "name"                   
## [73] "location"                "description"            
## [75] "url"                     "protected"              
## [77] "followers_count"         "friends_count"          
## [79] "listed_count"            "statuses_count"         
## [81] "favourites_count"        "account_created_at"     
## [83] "verified"                "profile_url"            
## [85] "profile_expanded_url"    "account_lang"           
## [87] "profile_banner_url"      "profile_background_url" 
## [89] "profile_image_url"       "query"

dim(df)

## [1] 214741     90

range(df$created_at)

## [1] "2018-12-17 11:37:27 UTC" "2018-12-18 07:53:15 UTC"

range(df$created_at)[2]-range(df$created_at)[1]

## Time difference of 20.26333 hours

As you can see, there’s plenty of metadata surrounding the 214,741 tweets that I’ve downloaded from a roughly 20 hour period between December 17-18, 2018.

Next, we’ll do some preprocessing where we remove some stopwords and create a ‘gender’ variable to categorize our tweets.

df                             %<>%
  select(text, screen_name)    %>%
  mutate(text = tolower(text)) %>%
  mutate(gender = case_when(str_detect(text, "women") &
                            str_detect(text, " men")  ~ "both",
                            str_detect(text, "women") ~ "women",
                            str_detect(text, "men")   ~ "men")) %>%
  unnest_tokens(word, text)    %>%
  anti_join(stop_words[stop_words$lexicon=="SMART",]) %>%
  mutate(word = removeWords(word,c(stopwords(),"t.co","https","amp","'s","’s"))) %>%
  add_count(word)              %>%
  filter(n > 1, word != "", gender != "both") %>%
  select(-n)

table(df$gender)

## 
##     men   women 
##  891328 1295095

We see that the majority of tweets contain ‘women’ rather than ‘men’.

Using the pairwise_count function from the widyr package, we can create new data frames with the words that most often co-occur with ‘men’ and ‘women’.

word_pairs_men <- df      %>%
  filter(gender == "men") %>%
  pairwise_count(word, screen_name, sort = TRUE) %>%
  filter(item1 == "men")  %>% 
  top_n(20)

word_pairs_women <- df      %>%
  filter(gender == "women") %>%
  pairwise_count(word, screen_name, sort = TRUE)  %>%
  filter(item1 == "women")  %>% 
  top_n(20)

Let’s put the data back together and plot the most common words.

word_pairs <- rbind(word_pairs_men, word_pairs_women) %>%
  mutate(order = rev(row_number()), item1 = factor(item1, levels = c("men", "women")))

word_pairs %>% 
  ggplot(aes(x = order, y = n, fill = item1)) + 
  geom_col(show.legend = FALSE) + 
  scale_x_continuous(breaks = word_pairs$order, 
                     labels = word_pairs$item2, 
                     expand = c(0,0)) + 
  facet_wrap(~item1, scales = "free") +
  scale_fill_manual(values = c("steelblue", "indianred")) + coord_flip() + labs(x = "words") +
  theme_minimal() +
  theme(axis.text  = element_text(size = 14),
        axis.title   = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        strip.text.x = element_text(size=22, face="bold"),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank())

It seems that tweets tend to characterise people on the basis of skin colour. A more direct way to explore this, will be to analyse two-word colocations, or bigrams, where ‘men’ or ‘women’ appear as the second word in the pair.

Let’s first do the necessary preprocessing and create wordcloud starting with colocations with ‘men’ as the second word.

men <- read_csv("mwTweets.csv") %>%
  select(screen_name, text)     %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ")     %>%
  filter(word2 == "men")

menCount <- men      %>%
  count(word1,word2) %>%
  select(word1,n)    %>%
  arrange(desc(n))   %>%
  anti_join(stop_words[stop_words$lexicon=="SMART",],by = c("word1" = "word")) 
  
wordcloud(words = menCount$word1, freq = menCount$n, min.freq = 30, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(9,"Blues")[4:9])

Let’s skip the code and do the same for colocations with ‘women’.

In order to make comparisons a little easier, we’ll can plot frequencies more accurately with bar charts.

menCountTop <- menCount            %>%
  filter(word1!="amp",word1!="ii") %>%
  mutate(row = rev(row_number()))  %>%
  top_n(20,n)

menPlot <- menCountTop %>%
  ggplot(aes(row, n, fill = n)) +
  geom_col(show.legend = FALSE,width = .9) +
  coord_flip() +
  scale_x_continuous( 
    breaks = menCountTop$row,
    labels = menCountTop$word1,
    expand = c(0,0)) + 
  theme_minimal() + 
  theme(axis.text    = element_text(size = 14),
        axis.title   = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank()) +
  ggtitle("\"__  Men\"") +
  scale_fill_gradient(low=brewer.pal(9,"Blues")[2],high=brewer.pal(9,"Blues")[9])

womenCountTop <- womenCount        %>%
  mutate(row = rev(row_number()))  %>%
  top_n(20,n)

womenPlot <- womenCountTop %>%
  ggplot(aes(row, n, fill = n)) +
  geom_col(show.legend = FALSE,width = .9) +
  coord_flip() +
  scale_x_continuous( 
    breaks = womenCountTop$row,
    labels = womenCountTop$word1,
    expand = c(0,0)) + 
  theme_minimal() + 
  theme(axis.text    = element_text(size = 14),
        axis.title   = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        #axis.title.y = element_text(margin = margin(r = 40,l=40)),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank()) +
  ggtitle("\"__  Women\"") +
  scale_fill_gradient(low=brewer.pal(9,"Reds")[2],high=brewer.pal(9,"Reds")[9])

grid.arrange(menPlot,womenPlot, ncol = 2)

There are plenty of observations to make in these charts. They confirm that skin colour frequently precedes ‘men’ and ‘women’. Interestingly, the relative frequency of ‘black’ and ‘white’ is reversed for the two genders, though I kind of suspected that ‘white men’ would be a prominent colocation. We can also observe that the sexual orientation of men is highlighted, and that ‘trans’ appears more frequently before ‘women’.

Lastly, let’s compute and visualize the frequency of the most common hashtags co-occurring with ‘men’ and ‘women’ that also contain the forms ‘men’ and ‘women’.

df <- read_csv("mwTweets.csv") %>%
  select(X1,screen_name,text)  %>% 
  mutate(text = tolower(text))

remove_reg <- "&amp;|&lt;|&gt;"

df <- df %>%
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(hashtag, text, token = "tweets") %>%
  filter(!hashtag %in% stop_words$word,
         !hashtag %in% str_remove_all(stop_words$word, "'")) %>%
  filter(str_detect(hashtag, "^#")) %>%
  mutate(hashtag = str_remove(hashtag,"#")) %>%
  filter(str_detect(hashtag,"men|women")) %>%
  filter(hashtag != "men", hashtag != "women",hashtag != "mens", hashtag != "womens", !str_detect(hashtag,"ment"))

tags <- df %>%
  group_by(hashtag) %>%
  count() %>%
  arrange(desc(n))

tags <- tags %>%
  mutate(gender = case_when( str_detect(hashtag,"women") ~ "f",
                            !str_detect(hashtag,"women") ~ "m"))

menTags <- tags %>%
  filter(gender == "m")

womenTags <- tags %>%
  filter(gender == "f")

wordcloud(words = menTags$hashtag, freq = menTags$n, min.freq = 3, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(9,"Blues")[4:9])

wordcloud(words = womenTags$hashtag, freq = womenTags$n, min.freq = 3, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(9,"Reds")[4:9])

For whatever reason, the hashtags co-occurring with ‘men’ revolve around fashion, style and grooming. By contrast, the hashtags co-occurring with ‘women’ reflect career choices (STEM, tech, business). More generally, the construction “women in X” appears to be highly productive and frequent.

His name, her father

Next, we’ll perform frequency analyses of words following the possessive pronouns starting with ‘his’.

his <- read_csv("hisTweets.csv") %>%
  select(text)                   %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ")     %>%
  filter(word1 == "his")

his

## # A tibble: 123,762 x 2
##    word1 word2   
##    <chr> <chr>   
##  1 his   hand    
##  2 his   anger   
##  3 his   daughter
##  4 his   gift    
##  5 his   abusive 
##  6 his   abusive 
##  7 his   back    
##  8 his   failure 
##  9 his   views   
## 10 his   2nd     
## # ... with 123,752 more rows

We’re also going to perform parts-of-speech tagging and narrow the lexical items down to nouns. To do this, we’ll use the udpipe package.

# udmodel <- udpipe_download_model(language = "english")

udmodel <- udpipe_load_model("english-ud-2.0-170801.udpipe")

include <- udpipe(x = his$word2,
                  object = udmodel)

include <- include      %>%
  select(token,upos)    %>%
  filter(upos =="NOUN") %>%
  select(token) 

his <- his %>%
  filter(word2 %in% include$token)

his

## # A tibble: 86,966 x 2
##    word1 word2        
##    <chr> <chr>        
##  1 his   hand         
##  2 his   anger        
##  3 his   daughter     
##  4 his   gift         
##  5 his   failure      
##  6 his   views        
##  7 his   substitutions
##  8 his   basement     
##  9 his   fan          
## 10 his   supporters   
## # ... with 86,956 more rows

Now, we can count and prepare a bar chart of the most frequent nouns following ‘his’.

hisCount <- his         %>%
  count(word1,word2)    %>%
  arrange(desc(n))      %>%
  select(word2,n)       %>%
  mutate(row = rev(row_number()))

hisPlot <- hisCount     %>%
  top_n(20,n)           %>%
  ggplot(aes(row, n, fill = n)) +
  geom_col(show.legend = FALSE,width = .9) +
  coord_flip() +
  scale_x_continuous( 
    breaks = hisCount$row,
    labels = hisCount$word2,
    expand = c(0,0)) +
  theme_minimal() + 
  theme(axis.text    = element_text(size = 14),
        axis.title   = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank()) +
  ggtitle("\"His __\"") +
  scale_fill_gradient(low=brewer.pal(9,"Blues")[2],high=brewer.pal(9,"Blues")[9])

I’ve completed the same steps for nouns following ‘her’. Let’s plot and compare the results!

grid.arrange(hisPlot,herPlot, ncol = 2)

Apparently, tweets about possessions and attributes are often concerned with family relations and body parts.

“What is s/he?”

In this section, we’ll examine the most frequent trigram constructions with the form “s/he is X”

We’ll again use parts-of-speech tagging and only consider adjectives in the place of X.

heTweets <- read_csv("heTweets.csv") 

he <- heTweets                 %>%
  select(screen_name, text)    %>% 
  mutate(text = tolower(text)) %>%
  mutate(text = str_replace(text, "he's", "he is"))         %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)     %>%
  separate(trigram, c("word1", "word2","word3"), sep = " ") %>%
  filter(word1 == "he", word2 == "is")

udmodel <- udpipe_load_model("english-ud-2.0-170801.udpipe")

include <- udpipe(x = he$word3,
                  object = udmodel)

include <- include      %>%
  select(token,upos)    %>%
  filter(upos == "ADJ") %>%
  select(token) 

he <- he %>%
  filter(word3 %in% include$token)

heCount <- he              %>%
  count(word1,word2,word3) %>%
  arrange(desc(n))         %>%
  select(word3,n)          %>%
  mutate(row = rev(row_number()))

Just like above, the same steps has been completed for tweets with “she is X”.

We’ll again prepare bar charts highlighting the most common words. The code is pretty much redundant, so let’s skip that here.

The first question on my mind is *where did he go? Without looking at some tweets, I can’t figure out why ‘gone’ should top the list for tweets about men. Conversely, it is perhaps not too surprising that tweets about women tend to focus on looks.

X is what s/he is!

Finally, let’s objectify both genders and find the most common quadrograms(?) starting with “s/he is a(n) ..” We’ll limit results to nouns this time.

Since the code is largely redundant, let’s just see what we get!

The words describing men are somewhat more negative than those describing women in tweets. However, I bet that, in most cases, the words ‘traitor’, ‘idiot’, ‘baby’, ‘disgrace’, ‘racist’ and ‘coward’ are used in reference to Donald Trump.

Conclusion

It is important to keep in mind that the analyses presented here are based on tweets created during a fairly short time window (approx. 20 hours). It would therefore be interesting to compare the resulting word frequencies with those from a set of different tweets.

Nevertheless, we saw clear differences in how men and women were characterised in the downloaded tweets. It would of course be very useful to dig into the contexts in which men and women are mentioned. However, my main goal here was to explore what can be done with simple word frequency and n-gram analyses of specific linguistic constructions.

Twitter Gender rtweet