Tutorial on Wordbank and wordbankr package

Introduction to CDI & Stanford Wordbank
How to Use the Stanford Wordbank Website
In-Depth Analysis with wordbankr

The MacArthur-Bates Communicative Development Inventories (MB-CDIs)

are parent report instruments which capture information about children’s developing abilities in early language:
vocabulary comprehension, production, gestures, and grammar.

Korean MB-CDI

items_kor_ws

Korean MB-CDI

summary

Wordbank

http://wordbank.stanford.edu/

Wordbank

http://wordbank.stanford.edu/data

Wordbank

Vocabulary Norms

show the size of child’s language across age
show typical language development

Item Trajectories

track the use of specific words or phrases over time
show how a child’s language skills are developing
identifying potential areas of difficulty

Wordbank

Vocabulary Norms

show the size of child’s language across age
show typical language development

Item Trajectories

track the use of specific words or phrases over time
show how a child’s language skills are developing
identifying potential areas of difficulty

Vocabulary Norms

web snapshot

Vocabulary Norms

web snapshot

Vocabulary Norms

read_csv("files/wordbank_vocab_data.csv")

Wordbank

Vocabulary Norms

show the size of child’s language across age
show typical language development

Item Trajectories

track the use of specific words or phrases over time
show how a child’s language skills are developing
identifying potential areas of difficulty

Wordbank

Vocabulary Norms

show the size of child’s language across age.
show typical language development

Item Trajectories

track the use of specific words or phrases over time
show how a child’s language skills are developing
identifying potential areas of difficulty

Item Trajectories

web snapshot

Item Trajectories

web snapshot

Item Trajectories

read_csv("files/wordbank_item_trajectories.csv")

Wordbank

Cross-Linguistic Trajectories

similar to item trajectories but,
allows cross-linguistic comparisons

Semantic Networks

show relationships between different words
show how a child’s vocabulary is organized

Data Export Tools

export data from the Wordbank website
for conducting in-depth analyses using software such as R or SPSS

Wordbank

Cross-Linguistic Trajectories

similar to item trajectories but,
allows cross-linguistic comparisons

Semantic Networks

show relationships between different words
show how a child’s vocabulary is organized

Data Export Tools

export data from the Wordbank website
for conducting in-depth analyses using software such as R or SPSS

Cross-Linguistic Trajectories: mommy

web snapshot

Cross-Linguistic Trajectories: mommy

read_csv("files/wordbank_crosslinguistic_mommy.csv") %>% arrange(age)

Cross-Linguistic Trajectories: food

web snapshot

Cross-Linguistic Trajectories: food

read_csv("files/wordbank_crosslinguistic_food.csv") %>% arrange(age)

Cross-Linguistic Trajectories: eat

web snapshot

Cross-Linguistic Trajectories: eat

read_csv("files/wordbank_crosslinguistic_eat.csv") %>% arrange(age)

Wordbank

Cross-Linguistic Trajectories

similar to item trajectories but,
allows cross-linguistic comparisons

Semantic Networks

show relationships between different words
show how a child’s vocabulary is organized

Data Export Tools

export data from the Wordbank website
for conducting in-depth analyses using software such as R or SPSS

Wordbank

Cross-Linguistic Trajectories

similar to item trajectories but,
allows cross-linguistic comparisons

Semantic Networks

show relationships between different words
show how a child’s vocabulary is organized

Data Export Tools

export data from the Wordbank website
for conducting in-depth analyses using software such as R or SPSS

Semantic Networks

Snapshot at 6 - 13 mo

Semantic Networks

Snapshot at 6 - 17 mo

Semantic Networks

Fourtassi, Bian, and Frank, 2020

Wordbank

Cross-Linguistic Trajectories

similar to item trajectories but,
allows cross-linguistic comparisons

Semantic Networks

show relationships between different words
show how a child’s vocabulary is organized

Data Export Tools

export data from the Wordbank website
for conducting in-depth analyses using software such as R or SPSS

Wordbank

Cross-Linguistic Trajectories

similar to item trajectories but,
allows cross-linguistic comparisons

Semantic Networks

show relationships between different words
show how a child’s vocabulary is organized

Data Export Tools

export data from the Wordbank website
for conducting in-depth analyses using software such as R or SPSS

Data Export Tools

By-Child

Data Export Tools

By-Word

Data Export Tools

Full Child-by-Word

By-Word Summary Data

read_csv("files/wordbank_item_data.csv")

By-Word Summary Data

set.seed(1234)
item <- sample(641, 3)
read_csv("files/wordbank_item_data.csv") %>% 
  filter(item_id %in% item)

By-Word Summary Data

read_csv("files/wordbank_item_data.csv") %>% 
  filter(item_id %in% item) %>% 
  pivot_longer(cols = c(paste(18:36)), names_to = "age", values_to = "proportion")

By-Word Summary Data

read_csv("files/wordbank_item_data.csv") %>% 
  filter(item_id %in% item) %>% 
  pivot_longer(cols = c(paste(18:36)), names_to = "age", values_to = "proportion") %>%
  ggplot(aes(y = proportion, x = age, col = item_definition)) + geom_point() + 
  theme_classic(base_family = "NanumGothic") + geom_hline(yintercept = .5, col = "grey40", linetype = "dashed")

wordbankr

langcog/wordbankr

> install.packages(“wordbankr”) or

> devtools::install_github(“langcog/wordbankr”)

wordbankr

library(wordbankr)

help(package = "wordbankr")

ls("package:wordbankr")

##  [1] "fit_aoa"                 "fit_vocab_quantiles"    
##  [3] "get_administration_data" "get_crossling_data"     
##  [5] "get_crossling_items"     "get_instrument_data"    
##  [7] "get_instruments"         "get_item_data"          
##  [9] "get_sources"             "summarise_items"

wordbankr: instruments

get_instruments()

wordbankr: sources

get_sources()

wordbankr: sources

get_sources(language = "Korean")

wordbankr: get_administration_data()

admins_kor_ws <- get_administration_data(language = "Korean", form = "WS")
admins_kor_ws

n_distinct(admins_kor_ws$data_id)

## [1] 1370

wordbankr: summary

admins_kor_ws %>% group_by(sex) %>% count(age) %>% spread(sex, n)

wordbankr: get_instrument_data()

inst_kor_ws <- get_instrument_data(language = "Korean", form = "WS")
inst_kor_ws

wordbankr: get_instrument_data() + get_administration_data()

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex))

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex))

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) %>% 
  filter(value == "produces" & !is.na(sex))

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) %>% 
  filter(value == "produces" & !is.na(sex)) %>% 
  group_by(data_id, age, sex)

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) %>% 
  filter(value == "produces" & !is.na(sex)) %>% 
  group_by(data_id, age, sex) %>% 
  count() %>% rename(production = n) -> data_kor_ws
data_kor_ws

wordbankr: production growth snapshot

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size") + theme_classic()

wordbankr: production growth snapshot

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size") + theme_classic() +
  geom_jitter(size = 0.5)

wordbankr: production growth snapshot

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size") + theme_classic() +
  geom_jitter(colour = "grey", size = 0.5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2))

wordbankr: fit_vocab_quantiles()

data_quantiles <- fit_vocab_quantiles(
  vocab_data = data_kor_ws %>% mutate(language = "Korean", form = "WS"),
  measure = production, 
  group = sex, 
  quantiles = "standard")

data_quantiles

wordbankr: fit_vocab_quantiles()

ggplot(data_kor_ws, aes(x = age, y = production)) +
  labs(x = "Age (months)", y = "Productive vocabulary size")+theme_classic() + facet_wrap(~sex)

wordbankr: fit_vocab_quantiles()

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size")+theme_classic() + facet_wrap(~sex) +
  geom_jitter(size = 0.5)

wordbankr: fit_vocab_quantiles()

ggplot(data_kor_ws, aes(x = age, y = production)) +
  labs(x = "Age (months)", y = "Productive vocabulary size")+theme_classic() + facet_wrap(~sex) +
  geom_jitter(colour = "grey", size = 0.5) +
  geom_line(data = data_quantiles, aes(y = production, x = age, col = quantile), inherit.aes = F, size = 1)

wordbankr: fit_aoa()

fit_aoa(
  inst_kor_ws %>% inner_join(admins_kor_ws %>% select(data_id, age, sex)),
  measure = "produces",
  method = "glmrob",
  proportion = 0.5
) -> aoa_list

aoa_list

wordbankr: fit_aoa()

aoa_list %>% filter(!is.na(aoa)) -> aoa_list
aoa_list

items_kor_ws # from get_item_data()

wordbankr: fit_aoa()

aoa_list %>% 
  inner_join(items_kor_ws %>% select(num_item_id, definition, uni_lemma)) -> aoa_list
aoa_list %>% arrange(aoa)

Some ideas

Use data from Wordbank to explore questions about language learning.

Investigate the relationship between vocabulary size and other reported items, e.g.,
- grammar proficiency (“complexity” items), age, gender, and socio-economic status
Explore the relationship between lexical categories and proportion of words that children knew
- to what extent some type of words are more easily acquired than the other word types

Some ideas

Use data from Wordbank to explore questions about language learning.

Investigate the relationship between vocabulary size and other reported items, e.g.,
- grammar proficiency (“complexity” items), age, gender, and socio-economic status
Explore the relationship between lexical categories and proportion of words that children knew
- to what extent some type of words are more easily acquired than the other word types

Examples 1: lexical class

items_kor_ws # from get_item_data()

Examples 1: lexical class

summarise_items(items_kor_ws) ->
  item_summary
item_summary

Examples 1: lexical class

unique(item_summary$lexical_category)

## [1] "other"          "nouns"          "function_words" "predicates"    
## [5] NA

unique(item_summary$lexical_class)

## [1] "other"          "nouns"          "function_words" "verbs"         
## [5] "adjectives"     NA

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age)) + theme_classic()

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age)) + theme_classic() +
  geom_jitter(size = .5, col = "grey", alpha = .5)

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age, col = lexical_class)) + theme_classic() +
  geom_jitter(size = .5)

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age)) + theme_classic() +
  geom_jitter(size = .5, col = "grey", alpha = .5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2), aes(col = lexical_class))

Examples 2

items <- items_kor_ws %>% filter(category == "games_routines")
items

Examples 2

item_summary %>% filter(item_id %in% items$item_id)

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~definition, ncol = 5)+ labs(colour="Items")

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~definition, ncol = 5)+ labs(colour="Items") + geom_point(col = "grey", alpha = .5)

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~definition, ncol = 5)+ labs(colour="Items") + geom_point(col = "grey", alpha = .5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2))

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~paste(definition, uni_lemma, sep = "_"), ncol = 5)+ labs(colour="Items") + geom_point(col = "grey", alpha = .5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2))

Some ideas

Integrative data analysis

Estimate the age of acquisition (AoA) of items of interest from Wordbank
Measure the frequency of these items in child-directed speech from corpus data (e.g., childes-db)

Examine the relationship between the (1) and (2)

Some ideas

Integrative data analysis

Estimate the age of acquisition (AoA) of items of interest from Wordbank
Measure the frequency of these items in child-directed speech from corpus data (e.g., childes-db)

Examine the relationship between the (1) and (2)

Examples 3

full_kor_ws <- get_instrument_data(language = "Korean", form = "WS", administrations = T, iteminfo = T) 
full_kor_ws

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces"))

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands)

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands)

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands) %>% 
  dplyr::filter(measure_name == "produces")

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands) %>% 
  dplyr::filter(measure_name == "produces") %>% 
  dplyr::group_by(age, num_item_id, uni_lemma, category, definition)

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands) %>% 
  dplyr::filter(measure_name == "produces") %>% 
  dplyr::group_by(age, num_item_id, uni_lemma, category, definition) %>% 
  dplyr::summarise(num_true = sum(value), 
                   num_false = dplyr::n() -  num_true) -> 
  item_data; item_data

Examples 3

item_data %>% filter(definition == "파이팅") -> word_data; word_data

cbind(ages = word_data$age, 
      data_prop = word_data$num_true/
        (word_data$num_true + word_data$num_false)) %>% data.frame()

Examples 3

inv_logit <- function(x) 1/(exp(-x) + 1)
ages <- dplyr::tibble(age = c(min(item_data$age):max(item_data$age)))
ages

robustbase::glmrob(cbind(num_true, num_false) ~ age, 
                   data = word_data, 
                   family = "binomial")

## 
## Call:  robustbase::glmrob(formula = cbind(num_true, num_false) ~ age,      family = "binomial", data = word_data) 
## 
## Coefficients:
## (Intercept)          age  
##     -3.9171       0.1513  
## 
## Number of observations: 19 
## Fitted by method  'Mqle'

Examples 3

robustbase::glmrob(cbind(num_true, num_false) ~ age, 
                   data = word_data,
                   family = "binomial") %>%
  stats::predict(ages)

##          1          2          3          4          5          6          7 
## -1.1932928 -1.0419722 -0.8906516 -0.7393311 -0.5880105 -0.4366900 -0.2853694 
##          8          9         10         11         12         13         14 
## -0.1340489  0.0172717  0.1685923  0.3199128  0.4712334  0.6225539  0.7738745 
##         15         16         17         18         19 
##  0.9251950  1.0765156  1.2278362  1.3791567  1.5304773

Examples 3

robustbase::glmrob(cbind(num_true, num_false) ~ age, 
                   data = word_data,
                   family = "binomial") %>%
  stats::predict(ages) %>% 
  inv_logit() -> mod_prop; cbind(ages, mod_prop)

Examples 3

inner_join(cbind(ages, mod_prop), 
           cbind(ages, 
      data_prop = word_data$num_true/
        (word_data$num_true + word_data$num_false)))

Examples 3

inner_join(cbind(ages, mod_prop), 
           cbind(ages, 
      data_prop = word_data$num_true/
        (word_data$num_true + word_data$num_false))) %>%
  gather(key = "type",
         value = "proportion",
         c(mod_prop, data_prop)) -> plot_prop; plot_prop

Examples 3

plot_prop %>% 
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1))

Examples 3

plot_prop %>% filter(type == "data_prop") %>%
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) +
  geom_point()

Examples 3

plot_prop %>% 
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) +
  geom_point()

Examples 3

plot_prop %>% 
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) +
  geom_point() + geom_hline(yintercept = .5, col = "grey70") + geom_vline(xintercept = 26, col = "grey70")

Resources

MacArthur-Bates Communicative Development Inventories
– mb-cdi.stanford.edu

Wordbank
– wordbank.stanford.edu
– github.com/langcog/wordbankr

Learning R
- R for Data Science by Hadley Wickham and Garrett Grolemund
- Data Visualization with ggplot2 by Hadley Wickham