1. Introduction to CDI & Stanford Wordbank
  2. How to Use the Stanford Wordbank Website
  3. In-Depth Analysis with wordbankr

The MacArthur-Bates Communicative Development Inventories (MB-CDIs)

are parent report instruments which capture information about children’s developing abilities in early language:
vocabulary comprehension, production, gestures, and grammar.

Korean MB-CDI

Korean MB-CDI

items_kor_ws

Korean MB-CDI

summary

Wordbank

Wordbank

Wordbank

Vocabulary Norms

  • show the size of child’s language across age
  • show typical language development

Item Trajectories

  • track the use of specific words or phrases over time
  • show how a child’s language skills are developing
  • identifying potential areas of difficulty

Wordbank

Vocabulary Norms

  • show the size of child’s language across age
  • show typical language development

Item Trajectories

  • track the use of specific words or phrases over time
  • show how a child’s language skills are developing
  • identifying potential areas of difficulty

Vocabulary Norms

Vocabulary Norms

Vocabulary Norms

read_csv("files/wordbank_vocab_data.csv")

Wordbank

Vocabulary Norms

  • show the size of child’s language across age
  • show typical language development

Item Trajectories

  • track the use of specific words or phrases over time
  • show how a child’s language skills are developing
  • identifying potential areas of difficulty

Wordbank

Vocabulary Norms

  • show the size of child’s language across age.
  • show typical language development

Item Trajectories

  • track the use of specific words or phrases over time
  • show how a child’s language skills are developing
  • identifying potential areas of difficulty

Item Trajectories

Item Trajectories



Item Trajectories

Item Trajectories

read_csv("files/wordbank_item_trajectories.csv")

Wordbank

Cross-Linguistic Trajectories

  • similar to item trajectories but,
  • allows cross-linguistic comparisons

Semantic Networks

  • show relationships between different words
  • show how a child’s vocabulary is organized

Data Export Tools

  • export data from the Wordbank website
  • for conducting in-depth analyses using software such as R or SPSS

Wordbank

Cross-Linguistic Trajectories

  • similar to item trajectories but,
  • allows cross-linguistic comparisons

Semantic Networks

  • show relationships between different words
  • show how a child’s vocabulary is organized

Data Export Tools

  • export data from the Wordbank website
  • for conducting in-depth analyses using software such as R or SPSS

Cross-Linguistic Trajectories: mommy

Cross-Linguistic Trajectories: mommy

read_csv("files/wordbank_crosslinguistic_mommy.csv") %>% arrange(age)

Cross-Linguistic Trajectories: food

Cross-Linguistic Trajectories: food

read_csv("files/wordbank_crosslinguistic_food.csv") %>% arrange(age)

Cross-Linguistic Trajectories: eat

Cross-Linguistic Trajectories: eat

read_csv("files/wordbank_crosslinguistic_eat.csv") %>% arrange(age)

Wordbank

Cross-Linguistic Trajectories

  • similar to item trajectories but,
  • allows cross-linguistic comparisons

Semantic Networks

  • show relationships between different words
  • show how a child’s vocabulary is organized

Data Export Tools

  • export data from the Wordbank website
  • for conducting in-depth analyses using software such as R or SPSS

Wordbank

Cross-Linguistic Trajectories

  • similar to item trajectories but,
  • allows cross-linguistic comparisons

Semantic Networks

  • show relationships between different words
  • show how a child’s vocabulary is organized

Data Export Tools

  • export data from the Wordbank website
  • for conducting in-depth analyses using software such as R or SPSS

Semantic Networks

Semantic Networks

Semantic Networks

Wordbank

Cross-Linguistic Trajectories

  • similar to item trajectories but,
  • allows cross-linguistic comparisons

Semantic Networks

  • show relationships between different words
  • show how a child’s vocabulary is organized

Data Export Tools

  • export data from the Wordbank website
  • for conducting in-depth analyses using software such as R or SPSS

Wordbank

Cross-Linguistic Trajectories

  • similar to item trajectories but,
  • allows cross-linguistic comparisons

Semantic Networks

  • show relationships between different words
  • show how a child’s vocabulary is organized

Data Export Tools

  • export data from the Wordbank website
  • for conducting in-depth analyses using software such as R or SPSS

Data Export Tools

Data Export Tools

Data Export Tools

By-Word Summary Data

read_csv("files/wordbank_item_data.csv")

By-Word Summary Data

set.seed(1234)
item <- sample(641, 3)
read_csv("files/wordbank_item_data.csv") %>% 
  filter(item_id %in% item) 

By-Word Summary Data

read_csv("files/wordbank_item_data.csv") %>% 
  filter(item_id %in% item) %>% 
  pivot_longer(cols = c(paste(18:36)), names_to = "age", values_to = "proportion") 

By-Word Summary Data

read_csv("files/wordbank_item_data.csv") %>% 
  filter(item_id %in% item) %>% 
  pivot_longer(cols = c(paste(18:36)), names_to = "age", values_to = "proportion") %>%
  ggplot(aes(y = proportion, x = age, col = item_definition)) + geom_point() + 
  theme_classic(base_family = "NanumGothic") + geom_hline(yintercept = .5, col = "grey40", linetype = "dashed") 

wordbankr

langcog/wordbankr



> install.packages(“wordbankr”) or

> devtools::install_github(“langcog/wordbankr”)

wordbankr

library(wordbankr)
help(package = "wordbankr")
ls("package:wordbankr")
##  [1] "fit_aoa"                 "fit_vocab_quantiles"    
##  [3] "get_administration_data" "get_crossling_data"     
##  [5] "get_crossling_items"     "get_instrument_data"    
##  [7] "get_instruments"         "get_item_data"          
##  [9] "get_sources"             "summarise_items"

wordbankr: instruments

get_instruments()

wordbankr: sources

get_sources()

wordbankr: sources

get_sources(language = "Korean")

wordbankr: get_administration_data()

admins_kor_ws <- get_administration_data(language = "Korean", form = "WS")
admins_kor_ws
n_distinct(admins_kor_ws$data_id)
## [1] 1370

wordbankr: summary

admins_kor_ws %>% group_by(sex) %>% count(age) %>% spread(sex, n)

wordbankr: get_instrument_data()

inst_kor_ws <- get_instrument_data(language = "Korean", form = "WS")
inst_kor_ws

wordbankr: get_instrument_data() + get_administration_data()

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex))

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) 

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) %>% 
  filter(value == "produces" & !is.na(sex))

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) %>% 
  filter(value == "produces" & !is.na(sex)) %>% 
  group_by(data_id, age, sex)

wordbankr: production

inst_kor_ws %>% 
  inner_join(admins_kor_ws %>% select(data_id, age, sex)) %>% 
  filter(value == "produces" & !is.na(sex)) %>% 
  group_by(data_id, age, sex) %>% 
  count() %>% rename(production = n) -> data_kor_ws
data_kor_ws

wordbankr: production growth snapshot

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size") + theme_classic() 

wordbankr: production growth snapshot

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size") + theme_classic() +
  geom_jitter(size = 0.5) 

wordbankr: production growth snapshot

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size") + theme_classic() +
  geom_jitter(colour = "grey", size = 0.5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2)) 

wordbankr: fit_vocab_quantiles()

data_quantiles <- fit_vocab_quantiles(
  vocab_data = data_kor_ws %>% mutate(language = "Korean", form = "WS"),
  measure = production, 
  group = sex, 
  quantiles = "standard")

data_quantiles

wordbankr: fit_vocab_quantiles()

ggplot(data_kor_ws, aes(x = age, y = production)) +
  labs(x = "Age (months)", y = "Productive vocabulary size")+theme_classic() + facet_wrap(~sex)

wordbankr: fit_vocab_quantiles()

ggplot(data_kor_ws, aes(x = age, y = production, col = sex)) +
  labs(x = "Age (months)", y = "Productive vocabulary size")+theme_classic() + facet_wrap(~sex) +
  geom_jitter(size = 0.5) 

wordbankr: fit_vocab_quantiles()

ggplot(data_kor_ws, aes(x = age, y = production)) +
  labs(x = "Age (months)", y = "Productive vocabulary size")+theme_classic() + facet_wrap(~sex) +
  geom_jitter(colour = "grey", size = 0.5) +
  geom_line(data = data_quantiles, aes(y = production, x = age, col = quantile), inherit.aes = F, size = 1) 

wordbankr: fit_aoa()

fit_aoa(
  inst_kor_ws %>% inner_join(admins_kor_ws %>% select(data_id, age, sex)),
  measure = "produces",
  method = "glmrob",
  proportion = 0.5
) -> aoa_list

aoa_list

wordbankr: fit_aoa()

aoa_list %>% filter(!is.na(aoa)) -> aoa_list
aoa_list
items_kor_ws # from get_item_data()

wordbankr: fit_aoa()

aoa_list %>% 
  inner_join(items_kor_ws %>% select(num_item_id, definition, uni_lemma)) -> aoa_list
aoa_list %>% arrange(aoa)

Some ideas

Use data from Wordbank to explore questions about language learning.

  • Investigate the relationship between vocabulary size and other reported items, e.g.,
    • grammar proficiency (“complexity” items), age, gender, and socio-economic status
  • Explore the relationship between lexical categories and proportion of words that children knew
    • to what extent some type of words are more easily acquired than the other word types

Some ideas

Use data from Wordbank to explore questions about language learning.

  • Investigate the relationship between vocabulary size and other reported items, e.g.,
    • grammar proficiency (“complexity” items), age, gender, and socio-economic status
  • Explore the relationship between lexical categories and proportion of words that children knew
    • to what extent some type of words are more easily acquired than the other word types

Examples 1: lexical class

items_kor_ws # from get_item_data()

Examples 1: lexical class

summarise_items(items_kor_ws) ->
  item_summary
item_summary

Examples 1: lexical class

unique(item_summary$lexical_category)
## [1] "other"          "nouns"          "function_words" "predicates"    
## [5] NA
unique(item_summary$lexical_class)
## [1] "other"          "nouns"          "function_words" "verbs"         
## [5] "adjectives"     NA

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age)) + theme_classic()

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age)) + theme_classic() +
  geom_jitter(size = .5, col = "grey", alpha = .5) 

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age, col = lexical_class)) + theme_classic() +
  geom_jitter(size = .5)

Examples 1: lexical class

ggplot(item_summary %>% filter(!is.na(lexical_class)), aes(y = production, x = age)) + theme_classic() +
  geom_jitter(size = .5, col = "grey", alpha = .5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2), aes(col = lexical_class))

Examples 2

items <- items_kor_ws %>% filter(category == "games_routines")
items

Examples 2

item_summary %>% filter(item_id %in% items$item_id) 

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~definition, ncol = 5)+ labs(colour="Items") 

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~definition, ncol = 5)+ labs(colour="Items") + geom_point(col = "grey", alpha = .5) 

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~definition, ncol = 5)+ labs(colour="Items") + geom_point(col = "grey", alpha = .5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2))

Examples 2

item_summary %>% filter(item_id %in% items$item_id) %>%
  ggplot(aes(y = production, x = age)) + theme_classic(base_family = "NanumGothic") +
  facet_wrap(~paste(definition, uni_lemma, sep = "_"), ncol = 5)+ labs(colour="Items") + geom_point(col = "grey", alpha = .5) +
  geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 2))

Some ideas

Integrative data analysis

  1. Estimate the age of acquisition (AoA) of items of interest from Wordbank
  2. Measure the frequency of these items in child-directed speech from corpus data (e.g., childes-db)
  • Examine the relationship between the (1) and (2)

Some ideas

Integrative data analysis

  1. Estimate the age of acquisition (AoA) of items of interest from Wordbank
  2. Measure the frequency of these items in child-directed speech from corpus data (e.g., childes-db)
  • Examine the relationship between the (1) and (2)

Examples 3

full_kor_ws <- get_instrument_data(language = "Korean", form = "WS", administrations = T, iteminfo = T) 
full_kor_ws

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) 

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands)

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands)

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands) %>% 
  dplyr::filter(measure_name == "produces") 

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands) %>% 
  dplyr::filter(measure_name == "produces") %>% 
  dplyr::group_by(age, num_item_id, uni_lemma, category, definition) 

Examples 3

full_kor_ws %>% 
  dplyr::mutate(produces = !is.na(value) & value == 
                  "produces", 
                understands = !is.na(value) & 
                  (value == "understands" | value == "produces")) %>% 
  dplyr::select(age, num_item_id, uni_lemma, category, definition, produces, understands) %>% 
  tidyr::gather("measure_name", "value", produces, understands) %>% 
  dplyr::filter(measure_name == "produces") %>% 
  dplyr::group_by(age, num_item_id, uni_lemma, category, definition) %>% 
  dplyr::summarise(num_true = sum(value), 
                   num_false = dplyr::n() -  num_true) -> 
  item_data; item_data

Examples 3

item_data %>% filter(definition == "파이팅") -> word_data; word_data
cbind(ages = word_data$age, 
      data_prop = word_data$num_true/
        (word_data$num_true + word_data$num_false)) %>% data.frame()

Examples 3

inv_logit <- function(x) 1/(exp(-x) + 1)
ages <- dplyr::tibble(age = c(min(item_data$age):max(item_data$age)))
ages
robustbase::glmrob(cbind(num_true, num_false) ~ age, 
                   data = word_data, 
                   family = "binomial")
## 
## Call:  robustbase::glmrob(formula = cbind(num_true, num_false) ~ age,      family = "binomial", data = word_data) 
## 
## Coefficients:
## (Intercept)          age  
##     -3.9171       0.1513  
## 
## Number of observations: 19 
## Fitted by method  'Mqle'

Examples 3

robustbase::glmrob(cbind(num_true, num_false) ~ age, 
                   data = word_data,
                   family = "binomial") %>%
  stats::predict(ages)
##          1          2          3          4          5          6          7 
## -1.1932928 -1.0419722 -0.8906516 -0.7393311 -0.5880105 -0.4366900 -0.2853694 
##          8          9         10         11         12         13         14 
## -0.1340489  0.0172717  0.1685923  0.3199128  0.4712334  0.6225539  0.7738745 
##         15         16         17         18         19 
##  0.9251950  1.0765156  1.2278362  1.3791567  1.5304773

Examples 3

robustbase::glmrob(cbind(num_true, num_false) ~ age, 
                   data = word_data,
                   family = "binomial") %>%
  stats::predict(ages) %>% 
  inv_logit() -> mod_prop; cbind(ages, mod_prop)

Examples 3

inner_join(cbind(ages, mod_prop), 
           cbind(ages, 
      data_prop = word_data$num_true/
        (word_data$num_true + word_data$num_false))) 

Examples 3

inner_join(cbind(ages, mod_prop), 
           cbind(ages, 
      data_prop = word_data$num_true/
        (word_data$num_true + word_data$num_false))) %>%
  gather(key = "type",
         value = "proportion",
         c(mod_prop, data_prop)) -> plot_prop; plot_prop

Examples 3

plot_prop %>% 
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) 

Examples 3

plot_prop %>% filter(type == "data_prop") %>%
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) +
  geom_point() 

Examples 3

plot_prop %>% 
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) +
  geom_point() 

Examples 3

plot_prop %>% 
  ggplot(aes(y = proportion, x = age, col = type)) + theme_classic(base_family = "NanumGothic") + ggtitle("파이팅") + coord_cartesian(ylim = c(0,1)) +
  geom_point() + geom_hline(yintercept = .5, col = "grey70") + geom_vline(xintercept = 26, col = "grey70")

Resources