stylest2 vignette

Christian Baehr

2024-03-22

About stylest2

This vignette describes the usage of stylest2 for estimating speaker (author) style distinctiveness.

Installation

The dev version of stylest2 on GitHub may be installed with:

install.packages("devtools")
devtools::install_github("ArthurSpirling/stylest2")

Load the package

stylest2 is built to interface with quanteda. A quanteda dfm object is required to fit a model in stylest2, so we recommend installing quanteda as well.

library(stylest2)
library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.

Example: Fitting a model to English novels

dfm

We will be using texts of the first lines of novels by Jane Austen, George Eliot, and Elizabeth Gaskell. Excerpts were obtained from the full texts of novels available on Project Gutenberg: http://gutenberg.org.

data(novels)
title author text
1 A Dark Night’s Work Gaskell, Elizabeth Cleghorn In the county town of a certain shire there lived (about forty years ago) one Mr. Wilkins, a conveyancing attorney of considerable standing. The certain shire was but a small county, and the principal town in it contained only about four thousand inhabitants; so in saying that Mr. Wilkins was the principal lawyer in Hamley, I say very little, unless I add that he transacted all the legal business of the gentry for twenty miles round. His grandfather had established the connection; his father had consolidated and strengthened it, and, indeed, by his wise and upright conduct, as well as by his professional skill, had obtained for himself the position of confidential friend to many of the surrounding families of distinction.
4 Brother Jacob Eliot, George Among the many fatalities attending the bloom of young desire, that of blindly taking to the confectionery line has not, perhaps, been sufficiently considered. How is the son of a British yeoman, who has been fed principally on salt pork and yeast dumplings, to know that there is satiety for the human stomach even in a paradise of glass jars full of sugared almonds and pink lozenges, and that the tedium of life can reach a pitch where plum-buns at discretion cease to offer the slightest excitement? Or how, at the tender age when a confectioner seems to him a very prince whom all the world must envy–who breakfasts on macaroons, dines on meringues, sups on twelfth-cake, and fills up the intermediate hours with sugar-candy or peppermint–how is he to foresee the day of sad wisdom, when he will discern that the confectioner’s calling is not socially influential, or favourable to a soaring ambition?
8 Emma Austen, Jane Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister’s marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection. Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a governess than a friend, very fond of both daughters, but particularly of Emma. Between them it was more the intimacy of sisters.

The data should be transformed into a quanteda dfm object. It should also include a document variable (docvar) entitled “author”.

The corpus should have at least one variable by which the texts can be grouped — the most common examples are a “speaker” or “author” attribute. Here, we will use novels$author.

novels_tok <- tokens(novels$text)
novels_dfm <- dfm(novels_tok)

unique(novels$author)
#> [1] "Gaskell, Elizabeth Cleghorn" "Eliot, George"              
#> [3] "Austen, Jane"
docvars(novels_dfm)["author"] <- novels$author

Tokenization selections can be passed to the tokens() function prior to generating a document-feature matrix; see the quanteda package for more information about tokens().


novels_tok <- tokens(novels$text, 
                     remove_punct = T,
                     remove_symbols = T,
                     remove_numbers = T,
                     remove_separators = T,
                     split_hyphens = T)
novels_dfm <- dfm(novels_tok)

unique(novels$author)
#> [1] "Gaskell, Elizabeth Cleghorn" "Eliot, George"              
#> [3] "Austen, Jane"
docvars(novels_dfm)["author"] <- novels$author

Using stylest2_select_vocab

This function uses n-fold cross-validation to identify the set of terms that maximizes the model’s rate of predicting the speakers of out-of-sample texts. For those unfamiliar with cross-validation, the technical details follow:

(Vocabulary selection is optional; the model can be fit using all the terms in the support of the corpus.)

Setting the seed before this step, to ensure reproducible runs, is recommended:

set.seed(1234)

Below are examples of stylest2_select_vocab using the defaults and with custom parameters:

vocab_with_defaults <- stylest2_select_vocab(dfm = novels_dfm)
vocab_custom <- stylest2_select_vocab(dfm = novels_dfm, 
                                      smoothing = 1, 
                                      nfold = 10, 
                                      cutoffs = c(50, 75, 99))

Let’s look inside the vocab_with_defaults object.

# Percentile with best prediction rate
vocab_with_defaults$cutoff_pct_best
#> [1] 90

# Rate of INCORRECTLY predicted speakers of held-out texts
vocab_with_defaults$cv_missrate_results
#>   50% 60% 70%      80%      90%      99%
#> 1 100 100 100 66.66667 66.66667 66.66667
#> 2  60  60  60 60.00000 20.00000 60.00000
#> 3 100 100 100 75.00000 50.00000 50.00000
#> 4 100 100 100 60.00000 60.00000 40.00000
#> 5  50  50  50 50.00000 50.00000 50.00000

# Data on the setup:

# Percentiles tested
vocab_with_defaults$cutoff_candidates
#> [1] 50 60 70 80 90 99

# Number of folds
vocab_with_defaults$nfold
#> [1] 5

Fitting a model

Using a percentile to select terms

With the best percentile identified as 90 percent, we can select the terms above that percentile to use in the model. Be sure to use the same text_filter here as in the previous step.

terms_90 <- stylest2_terms(dfm = novels_dfm, cutoff = 90)
#> Warning in stylest2_terms(dfm = novels_dfm, cutoff = 90): Detected multiple
#> texts with the same author. Collapsing to author-level dfm for stylest2_fit()
#> function.

Fitting the model: basic

Below, we fit the model using the terms above the 90th percentile, using the same text_filter as before, and leaving the smoothing value for term frequencies as the default 0.5.

mod <- stylest2_fit(dfm = novels_dfm, terms = terms_90)
#> Warning in fit_term_usage(dfm = dfm, smoothing = smoothing, terms = terms, :
#> Detected multiple texts with the same author. Collapsing to author-level dfm
#> for stylest2_fit() function.

The model contains detailed information about token usage by each of the authors (see mod$rate); exploring this is left as an exercise.

Fitting the model: adding custom term weights

A new feature is the option to specify custom term weights, in the form of a dataframe. The intended use case is the mean cosine distance from the embedding representation of the word to all other words in the vocabulary, but the weights can be anything desired by the user.

An example term_weights is shown below. When this argument is passed to stylest_fit(), the weights for terms in the model vocabulary will be extracted. Any term not included in term_weights will be assigned a default weight of 0.

term_weights <- c(0.1,0.2,0.001)
names(term_weights) <- c("the", "and", "Floccinaucinihilipilification")

term_weights
#>                           the                           and 
#>                         0.100                         0.200 
#> Floccinaucinihilipilification 
#>                         0.001

Below is an example of fitting the model with term_weights:

mod <- stylest2_fit(dfm = novels_dfm,  terms = terms_90, term_weights = term_weights)
#> Warning in fit_term_usage(dfm = dfm, smoothing = smoothing, terms = terms, :
#> Detected multiple texts with the same author. Collapsing to author-level dfm
#> for stylest2_fit() function.

The weights are stored in mod$term_weights.

Using the model

By default, stylest_predict() returns the posterior probabilities of authorship for each prediction text.

predictions <- stylest2_predict(dfm = novels_dfm, model = mod)

stylest_predict() can optionally return the log odds of authorship for each speaker over each text, as well as the average contribution of each term in the model to speaker distinctiveness.

predictions <- stylest2_predict(dfm = novels_dfm, model = mod,
                                speaker_odds = TRUE, term_influence = TRUE)

We can examine the mean log odds that Jane Austen wrote Pride and Prejudice (in-sample).

# Pride and Prejudice
novels$text[14]
#> [1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. \"My dear Mr. Bennet,\" said his lady to him one day, \"have you heard that Netherfield Park is let at last?\" Mr. Bennet replied that he had not. \"But it is,\" returned she; \"for Mrs. Long has just been here, and she told me all about it.\" Mr. Bennet made no answer. \"Do you not want to know who has taken it?\" cried his wife impatiently. \"_You_ want to tell me, and I have no objection to hearing it.\" This was invitation enough."

predictions$speaker_odds$log_odds_mean[14]
#>     text14 
#> 0.02689195

predictions$speaker_odds$log_odds_se[14]
#>     text14 
#> 0.01669951

Predicting the speaker of a new text

In this example, the model is used to predict the speaker of a new text, in this case Northanger Abbey by Jane Austen.

Note that a prior may be specified, and may be useful for handling texts containing out-of-sample terms. Here, we do not specify a prior, so a uniform prior is used.

na_text <- "No one who had ever seen Catherine Morland in her infancy would have supposed 
            her born to be an heroine. Her situation in life, the character of her father 
            and mother, her own person and disposition, were all equally against her. Her 
            father was a clergyman, without being neglected, or poor, and a very respectable 
            man, though his name was Richard—and he had never been handsome. He had a 
            considerable independence besides two good livings—and he was not in the least 
            addicted to locking up his daughters."

na_text_dfm <- dfm(tokens(na_text))

pred <- stylest2_predict(dfm = na_text_dfm, model = mod)

Viewing the result, and recovering the log probabilities calculated for each speaker, is simple:

pred$posterior$predicted
#> [1] "Austen, Jane"

pred$posterior$log_probs
#> 1 x 3 Matrix of class "dgeMatrix"
#>       Austen, Jane Eliot, George Gaskell, Elizabeth Cleghorn
#> text1   -0.1478721     -6.999755                    -1.99109

The terms with the highest mean influence can be obtained:

#> [1] "the" "of"  "i"   "was" "her" "his"

And the least influential terms:

#> [1] "him"   "three" "days"  "know"  "much"  "from"

Issues

Please submit any bugs, error reports, etc. on GitHub at: https://github.com/ArthurSpirling/stylest2/issues.