Getting Started with spell.replacer

library(spell.replacer)

Introduction

The spell.replacer package provides probabilistic spelling correction for character vectors in R. It uses the Jaro-Winkler string distance metric combined with word frequency data from the Corpus of Contemporary American English (COCA) to automatically correct misspelled words.

Basic Usage

The main function is spell_replace(), which takes a character vector and returns it with corrected spellings:

# Example text with misspellings
text <- c("This is a smple text with some mispelled words.",
          "We can corect them automaticaly.")

# Apply spell correction
corrected_text <- spell_replace(text)
print(corrected_text)
#> [1] "This is a simple text with some spelled words."
#> [2] "We can correct them automatically."

How It Works

The package uses a two-step process:

  1. Identify misspelled words: Uses the hunspell package to identify words not found in standard dictionaries
  2. Find corrections: For each misspelled word, calculates Jaro-Winkler distance to words in the COCA frequency list and selects the best match

Customizing Correction

You can adjust the correction behavior with several parameters:

# More restrictive threshold (fewer corrections)
conservative <- spell_replace(text, threshold = 0.08)

# Ignore potential proper names
text_with_names <- "John went to Bostan yesterday."
corrected_names <- spell_replace(text_with_names, ignore_names = TRUE)
print(corrected_names)
#> [1] "John went to Boston yesterday."

Single Word Correction

You can also correct individual words using the correct() function:

# Correct a single word
corrected_word <- correct("recieve", coca_list)
print(corrected_word)
#> [1] "receive"

Working with Dataframes

One of the main benefits of spell.replacer is that it integrates seamlessly with tidyverse workflows. You can easily apply spell correction to entire columns of text data:

library(dplyr)

# Example dataframe with text column
docs <- data.frame(
  id = 1:3,
  text = c("This docment has misspellings.",
           "Anothr exmple with erors.",
           "The finl text sampel.")
)

# Apply spell correction using tidy syntax
docs %>%
  mutate(text = spell_replace(text))

Performance

The package processes approximately 1,000 words per second, making it suitable for large-scale text processing tasks. For example:

  • A 100,000 word corpus would take about 1.7 minutes
  • A 1,000,000 word corpus would take about 16 minutes

This makes spell.replacer practical for preprocessing large text datasets before analysis.

Word Frequency Data

The package includes the coca_list dataset with the 100,000 most frequent words from COCA:

# Most frequent words
head(coca_list, 10)
#>  [1] "the"  "and"  "of"   "to"   "a"    "in"   "that" "is"   "i"    "for"

# Check if a word is in the list
"hello" %in% coca_list
#> [1] TRUE

# Find the frequency rank of a word
which(coca_list == "hello")
#> [1] 2579