Generates n-grams from text files

It generates n-gram tokens along with their frequencies. The data may be saved to a file in plain text format or as a R object.

Super class

wordpredictor::Base -> TokenGenerator

Methods

Public methods

TokenGenerator$new()
TokenGenerator$generate_tokens()
TokenGenerator$clone()

Method `new()`

It initializes the current obj. It is used to set the file name, tokenization options and verbose option.

Usage

TokenGenerator$new(fn = NULL, opts = list(), ve = 0)

Arguments

fn

The path to the input file.

opts

The options for generating the n-gram tokens.

n. The n-gram size.
save_ngrams. If the n-gram data should be saved.
min_freq. All n-grams with frequency less than min_freq are ignored.
line_count. The number of lines to process at a time.
stem_words. If words should be transformed to their stems.
dir. The dir where the output file should be saved.
format. The format for the output. There are two options.
- plain. The data is stored in plain text.
- obj. The data is stored as a R obj.

ve

The level of detail in the information messages.

Method `generate_tokens()`

It generates n-gram tokens and their frequencies from the given file name. The tokens may be saved to a text file as plain text or a R object.

Usage

TokenGenerator$generate_tokens()

Returns

The data frame containing n-gram tokens along with their frequencies.

Examples

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method `clone()`

The objects of this class are cloneable with this method.

Usage

TokenGenerator$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `TokenGenerator$generate_tokens`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Super class

Methods

Public methods

Method new()

Usage

Arguments

Method generate_tokens()

Usage

Returns

Examples

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `generate_tokens()`

Method `clone()`