It generates n-gram tokens along with their frequencies. The data may be saved to a file in plain text format or as a R object.
Super class
wordpredictor::Base
-> TokenGenerator
Methods
Method new()
It initializes the current obj. It is used to set the file name, tokenization options and verbose option.
Usage
TokenGenerator$new(fn = NULL, opts = list(), ve = 0)
Arguments
fn
The path to the input file.
opts
The options for generating the n-gram tokens.
n. The n-gram size.
save_ngrams. If the n-gram data should be saved.
min_freq. All n-grams with frequency less than min_freq are ignored.
line_count. The number of lines to process at a time.
stem_words. If words should be transformed to their stems.
dir. The dir where the output file should be saved.
format. The format for the output. There are two options.
plain. The data is stored in plain text.
obj. The data is stored as a R obj.
ve
The level of detail in the information messages.
Method generate_tokens()
It generates n-gram tokens and their frequencies from the given file name. The tokens may be saved to a text file as plain text or a R object.
Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code
# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()
Examples
## ------------------------------------------------
## Method `TokenGenerator$generate_tokens`
## ------------------------------------------------
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code
# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()