Skip to contents

It generates n-gram tokens along with their frequencies. The data may be saved to a file in plain text format or as a R object.

Super class

wordpredictor::Base -> TokenGenerator

Methods

Inherited methods


    Method new()

    It initializes the current obj. It is used to set the file name, tokenization options and verbose option.

    Usage

    TokenGenerator$new(fn = NULL, opts = list(), ve = 0)

    Arguments

    fn

    The path to the input file.

    opts

    The options for generating the n-gram tokens.

    • n. The n-gram size.

    • save_ngrams. If the n-gram data should be saved.

    • min_freq. All n-grams with frequency less than min_freq are ignored.

    • line_count. The number of lines to process at a time.

    • stem_words. If words should be transformed to their stems.

    • dir. The dir where the output file should be saved.

    • format. The format for the output. There are two options.

      • plain. The data is stored in plain text.

      • obj. The data is stored as a R obj.

    ve

    The level of detail in the information messages.


    Method generate_tokens()

    It generates n-gram tokens and their frequencies from the given file name. The tokens may be saved to a text file as plain text or a R object.

    Usage

    TokenGenerator$generate_tokens()

    Returns

    The data frame containing n-gram tokens along with their frequencies.

    Examples

    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("test-clean.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The n-gram size
    n <- 4
    # The test file name
    tfn <- paste0(ed, "/test-clean.txt")
    # The n-gram number is set
    tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
    # The TokenGenerator object is created
    tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
    # The n-gram tokens are generated
    tg$generate_tokens()
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()


    Method clone()

    The objects of this class are cloneable with this method.

    Usage

    TokenGenerator$clone(deep = FALSE)

    Arguments

    deep

    Whether to make a deep clone.

    Examples

    
    ## ------------------------------------------------
    ## Method `TokenGenerator$generate_tokens`
    ## ------------------------------------------------
    
    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("test-clean.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The n-gram size
    n <- 4
    # The test file name
    tfn <- paste0(ed, "/test-clean.txt")
    # The n-gram number is set
    tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
    # The TokenGenerator object is created
    tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
    # The n-gram tokens are generated
    tg$generate_tokens()
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()