Skip to contents

It provides a method for generating n-gram models. The n-gram models may be customized by specifying data cleaning and tokenization options.

Details

It provides a method that generates a n-gram model. The n-gram model may be customized by specifying the data cleaning and tokenization options.

The data cleaning options include removal of punctuation, stop words, extra space, non-dictionary words and bad words. The tokenization options include n-gram number and word stemming.

Super class

wordpredictor::Base -> ModelGenerator

Methods

Inherited methods


    Method new()

    It initializes the current object. It is used to set the maximum n-gram number, sample size, input file name, data cleaner options, tokenization options and verbose option.

    Usage

    ModelGenerator$new(
      name = NULL,
      desc = NULL,
      fn = NULL,
      df = NULL,
      n = 4,
      ssize = 0.3,
      dir = ".",
      dc_opts = list(),
      tg_opts = list(),
      ve = 0
    )

    Arguments

    name

    The model name.

    desc

    The model description.

    fn

    The model file name.

    df

    The path of the input text file. It should be the short file name and should be present in the data directory.

    n

    The n-gram size of the model.

    ssize

    The sample size as a proportion of the input file.

    dir

    The directory containing the input and output files.

    dc_opts

    The data cleaner options.

    tg_opts

    The token generator options.

    ve

    The level of detail in the information messages.


    Method generate_model()

    It generates the model using the parameters passed to the object's constructor. It generates a n-gram model file and saves it to the model directory.

    Usage

    ModelGenerator$generate_model()

    Examples

    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("input.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # ModelGenerator class object is created
    mg <- ModelGenerator$new(
        name = "default-model",
        desc = "1 MB size and default options",
        fn = "def-model.RDS",
        df = "input.txt",
        n = 4,
        ssize = 0.99,
        dir = ed,
        dc_opts = list(),
        tg_opts = list(),
        ve = ve
    )
    # The n-gram model is generated
    mg$generate_model()
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()


    Method clone()

    The objects of this class are cloneable with this method.

    Usage

    ModelGenerator$clone(deep = FALSE)

    Arguments

    deep

    Whether to make a deep clone.

    Examples

    
    ## ------------------------------------------------
    ## Method `ModelGenerator$generate_model`
    ## ------------------------------------------------
    
    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("input.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # ModelGenerator class object is created
    mg <- ModelGenerator$new(
        name = "default-model",
        desc = "1 MB size and default options",
        fn = "def-model.RDS",
        df = "input.txt",
        n = 4,
        ssize = 0.99,
        dir = ed,
        dc_opts = list(),
        tg_opts = list(),
        ve = ve
    )
    # The n-gram model is generated
    mg$generate_model()
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()