Skip to contents

It provides a method for generating training, testing and validation data sets from a given input text file.

It also provides a method for generating a sample file of given size or number of lines from an input text file. The contents of the sample file may be cleaned or randomized.

Super class

wordpredictor::Base -> DataSampler

Methods

Inherited methods


    Method new()

    It initializes the current object. It is used to set the verbose option.

    Usage

    DataSampler$new(dir = ".", ve = 0)

    Arguments

    dir

    The directory for storing the input and output files.

    ve

    The level of detail in the information messages.


    Method generate_sample()

    Generates a sample file of given size from the given input file. The file is saved to the directory given by the dir object attribute. Once the file has been generated, its contents may be cleaned or randomized.

    Usage

    DataSampler$generate_sample(fn, ss, ic, ir, ofn, is, dc_opts = NULL)

    Arguments

    fn

    The input file name. It is the short file name relative to the dir attribute.

    ss

    The number of lines or proportion of lines to sample.

    ic

    If the sample file should be cleaned.

    ir

    If the sample file contents should be randomized.

    ofn

    The output file name. It will be saved to the dir.

    is

    If the sampled data should be saved to a file.

    dc_opts

    The options for cleaning the data.

    Examples

    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("input.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The sample file name
    sfn <- paste0(ed, "/sample.txt")
    # An object of class DataSampler is created
    ds <- DataSampler$new(dir = ed, ve = ve)
    # The sample file is generated
    ds$generate_sample(
        fn = "input.txt",
        ss = 0.5,
        ic = FALSE,
        ir = FALSE,
        ofn = "sample.txt",
        is = TRUE
    )
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()


    Method generate_data()

    It generates training, testing and validation data sets from the given input file. It first reads the file given as a parameter to the current object. It partitions the data into training, testing and validation sets, according to the perc parameter. The files are named train.txt, test.txt and va.txt and are saved to the given output folder.

    Usage

    DataSampler$generate_data(fn, percs)

    Arguments

    fn

    The input file name. It should be relative to the dir attribute.

    percs

    The size of the training, testing and validation sets.

    Examples

    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be
    # used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("input.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve)
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The files to clean
    fns <- c("train", "test", "validate")
    # An object of class DataSampler is created
    ds <- DataSampler$new(dir = ed, ve = ve)
    # The train, test and validation files are generated
    ds$generate_data(
        fn = "input.txt",
        percs = list(
            "train" = 0.8,
            "test" = 0.1,
            "validate" = 0.1
        )
    )
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()


    Method clone()

    The objects of this class are cloneable with this method.

    Usage

    DataSampler$clone(deep = FALSE)

    Arguments

    deep

    Whether to make a deep clone.

    Examples

    
    ## ------------------------------------------------
    ## Method `DataSampler$generate_sample`
    ## ------------------------------------------------
    
    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("input.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The sample file name
    sfn <- paste0(ed, "/sample.txt")
    # An object of class DataSampler is created
    ds <- DataSampler$new(dir = ed, ve = ve)
    # The sample file is generated
    ds$generate_sample(
        fn = "input.txt",
        ss = 0.5,
        ic = FALSE,
        ir = FALSE,
        ofn = "sample.txt",
        is = TRUE
    )
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()
    
    ## ------------------------------------------------
    ## Method `DataSampler$generate_data`
    ## ------------------------------------------------
    
    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be
    # used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("input.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve)
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The files to clean
    fns <- c("train", "test", "validate")
    # An object of class DataSampler is created
    ds <- DataSampler$new(dir = ed, ve = ve)
    # The train, test and validation files are generated
    ds$generate_data(
        fn = "input.txt",
        percs = list(
            "train" = 0.8,
            "test" = 0.1,
            "validate" = 0.1
        )
    )
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()