Skip to contents

It provides a method for generating training, testing and validation data sets from a given input text file.

It also provides a method for generating a sample file of given size or number of lines from an input text file. The contents of the sample file may be cleaned or randomized.

Super class

wordpredictor::Base -> DataSampler

Methods


Method new()

It initializes the current object. It is used to set the verbose option.

Usage

DataSampler$new(dir = ".", ve = 0)

Arguments

dir

The directory for storing the input and output files.

ve

The level of detail in the information messages.


Method generate_sample()

Generates a sample file of given size from the given input file. The file is saved to the directory given by the dir object attribute. Once the file has been generated, its contents may be cleaned or randomized.

Usage

DataSampler$generate_sample(fn, ss, ic, ir, ofn, is, dc_opts = NULL)

Arguments

fn

The input file name. It is the short file name relative to the dir attribute.

ss

The number of lines or proportion of lines to sample.

ic

If the sample file should be cleaned.

ir

If the sample file contents should be randomized.

ofn

The output file name. It will be saved to the dir.

is

If the sampled data should be saved to a file.

dc_opts

The options for cleaning the data.

Examples

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The sample file name
sfn <- paste0(ed, "/sample.txt")
# An object of class DataSampler is created
ds <- DataSampler$new(dir = ed, ve = ve)
# The sample file is generated
ds$generate_sample(
    fn = "input.txt",
    ss = 0.5,
    ic = FALSE,
    ir = FALSE,
    ofn = "sample.txt",
    is = TRUE
)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()


Method generate_data()

It generates training, testing and validation data sets from the given input file. It first reads the file given as a parameter to the current object. It partitions the data into training, testing and validation sets, according to the perc parameter. The files are named train.txt, test.txt and va.txt and are saved to the given output folder.

Usage

DataSampler$generate_data(fn, percs)

Arguments

fn

The input file name. It should be relative to the dir attribute.

percs

The size of the training, testing and validation sets.

Examples

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be
# used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve)
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The files to clean
fns <- c("train", "test", "validate")
# An object of class DataSampler is created
ds <- DataSampler$new(dir = ed, ve = ve)
# The train, test and validation files are generated
ds$generate_data(
    fn = "input.txt",
    percs = list(
        "train" = 0.8,
        "test" = 0.1,
        "validate" = 0.1
    )
)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()


Method clone()

The objects of this class are cloneable with this method.

Usage

DataSampler$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `DataSampler$generate_sample`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The sample file name
sfn <- paste0(ed, "/sample.txt")
# An object of class DataSampler is created
ds <- DataSampler$new(dir = ed, ve = ve)
# The sample file is generated
ds$generate_sample(
    fn = "input.txt",
    ss = 0.5,
    ic = FALSE,
    ir = FALSE,
    ofn = "sample.txt",
    is = TRUE
)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

## ------------------------------------------------
## Method `DataSampler$generate_data`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be
# used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve)
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The files to clean
fns <- c("train", "test", "validate")
# An object of class DataSampler is created
ds <- DataSampler$new(dir = ed, ve = ve)
# The train, test and validation files are generated
ds$generate_data(
    fn = "input.txt",
    percs = list(
        "train" = 0.8,
        "test" = 0.1,
        "validate" = 0.1
    )
)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()