Skip to contents

It provides a memory efficient method for removing unneeded characters from text files. It is suitable for cleaning large text files.

Details

It provides a method for cleaning text files. It allows removing bad words, stop words, non dictionary words, extra space, punctuation and non-alphabet characters. It also allows conversion to lower case. It supports large text files.

Super class

wordpredictor::Base -> DataCleaner

Methods

Inherited methods


    Method new()

    It initializes the current object. It is used to set the file name and verbose options.

    Usage

    DataCleaner$new(fn = NULL, opts = list(), ve = 0)

    Arguments

    fn

    The path to the file to clean.

    opts

    The options for data cleaning.

    • min_words. The minimum number of words per sentence.

    • line_count. The number of lines to read and clean at a time.

    • save_data. If the combined processed lines should be saved.

    • output_file. Name of the output file used to store the data.

    • sw_file. The stop words file path.

    • dict_file. The dictionary file path.

    • bad_file. The bad words file path.

    • to_lower. If the words should be converted to lower case.

    • remove_stop. If stop words should be removed.

    • remove_punct. If punctuation symbols should be removed.

    • remove_non_dict. If non dictionary words should be removed.

    • remove_non_alpha. -> If non alphabet symbols should be removed.

    • remove_extra_space. -> If leading, trailing and double spaces should be removed.

    • remove_bad. If bad words should be removed

    ve

    The level of detail in the information messages.


    Method clean_file()

    It provides an efficient method for cleaning text files. It removes unneeded characters from the given text file with several options.

    It allows removing punctuation, bad words, stop words, non-alphabetical symbols and non-dictionary words. It reads a certain number of lines from the given text file. It removes unneeded characters from the lines and then saves the lines to an output text file.

    File cleaning progress is displayed if the verbose option was set in the class constructor. It is suitable for cleaning large text files.

    Usage

    DataCleaner$clean_file()

    Examples

    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("test.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The cleaned test file name
    cfn <- paste0(ed, "/test-clean.txt")
    # The test file name
    fn <- paste0(ed, "/test.txt")
    # The data cleaning options
    dc_opts <- list("output_file" = cfn)
    # The data cleaner object is created
    dc <- DataCleaner$new(fn, dc_opts, ve = ve)
    # The sample file is cleaned
    dc$clean_file()
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()


    Method clean_lines()

    It cleans the given lines of text using the options passed to the current object.

    Usage

    DataCleaner$clean_lines(lines)

    Arguments

    lines

    The input sentences.

    Returns

    The cleaned lines of text.

    Examples

    # The level of detail in the information messages
    ve <- 0
    # Test data is read
    l <- c(
        "If you think I'm wrong, send me a link to where it's happened",
        "We're about 90percent done with this room",
        "This isn't how I wanted it between us.",
        "Almost any cute breed can become ornamental",
        "Once upon a time there was a kingdom with a castle",
        "That's not a thing any of us are granted'",
        "Why are you being so difficult? she asks."
    )
    # The expected results
    res <- c(
        "if you think wrong send me a link to where its happened",
        "were about percent done with this room",
        "this how i wanted it between us",
        "almost any cute breed can become ornamental",
        "once upon a time there was a kingdom with a castle",
        "thats not a thing any of us are granted",
        "why are you being so difficult she asks"
    )
    # The DataCleaner object is created
    dc <- DataCleaner$new(ve = ve)
    # The line is cleaned
    cl <- dc$clean_lines(l)
    # The cleaned lines are printed
    print(cl)


    Method clone()

    The objects of this class are cloneable with this method.

    Usage

    DataCleaner$clone(deep = FALSE)

    Arguments

    deep

    Whether to make a deep clone.

    Examples

    
    ## ------------------------------------------------
    ## Method `DataCleaner$clean_file`
    ## ------------------------------------------------
    
    # Start of environment setup code
    # The level of detail in the information messages
    ve <- 0
    # The name of the folder that will contain all the files. It will be
    # created in the current directory. NULL implies tempdir will be used
    fn <- NULL
    # The required files. They are default files that are part of the
    # package
    rf <- c("test.txt")
    # An object of class EnvManager is created
    em <- EnvManager$new(ve = ve, rp = "./")
    # The required files are downloaded
    ed <- em$setup_env(rf, fn)
    # End of environment setup code
    
    # The cleaned test file name
    cfn <- paste0(ed, "/test-clean.txt")
    # The test file name
    fn <- paste0(ed, "/test.txt")
    # The data cleaning options
    dc_opts <- list("output_file" = cfn)
    # The data cleaner object is created
    dc <- DataCleaner$new(fn, dc_opts, ve = ve)
    # The sample file is cleaned
    dc$clean_file()
    
    # The test environment is removed. Comment the below line, so the
    # files generated by the function can be viewed
    em$td_env()
    
    ## ------------------------------------------------
    ## Method `DataCleaner$clean_lines`
    ## ------------------------------------------------
    
    # The level of detail in the information messages
    ve <- 0
    # Test data is read
    l <- c(
        "If you think I'm wrong, send me a link to where it's happened",
        "We're about 90percent done with this room",
        "This isn't how I wanted it between us.",
        "Almost any cute breed can become ornamental",
        "Once upon a time there was a kingdom with a castle",
        "That's not a thing any of us are granted'",
        "Why are you being so difficult? she asks."
    )
    # The expected results
    res <- c(
        "if you think wrong send me a link to where its happened",
        "were about percent done with this room",
        "this how i wanted it between us",
        "almost any cute breed can become ornamental",
        "once upon a time there was a kingdom with a castle",
        "thats not a thing any of us are granted",
        "why are you being so difficult she asks"
    )
    # The DataCleaner object is created
    dc <- DataCleaner$new(ve = ve)
    # The line is cleaned
    cl <- dc$clean_lines(l)
    # The cleaned lines are printed
    print(cl)
    #> [1] "if you think wrong send me a link to where its happened"
    #> [2] "were about percent done with this room"                 
    #> [3] "this how i wanted it between us"                        
    #> [4] "almost any cute breed can become ornamental"            
    #> [5] "once upon a time there was a kingdom with a castle"     
    #> [6] "thats not a thing any of us are granted"                
    #> [7] "why are you being so difficult she asks"