Data frame cleaner

main

nlper.dataframe_cleaner.__init__.main(config: str)

Executes the data frame cleaning pipeline.

Parameters

config (str) – Path to config

application

class nlper.dataframe_cleaner.application.Application(config_path: str)

Data frame cleaner application, starts by initializing read and write objects.

Parameters

config_path (str) – Text to clean

check_if_should_save(type: str) → None

Resolves data frame saving after particular procedure, based on config file. :param type: Name of procedure to save after :type type: str

clean_dataframes() → None

Calls text in data frame cleaning of every data frame using cleaner. Saves cleaned data frame if specified in a config file.

load_language_model() → None

Initializes and obtains the language model from SpaCy.

read_files() → None

Calls data frames reading procedure using file reader.

reduce_dataframes() → None

Calls data frame reduction of every data frame using reducer. Saves reduced data frame if specified in a config file.

run() → None

Executes data frame cleaning process.

trim_dataframes() → None

Calls text in data frame trimming of every data frame using trimmer. Saves reduced data frame if specified in a config file.

cleaner

class nlper.dataframe_cleaner.cleaner.Cleaner(config: Dict[str, Any], data: pandas.core.frame.DataFrame)

Cleans raw text data frame obtaining data in unified format.

  • Removes unwanted characters

  • Hides numbers, date and time in different formats

  • Lemmatizes text

Parameters
  • config (dict) – Configuration dictionary

  • data (pd.DataFrame) – Raw text data frame to clean

clean_dataframe(**kw) → Any

Calculates function execution time. :param args: Function to be measured :type args: callable :param kw: Additional parameters :type kw: any, optional :return: Function execution result :rtype: any

convert_list_to_text_in_dataframe() → None

Calls conversion of data in list of sentences format to single, multi sentenced text using cleaning utils.

hide_numbers() → None

Splits hiding numbers in data frame to separate columns.

static hide_numbers_for_column(column_data: pandas.core.series.Series) → pandas.core.series.Series

Calls hiding numbers from clean utils for single column in data frame using cleaning utils.

Parameters

column_data (pd.Series) – Column in data frame to clean.

Returns

Cleaned data frame column

Return type

pd.Series

lemmatize_text() → None

Applies parallelization of text lemmatization for data frame using python multiprocessing. Lemmatization process is computationally expensive and thus parallelization greatly reduces the required time.

static lemmatize_text_for_column(column_data: pandas.core.series.Series, clean_utils: nlper.utils.clean_utils.CleanUtils) → pandas.core.series.Series

Calls text lemmatization on single data frame column using cleaning utils. Method uses progress_map to visualize progress of lemmatization using tqdm.

Parameters
  • column_data (pd.Series) – Column in data frame to lemmatize.

  • clean_utils (object) – Cleaning utility class

Returns

Column in data frame with lemmatized text

Return type

pd.Series

lemmatize_text_for_dataframe(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Splits text lemmatization in data frame into separated columns.

Parameters

dataframe (pd.DataFrame) – Data frame to lemmatize text in.

Returns

Data frame with lemmatized text

Return type

pd.DataFrame

static remove_characters_for_column(column_data: pandas.core.series.Series) → pandas.core.series.Series

Calls removal of unwanted characters in data frame column using cleaning utils.

Parameters

column_data (pd.Series) – Column in data frame to remove characters from.

Returns

Data frame column with removed characters

Return type

pd.Series

remove_characters_for_dataframe() → None

Splits unwanted characters removal from data frame to separate columns.

reducer

class nlper.dataframe_cleaner.reducer.Reducer(config: Dict[str, Any], data: pandas.core.frame.DataFrame)

Reduces data frame by removing unnecessary columns and rows. :param config: Configuration dictionary :type config: dict :param data: Raw data frame to reduce :type data: pd.DataFrame

merge_columns() → None

Calls merging the particular columns and concatenates them to new data frame.

merge_columns_as_text_or_summary(columns_to_merge_on: List[str]) → Optional[pandas.core.series.Series]

Finds columns to merge by names and calls merging method.

Parameters

columns_to_merge_on (list) – Names of columns to merge

Returns

Merged column

Return type

pd.Series, optional

merge_or_create_column(column: pandas.core.series.Series, series: Optional[pandas.core.series.Series] = None) → pandas.core.series.Series

Merges columns into single column

Parameters
  • column – Column to be merged

  • column – pd.Series

  • series (pd.Series, optional) – Existing column to merge on, if None then just assigns pd.Series

Returns

New merged column

Return type

pd.Series

organize_columns() → None

Calls removal of duplicated text and merge columns.

reduce_dataframe() → pandas.core.frame.DataFrame

Executes data frame reducing process. Drops columns specified by config file. :return: Cleaned data frame :rtype: pd.DataFrame

remove_duplicates_in_lead_and_text_columns() → None

Checks if lead and text columns contain the same text, calls the removal method if so.

unify_dataframe_content() → None

Splits data frame content unification to separate columns.

trimmer

class nlper.dataframe_cleaner.trimmer.Trimmer(config: Dict[str, Any], data: pandas.core.frame.DataFrame)

Trims texts by length in data frame.

For texts with length below minimum threshold, whole text row is removed

Parameters
  • config (dict) – Configuration dictionary

  • data (pd.DataFrame) – Data frame to trim

remove_below_lower_length_limit() → None

Calls removal of rows where text if its length is below set threshold value. For each column we apply different minimum text length threshold values.

The index of data frame is reset after all removal operations.

trim_dataframe(**kw) → Any

Calculates function execution time. :param args: Function to be measured :type args: callable :param kw: Additional parameters :type kw: any, optional :return: Function execution result :rtype: any

static trim_text_for_column(column_data: pandas.core.series.Series, threshold: int, trim_utils: nlper.utils.trim_utils.TrimUtils) → pandas.core.series.Series

Calls text length trimming on single data frame column using trimming utils. Method uses progress_map to visualize progress of length trimming using tqdm.

Parameters
  • column_data

  • threshold

  • trim_utils

Returns

trim_text_for_dataframe(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Splits text length trimming in data frame into separated columns.

Parameters

data (pd.DataFrame) – Data frame to trim length

Returns

Data frame with trimmed text

Return type

pd.DataFrame

trim_to_upper_length_limit() → None

Applies parallelization of text length trimming for data frame using python multiprocessing. Trimming process is computationally expensive and thus parallelization greatly reduces the required time.