Data frame cleaner¶
main¶
-
nlper.dataframe_cleaner.__init__.main(config: str)¶ Executes the data frame cleaning pipeline.
- Parameters
config (str) – Path to config
application¶
-
class
nlper.dataframe_cleaner.application.Application(config_path: str)¶ Data frame cleaner application, starts by initializing read and write objects.
- Parameters
config_path (str) – Text to clean
-
check_if_should_save(type: str) → None¶ Resolves data frame saving after particular procedure, based on config file. :param type: Name of procedure to save after :type type: str
-
clean_dataframes() → None¶ Calls text in data frame cleaning of every data frame using cleaner. Saves cleaned data frame if specified in a config file.
-
load_language_model() → None¶ Initializes and obtains the language model from SpaCy.
-
read_files() → None¶ Calls data frames reading procedure using file reader.
-
reduce_dataframes() → None¶ Calls data frame reduction of every data frame using reducer. Saves reduced data frame if specified in a config file.
-
run() → None¶ Executes data frame cleaning process.
-
trim_dataframes() → None¶ Calls text in data frame trimming of every data frame using trimmer. Saves reduced data frame if specified in a config file.
cleaner¶
-
class
nlper.dataframe_cleaner.cleaner.Cleaner(config: Dict[str, Any], data: pandas.core.frame.DataFrame)¶ Cleans raw text data frame obtaining data in unified format.
Removes unwanted characters
Hides numbers, date and time in different formats
Lemmatizes text
- Parameters
config (dict) – Configuration dictionary
data (pd.DataFrame) – Raw text data frame to clean
-
clean_dataframe(**kw) → Any¶ Calculates function execution time. :param args: Function to be measured :type args: callable :param kw: Additional parameters :type kw: any, optional :return: Function execution result :rtype: any
-
convert_list_to_text_in_dataframe() → None¶ Calls conversion of data in list of sentences format to single, multi sentenced text using cleaning utils.
-
hide_numbers() → None¶ Splits hiding numbers in data frame to separate columns.
-
static
hide_numbers_for_column(column_data: pandas.core.series.Series) → pandas.core.series.Series¶ Calls hiding numbers from clean utils for single column in data frame using cleaning utils.
- Parameters
column_data (pd.Series) – Column in data frame to clean.
- Returns
Cleaned data frame column
- Return type
pd.Series
-
lemmatize_text() → None¶ Applies parallelization of text lemmatization for data frame using python multiprocessing. Lemmatization process is computationally expensive and thus parallelization greatly reduces the required time.
-
static
lemmatize_text_for_column(column_data: pandas.core.series.Series, clean_utils: nlper.utils.clean_utils.CleanUtils) → pandas.core.series.Series¶ Calls text lemmatization on single data frame column using cleaning utils. Method uses
progress_mapto visualize progress of lemmatization using tqdm.- Parameters
column_data (pd.Series) – Column in data frame to lemmatize.
clean_utils (object) – Cleaning utility class
- Returns
Column in data frame with lemmatized text
- Return type
pd.Series
-
lemmatize_text_for_dataframe(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Splits text lemmatization in data frame into separated columns.
- Parameters
dataframe (pd.DataFrame) – Data frame to lemmatize text in.
- Returns
Data frame with lemmatized text
- Return type
pd.DataFrame
-
static
remove_characters_for_column(column_data: pandas.core.series.Series) → pandas.core.series.Series¶ Calls removal of unwanted characters in data frame column using cleaning utils.
- Parameters
column_data (pd.Series) – Column in data frame to remove characters from.
- Returns
Data frame column with removed characters
- Return type
pd.Series
-
remove_characters_for_dataframe() → None¶ Splits unwanted characters removal from data frame to separate columns.
reducer¶
-
class
nlper.dataframe_cleaner.reducer.Reducer(config: Dict[str, Any], data: pandas.core.frame.DataFrame)¶ Reduces data frame by removing unnecessary columns and rows. :param config: Configuration dictionary :type config: dict :param data: Raw data frame to reduce :type data: pd.DataFrame
-
merge_columns() → None¶ Calls merging the particular columns and concatenates them to new data frame.
-
merge_columns_as_text_or_summary(columns_to_merge_on: List[str]) → Optional[pandas.core.series.Series]¶ Finds columns to merge by names and calls merging method.
- Parameters
columns_to_merge_on (list) – Names of columns to merge
- Returns
Merged column
- Return type
pd.Series, optional
-
merge_or_create_column(column: pandas.core.series.Series, series: Optional[pandas.core.series.Series] = None) → pandas.core.series.Series¶ Merges columns into single column
- Parameters
column – Column to be merged
column – pd.Series
series (pd.Series, optional) – Existing column to merge on, if None then just assigns pd.Series
- Returns
New merged column
- Return type
pd.Series
-
organize_columns() → None¶ Calls removal of duplicated text and merge columns.
-
reduce_dataframe() → pandas.core.frame.DataFrame¶ Executes data frame reducing process. Drops columns specified by config file. :return: Cleaned data frame :rtype: pd.DataFrame
-
remove_duplicates_in_lead_and_text_columns() → None¶ Checks if lead and text columns contain the same text, calls the removal method if so.
-
unify_dataframe_content() → None¶ Splits data frame content unification to separate columns.
-
trimmer¶
-
class
nlper.dataframe_cleaner.trimmer.Trimmer(config: Dict[str, Any], data: pandas.core.frame.DataFrame)¶ Trims texts by length in data frame.
For texts with length below minimum threshold, whole text row is removed
- Parameters
config (dict) – Configuration dictionary
data (pd.DataFrame) – Data frame to trim
-
remove_below_lower_length_limit() → None¶ Calls removal of rows where text if its length is below set threshold value. For each column we apply different minimum text length threshold values.
The index of data frame is reset after all removal operations.
-
trim_dataframe(**kw) → Any¶ Calculates function execution time. :param args: Function to be measured :type args: callable :param kw: Additional parameters :type kw: any, optional :return: Function execution result :rtype: any
-
static
trim_text_for_column(column_data: pandas.core.series.Series, threshold: int, trim_utils: nlper.utils.trim_utils.TrimUtils) → pandas.core.series.Series¶ Calls text length trimming on single data frame column using trimming utils. Method uses
progress_mapto visualize progress of length trimming using tqdm.- Parameters
column_data –
threshold –
trim_utils –
- Returns
-
trim_text_for_dataframe(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Splits text length trimming in data frame into separated columns.
- Parameters
data (pd.DataFrame) – Data frame to trim length
- Returns
Data frame with trimmed text
- Return type
pd.DataFrame
-
trim_to_upper_length_limit() → None¶ Applies parallelization of text length trimming for data frame using python multiprocessing. Trimming process is computationally expensive and thus parallelization greatly reduces the required time.