Utils

clean utils

class nlper.utils.clean_utils.CleanUtils

Utils for cleaning text

static convert_list_to_text(text_as_list: List[str]) → str

Converts list of texts to single string, filtering the empty texts.

Parameters

text_as_list (list) – List of texts

Returns

Joined text

Return type

str

get_language_model() → None

Obtains the language model.

static hide_numbers(text: str, number_replacement: str = '<num>') → str

Hides numbers, date and time in text with specified token. Supports various formats of numbers dates and times.

Supported formats example: * 22 -> <num> * 11:45 -> <num> * 99.99 -> <num> * 596,789 -> <num> * 6;15 -> <num> * 5.99 -> <num> * 29/12/2010 -> <num> * 10€0 -> <num> * 15.V.2030 -> <num> * 10.01.2020 -> <num> * 29/XII/1990 -> <num> * 22—22-2222 -> <num> * 22……22.2000 -> <num> * 01.01.2000 12:15 -> <num> <num>

Parameters
  • text (str) – Text to hide numbers in

  • number_replacement (str) – Token to replace numbers with

Returns

Text with replaced numbers

Return type

str

lemmatize(text: str) → str

Lemmatizes text using language model from SpaCy.

Parameters

text (str) – String text to be lemmatized

Returns

Lemmatized text

Return type

str

static remove_characters_for_text(text: str) → str

Executes removal of various types of unwanted characters from text.

Parameters

text (str) – Text to remove characters from

Returns

Text with removed characters

Return type

str

static remove_hmtl_elements(text: str) → str

Removes html elements from text.

Example: <p>sample text</p> -> sample text

Parameters

text (str) – Text to remove html elements from

Returns

Text with removed html elements

Return type

str

static remove_non_text_characters(text: str) → str

Removes non text characters from text.

Types of characters to remove: * tab and new line characters -

  • curly braces and brackets with text inside - {}, []

  • characters: _ - ~ + .. : /

  • unnecessary whitespaces

  • various non-standard characters: <>()|&©ø,;~*’”`”„”“‟‶‚’‘‛⁏;- — ― –⁋%^‰&*$#@!

Types of characters to replace: * ?! to be replaced with dot .

Parameters

text (str) – Text to remove non text characters from

Returns

Text with removed characters

Return type

str

static remove_special_characters(text: str, characters: Tuple[str] = '\\xao') → str

Removes special characters.

Example: * xao

Parameters
  • text (str) – Text to remove special characters from

  • characters (tuple) – Special characters to remove

Returns

Text with removed characters

Return type

str

config utils

nlper.utils.config_utils.read_config(config_path: str, logger: Any = None) → Dict

Reads and returns config from yaml file :param config_path: Path to yaml config :type config_path: str :param logger: Logger, logs the loaded config :param logger: logger :return: Loaded config dictionary :rtype: dict

dataframe utils

class nlper.utils.dataframe_utils.ColumnsWithDuplicates

An enumeration.

class nlper.utils.dataframe_utils.DataFrameUtils

Utils for pandas data frame

static drop_columns(dataframe: pandas.core.frame.DataFrame, columns: list, inplace=True) → Optional[pandas.core.frame.DataFrame]

Drops particular columns :param dataframe: Data frame to drops columns from :type dataframe: pd.DataFrame :param columns: list of columns to drop :type columns: list :param inplace: boolean flag whether to drop inplace, default True :type inplace: bool :return: Data frame with dropped columns :rtype: pd.DataFrame, optional

static remove_empty_rows(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Removes empty rows from data frame :param dataframe: Data frame to remove empty rows :type dataframe: pd.DataFrame :return: Data frame with dropped rows :rtype: pd.DataFrame

class nlper.utils.dataframe_utils.EnumWithListing

Enum supporting listing of elements value

class nlper.utils.dataframe_utils.OutputColumns

An enumeration.

lang utils

class nlper.utils.lang_utils.LangUtils

Utils for SpaCy language model

By default defines special case list of tokens to tokenizer.

set_language_model(spacy_lang: str = 'pl_spacy_model', disable_options=None) → <module ‘spacy’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/nlper/envs/latest/lib/python3.7/site-packages/spacy/__init__.py’>

Loads the SpaCy language model and adds the special case to tokenizer. By default tries to load spacy polish model and english model if first one is not available.

Parameters
  • spacy_lang (str) – Name of language model

  • disable_options – List of SpaCy options to disable, for example ‘NER’ for accelerated text parsing

Returns

Loaded Spacy language model

Return type

spacy

tokenize_text(text: str) → List[str]

Tokenizes text using SpaCy language model

Parameters

text (str) – Text to tokenize

Returns

List of tokens

Return type

list

class nlper.utils.lang_utils.Token

Language tokens representing

  • <num> - any numerical value including date and time in different formats

  • <unk> - unknown / out of vocabulary word

  • <pad> - padding for short text sequences in batch

  • <sos> - start of sequence for the beginning of summary generation

  • <eos> - end of sequence for marking the end of generated summary

class nlper.utils.lang_utils.VocabConfig(stoi=None, itos=None)

Utils for vocabulary

Parameters
  • stoi (defaultdict, optional) – Dictionary with text token and assigned index

  • itos (list, optional) – List of text tokens

indices_from_text(text: str) → torch.Tensor

Converts text token to tensor of corresponding indices.

Parameters

text (str) – Text to convert:

Returns

Tensor with indices

Return type

torch.Tensor

set_vocab_from_field(text: object) → None

Assigns dictionary of text tokens and indices, together with list of tokens from torchtext vocabulary.

Parameters
  • text – Torchtext vocabulary

  • text – torchtext.Vocab

set_vocab_from_file(filepath: str = None) → None

Loads vocabulary from file.

Parameters

filepath (str) – Path to file with vocabulary

text_from_indices(indices: List[int]) → str

Converts list of indices into corresponding text - sequence of tokens.

Parameters

indices (list) – List of indices

Returns

Text from tokens

Return type

str

time utils

nlper.utils.time_utils.timeit(f: Callable) → Callable

Decorates function.

Parameters

f (Callable) – Function to decorate

Returns

torch utils

nlper.utils.torch_utils.get_device() → object

Returns the available device

Returns

device

Return type

object

nlper.utils.torch_utils.to_gpu(tensor: torch.Tensor) → torch.Tensor

Transfers torch tensor to available device

Parameters

tensor (torch.Tensor) – Tensor to be transferred to device

Returns

Return type

torch.Tensor

train test splitter

class nlper.utils.train_test_splitter.TrainTestSplitter(config: str = None, filepath: str = None, valid: bool = True)

Splits data frame into train, test and valid parts.

Parameters
  • config (str, optional) – Path to yaml config file

  • filepath (str) – Path to pandas data frame

  • valid (bool) – Flag whether to include split for valid part

build_paths() → None

Builds paths required for saving the split data frames

create_dirs() → None

Creates directories for split data frames to save :return:

read_file() → None

Reads pandas data frame from path

run() → None

Executes train, test, valid data frame split

save_data(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, val: Optional[pandas.core.frame.DataFrame] = None, columns: Tuple[str] = 'text', 'summary') → None

Saves specified columns of train, test and valid data frames into csv files.

Parameters
  • train – Train part data frame

  • train – pd.DataFrame

  • test – Test part data frame

  • test – pd.DataFrame

  • val – Valid part data frame

  • val – pd.DataFrame

  • columns – Columns to save

  • columns – tuple

set_config_from_filepath() → None

Specifies directory name to write split data frame parts

split_data() → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

Splits data frame into train, test and valid parts.

Returns

Train, test and valid data frames

Return type

tuple

train utils

nlper.utils.train_utils.calculate_rouge(hypothesis: str, reference: str) → Optional[List[dict]]

Calculates Rouge scores which is a set of metrics for evaluation of machine translation or text summarization tasks. Rouge stands for Recall-Oriented Understudy for Gisting Evaluation and compares model output with target text. For text summarization task we consider two base accuracy measures - recall and precision. * Recall - number of overlapping words, divided by number of words in reference summary * Precision - number of overlapping words, divided by number of words in model generated summary

Additionally we consider F1-score of both recall and precision.

For the best overview of a model performance, we should measure recall, precision and F-score values.

There are few type of metrices: * ROUGE-1

Measures overlapping unigrams

  • ROUGE-2

    Measures overlapping bigrams

  • ROUGE-L

    Measures longest common subsequence (LCS), takes into account in-sequence matches on sentence level word order.

Parameters
  • hypothesis (str) – Model generated text sequence

  • reference (str) – Reference text sequence

Returns

List of precision, recall and F1-score for Rouge-1, Rouge-2 and Rouge-L metrics

Return type

list

nlper.utils.train_utils.draw_attention_matrix(attention: torch.Tensor, original: str, summary: str, config=None, epoch=None, batch_id=None) → None

Draws plot with heatmap of attention using matplotlib for particular training step. If config specified, saves plot in specified location

Parameters
  • attention (torch.Tensor) – Matrix with attention values

  • original (str) – Original text to summarize

  • summary (str) – Model generated summary text

  • config (dict, optional) – Config, if passed then the plot is saved in config-defined location

  • epoch (int, optional) – Current training epoch for naming purpose in plot saving operation

  • batch_id (int, optional) – Current batch number for naming purpose in plot saving operation

trim utils

class nlper.utils.trim_utils.TrimUtils

Utils for trimming text to specified length

static calculate_cumulative_sentences_lengths(sentences: List[Any]) → List[int]

Calculates cumulative length of sentences.

Parameters

sentences (list) – List of sentences

Returns

List of cumulative lengths

Return type

np.array

get_language_model() → None

Obtains the language model.

static get_last_sentence_index(lengths: List[int], threshold: int) → int

Obtains index of last sentence in sequence before trimming.

Parameters
  • lengths (list) – List of sentence lengths

  • threshold (int) – Text sequence length threshold

Returns

Index of last sentence

Return type

int

get_parsed_text(text: str) → Any

Parses text through language model from SpaCy.

Parameters

text (str) – String text to be parsed

Returns

SpaCy parsed text

Return type

spacy.tokens

static join_sentences(sentences: List[Any]) → str

Joins list of sentences into single text.

Parameters

sentences (list) – List of sentences

Returns

Joined text

Return type

str

static remove_text_below_lower_length_threshold(threshold: int) → Callable

Calls lambda function to check if sequence length is above specified length threshold.

Parameters

threshold (int) – Length threshold value

Returns

Calls anonymous lambda function

Return type

callable

static trim_sentences(sentences: List[str], index: int) → List[str]

Trims sequence of sentences to particular index.

Parameters
  • sentences (list) – List of sentences

  • index (int) – Index to trim at

Returns

Trimmed list of sentences

Return type

list

trim_text_to_upper_length_threshold(text: str, threshold: int) → str

Trims text to specified maximum length threshold. If text length is above the threshold, removes the last sentences until the total length does not exceed the threshold value. If the text length is below threshold, leaves the text unchanged.

Parameters
  • text (str) – Text to be trimmed

  • threshold (int) – Maximum length threshold

Returns