Utils¶

clean utils¶

class nlper.utils.clean_utils.CleanUtils¶

Utils for cleaning text

static convert_list_to_text(text_as_list: List[str]) → str¶

Converts list of texts to single string, filtering the empty texts.

Parameters: text_as_list (list) – List of texts
Returns: Joined text
Return type: str

get_language_model() → None¶: Obtains the language model.

static hide_numbers(text: str, number_replacement: str = '<num>') → str¶

Hides numbers, date and time in text with specified token. Supports various formats of numbers dates and times.

Supported formats example: * 22 -> <num> * 11:45 -> <num> * 99.99 -> <num> * 596,789 -> <num> * 6;15 -> <num> * 5.99 -> <num> * 29/12/2010 -> <num> * 100 -> <num> * 15.V.2030 -> <num> * 10.01.2020 -> <num> * 29/XII/1990 -> <num> * 22—22-2222 -> <num> * 22……22.2000 -> <num> * 01.01.2000 12:15 -> <num> <num>

Parameters

text (str) – Text to hide numbers in
number_replacement (str) – Token to replace numbers with

Returns

Text with replaced numbers

Return type

str

lemmatize(text: str) → str¶

Lemmatizes text using language model from SpaCy.

Parameters: text (str) – String text to be lemmatized
Returns: Lemmatized text
Return type: str

static remove_characters_for_text(text: str) → str¶

Executes removal of various types of unwanted characters from text.

Parameters: text (str) – Text to remove characters from
Returns: Text with removed characters
Return type: str

static remove_hmtl_elements(text: str) → str¶

Removes html elements from text.

Example: <p>sample text</p> -> sample text

Parameters: text (str) – Text to remove html elements from
Returns: Text with removed html elements
Return type: str

static remove_non_text_characters(text: str) → str¶

Removes non text characters from text.

Types of characters to remove: * tab and new line characters -

curly braces and brackets with text inside - {}, []
characters: _ - ~ + .. : /
unnecessary whitespaces
various non-standard characters: <>()|&©ø,;~*’”`”„”“‟‶‚’‘‛⁏;- — ― –⁋%^‰&*$#@!

Types of characters to replace: * ?! to be replaced with dot .

Parameters: text (str) – Text to remove non text characters from
Returns: Text with removed characters
Return type: str

static remove_special_characters(text: str, characters: Tuple[str] = '\\xao') → str¶

Removes special characters.

Example: * xao

Parameters

text (str) – Text to remove special characters from
characters (tuple) – Special characters to remove

Returns

Text with removed characters

Return type

str

config utils¶

nlper.utils.config_utils.read_config(config_path: str, logger: Any = None) → Dict¶: Reads and returns config from yaml file :param config_path: Path to yaml config :type config_path: str :param logger: Logger, logs the loaded config :param logger: logger :return: Loaded config dictionary :rtype: dict

dataframe utils¶

class nlper.utils.dataframe_utils.ColumnsWithDuplicates¶: An enumeration.

class nlper.utils.dataframe_utils.DataFrameUtils¶

Utils for pandas data frame

static drop_columns(dataframe: pandas.core.frame.DataFrame, columns: list, inplace=True) → Optional[pandas.core.frame.DataFrame]¶: Drops particular columns :param dataframe: Data frame to drops columns from :type dataframe: pd.DataFrame :param columns: list of columns to drop :type columns: list :param inplace: boolean flag whether to drop inplace, default True :type inplace: bool :return: Data frame with dropped columns :rtype: pd.DataFrame, optional

static remove_empty_rows(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶: Removes empty rows from data frame :param dataframe: Data frame to remove empty rows :type dataframe: pd.DataFrame :return: Data frame with dropped rows :rtype: pd.DataFrame

class nlper.utils.dataframe_utils.EnumWithListing¶: Enum supporting listing of elements value

class nlper.utils.dataframe_utils.OutputColumns¶: An enumeration.

lang utils¶

class nlper.utils.lang_utils.LangUtils¶

Utils for SpaCy language model

By default defines special case list of tokens to tokenizer.

set_language_model(spacy_lang: str = 'pl_spacy_model', disable_options=None) → <module ‘spacy’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/nlper/envs/latest/lib/python3.7/site-packages/spacy/__init__.py’>¶

Loads the SpaCy language model and adds the special case to tokenizer. By default tries to load spacy polish model and english model if first one is not available.

Parameters

spacy_lang (str) – Name of language model
disable_options – List of SpaCy options to disable, for example ‘NER’ for accelerated text parsing

Returns

Loaded Spacy language model

Return type

spacy

tokenize_text(text: str) → List[str]¶

Tokenizes text using SpaCy language model

Parameters: text (str) – Text to tokenize
Returns: List of tokens
Return type: list

class nlper.utils.lang_utils.Token¶

Language tokens representing

<num> - any numerical value including date and time in different formats
<unk> - unknown / out of vocabulary word
<pad> - padding for short text sequences in batch
<sos> - start of sequence for the beginning of summary generation
<eos> - end of sequence for marking the end of generated summary

class nlper.utils.lang_utils.VocabConfig(stoi=None, itos=None)¶

Utils for vocabulary

Parameters

stoi (defaultdict, optional) – Dictionary with text token and assigned index
itos (list, optional) – List of text tokens

indices_from_text(text: str) → torch.Tensor¶

Converts text token to tensor of corresponding indices.

Parameters: text (str) – Text to convert:
Returns: Tensor with indices
Return type: torch.Tensor

set_vocab_from_field(text: object) → None¶

Assigns dictionary of text tokens and indices, together with list of tokens from torchtext vocabulary.

Parameters

text – Torchtext vocabulary
text – torchtext.Vocab

set_vocab_from_file(filepath: str = None) → None¶

Loads vocabulary from file.

Parameters: filepath (str) – Path to file with vocabulary

text_from_indices(indices: List[int]) → str¶

Converts list of indices into corresponding text - sequence of tokens.

Parameters: indices (list) – List of indices
Returns: Text from tokens
Return type: str

time utils¶

nlper.utils.time_utils.timeit(f: Callable) → Callable¶

Decorates function.

Parameters: f (Callable) – Function to decorate
Returns

torch utils¶

nlper.utils.torch_utils.get_device() → object¶

Returns the available device

Returns: device
Return type: object

nlper.utils.torch_utils.to_gpu(tensor: torch.Tensor) → torch.Tensor¶

Transfers torch tensor to available device

Parameters: tensor (torch.Tensor) – Tensor to be transferred to device
Returns
Return type: torch.Tensor

train test splitter¶

class nlper.utils.train_test_splitter.TrainTestSplitter(config: str = None, filepath: str = None, valid: bool = True)¶

Splits data frame into train, test and valid parts.

Parameters

config (str, optional) – Path to yaml config file
filepath (str) – Path to pandas data frame
valid (bool) – Flag whether to include split for valid part

build_paths() → None¶: Builds paths required for saving the split data frames

create_dirs() → None¶: Creates directories for split data frames to save :return:

read_file() → None¶: Reads pandas data frame from path

run() → None¶: Executes train, test, valid data frame split

save_data(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, val: Optional[pandas.core.frame.DataFrame] = None, columns: Tuple[str] = 'text', 'summary') → None¶

Saves specified columns of train, test and valid data frames into csv files.

Parameters

train – Train part data frame
train – pd.DataFrame
test – Test part data frame
test – pd.DataFrame
val – Valid part data frame
val – pd.DataFrame
columns – Columns to save
columns – tuple

set_config_from_filepath() → None¶: Specifies directory name to write split data frame parts

split_data() → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]¶

Splits data frame into train, test and valid parts.

Returns: Train, test and valid data frames
Return type: tuple

train utils¶

nlper.utils.train_utils.calculate_rouge(hypothesis: str, reference: str) → Optional[List[dict]]¶

Calculates Rouge scores which is a set of metrics for evaluation of machine translation or text summarization tasks. Rouge stands for Recall-Oriented Understudy for Gisting Evaluation and compares model output with target text. For text summarization task we consider two base accuracy measures - recall and precision. * Recall - number of overlapping words, divided by number of words in reference summary * Precision - number of overlapping words, divided by number of words in model generated summary

Additionally we consider F1-score of both recall and precision.

For the best overview of a model performance, we should measure recall, precision and F-score values.

There are few type of metrices: * ROUGE-1

Measures overlapping unigrams

ROUGE-2
Measures overlapping bigrams
ROUGE-L
Measures longest common subsequence (LCS), takes into account in-sequence matches on sentence level word order.

Parameters

hypothesis (str) – Model generated text sequence
reference (str) – Reference text sequence

Returns

List of precision, recall and F1-score for Rouge-1, Rouge-2 and Rouge-L metrics

Return type

list

nlper.utils.train_utils.draw_attention_matrix(attention: torch.Tensor, original: str, summary: str, config=None, epoch=None, batch_id=None) → None¶

Draws plot with heatmap of attention using matplotlib for particular training step. If config specified, saves plot in specified location

Parameters

attention (torch.Tensor) – Matrix with attention values
original (str) – Original text to summarize
summary (str) – Model generated summary text
config (dict, optional) – Config, if passed then the plot is saved in config-defined location
epoch (int, optional) – Current training epoch for naming purpose in plot saving operation
batch_id (int, optional) – Current batch number for naming purpose in plot saving operation

trim utils¶

class nlper.utils.trim_utils.TrimUtils¶

Utils for trimming text to specified length

static calculate_cumulative_sentences_lengths(sentences: List[Any]) → List[int]¶

Calculates cumulative length of sentences.

Parameters: sentences (list) – List of sentences
Returns: List of cumulative lengths
Return type: np.array

get_language_model() → None¶: Obtains the language model.

static get_last_sentence_index(lengths: List[int], threshold: int) → int¶

Obtains index of last sentence in sequence before trimming.

Parameters

lengths (list) – List of sentence lengths
threshold (int) – Text sequence length threshold

Returns

Index of last sentence

Return type

int

get_parsed_text(text: str) → Any¶

Parses text through language model from SpaCy.

Parameters: text (str) – String text to be parsed
Returns: SpaCy parsed text
Return type: spacy.tokens

static join_sentences(sentences: List[Any]) → str¶

Joins list of sentences into single text.

Parameters: sentences (list) – List of sentences
Returns: Joined text
Return type: str

static remove_text_below_lower_length_threshold(threshold: int) → Callable¶

Calls lambda function to check if sequence length is above specified length threshold.

Parameters: threshold (int) – Length threshold value
Returns: Calls anonymous lambda function
Return type: callable

static trim_sentences(sentences: List[str], index: int) → List[str]¶

Trims sequence of sentences to particular index.

Parameters

sentences (list) – List of sentences
index (int) – Index to trim at

Returns

Trimmed list of sentences

Return type

list

trim_text_to_upper_length_threshold(text: str, threshold: int) → str¶

Trims text to specified maximum length threshold. If text length is above the threshold, removes the last sentences until the total length does not exceed the threshold value. If the text length is below threshold, leaves the text unchanged.

Parameters

text (str) – Text to be trimmed
threshold (int) – Maximum length threshold

Returns

Utils¶

clean utils¶

config utils¶

dataframe utils¶

lang utils¶

time utils¶

torch utils¶

train test splitter¶

train utils¶

trim utils¶

NLPer

Navigation

Related Topics