Utils¶
clean utils¶
-
class
nlper.utils.clean_utils.CleanUtils¶ Utils for cleaning text
-
static
convert_list_to_text(text_as_list: List[str]) → str¶ Converts list of texts to single string, filtering the empty texts.
- Parameters
text_as_list (list) – List of texts
- Returns
Joined text
- Return type
str
-
get_language_model() → None¶ Obtains the language model.
-
static
hide_numbers(text: str, number_replacement: str = '<num>') → str¶ Hides numbers, date and time in text with specified token. Supports various formats of numbers dates and times.
Supported formats example: * 22 -> <num> * 11:45 -> <num> * 99.99 -> <num> * 596,789 -> <num> * 6;15 -> <num> * 5.99 -> <num> * 29/12/2010 -> <num> * 100 -> <num> * 15.V.2030 -> <num> * 10.01.2020 -> <num> * 29/XII/1990 -> <num> * 22—22-2222 -> <num> * 22……22.2000 -> <num> * 01.01.2000 12:15 -> <num> <num>
- Parameters
text (str) – Text to hide numbers in
number_replacement (str) – Token to replace numbers with
- Returns
Text with replaced numbers
- Return type
str
-
lemmatize(text: str) → str¶ Lemmatizes text using language model from SpaCy.
- Parameters
text (str) – String text to be lemmatized
- Returns
Lemmatized text
- Return type
str
-
static
remove_characters_for_text(text: str) → str¶ Executes removal of various types of unwanted characters from text.
- Parameters
text (str) – Text to remove characters from
- Returns
Text with removed characters
- Return type
str
-
static
remove_hmtl_elements(text: str) → str¶ Removes html elements from text.
Example: <p>sample text</p> -> sample text
- Parameters
text (str) – Text to remove html elements from
- Returns
Text with removed html elements
- Return type
str
-
static
remove_non_text_characters(text: str) → str¶ Removes non text characters from text.
Types of characters to remove: * tab and new line characters -
curly braces and brackets with text inside - {}, []
characters: _ - ~ + .. : /
unnecessary whitespaces
various non-standard characters: <>()|&©ø,;~*’”`”„”“‟‶‚’‘‛⁏;- — ― –⁋%^‰&*$#@!
Types of characters to replace: * ?! to be replaced with dot .
- Parameters
text (str) – Text to remove non text characters from
- Returns
Text with removed characters
- Return type
str
-
static
remove_special_characters(text: str, characters: Tuple[str] = '\\xao') → str¶ Removes special characters.
Example: * xao
- Parameters
text (str) – Text to remove special characters from
characters (tuple) – Special characters to remove
- Returns
Text with removed characters
- Return type
str
-
static
config utils¶
-
nlper.utils.config_utils.read_config(config_path: str, logger: Any = None) → Dict¶ Reads and returns config from yaml file :param config_path: Path to yaml config :type config_path: str :param logger: Logger, logs the loaded config :param logger: logger :return: Loaded config dictionary :rtype: dict
dataframe utils¶
-
class
nlper.utils.dataframe_utils.ColumnsWithDuplicates¶ An enumeration.
-
class
nlper.utils.dataframe_utils.DataFrameUtils¶ Utils for pandas data frame
-
static
drop_columns(dataframe: pandas.core.frame.DataFrame, columns: list, inplace=True) → Optional[pandas.core.frame.DataFrame]¶ Drops particular columns :param dataframe: Data frame to drops columns from :type dataframe: pd.DataFrame :param columns: list of columns to drop :type columns: list :param inplace: boolean flag whether to drop inplace, default True :type inplace: bool :return: Data frame with dropped columns :rtype: pd.DataFrame, optional
-
static
remove_empty_rows(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Removes empty rows from data frame :param dataframe: Data frame to remove empty rows :type dataframe: pd.DataFrame :return: Data frame with dropped rows :rtype: pd.DataFrame
-
static
-
class
nlper.utils.dataframe_utils.EnumWithListing¶ Enum supporting listing of elements value
-
class
nlper.utils.dataframe_utils.OutputColumns¶ An enumeration.
lang utils¶
-
class
nlper.utils.lang_utils.LangUtils¶ Utils for SpaCy language model
By default defines special case list of tokens to tokenizer.
-
set_language_model(spacy_lang: str = 'pl_spacy_model', disable_options=None) → <module ‘spacy’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/nlper/envs/latest/lib/python3.7/site-packages/spacy/__init__.py’>¶ Loads the SpaCy language model and adds the special case to tokenizer. By default tries to load spacy polish model and english model if first one is not available.
- Parameters
spacy_lang (str) – Name of language model
disable_options – List of SpaCy options to disable, for example ‘NER’ for accelerated text parsing
- Returns
Loaded Spacy language model
- Return type
spacy
-
tokenize_text(text: str) → List[str]¶ Tokenizes text using SpaCy language model
- Parameters
text (str) – Text to tokenize
- Returns
List of tokens
- Return type
list
-
-
class
nlper.utils.lang_utils.Token¶ Language tokens representing
<num> - any numerical value including date and time in different formats
<unk> - unknown / out of vocabulary word
<pad> - padding for short text sequences in batch
<sos> - start of sequence for the beginning of summary generation
<eos> - end of sequence for marking the end of generated summary
-
class
nlper.utils.lang_utils.VocabConfig(stoi=None, itos=None)¶ Utils for vocabulary
- Parameters
stoi (defaultdict, optional) – Dictionary with text token and assigned index
itos (list, optional) – List of text tokens
-
indices_from_text(text: str) → torch.Tensor¶ Converts text token to tensor of corresponding indices.
- Parameters
text (str) – Text to convert:
- Returns
Tensor with indices
- Return type
torch.Tensor
-
set_vocab_from_field(text: object) → None¶ Assigns dictionary of text tokens and indices, together with list of tokens from torchtext vocabulary.
- Parameters
text – Torchtext vocabulary
text – torchtext.Vocab
-
set_vocab_from_file(filepath: str = None) → None¶ Loads vocabulary from file.
- Parameters
filepath (str) – Path to file with vocabulary
-
text_from_indices(indices: List[int]) → str¶ Converts list of indices into corresponding text - sequence of tokens.
- Parameters
indices (list) – List of indices
- Returns
Text from tokens
- Return type
str
time utils¶
-
nlper.utils.time_utils.timeit(f: Callable) → Callable¶ Decorates function.
- Parameters
f (Callable) – Function to decorate
- Returns
torch utils¶
-
nlper.utils.torch_utils.get_device() → object¶ Returns the available device
- Returns
device
- Return type
object
-
nlper.utils.torch_utils.to_gpu(tensor: torch.Tensor) → torch.Tensor¶ Transfers torch tensor to available device
- Parameters
tensor (torch.Tensor) – Tensor to be transferred to device
- Returns
- Return type
torch.Tensor
train test splitter¶
-
class
nlper.utils.train_test_splitter.TrainTestSplitter(config: str = None, filepath: str = None, valid: bool = True)¶ Splits data frame into train, test and valid parts.
- Parameters
config (str, optional) – Path to yaml config file
filepath (str) – Path to pandas data frame
valid (bool) – Flag whether to include split for valid part
-
build_paths() → None¶ Builds paths required for saving the split data frames
-
create_dirs() → None¶ Creates directories for split data frames to save :return:
-
read_file() → None¶ Reads pandas data frame from path
-
run() → None¶ Executes train, test, valid data frame split
-
save_data(train: pandas.core.frame.DataFrame, test: pandas.core.frame.DataFrame, val: Optional[pandas.core.frame.DataFrame] = None, columns: Tuple[str] = 'text', 'summary') → None¶ Saves specified columns of train, test and valid data frames into csv files.
- Parameters
train – Train part data frame
train – pd.DataFrame
test – Test part data frame
test – pd.DataFrame
val – Valid part data frame
val – pd.DataFrame
columns – Columns to save
columns – tuple
-
set_config_from_filepath() → None¶ Specifies directory name to write split data frame parts
-
split_data() → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]¶ Splits data frame into train, test and valid parts.
- Returns
Train, test and valid data frames
- Return type
tuple
train utils¶
-
nlper.utils.train_utils.calculate_rouge(hypothesis: str, reference: str) → Optional[List[dict]]¶ Calculates Rouge scores which is a set of metrics for evaluation of machine translation or text summarization tasks. Rouge stands for Recall-Oriented Understudy for Gisting Evaluation and compares model output with target text. For text summarization task we consider two base accuracy measures - recall and precision. * Recall - number of overlapping words, divided by number of words in reference summary * Precision - number of overlapping words, divided by number of words in model generated summary
Additionally we consider F1-score of both recall and precision.
For the best overview of a model performance, we should measure recall, precision and F-score values.
There are few type of metrices: * ROUGE-1
Measures overlapping unigrams
- ROUGE-2
Measures overlapping bigrams
- ROUGE-L
Measures longest common subsequence (LCS), takes into account in-sequence matches on sentence level word order.
- Parameters
hypothesis (str) – Model generated text sequence
reference (str) – Reference text sequence
- Returns
List of precision, recall and F1-score for Rouge-1, Rouge-2 and Rouge-L metrics
- Return type
list
-
nlper.utils.train_utils.draw_attention_matrix(attention: torch.Tensor, original: str, summary: str, config=None, epoch=None, batch_id=None) → None¶ Draws plot with heatmap of attention using matplotlib for particular training step. If config specified, saves plot in specified location
- Parameters
attention (torch.Tensor) – Matrix with attention values
original (str) – Original text to summarize
summary (str) – Model generated summary text
config (dict, optional) – Config, if passed then the plot is saved in config-defined location
epoch (int, optional) – Current training epoch for naming purpose in plot saving operation
batch_id (int, optional) – Current batch number for naming purpose in plot saving operation
trim utils¶
-
class
nlper.utils.trim_utils.TrimUtils¶ Utils for trimming text to specified length
-
static
calculate_cumulative_sentences_lengths(sentences: List[Any]) → List[int]¶ Calculates cumulative length of sentences.
- Parameters
sentences (list) – List of sentences
- Returns
List of cumulative lengths
- Return type
np.array
-
get_language_model() → None¶ Obtains the language model.
-
static
get_last_sentence_index(lengths: List[int], threshold: int) → int¶ Obtains index of last sentence in sequence before trimming.
- Parameters
lengths (list) – List of sentence lengths
threshold (int) – Text sequence length threshold
- Returns
Index of last sentence
- Return type
int
-
get_parsed_text(text: str) → Any¶ Parses text through language model from SpaCy.
- Parameters
text (str) – String text to be parsed
- Returns
SpaCy parsed text
- Return type
spacy.tokens
-
static
join_sentences(sentences: List[Any]) → str¶ Joins list of sentences into single text.
- Parameters
sentences (list) – List of sentences
- Returns
Joined text
- Return type
str
-
static
remove_text_below_lower_length_threshold(threshold: int) → Callable¶ Calls lambda function to check if sequence length is above specified length threshold.
- Parameters
threshold (int) – Length threshold value
- Returns
Calls anonymous lambda function
- Return type
callable
-
static
trim_sentences(sentences: List[str], index: int) → List[str]¶ Trims sequence of sentences to particular index.
- Parameters
sentences (list) – List of sentences
index (int) – Index to trim at
- Returns
Trimmed list of sentences
- Return type
list
-
trim_text_to_upper_length_threshold(text: str, threshold: int) → str¶ Trims text to specified maximum length threshold. If text length is above the threshold, removes the last sentences until the total length does not exceed the threshold value. If the text length is below threshold, leaves the text unchanged.
- Parameters
text (str) – Text to be trimmed
threshold (int) – Maximum length threshold
- Returns
-
static