File I/O

main

data frame reader

class nlper.file_io.dataframe_reader.FileReader(path: str, allowed_extensions: Sequence = '.jsonl', '.jl')

Extraction of the raw data files into pandas data frames. Starts with fetching file paths and names.

Parameters
  • path (str) – Path to folder with raw data files

  • allowed_extensions (sequence) – Types of allowed files extension

read_json_lines_files() → Dict[str, pandas.core.frame.DataFrame]

Reads json lines raw files to pandas data frames and stores it inside dict with name of file as key.

Example output:

{ 'BBC' : pd.DataFrame(...), 'CNN' : pd.DataFrame(...) }

Returns

Dictionary with file names and data frames

Return type

dict

data frame writer

class nlper.file_io.dataframe_writer.FileWriter(path: str, output_type: str = 'pickle')

Saving the data into pandas data frames. Currently supports saving files in CSV and Pickle format.

Parameters
  • path (str) – Path to folder to save files

  • output_type (str) – Format of saved files

merge_dataframes() → pandas.core.frame.DataFrame

Merges multiple data frames into single.

Returns

Merged data frames

Return type

pd.DataFrame

resolve_output_format_type_and_save(data: pandas.core.frame.DataFrame, name: str) → None

Resolved output format type and saves a single data frame. Currently supports saving only Pickle and CSV file types using python.

Parameters
  • data (pd.DataFrame) – Data frame to save.

  • name (str) – Name under which save data frame to

save_dataframe(name: str) → None

Calls method to resolve output format and save single data frame.

Parameters

name (str) – Name of data frame to save

save_dataframes(name: str) → None

Calls method to resolve output format and save multiple data frame.

Parameters

name (str) – Name of data frame to save

save_file(data: Any, name: str, merge_data: Any = None, output_type: str = None) → str

Resolves how to process saving all data frames into files regarding the passed arguments. If merge_data is set to True, all data frames are merged into single one.

Parameters
  • data (dict, pd.DataFrame) – Dictionary with file names as key and data frames as values, or single data frame.

  • name (str) – Name of output file(s)

  • merge_data (bool, optional) – Flag to merge of not multiple data frames into one, optional

  • output_type (str, optional) – Format to save data frame(s), if not specified using one from __init__ method.

Returns

File saving location

Return type

str

file type resolver

class nlper.file_io.file_type_resolver.FileTypesResolver

Supported file types readers.

reader

class nlper.file_io.reader.CsvReader
class nlper.file_io.reader.HtmlReader
class nlper.file_io.reader.JsonReader
class nlper.file_io.reader.Reader
open_file(filepath: str) → Any

Safely opens and returns file specified in file path.

Parameters

filepath (str) – File path to open

Returns

Opened file or FileNotFoundError

Return type

any

class nlper.file_io.reader.TextReader

writer

class nlper.file_io.writer.CsvWriter
class nlper.file_io.writer.JsonWriter
class nlper.file_io.writer.PickleWriter
class nlper.file_io.writer.Writer
static create_dir(directory: str) → None

Creates a directory for split data frame parts if not exists.

Parameters

directory (str) – Directory to create

write(path: str, file: Any) → None

Safely writes file to specified location.

Parameters
  • path (str) – Path to save file

  • file (any) – File to save