Utilities

Basic and General Utils Functions

class impresso_essentials.utils.IssueDir(journal, date, edition, path)

Bases: tuple

Create new instance of IssueDir(journal, date, edition, path)

date

Alias for field number 1

edition

Alias for field number 2

journal

Alias for field number 0

path

Alias for field number 3

class impresso_essentials.utils.SourceType(value)

Bases: StrEnum

Enum all types of media source in Impresso.

MG = 'monograph'
NP = 'newspaper'
RB = 'radio_broadcast'
RM = 'radio_magazine'
RS = 'radio_schedule'
classmethod has_value(value: str) bool

Check if enum contains given value

Parameters:
  • cls (Self) – This DataStage class

  • value (str) – Value to check

Returns:

True if the value provided is in this enum’s values, False otherwise.

Return type:

bool

class impresso_essentials.utils.Timer

Bases: object

Basic timer

stop() str

Stop the timer.

Returns:

Elapsed time since the start tick in seconds.

Return type:

str

tick() str

Perform a tick with the timer.

Returns:

Elapsed time since last tick in seconds.

Return type:

str

impresso_essentials.utils.bytes_to(bytes_nb: int, to_unit: str, bsize: int = 1024) float

Convert bytes to the specified unit.

Supported target units: - ‘k’ (kilobytes), ‘m’ (megabytes), - ‘g’ (gigabytes), ‘t’ (terabytes), - ‘p’ (petabytes), ‘e’ (exabytes).

Parameters:
  • bytes_nb (int) – The number of bytes to be converted.

  • to_unit (str) – The target unit for conversion.

  • bsize (int, optional) – The base size used for conversion (default is 1024).

Returns:

The converted value in the specified unit.

Return type:

float

Raises:

KeyError – If the specified target unit is not supported.

impresso_essentials.utils.chunk(l_to_chunk: list, chunksize: int) Generator

Yield successive n-sized chunks from list.

Parameters:
  • l_to_chunk (list) – List to chunk down.

  • chunksize (int) – Size of each chunk.

Yields:

Generator – Each chunk of the list.

impresso_essentials.utils.get_list_intersection(list1: list, list2: list) list

Compute the intersection between two lists.

Parameters:
  • list1 (list) – First list to intersect.

  • list2 (list) – First list to intersect.

Returns:

List of intersection of both arguments.

Return type:

list

impresso_essentials.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'impresso_essentials') PosixPath

Return the resource at path in package, using a context manager.

Note

The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.

Parameters:
  • file_manager (contextlib.ExitStack) – Context manager.

  • path (str) – Path to the desired resource in given package.

  • package (str, optional) – Package name. Defaults to “impresso_essentials”.

Returns:

Path to desired managed resource.

Return type:

pathlib.PosixPath

impresso_essentials.utils.id_to_issuedir(canonical_id: str, issue_path: str) IssueDir

Instantiate an IssueDir object from a canonical ID and the path to the issue.

Parameters:
  • canonical_id (str) – Canonical ID of the issue.

  • issue_path (str) – Local path to the issue files.

Returns:

IssueDir instance for the object

Return type:

IssueDir

impresso_essentials.utils.init_logger(_logger: RootLogger, level: int = 20, file: str | None = None) RootLogger

Initialises the root logger.

Parameters:
  • _logger (logging.RootLogger) – Logger instance to initialise.

  • level (int, optional) – desired level of logging. Defaults to logging.INFO.

  • file (str | None, optional) – _description_. Defaults to None.

Returns:

the initialised logger

Return type:

logging.RootLogger

impresso_essentials.utils.partitioner(bag: Bag, path: str, nb_partitions: int) None

Partition a Dask bag into n partitions and write each to a separate file.

Parameters:
  • bag (dask.bag.core.Bag) – The Dask bag to be partitioned.

  • path (str) – Directory path where partitioned files will be saved.

  • nb_partitions (int) – Number of partitions to create.

Returns:

The function writes partitioned files to the specified path.

Return type:

None

impresso_essentials.utils.timestamp(ts_format: str = '%Y-%m-%dT%H:%M:%SZ', with_space: bool = False) str

Return an iso-formatted timestamp.

Parameters:
  • ts_format (str, optional) – Timestamp format to use for the returned timestamp. Defaults to “%Y-%m-%dT%H:%M:%SZ”.

  • with_space (bool, optional) – Format the timestamp with spaces. If True, the format used will be “%Y-%m-%d %H:%M:%S”. Defaults to False.

Returns:

Timestamp formatted according to a provided format.

Return type:

str

impresso_essentials.utils.user_confirmation(question: str, default: str | None = None) bool

Ask a yes/no question via raw_input() and return their answer.

Parameters:
  • question (str) – String question presented to the user.

  • default (str | None, optional) – Presumed answer if the user just hits <Enter>. Should be one of “yes”, “no” and None. Defaults to None.

Raises:

ValueError – The default value provided is not valid.

Returns:

User’s answer to the asked question.

Return type:

bool

impresso_essentials.utils.user_question(variable_to_confirm: str) None

Ask the user if the identified variable is correct.

Parameters:

variable_to_confirm (str) – Variable to be checked by the user.

impresso_essentials.utils.validate_against_schema(json_to_validate: dict[str, Any], path_to_schema: str = 'schemas/json/versioning/manifest.schema.json') None

Validate a dict corresponding to a JSON against a provided JSON schema.

Parameters:
  • json (dict[str, Any]) – JSON data to validate against a schema.

  • path_to_schema (str, optional) – Path to the JSON schema to validate against. Defaults to “impresso-schemas/json/versioning/manifest.schema.json”.

Raises:

e – The provided JSON could not be validated against the provided schema.

Text Processing Utils Functions

Reusable util functions to perform some text processing.

impresso_essentials.text_utils.is_stopword_or_all_stopwords(text: str, languages: list | None = None) bool

Check if all tokens in the text are stopwords in the given languages.

Parameters:
  • text (str) – The text to check.

  • languages (list | None, optional) – List of languages to consider for stopwords. If not defined, will be set to [“french”, “german”]. Defaults to None.

Returns:

True if the text is a stopword or all tokens are stopwords, False otherwise.

Return type:

bool

impresso_essentials.text_utils.normalize_text(text: str) str

Remove spaces and tabs for the search but keep newline characters.

Parameters:

text (str) – Text to normalize.

Returns:

Normalized text.

Return type:

str

impresso_essentials.text_utils.search_text(article_text: str, search_text: str) list[tuple[int, int]]

Look for all occurrences or the search_text within the given article text.

Parameters:
  • article_text (str) – Article in which to find occurrences or search_text.

  • search_text (str) – Text to search within the article_text.

Returns:

Start and end indices of occurrences within the article.

Return type:

list[tuple[int, int]]

impresso_essentials.text_utils.segment_and_trim_sentences(article: str, language: str, max_length: int) list[str]

Segment the given article into trimmed sentences based on a max_length.

Parameters:
  • article (str) – Full-text article to segment into sentences.

  • language (str) – Two-letter language code of article.

  • max_length (int) – Maximum length for each segmented sentence.

Returns:

List of resulting trimmed sentences.

Return type:

list[str]

impresso_essentials.text_utils.tokenise(text: str, language: str) list[str]

Apply whitespace rules to the given text and language, separating it into tokens.

Parameters:
  • text (str) – The input text to separate into a list of tokens.

  • language (str) – Language of the text.

Returns:

List of tokens with punctuation as separate tokens.

Return type:

list[str]