Utilities

Basic and General Utils Functions

class impresso_essentials.utils.DataStage(value)

Bases: StrEnum

Enum all stages requiring a versioning manifest.

Each member corresponds to a data stage and the associated string is used to name each generated manifest accordingly.

CANONICAL = 'canonical'
CAN_CONSOLIDATED = 'canonical-consolidated'
CLASSIF_IMAGES = 'classif-images'
EMB_DOCS = 'emb-docs'
EMB_ENTITIES = 'emb-entities'
EMB_IMAGES = 'emb-images'
EMB_PARAGRAPHS = 'emb-paragraphs'
EMB_SENTS = 'emb-sents'
EMB_WORDS = 'emb-words'
ENTITIES = 'entities'
LANGIDENT = 'langident'
LANGIDENT_OCRQA = 'langid-ocrqa'
LINGPROC = 'lingproc'
MYSQL_CIS = 'mysql-ingestion'
NEWS_AGENCIES = 'newsagencies'
OCRQA = 'ocrqa'
PASSIM = 'passim'
REBUILT = 'rebuilt'
SOLR_TEXT = 'solr-text-ingestion'
TEXT_REUSE = 'textreuse'
TOPICS = 'topics'
classmethod has_value(value: str) bool

Check if enum contains given value

Parameters:
  • cls (Self) – This DataStage class

  • value (str) – Value to check

Returns:

True if the value provided is in this enum’s values, False otherwise.

Return type:

bool

class impresso_essentials.utils.IssueDir(provider, alias, date, edition, path)

Bases: tuple

Create new instance of IssueDir(provider, alias, date, edition, path)

alias

Alias for field number 1

date

Alias for field number 2

edition

Alias for field number 3

path

Alias for field number 4

provider

Alias for field number 0

class impresso_essentials.utils.SourceMedium(value)

Bases: StrEnum

Enum all mediums of media sources in Impresso.

AO = 'audio'
PT = 'print'
TPS = 'typescript'
classmethod has_value(value: str) bool

Check if enum contains given value

Parameters:
  • cls (Self) – This source medium

  • value (str) – Value to check

Returns:

True if the value provided is in this enum’s values, False otherwise.

Return type:

bool

class impresso_essentials.utils.SourceType(value)

Bases: StrEnum

Enum all types of media sources in Impresso.

EC = 'encyclopedia'
MG = 'monograph'
NP = 'newspaper'
RB = 'radio_broadcast'
RM = 'radio_magazine'
RS = 'radio_schedule'
classmethod has_value(value: str) bool

Check if enum contains given value

Parameters:
  • cls (Self) – This source type

  • value (str) – Value to check

Returns:

True if the value provided is in this enum’s values, False otherwise.

Return type:

bool

class impresso_essentials.utils.Timer

Bases: object

Basic timer

stop() str

Stop the timer.

Returns:

Elapsed time since the start tick in seconds.

Return type:

str

tick() str

Perform a tick with the timer.

Returns:

Elapsed time since last tick in seconds.

Return type:

str

impresso_essentials.utils.bytes_to(bytes_nb: int, to_unit: str, bsize: int = 1024) float

Convert bytes to the specified unit.

Supported target units: - ‘k’ (kilobytes), ‘m’ (megabytes), - ‘g’ (gigabytes), ‘t’ (terabytes), - ‘p’ (petabytes), ‘e’ (exabytes).

Parameters:
  • bytes_nb (int) – The number of bytes to be converted.

  • to_unit (str) – The target unit for conversion.

  • bsize (int, optional) – The base size used for conversion (default is 1024).

Returns:

The converted value in the specified unit.

Return type:

float

Raises:

KeyError – If the specified target unit is not supported.

impresso_essentials.utils.chunk(l_to_chunk: list, chunksize: int) Generator

Yield successive n-sized chunks from list.

Parameters:
  • l_to_chunk (list) – List to chunk down.

  • chunksize (int) – Size of each chunk.

Yields:

Generator – Each chunk of the list.

impresso_essentials.utils.disable_interrupts()

Context manager to temporarily disable keyboard interrupts.

impresso_essentials.utils.get_list_intersection(list1: list, list2: list) list

Compute the intersection between two lists.

Parameters:
  • list1 (list) – First list to intersect.

  • list2 (list) – First list to intersect.

Returns:

List of intersection of both arguments.

Return type:

list

impresso_essentials.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'impresso_essentials') PosixPath

Return the resource at path in package, using a context manager.

Note

The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.

Parameters:
  • file_manager (contextlib.ExitStack) – Context manager.

  • path (str) – Path to the desired resource in given package.

  • package (str, optional) – Package name. Defaults to “impresso_essentials”.

Returns:

Path to desired managed resource.

Return type:

pathlib.PosixPath

impresso_essentials.utils.get_provider_for_alias(media_alias: str) str

Get the provider for a given media alias.

Parameters:

media_alias (str) – The media alias to get the provider for.

Returns:

The provider for the given media alias.

Return type:

str

impresso_essentials.utils.get_src_info_for_alias(media_alias: str, provider: str | None = None, medium: bool | None = True) str

Get the provider for a given media alias.

Parameters:
  • media_alias (str) – The media alias to get the provider for.

  • provider (str | None, optional) – The provider for the media. If None, the provider will be determined from the media alias. Defaults to None.

Returns:

The source medium for the given media alias.

Return type:

str

impresso_essentials.utils.id_to_issuedir(canonical_id: str, issue_path: str, provider: str | None = None) IssueDir

Instantiate an IssueDir object from a canonical ID and the path to the issue.

Parameters:
  • canonical_id (str) – Canonical ID of the issue.

  • issue_path (str) – Local path to the issue files.

  • provider (str | None, optional) – Provider associated to that alias. Defaults to None, if not provided, will be deduced from the alias (slight overhead).

Returns:

IssueDir instance for the object

Return type:

IssueDir

impresso_essentials.utils.init_logger(_logger: RootLogger, level: int = 20, file: str | None = None) RootLogger

Initialises the root logger.

Parameters:
  • _logger (logging.RootLogger) – Logger instance to initialise.

  • level (int, optional) – desired level of logging. Defaults to logging.INFO.

  • file (str | None, optional) – _description_. Defaults to None.

Returns:

the initialised logger

Return type:

logging.RootLogger

impresso_essentials.utils.partitioner(bag: Bag, path: str, nb_partitions: int) None

Partition a Dask bag into n partitions and write each to a separate file.

Parameters:
  • bag (dask.bag.core.Bag) – The Dask bag to be partitioned.

  • path (str) – Directory path where partitioned files will be saved.

  • nb_partitions (int) – Number of partitions to create.

Returns:

The function writes partitioned files to the specified path.

Return type:

None

impresso_essentials.utils.timestamp(ts_format: str = '%Y-%m-%dT%H:%M:%SZ', with_space: bool = False) str

Return an iso-formatted timestamp.

Parameters:
  • ts_format (str, optional) – Timestamp format to use for the returned timestamp. Defaults to “%Y-%m-%dT%H:%M:%SZ”.

  • with_space (bool, optional) – Format the timestamp with spaces. If True, the format used will be “%Y-%m-%d %H:%M:%S”. Defaults to False.

Returns:

Timestamp formatted according to a provided format.

Return type:

str

impresso_essentials.utils.user_confirmation(question: str, default: str | None = None) bool

Ask a yes/no question via raw_input() and return their answer.

Parameters:
  • question (str) – String question presented to the user.

  • default (str | None, optional) – Presumed answer if the user just hits <Enter>. Should be one of “yes”, “no” and None. Defaults to None.

Raises:

ValueError – The default value provided is not valid.

Returns:

User’s answer to the asked question.

Return type:

bool

impresso_essentials.utils.user_question(variable_to_confirm: str) None

Ask the user if the identified variable is correct.

Parameters:

variable_to_confirm (str) – Variable to be checked by the user.

impresso_essentials.utils.validate_against_schema(json_to_validate: dict[str, Any], path_to_schema: str = 'schemas/json/versioning/manifest.schema.json') None

Validate a dict corresponding to a JSON against a provided JSON schema.

Parameters:
  • json (dict[str, Any]) – JSON data to validate against a schema.

  • path_to_schema (str, optional) – Path to the JSON schema to validate against. Defaults to “impresso-schemas/json/versioning/manifest.schema.json”.

Raises:

e – The provided JSON could not be validated against the provided schema.

impresso_essentials.utils.validate_granularity(value: str) str | None

Validate that the granularity value provided is valid.

Statistics are computed on three granularity levels: corpus, title and year. TODO: add provider?

Parameters:

value (str) – Granularity value to validate

Raises:

ValueError – The provided granularity isn’t one of corpus, title and year.

Returns:

The provided value, in lower case, or None if not valid.

Return type:

Optional[str]

impresso_essentials.utils.validate_source(source: str, return_value_str: bool = False, medium: bool = True) SourceType | SourceMedium | str | None

Validate the provided source type if it’s in the SourceType Enum (key or value).

Parameters:
  • source (str) – Source type or medium key or value to validate.

  • return_value_str (bool, optional) – Whether to return the source type or medium’s value if it was valid. Defaults to False.

  • medium (bool, optional) – Whether to validate a source medium (True) or key (False). Defaults to True.

Raises:

e – The provided str is neither a source type key nor value.

Returns:

The SourceType or value string if valid.

Return type:

SourceType | str | None | SourceMedium

impresso_essentials.utils.validate_stage(data_stage: str, return_value_str: bool = False) DataStage | str | None

Validate the provided data stage if it’s in the DataStage Enum (key or value).

Parameters:
  • data_stage (str) – Data stage key or value to validate.

  • return_value_str (bool, optional) – Whether to return the data stage’s value if it was valid. Defaults to False.

Raises:

e – The provided str is neither a data stage key nor value.

Returns:

The corresponding DataStage or value string if valid.

Return type:

DataStage | str | None

Text Processing Utils Functions

Reusable util functions to perform some text processing.

impresso_essentials.text_utils.insert_whitespace(token: str, next_t: str | None, prev_t: str | None, lang: str | None) bool

Determine whether a whitespace should be inserted after a token.

Parameters:
  • token (str) – Current token.

  • next_t (str) – Following token.

  • prev_t (str) – Previous token.

  • lang (str) – Language of text.

Returns:

Whether a whitespace should be inserted after the token.

Return type:

bool

impresso_essentials.text_utils.is_stopword_or_all_stopwords(text: str, languages: list | None = None) bool

Check if all tokens in the text are stopwords in the given languages.

Parameters:
  • text (str) – The text to check.

  • languages (list | None, optional) – List of languages to consider for stopwords. If not defined, will be set to [“french”, “german”]. Defaults to None.

Returns:

True if the text is a stopword or all tokens are stopwords, False otherwise.

Return type:

bool

impresso_essentials.text_utils.normalize_text(text: str) str

Remove spaces and tabs for the search but keep newline characters.

Parameters:

text (str) – Text to normalize.

Returns:

Normalized text.

Return type:

str

impresso_essentials.text_utils.search_text(article_text: str, text_to_search: str) list[tuple[int, int]]

Look for all occurrences or the text_to_search within the given article text.

Parameters:
  • article_text (str) – Article in which to find occurrences of text_to_search.

  • text_to_search (str) – Text to search within the article_text.

Returns:

Start and end indices of occurrences within the article.

Return type:

list[tuple[int, int]]

impresso_essentials.text_utils.segment_and_trim_sentences(article: str, language: str, max_length: int) list[str]

Segment the given article into trimmed sentences based on a max_length.

Parameters:
  • article (str) – Full-text article to segment into sentences.

  • language (str) – Two-letter language code of article.

  • max_length (int) – Maximum length for each segmented sentence.

Returns:

List of resulting trimmed sentences.

Return type:

list[str]

impresso_essentials.text_utils.tokenise(text: str, language: str) list[str]

Apply whitespace rules to the given text and language, separating it into tokens.

Parameters:
  • text (str) – The input text to separate into a list of tokens.

  • language (str) – Language of the text.

Returns:

List of tokens with punctuation as separate tokens.

Return type:

list[str]