Utilities
Basic and General Utils Functions
- class impresso_essentials.utils.DataStage(value)
Bases:
StrEnumEnum all stages requiring a versioning manifest.
Each member corresponds to a data stage and the associated string is used to name each generated manifest accordingly.
- CANONICAL = 'canonical'
- CAN_CONSOLIDATED = 'canonical-consolidated'
- CLASSIF_IMAGES = 'classif-images'
- EMB_DOCS = 'emb-docs'
- EMB_ENTITIES = 'emb-entities'
- EMB_IMAGES = 'emb-images'
- EMB_PARAGRAPHS = 'emb-paragraphs'
- EMB_SENTS = 'emb-sents'
- EMB_WORDS = 'emb-words'
- ENTITIES = 'entities'
- LANGIDENT = 'langident'
- LANGIDENT_OCRQA = 'langid-ocrqa'
- LINGPROC = 'lingproc'
- MYSQL_CIS = 'mysql-ingestion'
- NEWS_AGENCIES = 'newsagencies'
- OCRQA = 'ocrqa'
- PASSIM = 'passim'
- REBUILT = 'rebuilt'
- SOLR_TEXT = 'solr-text-ingestion'
- TEXT_REUSE = 'textreuse'
- TOPICS = 'topics'
- classmethod has_value(value: str) bool
Check if enum contains given value
- Parameters:
cls (Self) – This DataStage class
value (str) – Value to check
- Returns:
True if the value provided is in this enum’s values, False otherwise.
- Return type:
bool
- class impresso_essentials.utils.IssueDir(provider, alias, date, edition, path)
Bases:
tupleCreate new instance of IssueDir(provider, alias, date, edition, path)
- alias
Alias for field number 1
- date
Alias for field number 2
- edition
Alias for field number 3
- path
Alias for field number 4
- provider
Alias for field number 0
- class impresso_essentials.utils.SourceMedium(value)
Bases:
StrEnumEnum all mediums of media sources in Impresso.
- AO = 'audio'
- PT = 'print'
- TPS = 'typescript'
- classmethod has_value(value: str) bool
Check if enum contains given value
- Parameters:
cls (Self) – This source medium
value (str) – Value to check
- Returns:
True if the value provided is in this enum’s values, False otherwise.
- Return type:
bool
- class impresso_essentials.utils.SourceType(value)
Bases:
StrEnumEnum all types of media sources in Impresso.
- EC = 'encyclopedia'
- MG = 'monograph'
- NP = 'newspaper'
- RB = 'radio_broadcast'
- RM = 'radio_magazine'
- RS = 'radio_schedule'
- classmethod has_value(value: str) bool
Check if enum contains given value
- Parameters:
cls (Self) – This source type
value (str) – Value to check
- Returns:
True if the value provided is in this enum’s values, False otherwise.
- Return type:
bool
- class impresso_essentials.utils.Timer
Bases:
objectBasic timer
- stop() str
Stop the timer.
- Returns:
Elapsed time since the start tick in seconds.
- Return type:
str
- tick() str
Perform a tick with the timer.
- Returns:
Elapsed time since last tick in seconds.
- Return type:
str
- impresso_essentials.utils.bytes_to(bytes_nb: int, to_unit: str, bsize: int = 1024) float
Convert bytes to the specified unit.
Supported target units: - ‘k’ (kilobytes), ‘m’ (megabytes), - ‘g’ (gigabytes), ‘t’ (terabytes), - ‘p’ (petabytes), ‘e’ (exabytes).
- Parameters:
bytes_nb (int) – The number of bytes to be converted.
to_unit (str) – The target unit for conversion.
bsize (int, optional) – The base size used for conversion (default is 1024).
- Returns:
The converted value in the specified unit.
- Return type:
float
- Raises:
KeyError – If the specified target unit is not supported.
- impresso_essentials.utils.chunk(l_to_chunk: list, chunksize: int) Generator
Yield successive n-sized chunks from list.
- Parameters:
l_to_chunk (list) – List to chunk down.
chunksize (int) – Size of each chunk.
- Yields:
Generator – Each chunk of the list.
- impresso_essentials.utils.disable_interrupts()
Context manager to temporarily disable keyboard interrupts.
- impresso_essentials.utils.get_list_intersection(list1: list, list2: list) list
Compute the intersection between two lists.
- Parameters:
list1 (list) – First list to intersect.
list2 (list) – First list to intersect.
- Returns:
List of intersection of both arguments.
- Return type:
list
- impresso_essentials.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'impresso_essentials') PosixPath
Return the resource at path in package, using a context manager.
Note
The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.
- Parameters:
file_manager (contextlib.ExitStack) – Context manager.
path (str) – Path to the desired resource in given package.
package (str, optional) – Package name. Defaults to “impresso_essentials”.
- Returns:
Path to desired managed resource.
- Return type:
pathlib.PosixPath
- impresso_essentials.utils.get_provider_for_alias(media_alias: str) str
Get the provider for a given media alias.
- Parameters:
media_alias (str) – The media alias to get the provider for.
- Returns:
The provider for the given media alias.
- Return type:
str
- impresso_essentials.utils.get_src_info_for_alias(media_alias: str, provider: str | None = None, medium: bool | None = True) str
Get the provider for a given media alias.
- Parameters:
media_alias (str) – The media alias to get the provider for.
provider (str | None, optional) – The provider for the media. If None, the provider will be determined from the media alias. Defaults to None.
- Returns:
The source medium for the given media alias.
- Return type:
str
- impresso_essentials.utils.id_to_issuedir(canonical_id: str, issue_path: str, provider: str | None = None) IssueDir
Instantiate an IssueDir object from a canonical ID and the path to the issue.
- Parameters:
canonical_id (str) – Canonical ID of the issue.
issue_path (str) – Local path to the issue files.
provider (str | None, optional) – Provider associated to that alias. Defaults to None, if not provided, will be deduced from the alias (slight overhead).
- Returns:
IssueDir instance for the object
- Return type:
- impresso_essentials.utils.init_logger(_logger: RootLogger, level: int = 20, file: str | None = None) RootLogger
Initialises the root logger.
- Parameters:
_logger (logging.RootLogger) – Logger instance to initialise.
level (int, optional) – desired level of logging. Defaults to logging.INFO.
file (str | None, optional) – _description_. Defaults to None.
- Returns:
the initialised logger
- Return type:
logging.RootLogger
- impresso_essentials.utils.partitioner(bag: Bag, path: str, nb_partitions: int) None
Partition a Dask bag into n partitions and write each to a separate file.
- Parameters:
bag (dask.bag.core.Bag) – The Dask bag to be partitioned.
path (str) – Directory path where partitioned files will be saved.
nb_partitions (int) – Number of partitions to create.
- Returns:
The function writes partitioned files to the specified path.
- Return type:
None
- impresso_essentials.utils.timestamp(ts_format: str = '%Y-%m-%dT%H:%M:%SZ', with_space: bool = False) str
Return an iso-formatted timestamp.
- Parameters:
ts_format (str, optional) – Timestamp format to use for the returned timestamp. Defaults to “%Y-%m-%dT%H:%M:%SZ”.
with_space (bool, optional) – Format the timestamp with spaces. If True, the format used will be “%Y-%m-%d %H:%M:%S”. Defaults to False.
- Returns:
Timestamp formatted according to a provided format.
- Return type:
str
- impresso_essentials.utils.user_confirmation(question: str, default: str | None = None) bool
Ask a yes/no question via raw_input() and return their answer.
- Parameters:
question (str) – String question presented to the user.
default (str | None, optional) – Presumed answer if the user just hits <Enter>. Should be one of “yes”, “no” and None. Defaults to None.
- Raises:
ValueError – The default value provided is not valid.
- Returns:
User’s answer to the asked question.
- Return type:
bool
- impresso_essentials.utils.user_question(variable_to_confirm: str) None
Ask the user if the identified variable is correct.
- Parameters:
variable_to_confirm (str) – Variable to be checked by the user.
- impresso_essentials.utils.validate_against_schema(json_to_validate: dict[str, Any], path_to_schema: str = 'schemas/json/versioning/manifest.schema.json') None
Validate a dict corresponding to a JSON against a provided JSON schema.
- Parameters:
json (dict[str, Any]) – JSON data to validate against a schema.
path_to_schema (str, optional) – Path to the JSON schema to validate against. Defaults to “impresso-schemas/json/versioning/manifest.schema.json”.
- Raises:
e – The provided JSON could not be validated against the provided schema.
- impresso_essentials.utils.validate_granularity(value: str) str | None
Validate that the granularity value provided is valid.
Statistics are computed on three granularity levels: corpus, title and year. TODO: add provider?
- Parameters:
value (str) – Granularity value to validate
- Raises:
ValueError – The provided granularity isn’t one of corpus, title and year.
- Returns:
The provided value, in lower case, or None if not valid.
- Return type:
Optional[str]
- impresso_essentials.utils.validate_source(source: str, return_value_str: bool = False, medium: bool = True) SourceType | SourceMedium | str | None
Validate the provided source type if it’s in the SourceType Enum (key or value).
- Parameters:
source (str) – Source type or medium key or value to validate.
return_value_str (bool, optional) – Whether to return the source type or medium’s value if it was valid. Defaults to False.
medium (bool, optional) – Whether to validate a source medium (True) or key (False). Defaults to True.
- Raises:
e – The provided str is neither a source type key nor value.
- Returns:
The SourceType or value string if valid.
- Return type:
SourceType | str | None | SourceMedium
- impresso_essentials.utils.validate_stage(data_stage: str, return_value_str: bool = False) DataStage | str | None
Validate the provided data stage if it’s in the DataStage Enum (key or value).
- Parameters:
data_stage (str) – Data stage key or value to validate.
return_value_str (bool, optional) – Whether to return the data stage’s value if it was valid. Defaults to False.
- Raises:
e – The provided str is neither a data stage key nor value.
- Returns:
The corresponding DataStage or value string if valid.
- Return type:
DataStage | str | None
Text Processing Utils Functions
Reusable util functions to perform some text processing.
- impresso_essentials.text_utils.insert_whitespace(token: str, next_t: str | None, prev_t: str | None, lang: str | None) bool
Determine whether a whitespace should be inserted after a token.
- Parameters:
token (str) – Current token.
next_t (str) – Following token.
prev_t (str) – Previous token.
lang (str) – Language of text.
- Returns:
Whether a whitespace should be inserted after the token.
- Return type:
bool
- impresso_essentials.text_utils.is_stopword_or_all_stopwords(text: str, languages: list | None = None) bool
Check if all tokens in the text are stopwords in the given languages.
- Parameters:
text (str) – The text to check.
languages (list | None, optional) – List of languages to consider for stopwords. If not defined, will be set to [“french”, “german”]. Defaults to None.
- Returns:
True if the text is a stopword or all tokens are stopwords, False otherwise.
- Return type:
bool
- impresso_essentials.text_utils.normalize_text(text: str) str
Remove spaces and tabs for the search but keep newline characters.
- Parameters:
text (str) – Text to normalize.
- Returns:
Normalized text.
- Return type:
str
- impresso_essentials.text_utils.search_text(article_text: str, text_to_search: str) list[tuple[int, int]]
Look for all occurrences or the text_to_search within the given article text.
- Parameters:
article_text (str) – Article in which to find occurrences of text_to_search.
text_to_search (str) – Text to search within the article_text.
- Returns:
Start and end indices of occurrences within the article.
- Return type:
list[tuple[int, int]]
- impresso_essentials.text_utils.segment_and_trim_sentences(article: str, language: str, max_length: int) list[str]
Segment the given article into trimmed sentences based on a max_length.
- Parameters:
article (str) – Full-text article to segment into sentences.
language (str) – Two-letter language code of article.
max_length (int) – Maximum length for each segmented sentence.
- Returns:
List of resulting trimmed sentences.
- Return type:
list[str]
- impresso_essentials.text_utils.tokenise(text: str, language: str) list[str]
Apply whitespace rules to the given text and language, separating it into tokens.
- Parameters:
text (str) – The input text to separate into a list of tokens.
language (str) – Language of the text.
- Returns:
List of tokens with punctuation as separate tokens.
- Return type:
list[str]