BBOX visualizer JSON extractor

JSON-builder script (CLI)

Module allowing to generate a JSON file used to visualize the bounding boxes of a canonical Newspaper element (issue, page or content-item).

Usage:

python json_builder.py <element_ID> –level <level of bboxes> –output <output_path.json> –verbose –log-file <path/to/log_file>

  • element_id (positional) : ID of the element you want to extract the JSON from

  • level : level of the bounding boxes you want to visualize, it can be from {regions,paragraphs,lines,tokens}

  • output : path where the correspondin JSON with the bounding boxes will be outputed.

  • verbose : set the log level to DEBUG, otherwise will be INFO.

  • log-file : path to logfile to use, otherwise will print in stdout.

impresso_essentials.bbox_visualizer.json_builder.build_bbox_json(element_id: str, level: str = 'regions', output_path: str = None) dict

Build the JSON of the bounding boxes of a page, CI, or issue at the specified level.

Parameters:
  • id (str) – The id of the page, CI, or issue.

  • level (str) – The level at which to extract the bounding boxes Options: “regions”, “paragraphs”, “lines”, “tokens”

  • output_path (str) – Optional output file path

Returns:

The JSON structure containing the bounding boxes

Return type:

dict

Raises:

ValueError – If the level is not recognized or if the element_id is invalid

Helpers to extract bboxes from the canonical data

Helper functions to extract bounding boxes from the manifest, with improvements to reduce S3 calls.

impresso_essentials.bbox_visualizer.get_bbox.create_image_url(canonical_page_json: dict) str

Creates the URL for the page image using the base IIIF URL.

Parameters:

canonical_page_json (dict) – JSON object of the page in canonical format.

Returns:

URL of the page image.

Return type:

str

impresso_essentials.bbox_visualizer.get_bbox.create_s3_path(element_id: str) str

Constructs the S3 path based on the provided id.

Parameters:

id (str) – Identifier string for page, CI, or issue.

Returns:

S3 path string.

Return type:

str

Raises:

ValueError – If the id format is invalid.

impresso_essentials.bbox_visualizer.get_bbox.get_base_url(canonical_page_json: dict) str

Retrieves the base URL of the IIIF server from page JSON in canonical format.

Parameters:
  • canonical_page_json (dict) – JSON object of the page in canonical format.

  • "iiif_img_base_uri". (This should contain either "iiif" or)

Returns:

The IIIF base URL.

Return type:

str

impresso_essentials.bbox_visualizer.get_bbox.get_ci_bounding_boxes(rebuilt_ci_json: dict, level: str = 'regions') dict

Extract bounding boxes from the CI manifest at the specified level from the rebuilt manifest.

Parameters:
  • rebuilt_ci_json (dict) – The JSON dict of a CI from the rebuilt manifest

  • level (str) – The level at which to extract the bounding boxes - “regions”: Extract the bounding boxes of the regions - “tokens”: Extract the bounding boxes of the tokens - Default: “regions”

Returns:

A dictionary of bounding boxes (coordinates) type and CI ID with the image URL as key

Return type:

dict

impresso_essentials.bbox_visualizer.get_bbox.get_ci_type(ci_id: str) str

Get the type of the CI from its ID from the canonical manifest of the issue.

Uses a cache to avoid repeated S3 calls.

Parameters:

ci_id (str) – The ID of the CI

Returns:

The mapped CI type

Return type:

str

impresso_essentials.bbox_visualizer.get_bbox.get_issue_bounding_boxes(canonical_issue_json: dict, level: str = 'regions') dict

Extract bounding boxes from the issue manifest at the specified level from the rebuilt manifest.

Parameters:
  • canonical_issue_json (dict) – The JSON dict of an issue from the canonical manifest

  • level (str) – The level at which to extract the bounding boxes Options: “regions”, “tokens”

Returns:

A dictionary mapping image URLs to lists of bounding boxes.

Return type:

dict

impresso_essentials.bbox_visualizer.get_bbox.get_page_bounding_boxes(canonical_page_json: dict, level: str = 'regions') dict

Extract bounding boxes from the manifest at the specified level

Parameters:
  • canonical_page_json (dict) – JSON object of the page in canonical format.

  • level (str) – The level at which to extract the bounding boxes Options: “regions”, “paragraphs”, “lines”, “tokens”

Returns:

A dictionary mapping the base image URL to a list of bounding boxes

Return type:

dict

Raises:

ValueError – If the level is not recognized