BBOX visualizer JSON extractor
JSON-builder script (CLI)
Module allowing to generate a JSON file used to visualize the bounding boxes of a canonical Newspaper element (issue, page or content-item).
- Usage:
python json_builder.py <element_ID> –level <level of bboxes> –output <output_path.json> –verbose –log-file <path/to/log_file>
element_id (positional) : ID of the element you want to extract the JSON from
level : level of the bounding boxes you want to visualize, it can be from {regions,paragraphs,lines,tokens}
output : path where the correspondin JSON with the bounding boxes will be outputed.
verbose : set the log level to DEBUG, otherwise will be INFO.
log-file : path to logfile to use, otherwise will print in stdout.
- impresso_essentials.bbox_visualizer.json_builder.build_bbox_json(element_id: str, level: str = 'regions', output_path: str = None) dict
Build the JSON of the bounding boxes of a page, CI, or issue at the specified level.
- Parameters:
id (str) – The id of the page, CI, or issue.
level (str) – The level at which to extract the bounding boxes Options: “regions”, “paragraphs”, “lines”, “tokens”
output_path (str) – Optional output file path
- Returns:
The JSON structure containing the bounding boxes
- Return type:
dict
- Raises:
ValueError – If the level is not recognized or if the element_id is invalid
Helpers to extract bboxes from the canonical data
Helper functions to extract bounding boxes from the manifest, with improvements to reduce S3 calls.
- impresso_essentials.bbox_visualizer.get_bbox.create_image_url(canonical_page_json: dict) str
Creates the URL for the page image using the base IIIF URL.
- Parameters:
canonical_page_json (dict) – JSON object of the page in canonical format.
- Returns:
URL of the page image.
- Return type:
str
- impresso_essentials.bbox_visualizer.get_bbox.create_s3_path(element_id: str) str
Constructs the S3 path based on the provided id.
- Parameters:
id (str) – Identifier string for page, CI, or issue.
- Returns:
S3 path string.
- Return type:
str
- Raises:
ValueError – If the id format is invalid.
- impresso_essentials.bbox_visualizer.get_bbox.get_base_url(canonical_page_json: dict) str
Retrieves the base URL of the IIIF server from page JSON in canonical format.
- Parameters:
canonical_page_json (dict) – JSON object of the page in canonical format.
"iiif_img_base_uri". (This should contain either "iiif" or)
- Returns:
The IIIF base URL.
- Return type:
str
- impresso_essentials.bbox_visualizer.get_bbox.get_ci_bounding_boxes(rebuilt_ci_json: dict, level: str = 'regions') dict
Extract bounding boxes from the CI manifest at the specified level from the rebuilt manifest.
- Parameters:
rebuilt_ci_json (dict) – The JSON dict of a CI from the rebuilt manifest
level (str) – The level at which to extract the bounding boxes - “regions”: Extract the bounding boxes of the regions - “tokens”: Extract the bounding boxes of the tokens - Default: “regions”
- Returns:
A dictionary of bounding boxes (coordinates) type and CI ID with the image URL as key
- Return type:
dict
- impresso_essentials.bbox_visualizer.get_bbox.get_ci_type(ci_id: str) str
Get the type of the CI from its ID from the canonical manifest of the issue.
Uses a cache to avoid repeated S3 calls.
- Parameters:
ci_id (str) – The ID of the CI
- Returns:
The mapped CI type
- Return type:
str
- impresso_essentials.bbox_visualizer.get_bbox.get_issue_bounding_boxes(canonical_issue_json: dict, level: str = 'regions') dict
Extract bounding boxes from the issue manifest at the specified level from the rebuilt manifest.
- Parameters:
canonical_issue_json (dict) – The JSON dict of an issue from the canonical manifest
level (str) – The level at which to extract the bounding boxes Options: “regions”, “tokens”
- Returns:
A dictionary mapping image URLs to lists of bounding boxes.
- Return type:
dict
- impresso_essentials.bbox_visualizer.get_bbox.get_page_bounding_boxes(canonical_page_json: dict, level: str = 'regions') dict
Extract bounding boxes from the manifest at the specified level
- Parameters:
canonical_page_json (dict) – JSON object of the page in canonical format.
level (str) – The level at which to extract the bounding boxes Options: “regions”, “paragraphs”, “lines”, “tokens”
- Returns:
A dictionary mapping the base image URL to a list of bounding boxes
- Return type:
dict
- Raises:
ValueError – If the level is not recognized