Introduction
METS offers great opportunities to reflect complex structure more than any other standard.
Therefore the METAe project group chose METS for their challenging task to digitize historic books and journals (1850-1920).
While METS is great in describing the structure of objects, a schema related to the content and layout information of each piece of the object was missing. Thus, the METAe project group introduced the ALTO schema, that was not only able to hold all the text information of a page, but also to hold all the word and paragraph, text block or illustration coordinates within a page. This allows to fully describe and reconstruct the layout and segementation of the original page digitized.
ALTO became a great extension schema for METS during the METAe project, at least for printed materials.
(directory)
History
METS offers great opportunities to reflect complex structure more than any other standard. Thus, the METAe project group chose METS for their challenging task to digitize historic books and journals (1850-1920).
While METS is great in describing the structure of objects, a schema related to the content and layout information of each piece of the object was missing. Thus, the METAe project group introduced the ALTO schema, that was not only able to hold all the text information of a page, but also to hold all the word and paragraph, text block or illustration coordinates within a page. ALTO became a great extension schema for METS during the METAe project, at least for printed materials.
(top - directory)
METS/ALTO XML Objects in Real Life
CCS developed its software docWORKS/METAe as a content conversion software. Scanned images are processed (Pre-processing, Layout Analysis, OCR, Structure Analysis) and exported as standard XML objects, based on METS/ALTO XML schemas. From the rich METS/ALTO XML object, you can build derivatives (PDF, METS/TEI, METS/TXT) using XSL style sheets easily.
Several national and general libraries as well as other cultural and educational institutions already use docWORKS to digitize and preserve their books, newspapers and journals, f.e.:
Harvard University Library
Library of Congress
Stanford University Library
University of Texas at Austin
Royal Danish Library
National Library of Finland
National Library of Norway
National Library of the Netherlands
(top - directory)
ALTO in NDNP
For the NDNP (National Digital Newspaper Project) the Library of Congress was looking for a METS extension schema describing the layout and content on printed pages. ALTO was a perfect fit, as it is proven in digitization of books and journals for previous years. Due to NDNP related requests the ALTO schema was extended to cover all needs.
ALTO 1.1 has been released and published by Library of Congress for some adaptions to the technical requirements of the NDNP project.
(top - directory)
ALTO Description
ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO is a standardized XML format to store layout and content information. It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), where METS provides metadata and structural information while ALTO contains content and physical information.
Each ALTO file contains a style section where different styles (for paragraphs and fonts) are listed. The layout section contains what’s on the page. A page is divided into several regions (Print space, left margin, right margin, top margin and bottom margin). For each region all objects are listed which have been detected inside.
Measurements in ALTO XML files are given in 1/10mm or in 1/1200inch. For presentation purposes one might want to create low resolution images. To use the coordinates within the ALTO file with any resolution they need to be transformed into pixels.
Transforming the inch1200 values to pixel depends on the image resolution. Convert the values into pixel as follows:
pixel = value * resolution / 1200
For 1/10mm convert the values into pixel as follows:
pixel = value * resolution / 254
(top - directory)
|