Uses of Metadata
From Project Gutenberg, the first producer of free electronic books (ebooks).
There have been various attempts to create software to turn Project Gutenberg into a user-friendly virtual library. This article describes some issues and concepts for designing such software. Much of this proposed solution surrounds storing metadata in a separate file for each etext. The use of metadata files allows one to make virtual corrections, additions and enhancements to etexts without ever modifying them in any way. The information would be applied by software to the etext when it loads it and only to the copy it retains in RAM while running. The metadata could include many kinds of information which is thought to be useful or interesting which wouldn't be appropriate to include in the etext itself. The most important type considered here though, is "fancy-formatting" data. The reason to insist on storing "fancy-formatting" data rather than using a program such as GutenMark to generate it automatically, is that: although much of the fancy formatting for an etext can be generated automatically and perfectly, often some of the formatting for the same etext will be done incorrectly because there are too many things the program can't anticipate without it being extremely sophisticated, and that would mean being excessively large, slow and complex. So, rather than trying to write a perfect program, one can write a program which can do most of the formatting perfectly and use a small amount of external data to tell it how to avoid specific mistakes. The result is faster and "perfect" formatting, while using much less data. The degree of perfection depends only on human reviewers spotting imperfections and telling the program about them so it can generate corrective metadata for them.
Procedure for Generating Formatting Metadata:
- An etext is run through an auto-formatter program.
- A person reviews the result for imperfections
- If any mistake is found, it is indicated to the program
- For each mistake indicated by the reviewer, the program generates some data to tell itself how to avoid it in the future
- If possible, the program should simplify the data by storing the corrections in the form of additional processing rules rather than a list of individual exceptions to be made
- The corrective data is stored in the etext's official metadata file
So, why not just store complete HTML tags along with the positions to insert them in the metadata?...
- Because people will certainly have differing preferrences for various aspects of the layout, colors and rendering.
- Storing it in a form specific to any file format, such as HTML markup in this case, would mean more complex parsing to convert it to other file formats.
- The formatting metadata would only tell what type of text was where, not the fonts, styles or colors to use.
Metadata Types to Consider Including
- Fancy-formatting data. Rather than being complete, this would mostly contain data to tell an auto-formatting program how to avoid specific mistakes it would otherwise make in the etext.
- Start line and line count of the main text, so it can remove copyright information and producer's notes.
- The book's title and author's name
- The location of the title and author name in the text so a quality title page can be created.
- Start line and line count of the table of contents, if present.
- Start lines and line counts of any front matter and back matter: prolog, preface, foreword, author's note, dedication, introduction, afterword... These might be best formatted differently than the main text.
- Positions to place illustrations. These could be specified as centered over the text, or inline.
- Name translations for chapters, sections and other divisions, if their spellings in the table of contents differ from those in the main text.
- "Proper" capitalizations for chapter and division names which are in ALLCAPS
- Sometimes the name or number of a chapter, section or other division isn't present in the main text, especially if it's the first one in the book. By noting such instances, these can be inserted by the program for consistency.
- Line numbers of all chapter names and numbers for reliable hyperlinking plus better layout and formatting.
- A list of proper names to be sure they are always properly capitalized. If they are confusable with normal words, then the occurrences of each one should be recorded.
- Acronyms and other text for which being in ALLCAPS doesn't indicate emphasis
- Which markup styles are used, since this can differ from etext to etext
- Missing or mismatched quotes.
- Mismatched markup for bold, italics and so on...
- Spelling and typo corrections.
- Hyphenation positions which differ from the expected so the text can be reflowed more flexibly.
- Suggested themes: font faces, styles, colors, text flow, justification, backgrounds, illustrations.
- Index
- Glossary
These could also be used for tooltip-style definitions.
- Bibliography
- Web links
- Notes about the book: publication history, social and political atmosphere and geographical location in which it was written...
- Author biography