Software Redesign
From Project Gutenberg, the first producer of free electronic books (ebooks).
This page is intended for working out the design for a new implementation of the PG catalog software/web site.
Contents |
Problems to be addressed
The rewrite should address the following problems:
Full MARC Compatibility
Dont extract the information from the PG header of posted books. This has always been source of serious headaches. Despite the claims to the contrary, the PG header never has been machine readable. Also, the WWers conventions are incompatible with MARC.
Make the PPers enter the data of the books they are processing. Or harvest the DP database. The catalog should be up-to-date before the book is posted. Posting just adds the ebook number.
Full Work / Expression / Manifestation / Item Structrue
See below.
Allow Cataloging of Yet Unpublished Books
Follow the book from copyright clearance to publishing with status indicator. Users should be able to get an idea of when the book will be ready. Notify them by email?
Support for International PGs
Central database accessed from local servers. Filter records based on copyright region.
Metadata Interchange
- with DP
- with PG copyright clearance system
- with TEI (read and create teiHeader)
- available software for reading TEI bibliographic information:
- http://refdb.sourceforge.net
- only for reading especially prepared TEI bibliographies!
- what else?
- available software for reading TEI bibliographic information:
- with LoC (snarf LoC records)
- German language PG etext metadata at
- http://www.ark.in-berlin.de/PG-Posted-DE.MODS.xml (MODS format)
- http://www.ark.in-berlin.de/PG-Posted-DE.txt (Endnote/Refer format)
- metadata of about 70 percent of German language PG etexts included
Link with Wiki
Respecting the work / expression / manifestation hierarchy. Separate pages for work etc.?
Questions
- What changes to the database structure would be needed?
- What additional information would need to be captured?
- What information is captured now?
- In what specific ways is the current system not MARC compatible?
- How can we do this redesign in pieces?
- How should we deal with the early PG texts that are not based on any specific published copy?
- What will the process be for transfering our current data to the new form? -- MarcMaker might be useful.
- Can we use existing GPL catalog software like Evergreen?
- What about transcriptions of sections of a book? Journal articles? Both already exist in PG.
- Can we have a non-displaying notes field on bib records for catalogers' use?
Work / Expression / Manifestation / Item Structure
Just a suggestion, please modify.
Work
As in "Did you read that book?".
In PG: a catalog entity that has zero or more authors, a unified title and subject headings.
Expression
The work or a translation or adaptation thereof.
In PG: a catalog entity that points to a work entity and has zero or more translators, illustrators etc. and a title statement.
Manifestation
An edition, on paper or other medium.
In PG: a catalog entity that points to an expression and has an (international) PG etext number.
Item
A particular copy of a manifestation, like the one with the coffee stain on page 42 that is shelfed on shelf 69.
Here the analogy breaks down somewhat:
In PG: the TXT version, HTML version / the MP3 version, OGG version, etc.
See also: http://www.loc.gov/cds/downloads/FRBR.PDF
Proposed Database Structure
The database contains Groups of MARC fields. Groups can form acyclic directed graphs. (Note that there are no tables for books, authors, subjects etc.)
See also:
Recommendation for 'Digital Surrogates'
Examples of MARC Groups
(Note that the cataloger doesn't have to respect the w/e/m/i hierarchy. She can just as well put everything into one big group, like LoC does. All the records we'll have to copy from the current system will also be put into "one big group". The advantage of w/e/m/i is that if you want to add a subject heading to all 'Hamlet's, you just have to add it in one place.)
An author would be represented by a group of MARC fields that contains exactly one "100" MARC field.
100 $a Twain, Mark, $d 1835-1910 500 $a Clemens, Samuel Langhorne, $d 1835-1910
A work would be represented by a group of MARC fields that contains exactly one "240" MARC field. The "work group" can point to zero or more "author groups".
>>> pointer to author 240 $a Huckleberry Finn 650 $a Slavery 651 $a Mississippi
An expression would be represented by a group of MARC fields that contains exactly one "245" MARC field:
>>> pointer to work 245 $a The Adventures of Huckleberry Finn
>>> pointer to work >>> pointer to translator 240 $a Huckleberry Finn, German 245 $a Die Abenteuer des Huckleberry Finn
A manifestation would be represented by a group of MARC fields that contains at least one "010" MARC field. The 010 contains the (internationalized) ebook number iff the book is legally downladable in that area. BIBREC pages are generated only for MARC groups containing ebook numbers that match the area the download server is running in. (The database thus can run in a central location and serve multiple copyright regions.)
>>> pointer to expression 010 (US) 12345 010 (AU) 4321 010 (EU) 555 250 $a 3. Edition 260 $a New York : $b Chelsea House, $c 1922. 300 $a 139 p. : $b ill. ; $c 24 cm.
>>> pointer to expression >>> pointer to reader 010 (US) 23456 245 $a The Adventures of Huckleberry Finn $h Audio Book $c Read by Joshua Hutchinson
N.B. "010" should probably be substitued by our own private field.
To generate a BIBREC page the catalog software will start with the MARC group containing the ebook number and resolve all pointers. This will result in a big MARC group containing all fields (with "near" fields overriding "far" ones.)
Views needed/Caching
It seems like all the various views of the catalog should be fully cached; and updated only when modified.
The views I see are:
- Item information
- Author information
- Items by Author list
- Items by Subject list
- Items by Title list
- Items by Year list
- more?