Software Redesign

From Project Gutenberg, the first producer of free electronic books (ebooks).

Jump to: navigation, search

This page is intended for working out the design for a new implementation of the PG catalog software/web site.

Contents

Problems to be addressed

The rewrite should address the following problems:

Full MARC Compatibility

Dont extract the information from the PG header of posted books. This has always been source of serious headaches. Despite the claims to the contrary, the PG header never has been machine readable. Also, the WWers conventions are incompatible with MARC.

Make the PPers enter the data of the books they are processing. Or harvest the DP database. The catalog should be up-to-date before the book is posted. Posting just adds the ebook number.

Full Work / Expression / Manifestation / Item Structrue

See below.

Allow Cataloging of Yet Unpublished Books

Follow the book from copyright clearance to publishing with status indicator. Users should be able to get an idea of when the book will be ready. Notify them by email?

Support for International PGs

Central database accessed from local servers. Filter records based on copyright region.

Metadata Interchange

Link with Wiki

Respecting the work / expression / manifestation hierarchy. Separate pages for work etc.?

Questions

Work / Expression / Manifestation / Item Structure

Just a suggestion, please modify.

Work

As in "Did you read that book?".

In PG: a catalog entity that has zero or more authors, a unified title and subject headings.

Expression

The work or a translation or adaptation thereof.

In PG: a catalog entity that points to a work entity and has zero or more translators, illustrators etc. and a title statement.

Manifestation

An edition, on paper or other medium.

In PG: a catalog entity that points to an expression and has an (international) PG etext number.

Item

A particular copy of a manifestation, like the one with the coffee stain on page 42 that is shelfed on shelf 69.

Here the analogy breaks down somewhat:

In PG: the TXT version, HTML version / the MP3 version, OGG version, etc.


See also: http://www.loc.gov/cds/downloads/FRBR.PDF

Proposed Database Structure

The database contains Groups of MARC fields. Groups can form acyclic directed graphs. (Note that there are no tables for books, authors, subjects etc.)

See also:

Recommendation for 'Digital Surrogates'

Examples of MARC Groups

(Note that the cataloger doesn't have to respect the w/e/m/i hierarchy. She can just as well put everything into one big group, like LoC does. All the records we'll have to copy from the current system will also be put into "one big group". The advantage of w/e/m/i is that if you want to add a subject heading to all 'Hamlet's, you just have to add it in one place.)

An author would be represented by a group of MARC fields that contains exactly one "100" MARC field.

 100 $a Twain, Mark, $d 1835-1910
 500 $a Clemens, Samuel Langhorne, $d 1835-1910

A work would be represented by a group of MARC fields that contains exactly one "240" MARC field. The "work group" can point to zero or more "author groups".

 >>> pointer to author
 240 $a Huckleberry Finn
 650 $a Slavery
 651 $a Mississippi

An expression would be represented by a group of MARC fields that contains exactly one "245" MARC field:

 >>> pointer to work
 245 $a The Adventures of Huckleberry Finn
 >>> pointer to work
 >>> pointer to translator
 240 $a Huckleberry Finn, German
 245 $a Die Abenteuer des Huckleberry Finn

A manifestation would be represented by a group of MARC fields that contains at least one "010" MARC field. The 010 contains the (internationalized) ebook number iff the book is legally downladable in that area. BIBREC pages are generated only for MARC groups containing ebook numbers that match the area the download server is running in. (The database thus can run in a central location and serve multiple copyright regions.)

 >>> pointer to expression
 010 (US) 12345
 010 (AU) 4321
 010 (EU) 555
 250 $a 3. Edition
 260 $a New York : $b Chelsea House, $c 1922.
 300 $a 139 p. : $b ill. ; $c 24 cm.
 >>> pointer to expression
 >>> pointer to reader
 010 (US) 23456
 245 $a The Adventures of Huckleberry Finn $h Audio Book $c Read by Joshua Hutchinson

N.B. "010" should probably be substitued by our own private field.

To generate a BIBREC page the catalog software will start with the MARC group containing the ebook number and resolve all pointers. This will result in a big MARC group containing all fields (with "near" fields overriding "far" ones.)

Views needed/Caching

It seems like all the various views of the catalog should be fully cached; and updated only when modified.

The views I see are: