Volunteers' FAQ
From Project Gutenberg, the first producer of free ebooks.
About the Basics
V.1. How do I get started as a Project Gutenberg volunteer?
What you actually need to do to produce a PG text can be stated very simply:
- Borrow or buy an eligible book.
- Send us a copy of the front and back of the title page, and wait for an OK.
- Turn the book into electronic text.
- Send it to us.
That's it! All the rest of the producing parts of the FAQ are about the details of how different people approach these steps.
Different people find their own ways into PG work, and once in, find their own niches. If you have your own ideas, don't let anything here stop you from pursuing them.
Most people now start by registering at Distributed Proofreaders http://www.pgdp.net, proofing a page or two, and go on from there.
Some people just read the FAQs, go up to their attic, pull an eligible book off the shelf, send TP&V [V.25] in, and start typing or scanning. Next time we hear from them is when they send in [V.46] the completed eBook for posting. It can be as simple as that.
Some people just download existing PG texts, re-proof them very carefully and send in corrections.
Some people find regular collaborators through gutvol-d or the distributed proofing sites, earn a reputation as reliable proofers, and continue working as proofers.
Most people start small, and after a little experience of distributed proofreading or other proofing, begin their PG career as producers.
If you're a typist, cheer now, because you can ignore all the complicated paraphernalia of computer interfaces, and scanners, and the quality of OCR software and the mistakes it makes. You can just sit down at the keyboard with your eligible [V.18] book.
If you're not a typist, start thinking about scanners. It may be a while before you're ready to start scanning for yourself, but it's never too early to find out about them.
As soon as you have a solid grasp of how to turn a book into an etext, please start thinking about how you're going to become a producer. While proofing work is valuable, PG can only add books when someone makes the effort to actually make etexts from them, and the people who run distributed and co-operative proofing projects have to do a lot of work before and after the proofing step; we want to spread that around as widely as possible. Project Gutenberg needs more producers!
Whatever you do, if you are working outside a Distributed Proofing project, don't just hang around expecting someone to offer you a task to undertake. There is no "head office" where overworked staff occasionally need interns to do filing and odd-jobs. There are maybe 200 fairly regular contributors to PG, producers and significant proofers. We almost never meet each other in person. We have jobs, and families, and other interests. We work for PG when we can, and when we want to. In many ways, you could look at us as 200 unrelated people, each doing our own etext project, using Project Gutenberg as an umbrella group that sets loose standards, files copyright proofs and provides secure placement for the finished texts. Since we each have our own self-assigned single-person tasks, there isn't too much room to delegate some of that work to a beginner. By all means, volunteer for some tasks — on the Volunteers' Board, or in gutvol-d — but you should think in terms of defining your own tasks, and making your own contribution.
Orientation.
Absolutely everyone — scanners, typists, proofers — should first spend some time working on a distributed or co-operative proofing project. This will allow you to get a feel for what happens in making an etext from paper pages without committing you to more than a few hours' work.
This is not in any way an institutional requirement, since we don't have any institutional requirements, but it is very good advice. Many volunteers start eagerly, wanting to do lots of PG work, and then drop out because they took on too much, too fast, without understanding the nature of the work. Don't let that happen to you. Take it in small chunks.
Check out these distributed proofing sites:
- PG's Distributed Proofreaders: http://www.pgdp.net/
and spend a few hours over a couple of weeks just processing some pages for real.
These two sites are very similar, and either may tackle any text, but while the original Distributed Proofreaders produces texts in all languages, a large part of its production is in English. While Distributed Proofreaders of Europe produces texts in English, it concentrates largely on other Western and Eastern European languages.
While you're doing that, you should also join a couple of PG mailing lists [V.12] — gutvol-d and either the weekly or monthly Newsletter list. Reading these will start to get you connected to what's going on. Browse the Volunteers' Board — there may be some offers going, and there's a lot of experience captured in some of those "back-issues", so don't confine yourself to the front page.
Inform yourself on e-text issues generally, not just within Project Gutenberg. Explore The On-Line Books Page [R.5] and find other eBooks available on-line.
Have a look at our In-Progress List and some lists of suggestions from others [B.4].
Look at sites like Pluckerbooks <http://www.pluckerbooks.com/>, Memoware <http://www.memoware.com> and Bookshare <http://www.bookshare.org> to learn how our work is being used as a basis and copied and converted and amplified in many other projects.
Above all, read a few Project Gutenberg eBooks! You don't have to read them in full; you don't need to spend weeks poring over Dostoyevsky or studying Shakespeare. Just download a few and skim them — you'll absorb what a PG text should be quite painlessly, and maybe you'll get caught up in the story! If you're looking for light reading, and can't think of something that you specifically want, how about these all-time favorites:
- The Gift of the Magi, by O. Henry.
- The Lady, or the Tiger?, by Frank R. Stockton
- A Christmas Carol, by Charles Dickens
- Alice in Wonderland, Lewis Carroll
- Anne of Green Gables, by Lucy Maud Montgomery
- The Marvelous Land of Oz, by L. Frank Baum
- A Princess of Mars, by Edgar Rice Burroughs
- Heidi, by Johanna Spyri
- A Connecticut Yankee in King Arthur's Court, by Mark Twain
- Black Beauty, by Anna Sewell
- Tarzan of the Apes, by Edgar Rice Burroughs
- Tom Swift and his Motor-Cycle, by Victor Appleton
- Rebecca Of Sunnybrook Farm, by Kate Douglas Wiggin
- Little Lord Fauntleroy, by Frances Hodgson Burnett
- Aesop's Fables
- Grimms' Fairy Tales
- The Art of War, by Sun Tzu
- Dracula, by Bram Stoker
- Swiss Family Robinson, by Johann David Wyss
- The War of the Worlds, by H.G. Wells
If you have a taste for detectives and mysteries, there's
- The Adventures of Sherlock Holmes, by Arthur Conan Doyle
- Monsieur Lecoq, by Emile Gaboriau
- The Mysterious Affair at Styles, by Agatha Christie
- Arsene Lupin, by Edgar Jepson & Maurice Leblanc
- Edgar Allen Poe's "The Gold-Bug" and
- "The Murders in the Rue Morgue" in The Works of Edgar Allan Poe V. 1
For the excessive buckling of various swashes, see:
- The Prisoner of Zenda, by Anthony Hope
- The Man in the Iron Mask, by Dumas, Pere
- The Three Musketeers, by Alexandre Dumas
- Treasure Island, by Robert Louis Stevenson
- The Scarlet Pimpernel, by Baroness Orczy
Effen youse got a hankerin' for a Western, there's:
- Riders of the Purple Sage, by Zane Grey
- The Virginian, Horseman Of The Plains, by Owen Wister
- Back to God's Country, By James Oliver Curwood
- Selected Stories by Bret Harte
- Jean of the Lazy A, by B. M. Bower
Or if you prefer your fiction more domesticated, there's:
- Little Women, by Louisa May Alcott
- Pride and Prejudice, by Jane Austen
- The Warden, by Anthony Trollope
- The Heir of Redclyffe, by Charlotte M Yonge
- Mother, by Kathleen Norris
For something to raise a smile, you can rely on:
- The Devil's Dictionary, by Ambrose Bierce
- The Wallet of Kai Lung, by Ernest Bramah
- The Importance of Being Earnest, by Oscar Wilde
- Three Men in a Boat, by Jerome K. Jerome
- Piccadilly Jim, by P. G. Wodehouse
If poetry is your thing, you have lots to choose from:
- Shakespeare's Sonnets
- Project Gutenberg's Book of English Verse
- The Home Book of Verse, edited by Burton Stevenson
- The Complete Poems of Henry Wadsworth Longfellow
- Leaves of Grass, by Walt Whitman
Now, that's just a handful from our over 10,000 eBooks, so don't tell me you can't find anything to read! If you do have ideas of your own, download GUTINDEX.ALL and browse through the whole list, or Browse by Author on the website at <http://www.gutenberg.org/catalog/world/search>.
Download a few. Read them on your PC, or reformat them and print them out, or convert them for your PDA. Get used to working with and formatting text. Look at the formatting decisions that earlier volunteers have made — they're not entirely consistent; different people make different choices, different books require different methods, and PG conventions have shifted slightly over the last 10 years — but they're all perfectly readable and convertible today.
If you find typos [R.26] in any of them, tell us! That's also a part of being a Gutenberg volunteer. Our eBooks improve with time!
If you're thinking of making the best use of your time looking for errors in posted texts, a good start would be to download 40 or 50 texts, and run a spelling checker and gutcheck [P.1] on them all, spending only 5 or 10 minutes on each. Having had a quick look at all of them, concentrate on the ones that seem to have most problems — where automated checkers see 10 problems, a careful human will usually be able to pick up 20.
Getting Productive
OK, so you've seen what etexts should look like, you know what we do, and proofing hasn't scared you off. It's time to step up and become a producer. If you're not a typist and you don't have a scanner, take a detour down to the Scanning FAQ [S.1] now, and come back when your scanner is set up. If you're a typist or you've already got a scanner, read on . . .
Get a book. Just do it, OK?
Ya gotta start somewhere, right? And finding an eligible book is definitely somewhere.
Finding an eligible book is a threshold for many beginning volunteers — it's the first major step on the way to producing. For a lot of people, it's also the toughest barrier they have to cross. Fortunately, the barrier is only psychological, and can be crossed in a few minutes.
It's an unfamiliar process, and one that a lot of beginners feel some anxiety about. Don't. It's quite straightforward: it's just buying a book — you've done that, haven't you? Don't over-think it, don't worry about whether you're making the "right" choice, don't spend months comparing lists and choosing. Just do it. Once you've got your first, you'll wonder what all the fuss was about. Thanks to the wonders of the internet, your book can be on its way to you in an hour if you have $20 to spend.
Typists blessed with a good local library don't even have to buy their books — they can just borrow one and type it up! (You may be able to scan a library book, but get some experience with scanning first, and avoid damage!)
Let's deal with the decisions and other issues of picking one.
Copyright
For your first book, don't try getting fancy with copyright issues. Choose one that was published before 1923, and you're in the clear for U.S. and PG copyright purposes. You can read the dates just as well as we can — with books printed before 1923, there are no hidden catches: "Pre-'23 is free". Just read the TP&V [V.25] of the book, and see that it was printed before 1923, and you have no problems. Of course, reprints [V.19] of books copyrighted pre-1923 (and various other cases) are also clear, but if you have any concerns, just stick to pre-'23 editions.
Which book?
The answer to this question is different for everyone, but see how much you agree with the following statements:
"I have a favorite book, and I'd really like to produce that."
Well, hey, this is no problem! You already know what you want. Go check out whether the book is already on-line [V.29].
"I'd like to work on an important book, but I don't know which."
Well, everybody's definition of "important" is different, but some people have put their various ideas forward already; you can see whether you agree with them! The InProg List contains some, with the notation "Suggested book to transcribe" beside them. Steve Harris keeps a list of unproduced possibles at Steveharris.net. John Mark Ockerbloom's "Books Requested" page lists titles that people have asked for. [B.4] Your problem if you fall into this category is that other people probably wanted to produce "important" books too, and lots are already done.
"I just want an easy, trouble-free book to start with."
Your first book doesn't have to be War and Peace (we've already got that anyway!). Here's a tip: try looking for children's or what we would nowadays call "Young Adult" books. These are typically short, and may have large print, which makes life much easier if you're scanning. They age well: children's stories from a century or more ago are still readable and interesting to children today. We have many children's and YA eBooks: not just the classics like Grimm and Andersen and Heidi and Oz and Peter Pan and William Tell, but lesser-known but still enchanting stories like The Counterpane Fairy, or Lang's Fairy books. There are series, like the Motor Girls, or the (Country) Twins series, or the Bobbsey Twins. There is lots and lots of material here for you to start with, and these books are relatively plentiful, since they were made to take the kind of treatment children dish out, and many of them have been in school libraries or attics for years.
Whatever your choice, pick a book that you'll like; you'll be living with it up close and personal for a while. Light reading, adventure fiction, and books aimed at younger readers are safe first choices for most people. If you admire 19th Century scientists or scholars, and want to immortalize their work, great! But don't feel that you have to dive in at the deep end just because someone else wants you to.
Getting your book: a practical exercise
The Search
At this point, you've got a list of books — maybe just one, maybe several by an author or two, maybe just a genre like "Children's Books" with some specific ideas. Maybe your mind is still wide-open.
Before used booksellers had the Net, finding a particular old book was a daunting job. Booksellers had informal networks among themselves and exchanged catalogs so that each would know something about what was available elsewhere, but, for a buyer, finding a particular book was still hit-and-miss. Now, however, a number of large sites provide a service to booksellers, where they can list their inventories for people to search from anywhere.
So now we go hunt for them on the Net. No, you don't have to buy them on the Net — you can rummage in booksales and garage sales and used bookstores, and that's its own kind of fun, though on a physical hunt, what you need is to bring a long list of "already done" books with you. But even if you never buy over the Net, it's a vast source of information about what books are available, which are plentiful, and which are cheap. It gives you some experience of what to expect when you do your in-person browsing.
Here's a story of a typical Net-hunt. And you can follow along with it at home. :-) Your results, and the sites you end up at, will be different from mine, but even if you don't end up buying a book on this hunt, you'll get some experience of what's involved. C'mon, do it with me — see if you can find a better bargain!
I'm starting with two lists, and I'll follow up whatever seems promising. I'd like to spend about $20 — might go to $30. Definitely not interested in $50 and up. I'm keeping in mind that I'll have to add a bit for delivery — usually up to $10 within the U.S., but can get expensive if you're in Perth, and ordering from a bookstore in Munich.
I'm also avoiding anything that might be tricky to clear on this search, and confining myself to books printed before 1923.
Of course, by the time you read this, some of these books may already have been produced, so if you're actually thinking of buying any, check carefully first!
My first shortlist consists of books that caught my eye from David Price's In-Progress List, Steve Harris's site, and The On-Line Books Requested page [B.4], and it reads:
- Louisa May Alcott: The Inheritance
- E. W. Hornung: Irralie's Bushranger
- E. W. Hornung: Stingaree
- A. A. Milne: The Dover Road
- A. A. Milne: Once on a Time
- Samuel Richardson: Pamela
- Oscar Wilde: The Critic as Artist
As well as following along with my list, you should try finding two or three books of your own, from those sites or from your own preferences, and search for them in the same ways that I do.
Everyone has their own searching technique and their own favorite sites to search. For this session, I'm opening up three copies of my browser — one for Alibris <http://www.alibris.com>, one for Abebooks <http://www.abebooks.com>, and one for the Catalog of the Library of Congress <http://catalog.loc.gov>. I'll do my initial searches on Alibris and Abebooks, and keep the LoC site handy for reference.
In Alibris, I head straight for the Advanced Search page, since they allow searching by date, and I immediately put "before 1923" into every search, which avoids having to scan through modern reprints. In Abebooks, I choose "Hardcover" in their advanced search, which is not quite as good a filter, but does at least screen out recent paperback editions.
In each of the sites, I just enter the author's surname and one word from the title of each book, and look at the search results.
Louisa May Alcott's "Inheritance" looks like it's going to be tough. I don't find it in either of my two bookstores. On doing a little checking with modern bookstores, I find it was her first novel, written when she was 17, and as far as I can see, not published during her life: apparently only recently published — the LoC site has nothing prior to 1997. A disappointing start to my search. I understand why it's very desirable to get it online, but this one's going to be very tough to clear, and I'm staying away from it.
E. W. Horning's "Irralee's Bushranger" is also elusive: it doesn't show up at either of my sites, so I check out the LoC to confirm I have the title right, and yes, there it is: "Irralee's Bushranger, a story of Australian adventure, 1896." So I widen my search by visiting <http://www.trussel.com/f_books.htm> and searching many of the sites there. Still no luck. If I were particularly eager to get this book, there are several things I might do at this point: I might register a "want" with one of the sites, asking to be notified when a copy is listed, I might use the OCLC WorldCat search (which Abebooks calls "Find it at a local library") where I can locate libraries that have copies, or I might even contact some individual booksellers and make a request that they look for it. Some booksellers actually specialize in looking for hard-to-find books; but of course I expect I'd have to pay a bit more for it when they do find it, and given my success with the rest of my list, and my price bracket, there seems no need to go that far today.
Horning's "Stingaree", by contrast, seems to be everywhere, in several editions, and cheap. It must have been a bestseller in its day — not surprising, from the author of "Raffles". 1902, 1905, 1909 editions abound. The cheapest are 1910 and 1907 editions for $4.95 and $5.00 from booksellers listed at Abebooks.
Milne's "Dover Road" is available from both sites. There seems to have been a Putnam's printing in 1922 of "Three Plays: The Dover Road. The Truth About Blayds. The Great Broxopp." of which lots of copies survive. There also seem to be later printings which would qualify as reprints if I were desperate, but the 1922 edition is priced from $12.00 to $50.00, so I'll take the 1922 $12.00 copy from Abebooks. As a bonus, I don't see the other two plays listed as being online anywhere, so I'll get three texts (and short ones, too! — 279 pages for all three) for the price and effort of one.
Milne's "Once on a Time" is a bit less common, but once again a Putnam's printing of 1922 keeps it in the race. There are a couple of booksellers in England selling for 15 pounds (which just about makes my $20 threshold) and 20 pounds, and an ex-library copy going for $25.
There are lots of eligible copies of "Pamela" available, ranging from a fourth edition at a mere $4,999 (no, thanks!) to a 1921 printing at $6.60 at Alibris. I'll take that one, please.
Wilde's "Critic as Artist" is fairly widely available. A 1905 edition of "Intentions: the Decay of Lying; Pen Pencil and Poison; the Critic as Artist; the Truth of Masks" is available at Alibris for $8.80, (and other copies of the same edition there and on Abebooks in the $20-$30 range) and Abebooks lists a London 1919 edition at $12.50. There are several copies listed in both places as "undated" and "reprints" — I'm avoiding these, since while it's quite likely that they might be clearable, I'm not taking risks on this search.
My second list isn't a list — just a vague category: children's books that are easy to do.
I go to Alibris' Advanced Search, and enter "Child's" in the title, and pre-1923 in the date, and, excluding titles already on-line, immediately get:
- A Child's History of France $13.20
- A Child's Story of the Bible $5.50
- First Lessons in Botany or The Child's Book of Flowers $13.20
- The Child's Book of American Biography $11.00
- The Child's First Bible $8.80
- The Child's Music World $8.80
and so on through quite a list.
OK. That's a good start. But my choice so far is unimaginative. I need better search terms. So I go to main search engines with the terms "children's antiquarian books" and find a half-dozen or so sites that specialize in them. I can browse around there, though it's slower going without searches to focus my results. I find <http://www.bookrescue.com>, specializing in children's books. Wading through the miles and miles of Alcotts and Barries and Burnetts, which are mostly already online, I think, I find a couple of authors from them who must have been popular, because they seem to have published lots of books before 1923: Angela Brazil and Dorothy Canfield. (I only got as far as the "C"s!)
I could of course stop here and buy some, but today I want to see what else is out there.
Back at Alibris and Abebooks, armed with my authors to search by, I turn up 4 pre-1923 books under $20 for Angela Brazil:
- A Terrible Tomboy
- The Youngest Girl in the Fifth
- A Fourth Form Friendship
- A Pair of Schoolgirls
and several between $20 and $30.
Dorothy Canfield immediately yields multiple copies of:
- The Brimming Cup
- Home Fires in France
- Hillsboro People
- Understood Betsy
- Rough Hewn
- The Real Motive
and others, and I haven't even got to $20 yet, nor to the letter "D".
A browse through the Ebay Collectible and Antiquarian Books section also throws up a respectable list of eligibles. I won't even bother counting that.
In 20 minutes, I have found five of the seven on my search list. In less than hour after that, I found over 16 eligible children's books, all under or around $20 and all available online.
Before committing to one, though, I would double-check that the book hasn't been transcribed online, and isn't In Progress.
Double-checking your selection
If you're concerned that the book you have chosen duplicates another that might be in progress, and want to double-check, you can e-mail the Posting Team asking them to check whether any recent clearances have come in for that title.
Duplications do happen — there's no way of avoiding them when different people are making independent decisions — but they are rare.
Dealing with used booksellers
As a class, used booksellers are very pleasant people — remarkably friendly, knowledgeable and helpful, even to people buying on a typical Gutenberger's budget.
Some of them are not, however, models of ideal data organization when it comes to Internet listings. There are lots of one- or two-person operations dealing with an inventory of many thousands of books, and having located your book online, you should check that it's still available.
You can place an order through the site and wait for the confirmation, or you can simply call the bookseller. Not all booksellers' contact details are listed, so it's not always an option, but when you do phone you're likely to be speaking immediately to someone who can tell you for sure whether the book is still there, can pull the book off the shelf and answer questions about it, and can take your credit card details on the spot and dispatch the book immediately.
Copyright Clearance
As soon as your book arrives, send us the information needed for Copyright Clearance first. Even if your book is a true-blue, no-questions-asked pre-1923 edition, we should know about it as soon as possible so that it can go onto the In-Progress list for others to see that someone has started on it.
Wait for the confirmation e-mail before starting any serious work. Some people have thought that "Copyright 1923" plus some wishful thinking would be good enough, and, unfortunately, it isn't. Some people have gone ahead and produced the whole book before sending in the clearance, only to be disappointed, all their work wasted.
Books published in 1922 or earlier are clearable, but some people, ever optimists, overlook that little "1927" in small print on the verso. Sometimes there is no copyright date on the front, and other optimists assume that these books are OK. They may be; they may not be. Don't get caught in the copyright trap.
As soon as you have what you think might be an eligible book, do not start on it. Do not ask another volunteer's opinion. Just send in the TP&V and wait for the confirmation e-mail to find out for sure.
Even when your TP&V clearly says "Copyright 1901", send it in. We need to get it into the clearance files so that we can register it as being In-Progress.
Producing
If you're a typist, there's not much more you need to know from this point: you can just get on with the job, with maybe a few tips from the FAQ. In fact, if you're a typist, you might wonder why the rest of us make such a fuss about scanners, and settings, and OCR. Take pity on us! we just can't produce the way you can. Smile indulgently, ignore all the scanner jargon, and submit your completed text while we're still saying bad words about the guttering on a greyscale image of page 372. :-)
If you are using a scanner to copy a book for the first time, be patient with yourself. Some people start off with too high expectations of what they can achieve. Believe it or not, scanning does work effectively; it just doesn't work perfectly. And often, you need a little practice before your scans work right with your OCR. The Scanning FAQ [S.1] has lots of specific tips you can try. Start by scanning a double-page about a third of the way through the book. Scan in Black and White and in Greyscale, at 300dpi and 400dpi. Try 600 dpi if it seems like a good idea. Put it through your OCR and see what comes out. Move your scanner so that you can be comfortable while placing the book and turning pages. Allow yourself an hour to experiment with different settings, and different pages. Put the sample images included with the Scanning FAQ through your OCR and see how the output compares to the text produced by other packages. That first hour finding out about how your setup works will be the most valuable hour of scanning you will ever do.
Having figured out what settings you want to use for this book, make sure you implement the best speed you can. Usually this means telling the scanner to scan only as much area as the book covers. This is quite important, since the scanner will by default scan its whole area, and you don't need all that; it just wastes time and makes your images bigger.
You may also be able to set your OCR or scanner software to auto-scan pages with some preset delay, like 5 seconds. This also speeds things up, because the scanner isn't waiting for you to hit the keyboard, and you have both hands free at all times to turn the page and replace the book. It takes a few pages to get into the rhythm; if you miss a page-turn, don't worry — you can get it on the next scan.
Using a reasonably modern but quite ordinary home/office type flatbed scanner, you should be able to scan 200 pages an hour [S.9] of a typical book, at good quality. 400 pages an hour is not unheard-of. Now, it may fairly be said that scanning offers all the fun of ironing, without the sense of adventure :-), but if you have got your settings right, you will probably be able to do the whole job in less than two hours. And now you're really on the road!
V.2. What experience do I need to produce or proof a text?
None.
For producing, you will have to be able to type pretty well, or have a scanner.
For proofing someone else's text, when you don't have a copy of the book in front of you, you should be reasonably familiar with the language used in the book, and the styles of the time — Chaucer's English was quite different from ours, and even 19th Century novelists write some phrases unfamiliar to us today.
That's it. You don't need experience in publishing, editing, or computers.
V.3. How do I produce a text?
There are acres of words in this FAQ about that, but it all boils down to 4 simple steps:
- Get an eligible book — pre-1923, or one of the exceptions. Pull it from your attic, borrow it from a library or a friend, buy it in your local bookstore, in a flea-market or on-line. We don't care which.
- Send us a copy or the front and back of the title page so we can file proof of copyright clearance.
- Copy the text from the book into a computer text file. We don't care whether you type it, scan it, voice-dictate it, or think of some totally new way to do it. Just get it into a file.
- Send us the computer text file.
That's all there is to it!
V.4. Do I need any special equipment?
You need the use of a computer of some kind, and Internet access is usual, though we have had some volunteers contribute texts on floppy disks.
If you intend to scan books, you will need a scanner, but if you're just typing or proofing you won't.
V.5. Do I need to be able to program?
Absolutely not! Very little of Project Gutenberg's work involves programming, and it is never necessary to any part of volunteering.
V.6. I am a programmer, and I would like to help by programming. What can I do?
See the Tools FAQ.
V.7. What does a Gutenberg volunteer actually do?
We buy or borrow eligible books, scan, type, and proofread. There are a few other activities, but they consume only a very small fraction of volunteer time.
V.8. Can I produce a book in my own language?
Yes! We want to encourage people to produce books in all languages, and we cheer when we can add a new language to the list.
V.9. Does it have to be a book? Can I produce pieces from a magazine or other periodical?
Magazines, newspapers, and other publications are just fine. For copyright clearance, they work just the same way as a book.
You do need to check the length of your piece [V.17]; we don't want a zillion separate one- or two-page files. If the piece you have in mind isn't long enough, you can add other pieces to it, or even most or all of the magazine. If the work was serialized over multiple issues, you can join them together for your PG text, but you do have to copyright clear every issue of the magazine from which you copy material.
We have many copies of periodicals posted already to PG, including Scientific American, The Mirror, and Punch.
If you have lots of old periodicals, you could even take one piece from several, and make a new text which is a "theme" anthology of those pieces. You can give it an appropriate title: "Civil War Commentaries from X magazine 1892-1898."
V.10. Do I have to produce in plain ASCII text?
Certainly not if it doesn't make sense. To take an extreme example, if you're working in Japanese or Arabic, or creating audio files, there is no point in trying to reproduce that in ASCII!
Where the text can largely be expressed in ASCII, we do want to post an ASCII version, even if it is somewhat degraded compared to the original. However, we will post your file in as many open formats as you want to create, so that your original work is available for those who have the software to read it.
V.11. Where do I sign up as a volunteer?
You don't. We have no formal sign-up process, no list of volunteers, no roll-call. If you produce a PG eBook, or help to produce one, you are a volunteer.
V.12. How do PG volunteers communicate, keep in touch, or co-ordinate work?
We are very scattered geographically: U.S., Australia, Brazil, Taiwan, Germany, South Africa, Italy, India, England, and all over the world, so we can't really meet for coffee on Thursdays. :-)
Most co-operation and co-ordination goes on by private e-mail. This is efficient for volunteers who have worked with each other before, since they know each other's interests and skills, but not so easy for beginners to break in on, since they don't.
There are a few Project Gutenberg mailing lists. How to subscribe to our mailing lists.
The Project Gutenberg Weekly and Monthly Newsletters, gweekly and gmonthly, are one-way announcements, which allow PG to communicate with non-volunteers who are interested in the eBooks we produce, but they also contain notes and requests for assistance from volunteers.
The Volunteers' Discussion Mailing list, gutvol-d, is a an e-mail discussion forum for subscribers about any Gutenberg topic.
The Volunteers' List, gutvol-l, is for private announcements for active volunteers.
The DP Forums at the Distributed Proofreaders sites focus mostly on DP issues, but are lively and interesting: http://www.pgdp.net/phpBB2/ .
There are some other, specialized, closed lists for people who do specific work within PG:
The Posted List, posted, is for people who perform indexing on our texts. An e-mail is sent to this list every time we post a text (see the FAQ "How does a text get produced?" [V.16] section 5: Notification) and the members of the list use it to update their catalogs.
The Whitewashers' List, pgww, is for Posting Team internal messages.
V.13. Where can I find a list of books that need proofing?
There is no central list of this kind. There are distributed proofing projects, currently at
- PG's Distributed Proofreaders: <http://www.pgdp.net/>
where you can proof parts of a book. This is advisable when you're just starting out because it gives you some feel for what the work is like.
You can also look up existing, posted texts from the archives and proof them. Just as there always seems to be one more bug in any given program, there always seems to be one more typo in any given text! Download a few, and scan quickly for problems by doing a spellcheck or other automated check; if you can find any problems quickly, then there are likely others to be discovered by a careful proofing.
V.14. Is there a list of books that Project Gutenberg wants?
No. Project Gutenberg, as such, does not "want" any specific books. Individual volunteers choose what books to produce. Nobody gives orders to volunteers about what they should work on. Nobody has an official "hit-list" of books to add to the archives.
Of course, individual volunteers and non-volunteers have their preferences, and may suggest books to transcribe, and such suggested lists pop up every so often, and are often useful to people looking for ideas.
There are usually some suggestions in David Price's InProgress list. The On-Line Books Page has a section where people can list requests, and Steve Harris has a site devoted to lists of books not yet in Gutenberg or elsewhere. Treat all of these lists with some caution, since someone may have started or even finished one of their suggestions since they were last updated.
- PG Books In Progress <http://www.dprice48.freeserve.co.uk/GutIP.html>
- On-Line Requested List <http://onlinebooks.library.upenn.edu/in-progress.html#requests>
- Steve Harris' "To-do"s <http://www.steveharris.net/PGList.htm>
V.15. I have one book I'd like to contribute. Can I do just that without signing up?
Well, since there is no formal sign-up, of course you can! A lot of texts have been contributed by people who just wanted to immortalize one favorite book. Many of them had already created the eBook before they even heard of Project Gutenberg, and we're always delighted to add these to the archive!
About production
V.16. How does a text get produced?
As stated back in the Basics section, all you need to do is:
- Borrow or buy an eligible book.
- Send us a copy of the front and back of the title page.
- Turn the book into electronic text.
- Send it to us.
That's all you actually need to know in order to be a producer. But if you're interested in the details of how other people actually do this, and want to know what else happens behind the scenes, here's a full, blow-by-blow account.
1. Finding an eligible book
Volunteers find eligible books [V.18] in all sorts of ways. Some lucky people have them in their bookshelves, or their attic. A lot of people have a good library nearby, where they can find books, or request them on interlibrary loan. Some people are big eBay fans; others like to hunt for bargains on specialist booksites. And of course lots of volunteers enjoy rummaging through actual used bookstores, or local markets, or yard sales.
Even if you're not going to take on a book yourself right now, search for some on the Net and find out about how to get a copy. Next time you pass an antiquarian bookstore, or a book market, drop in and browse. Ask your local library about interlibrary loans. Eligible books aren't hard to find once you know where to look.
2. Copyright Clearance
New volunteers sometimes find it hard to understand why this is so important, and why, in particular, Project Gutenberg is so careful about it. At base, it's simple: by keeping a filed copy of the TP&V [V.25] of every book we produce, we can at any time protect our publications against claims from publishers that they "own" the work, and thus we can keep them available to the public.
The copyright laws can be difficult to understand, and sometimes it may take serious research to prove that a particular edition is actually in the public domain. If you're not legally-inclined, just keep repeating "Pre-'23 is free" if you're in the U.S.A. and stick to books published before 1923. If you do want to delve deeper, read our Copyright Rules page at <http://www.gutenberg.org/howtos/copyright-howto> and then go on to reading the Library of Congress Copyright Office official papers at <http://www.copyright.gov/>. If you're in another country, find out about your own copyright laws.
Volunteers send in the TP&V from the book for us to inspect. This not only gives us the proof to file, it also lets us know that someone is really working on the text so that we can list it as being In Progress for the information of others who might be interested.
3. Scanning, typing, proofing and editing
This makes up the bulk of PG's effort, and is discussed at great length elsewhere in this FAQ. There are many, many ways to create an etext from a paper book, and different people use different methods, but it all boils down to making a text file. For a typical book, it will probably take 40 hours of a volunteer's time. All that happens here is that somebody makes the effort to transcribe one paper book into a file that can be shared around the world and for all time.
Since "The Slashdot Incident", when Distributed Proofreaders was featured on Slashdot, a popular technical news site, most of PG's production has gone through DP. The production steps there are more formalized, but it's still the same basic formula of scan, OCR, proofread. The big difference is that no one person has to do it all.
4. Posting
[Note: this information is quite specific to the process we go through now. It is quite likely to change as we improve the automation of the tasks.]
Posting is done by the Posting Team. The basic job is to receive the text from the producer, check that it has been copyright cleared, check that it conforms to Project Gutenberg standards, check it for correctness (which can be anything from XML validity to simple spelling), add the Project Gutenberg header and copy the text to the two PG servers.
In a simple case, where everything goes right, this can take as little as fifteen minutes. In a complicated case, where we have to convert formats, or there are a lot of errors in the text, or there are problems with the copyright clearance, it can take hours or even days while we wait for responses, or do a lot of editing, or find conversion tools.
Michael Hart used to do this work entirely alone, but in September 2001, he created the Posting Team to handle the load. (The Posting Team are nicknamed the "Whitewashers" in honor of Tom Sawyer's victims. :-)
Transferring the file
You send the text to us [V.46] either by Web, by FTP (with a username and password that any of the Posting Team can give you privately), or by e-mail.
If you're FTPing, you should e-mail one or more of us as well, to let us know what you've uploaded.
One problem is files that don't transfer correctly. Especially by e-mail, some files get damaged on the way. It's better to ZIP the file before sending, if possible, to prevent some common problems with text files. The use of compression formats other than Zip can also create problems. Members of the Posting Team work on multiple platforms — DOS, Windows, Linux, Solaris — and zipping and unzipping programs are commonly available for all of these. Other compression methods, like Stuffit or bzip2, are not so readily available, and may give us trouble.
We login via ssh to pglaf.org, which is the Unix system on which we work when posting, the same one that you uploaded the file to, unzip the file and glance at the top of it.
Checking Clearance.
We then check it for copyright clearance. The one and only absolute rule that we never bend, no matter what, is that we will not post a file that doesn't have a clearance. If it ain't in the clearance files, it don't get posted.
Most regulars know that they should include their clearance line in the web form or e-mail submitting the text, but not everybody does, and not everybody remembers every time. This can be frustrating, when clearance is not included and not obvious.
When you received your clearance on a book, you got what we call a "Clearance Line", something like this:
The Works Of Homer [Iliad/Odyssey] Tr. George Chapman Jim Tinsley 06/14/01 ok
These are saved in files that we posters can access. We regard this information as private, so we don't publish the details of who has cleared what.
When we get the text, we check whether the submitter has cleared it. If there is a clearance line in the e-mail notifying us about the text, there's no problem. If we can find the title of the text under the submitter's name in the clearance files, there's no problem. Unfortunately, sometimes we can't find it. There are two usual reasons: either the text submitted is part of the work cleared (for example, submitting one play from a collection), or the text hasn't been cleared yet. If the clearance isn't straightforward, we can go back and forth and round and round in e-mails for a while.
This is why it's important to paste the clearance line into the web form that you use to upload your etext, or into your e-mail, if for some reason you couldn't use the web form.
If the title of the text you're sending isn't the same as the title of the text cleared, BE SURE to paste in the clearance line AND explain that the text you're sending is PART of the cleared book. Please also list the titles of the other parts; it really does cause confusion and delay when this is not clear.
Checking and Editing
Sometimes, people send in a book in a non-text format like Word Perfect or Microsoft Word, or send a text with unwrapped lines. In that case, we try to get the submitter to fix them, but if they can't, we have to convert the file to straight text before starting.
Some producers, particularly inexperienced ones, want to add non-standard annotations and mark-up and symbols to the text. This can get ticklish; we don't want to discourage them, but we need to keep texts reasonably standard. Usually, we can work something out. Maybe the book should be added in both text and HTML, for example.
Assuming that it's a plain text file, we next run gutcheck and a quick spellcheck on the file. This will tell immediately if it adheres to PG standards and if there is any serious problem with it.
If the file looks clean, we may skim it, looking for potential problems or formatting issues. For clean texts, the only things we usually need to change are unindented quotations or inconsistent chapter headings (a lot of people seem to mix "CHAPTER III" with "Chapter 14" and have irregular numbers of blank lines) or spacing and a few 8-bit characters. Occasionally, we have to rewrap a text. We also look out for included publishers' trademarks, which we normally prefer to remove (trademarks are NOT subject to copyright expiration: Macmillan(TM), the publishing house, is still around and trading), unnecessary or downright odd indentation or centering, stray page numbers, and prefaces or introductions or appendices that may not be in the public domain. If the file has lots of 8-bit characters, we probably need to make a separate 7-bit version, and post both.
If the gutcheck and spellcheck don't look clean, or if conversion is required, we may spend a lot more than 15 minutes on it. In a bad case, we may have to get the file re-proofed.
If you are conscious that you're doing something non-standard, and really mean it to stay, say so in your e-mail. (For example, I recently posted a text containing a family-tree representation that had lines over 80 characters. Now, I would have left that one alone anyway, but it helped that the submitter drew my attention to it in the e-mail.) If it's too non-standard, the poster may not allow it to stay, but at least you can discuss it. When a text needs a lot of non-standard formatting or markup, you really need to ask yourself whether you shouldn't be submitting it in HTML, with all the bells and whistles, and settle for something more normal in the text variant.
Mostly, errors are obvious, and there are at least some obvious errors in most texts. When errors are completely obvious, we just fix them without feedback to the producer unless you have specifically asked for feedback in your e-mail.
We're getting more HTML formats now, which is great, but incoming HTML often needs a lot of work, because people who are not experienced with HTML often make mistakes. The W3C <http://validator.w3.org> is the official standard for valid HTML, but, for the average volunteer, it's awkward to use. However, if you're submitting a HTML format, please use Tidy, which you can get from <http://tidy.sourceforge.net>, to check your text before sending it. If you're using CSS along with your HTML, you need to check that separately at the W3C CSS Validator <http://jigsaw.w3.org/css-validator/> as well.
Header and Footer
We add the PG header and footer. If there is a header and footer already there, we strip them off first, since recent changes in the header mean that a lot of people send files with headers that are out of date. We have written programs to help with this.
We get the number for the text from a program called "ticket" that Brett Fishburne wrote, that dispenses the next number. That way, if two or three of us are posting at the same time, we won't all grab the same number. Given the number, we know the filenames, according to the rules for post-10K texts listed at [R.35], and finally zip up the file.
Posting
We now transfer the posted files to two servers: ftp.ibiblio.org (which also serves as gutenberg.org) and ftp.archive.org. (This is usually the point at which we realize that we forgot to make a change we noticed while checking. Aaaargh!)
Currently, we usually do this by uploading the files in one big zip to pglaf.org, where an timed job automatically puts the files onto both servers. At the moment, this happens 20 minutes past the hour, every hour.
5. Notification
At this point, the book is posted, but nobody knows about it! We need to do something about that. . . .
We compose an e-mail to the "posted" e-mail list, cc: the producer, with the line that is to go into GUTINDEX.ALL, the master list of PG files.
The "posted" list has only a few subscribers. These are the people who index and create links to PG texts, and include both PG volunteers and the maintainers of other sites that link to PG texts.
They also commonly download the texts to get more information for their indexes, and tell us if there is anything wrong with the files.
This e-mail is simply the official notification to all these people and the producer that the file has been posted. Here's a sample of such an e-mail:
To: "Posted Etexts for Project Gutenberg" <posted@listserv.unc.edu> Subject: [posted] Posted (#5301, Duncan) ! From: "Jim Tinsley" <jtinsley@pobox.com> Date: Tue, 25 Jun 2002 06:21:27 -0400 (EDT) Cc: you@example.com
Mar 2004 The Imperialist, by Sara Jeannette Duncan [SJD#4][mprlsxxx.xxx]5301
There may also be some remarks, if the text is in any way non-standard, or if files other than plain text were posted with it.
From this e-mail, you can, if you want to see any corrections made, immediately download the posted file and compare it to your version. Since the notification is made after the file has been copied to the servers, it should be there waiting for you.
To find out how to download a book that has just been posted, see the FAQ "R.3. How can I download a PG text without using the web catalog?" [R.3]
6. Indexing
From the "posted" list, the posting line is added to GUTINDEX.ALL. A skeleton index entry is then made to the website database by an automatic job daily, containing title, subtitle, author, etext number, language and character set. and our indexers begin the cataloging process, which is much more thorough, for the website. This includes work like finding author's dates of birth & death, getting the Library of Congress classification, and the other information that makes up the website searchable index. That process takes extra time, which is why the website searchable catalog must always lag behind the actual titles posted.
7. Corrections
It's remarkable how many people who went over and over the text to the point of hating it suddenly see problems with it when they download it a couple of days after it's posted! Something psychological there, I expect. Anyhow, if you do download your text and see problems with it, don't worry, just e-mail whoever posted it, or any other member of the Posting Team. No, you're not stupid, or if you are, you're in good company, because we've all done it! There's no big deal about replacing the posted file with a corrected copy immediately.
Over time, other readers may submit corrections. If you find an error in a PG etext, see the FAQ "I've found some obvious typos in a Project Gutenberg text. How should I report them?" [R.26]
When the corrections are small, as most are, we will just make the change to the existing text. We never make a new edition when we get corrections immediately after posting; we just update the file.
V.17. How long must a text be to qualify for PG?
Most of Project Gutenberg's work is focused on digitizing and distributing items that originated as print materials. Thus, the usual size of a PG eBook depends on the printed item or set of items it was based on. For multi-volume works, for example, we tend to assign a unique eBook number to each volume (since each volume was originally printed as a separate book). When short items are digitized, such as short stories, pamphlets, or periodicals, we again usually treat each item as a separate eBook (even though they might be quite short). Combining series of related short items into a longer item is acceptable, at the discretion of the eBook producer. This might be done with a series of pamphlets, or a set of poems, or a collection of short stories on a theme or from a single author, but many separate print sources.
In cases where copyright may not be obtainable for a set of related items, but can be obtained for a single item, we can redistribute that shorter item as a single eBook. This situation comes up frequently for periodicals where individual stories are copyrighted separately, and may also come up for some monographs.
While we would like to specify a "minimum" size for all eBooks, the variety and complexity of items that volunteers identify for the collection makes it difficult to be completely firm. Items with little text but many graphics (such as children's stories), song lyrics (accompanied by an MP3 or sheet music), and other items offer important variety to the Gutenberg collection, but might have less text than a more typical book-length work.
As a rule of thumb, most eBooks should have at least 25K of text. This cutoff applies to general fiction and non-fiction, reference works, and serials. When an individual work (such as a short story) is below this cutoff, we often seek to combine similar items together.
V.18. What books are eligible?
A book is "eligible" for posting if we can legally publish it. This is the case if:
- it is in the public domain in the U.S.A., or,
- the copyright holder has granted unlimited non-exclusive distribution rights to PG.
V.19. Are reprints or facsimiles eligible?
A reprint or facsimile of a book that would be eligible is itself eligible.
For example, if a book published in 1995 is a reprint of a book published in 1900, then it is eligible. However, the onus is on us to prove that it is a reprint, and if it doesn't say on the TP&V that it is a reprint, confirming its eligibility may be impractical.
V.20. What is the difference between a reprint and a facsimile?
A facsimile retains the page layout and formatting of the original. A reprint keeps the same words, but may lay the pages out differently. For our copyright purposes, there is no difference — we can use either.
V.21. What is the difference between a reprint and a "new edition"?
A reprint contains only the words and pictures that were printed in the original. A new edition is in some way changed; it has different text, or pictures. It may be abridged, or expanded. It may have material added or changed, using other versions of the book.
A new edition gets a new copyright, and has to be cleared based on its own copyright date and status, not the date of the original printing of the title. See also the FAQ "How come my paper book of Shakespeare says it's 'Copyright 1988'?" [C.16] for an example.
Please note that we are talking here about a new edition of the printed book, not a new (corrected) edition number for Project Gutenberg naming purposes.
V.22. What book should I work on?
Nobody in Gutenberg is going to set assignments for you. You decide what book to process. Just pick one that no-one else has already done, or is working on. It's also sensible to pick one that you'll like — you'll be living with it for a while. On a practical note, it's probably better to start with a short book or even a short story, since a long book can take quite a while to produce.
Start by thinking of books written before 1923. Pick a book you like, and check it out. If it's already done or still in copyright, try other books by the same author.
Visit the Project Gutenberg site and download a full list of Gutenberg books in GUTINDEX.ALL. Have a look at the List of Books In Progress and Complete and other "requested" lists [B.4]. Look for authors you like, and see what books by them aren't yet available.
Check out your old books. Maybe you have an eligible edition that would be of great help to the project.
Try your library. They may have some eligible editions — books we can prove to be in the public domain — and you will certainly come away with ideas. Ask your librarian. Librarians are keen to help on projects like this.
Browse second-hand bookshops in your area. There are lots of treasures to be picked up very cheaply.
Search for literature pages and bookshops on the Internet.
If all else fails, you can always ask on the Volunteers' Board or try the gutvol-d mailing [V.12] list for ideas. Others may know of books that people are especially looking for, or projects already started where you could help out.
V.23. I have a book in mind, but I don't have an eligible copy.
First, determine whether there are any eligible copies of the book, by finding out the date it was published, possibly from the Catalog of the Library of Congress [B.5] and checking the Public Domain and Copyright Rules [B.1]. If there is a public domain edition, the next problem is to find one to work with.
V.24. Where can I find an eligible book?
The most commonly used outlets are used bookstores, garage sales, library sales, charity shops and any other place that sells old books.
The Internet is a wonderful medium for finding used and antiquarian books — used bookstores all over the world have found ways of co-operating and listing their inventories on the Net, so that whether you live in Los Angeles, Moscow or Perth, you can still find that book you're looking for in a shop in a laneway of Amsterdam. Most on-line listings will quote the publication year of the book, so you can check that it's pre-1923.
Two such sites that allow second-hand booksellers to list their inventory are:
- Advanced Book Exchange <http://www.abebooks.com>
- Alibris <http://www.alibris.com>
The book search page at trussel.com [B.5] has a list of many such Net bookshops, or you can simply visit any search engine and search for Used or Antiquarian Bookshops. You can often buy eligible books through these sites very cheaply.
If you still can't find the book you need, post a message on the Volunteers' Board or to the gutvol-d mailing list; maybe someone else can find it for you.
Sometimes, it may be possible for you to work from a later edition, so long as somebody who has an eligible edition can check it to make sure that no changes have been made. Sometimes, you may be able to find a modern reprint; reprints may be eligible, as long as they say they are reprints of an edition that would be eligible.
If you can type, or can scan without damaging the book, you can borrow books long enough to produce them. Even if your local library doesn't have the books you want, they may well be able to get them for you on inter-library loan. Ask your librarian about it.
V.25. What is "TP&V"?
This is an abbreviation for "Title Page and Verso", and means a paper or image copy of the front and back of the title page.
Even if the back is blank, we need to have an image of it for the files, to show that it is blank, so that if, in ten years' time, somebody queries our right to publish, we can show that we haven't just lost it.
Publishers print copyright information, like title, author, copyright year and owner, and whether the book was a reprint, on the TP&V, and by filing this, we can prove that the book we produced was in the public domain.
Sending us the TP&V is the One True Way to getting PG copyright clearance [V.37].
V.26. What is "Posting"?
Posting is the final stage in the production process, where the file is given a number and official PG header, and copied onto our FTP servers for distribution. See section 4 of the FAQ "How does a text get produced?" [V.16] for a blow-by-blow account.
V.27. I think I've found an eligible book that I'd like to work on. What do I do next?
Make sure nobody else is working on it, and that it's not already online somewhere.
V.28. What books are currently being worked on?
Check out David Price's In Progress List (a.k.a. "the InProg List") online at <http://www.dprice48.freeserve.co.uk/GutIP.html>. David gets the information from Copyright Clearances that have been done, and organizes it into a list. It can never be 100% up to date, since clearances come in all the time, but it's the best online facility we have, and it's much more clearly presented than the original clearance files.
V.29. How do I find out if my book is already on-line somewhere?
There's no foolproof method; some student somewhere could have scanned it and put it on her college web page without announcing it anywhere. However, there are some regular places to check.
It may sound obvious, but you should always look in the PG archives first. Download GUTINDEX.ALL and keep it handy. Search the InProg List [B.1].
The main place to search for your book is the On-Line Books Page <http://onlinebooks.library.upenn.edu/>, which specializes in indexing books that people make available on-line.
If you still don't see your book on-line anywhere, hit your favorite search engine, and give it the title, author's last name, and preferably a few uncommon words from the first page of the book. Sometimes one of those solo efforts shows up in a general search.
V.30. My book is not on the In-Progress list, and I can't find it on-line. Is it safe to go ahead and buy it?
Probably. It could have been cleared, but not included in the InProg list yet. If the amount of money to buy it is a consideration, you can e-mail any of the members of the Posting Team, and ask them to check the latest clearances for you. Even this isn't foolproof; another volunteer could be placing their order at the same time you're placing yours. Such duplications do happen, but they are very rare.
V.31. My book is on-line, but not in Project Gutenberg. What should I do?
If the on-line file is from the same edition as the one you have (e.g. not a different translation) then you may be able to submit that file, perhaps slightly edited, to Gutenberg using the clearance from your paper copy. See "I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?" [V.62] for how to do that.
And of course, you can always still make your own version for PG. It's surprising how often even very similar paper editions have small differences that can be interesting or significant.
V.32. My book is already on-line in Project Gutenberg, but my printed book is different from the version already archived. Can I add my version?
Yes! In fact, assuming that the version already there is in the public domain, you can piggyback on the work already done by what is called "comparative retyping". For example, let's say that you have a later edition than the existing file; you can just take the existing file, edit it to match your paper version, and submit it as a new file. Of course, you must have Copyright Cleared [V.37] your paper version as well.
V.33. I see a book that was being worked on three years ago. Is anyone still working on it?
Maybe, maybe not. Some people abandon books, some people who are regular producers clear them and put them at the bottom of the pile, perhaps for years (though they will get round to them sometime), and some people just simply take two or three years to produce a book.
Once, we put names and contact details on the public InProg list, but for privacy and spam-prevention reasons, we've taken them off. However, the Posting Team have access to the master list of cleared files, and will send a message on your behalf to the person who originally cleared the book, asking if the project is still active, or if the producer wants help.
So if you really want to check this situation out, e-mail one of the Posting Team.
V.34. I've decided which book to produce. How do I tell PG I'm working on it?
As soon as you get Copyright Clearance [V.37], your book is entered in the "cleared" files. David Price will take these, and add your entry in his next release of the In Progress List.
V.35. I have a two- or three-volume set. Should I submit them as one text, or one text for each volume?
Both.
Quite a lot of 18th and 19th Century books, even straightforward novels, were published as multipart sets. When you have such a set, you should usually submit one text for each volume, and a "complete" text with the contents of all volumes together.
People who do this often complete and submit one volume at a time, until they've finished, and then contribute the "complete" file.
V.36. I have one physical book, with multiple works in it (like a collection of plays). Should I submit each text separately?
If the works are clearly separate, stand-alone texts, and are long enough [V.17] to warrant inclusion on their own in the archives, then yes, you should, and you may also submit a "complete" version as well, if it seems appropriate. This most commonly happens in a collection of plays, though essays and other works may also fit the criteria. Collections of poetry rarely do, since most poems are too short to submit as stand-alone texts.
Sometimes the book includes a preface or introduction or glossary covering all the works in it. In this case, you can decide whether to include these with each of the parts, or save them for the "complete" version.
V.37. How do I get copyright clearance?
Basically we need to see images of the front and back of the title page of the book, which is where copyright information is usually shown. This is called "TP&V", for "Title Page and Verso" [V.25].
To Submit Online:
As of late 2002, we have a new automated upload procedure using a web page. This is by far the fastest and easiest way to get clearance. You need scanned images (PNG, JPEG, TIFF, GIF), of the two pages, of good enough resolution that the text can be read clearly, though the files don't need to be huge.
Just go to http://copy.pglaf.org/ and follow the instructions.
There are two other, older ways to submit a text for clearance, but we would ask you not to use them unless you really can't use the web form.
To submit by paper mail:
Photocopy the front and back of the title page, even if the back is blank, write your e-mail address on it, and send the photocopies to:
MICHAEL STERN HART 405 WEST ELM STREET URBANA, IL 61801-3231 USA
This is called Title Page & Verso, or TP&V for short, and is needed for copyright research. A colored envelope is best, to make sure your letter is easily recognized as TP&V.
E-mail Michael hart@pobox.com when you send them, so he knows they're on the way. It's a good idea to check back with him by e-mail after a week or so if you haven't heard from him.
About this, Michael says: "Please include always your e-mail name and address, and mark the envelope with some distinctive mark and or color. Colored envelopes fine. Just something so I can find it easily, the mail here is slow and deep, like snow. Please send a note to: <hart@pobox.com> for more info."
To submit by e-mail:
Scan the front and back of the title page, even if the back is blank, and e-mail the images to Greg Newby <gbnewby@pglaf.org> as TIFF, JPEG or GIF in medium resolution. Make sure that the print is legible before you send.
Whichever method you use, you should expect to get an e-mail back after about a week, with one line containing the Author, Title, your name and date with the word "OK" at the end. This means that your text has been cleared.
A Clearance Line looks something like:
The Works Of Homer [Iliad/Odyssey] Tr. George Chapman Jim Tinsley 06/14/01 ok
If you don't get any response, e-mail to check that your TP&V was received OK. If the word at the end of the line is not "OK", then your text is not eligible, and a comment will probably be appended explaining why it is not eligible.
Don't start work on your book until you get that OK! It's very sickening to do all that work, and then find out that your text can't legally be put on-line!
V.38. I have a two- or three-volume set. Do I have to get a separate clearance on each physical book?
Yes.
Some multi-volume works, notably reference books and translations, were published in a series, and it may be that the first volume is 1922, but the others are 1923 or later, so we have to clear each individually.
V.39. I have one physical book, with multiple works in it (like a collection of plays). Do I have to get a separate clearance for each work?
No. Since they were all printed together, one TP&V will suffice for all, but . . .
You should list each separate title included, if you intend to submit each title separately (see the FAQ "I have one physical book, with multiple works in it like a collection of plays. Should I submit each work separately?" [V.36]). If, say, you clear a "Collected Plays of Sheridan", and later submit an eBook of "The School for Scandal", we will have trouble finding your clearance unless we have made a note that "School for Scandal" is part of the contents of "Collected Plays".
In a case like this, you should include, on your paper or e-mail, something like:
George Bernard Shaw. Plays Unpleasant. 1905. Contents:
- Preface to Unpleasant Plays
- Widower's Houses
- The Philanderer
- Mrs. Warren's Profession
You only need to do this when you are going to submit each part separately, which is commonly the case with plays, and sometimes essays, stories and novellas. Taking a different example, the "Collected Poems of Emily Dickinson", we would not need to list the contents, since we wouldn't publish each poem separately.
There is one exceptional case: if your book was printed after 1923, but contains stories or plays some of which are stated to be reprints of pre-1923 editions, you should give as much detail as possible about what you intend to submit.
V.40. Who will check up on my progress? When?
Nobody. There are no schedules or timetables. You're welcome to contact other volunteers [V.12] with comments or questions, though.
V.41. How long should it take me to complete a book?
Most books get done in between one and three months, but this varies wildly. It depends on the amount of time you can afford to give it, the length of the book and, if you're not typing, the quality of the scan — if the book scans badly, you need to put more time into proofing.
Some very productive volunteers manage to turn out an e-text a week; some books can take a year or more.
Scanning itself doesn't take too long. Even if it takes you as much as two minutes per page to scan, you will still complete a 300 page book in 10 hours, and you will probably be scanning much faster than that [S.9]. The problem is that the text generated by the scanner and your OCR package is usually faulty. There are many cute scanner errors, mistaking b for h, or e for c, so that "heard" is scanned as "beard" or "ear" as "car". Makes the story more interesting sometimes!
So now you need to do a first proof of the e-text. Read it carefully, correct scanning mistakes, and make sure that you haven't left out pages or got them in the wrong order. Unless your scan was exceptionally good, this is the time-burner in the process.
When you've done the first proof, you can either do a second proof yourself, or send it to another volunteer for second proofing.
If you're a typist, of course, you can skip right over the messy scanning and scan-correction process. Yay typists!!
V.42. I want/don't want my name published on my e-text
No problem. When you send the e-text for posting, mention exactly what, if anything, you want the Credits Line [V.47] to say.
V.43. I'd like to put a copy of my finished e-text, or another Gutenberg text, on my own web page.
Great! PG encourages the widest possible distribution of e-texts. We like to publish everything in plain text, which is the most accessible format, since everybody can read plain text. But once it's available in plain text, it's open to you or anyone else to convert it to other formats like HTML for further distribution.
If you are reposting a text, though, please be careful to check that your posting complies with the conditions spelled out in the header, especially for copyrighted works.
V.44. I've scanned, edited and proofed my text. How do I find someone to second-proof it?
You can post a request on the Volunteers' Board, or on the gutvol-d Mailing List. You will probably get some offers there. In a difficult case, you might ask Michael Hart to add it to the "Requests for Assistance" section of the next Newsletter.
In general, the best way to handle it is to make a co-operative proofing project out of it. This is like a miniature version of the distributed proofreading sites, without the page images.
There are always people looking for proofing work, but many beginners take on more than they can handle, and don't finish the job, and this can be very disappointing if you give the whole thing to one volunteer who then vanishes without trace. You can minimize the risk of this by splitting the book into chunks of about 20-30 pages, or one chapter if that's around the right size, each. Write explicit instructions about what you want them to do when they spot a suspected error, like fix it or mark it with an asterisk. (Marking is probably safer with beginners who don't have the book or an image of the page to refer to.) Give the first chapter to the first person who responds, the second to the second, and so on. As you hand out the chapters, let the proofers know that if they're not returned within three or five days, you'll assume they've quit. Three days is more than plenty of time for 20 pages. If someone returns a chapter, you can give them another. If someone doesn't get back to you within the time set, assume they're not going to, and recycle that chapter to someone else. No hard feelings, no problem. This process of "co-operative proofing" ensures that beginning proofers don't choke on the work, and that one vanishing volunteer doesn't hold up the whole project.
V.45. I've gone over and over my text. I can't find any more errors, and I'm sick of looking at it. What should I do now?
We all know that feeling! Particularly with your first book, you've probably gone through a patch when you thought you'd never finish — and when you do, you can't stand the idea of looking at it again. Heh. Cheer up — the first twenty texts are the worst! :-) And you'll feel a lot better when you see your text available for everyone to read.
You have three choices:
- You can send it for posting as it is. [V.46]
- You can put it aside for a week or so, and come back to it with fresh eyes.
- You can ask in any of the standard ways [V.12] for someone else to second-proof it for you. This has a lot to recommend it; it gets other sets of eyes looking at the text, it relieves the pressure that you may feel, it may rekindle your enthusiasm for the text, it allows you to "meet" other volunteers, and possibly form partnerships for future PG collaboration. Above all, it gives new proofers a chance to get their feet wet, and this is good for them, and good for PG. You are not only contributing a text, you're helping to train and encourage the next generation of producers.
V.46. Where and how can I send my text for posting?
As of late 2002, we have a new automated upload procedure using a web page. This has a lot of good things going for it, because we keep a record of what's uploaded, you get an e-mailed copy of the notification, you don't have to fiddle with FTP, and we can make up the header automatically from the information you enter, which saves time and prevents keying errors. Please use this method unless for some reason you really aren't able to.
As always, it's better to ZIP your file first, because it'll take less time to transfer.
Just go to <http://upload.pglaf.org/>, fill in the form, specify the file to upload, and hit "Send" at the bottom.
And you're done!
If, for some reason, you can't use this page, there are two backup options: you can e-mail it, or you can upload it by FTP. Whichever you use, it is always best to ZIP the file first if you can.
If you are comfortable with sending files by FTP, this is better than e-mail. First, you will need a username and password, which you can get by e-mailing any of the Posting Team.
Log in to pglaf.org using the username and password supplied. Change to binary mode with the "bin" command and "put" your file.
Summary instructions:
ftp pglaf.org login: yourlogin password: yourpassword bin put yourfile.ext quit
Here is a sample session:
>ftp pglaf.org Connected to pglaf.org. 220-Access from unknown@127.0.0.1 logged. 220 FTP Server User (pglaf.org:(none)): xxxxxxxx 331 Password required for xxxxxxxx. Password: xxxxxxxx 230 User xxxxxxxx logged in. ftp> bin 200 Type set to I. ftp> put MYFILE.ZIP 200 PORT command successful. 150 Opening BINARY mode data connection for MYFILE.ZIP. 226 Transfer complete. ftp: 172313 bytes sent in 17.34Seconds 9.94Kbytes/sec. ftp> quit
When you are in the work directory, you will not be able to list files, but they do exist and they are there.
When you have uploaded your file, e-mail a note to any or all of the Posting Team, including your
- filename
- credits line as you want it on your text
- clearance line you received [V.37]
An ideal note might be:
Subject: Upload for posting: Hamlet
I have uploaded to pglaf.org by FTP:
Hamlet, by William Shakespeare
File is: hamlet.zip
Credits line is:
Produced by John Doe <jdoe@example.com>
Clearance was given as:
Hamlet William Shakespeare John Doe 05/03/02 ok
If you'd rather send it by e-mail, send the e-mail, including the Credits Line and Clearance Line as in the sample above, to any or all of the Posting Team, with your text as an attachment. Again, ZIPped is better, since it avoids certain damage that can happen to a plain text e-mail along the way.
Do not add the Project Gutenberg header or footer to your file, unless we specifically asked you to. If you do add it, we'll just have to strip it off again, since we add headers automatically when posting. There are times, perhaps when you're working in an unusual non-editable format, when we may give you a header and ask you to add it, but this is rare.
Please read section "4: Posting" of the FAQ "How does a text get produced?" [V.16] for more detail about what happens in posting. Especially, if you want to draw some peculiarities of this text to the Posting Team's attention, or want feedback on any minor edits done during posting, you should say so in the e-mail you send.
Don't assume that we know anything when you send the e-mail. We don't know what you want us to put on the Credits Line. We don't know that this is an unusual text, and needs some kind of special reformatting. We don't know that the text should be split into two volumes before posting. We don't know that you would really like us to check it closely before posting. You have to tell us, exactly and precisely, what you want on the Credits Line. If the text needs some specific work, you have to tell us exactly what that is. And please do that in your e-mail, not in the text itself. Remember that we could be dealing with five or ten other texts at the same time, and even if the poster you discussed it with two weeks ago is the same one who posts the book, he may not remember.
V.47. What is the "Credits Line"?
The Credits line is a line that the Posting Team can insert into each PG text naming the producer or producers of a particular text.
You should decide what you want on the credits line of your text; it's really not up to us.
Most credits lines are something like:
- Produced by John Doe <jdoe@example.com>.
If you don't want to be mentioned by name at all, just say, in your e-mail:
- Please omit the Credits Line for this text. I want to contribute it anonymously.
If you do want to be mentioned, please give the exact wording you want us to use. Most people want their name only; they don't want us to include their e-mail addresses. Others want to make their e-mail addresses public so that readers can contact them with comments. That is entirely up to you, but you do need to tell us. If you do want to include your e-mail, remember that having it permanently on the net is a spam-magnet, and we can't effectively remove or change it later.
Occasionally, a Credits Line can spill onto more than one line, for example:
This text was converted to HTML by Jane Roe <jroe@example.com>
from an original ASCII text scanned by Jack Went
and proofed by Jill Hill
V.48. How soon after I send it will my text be posted?
First read the "Posting" section of the FAQ "How does a book get produced?" [V.16] to understand the process.
You should expect some response within three or four days. We try to get to all submissions within that time. In most cases, that response will be simply the official notification that it has been posted. If there is a query on your text, for example if we can't find the copyright clearance or if we have trouble converting or correcting your text, we will probably e-mail you back directly with questions.
If you don't hear from us within four days, send a follow-up e-mail; it could be that your original note never got to us, or just fell through the cracks.
If your file happens to arrive while one of us is logged in and working, it could get posted within the hour. Some frequent contributors who know our habits know just how to time their uploads!
V.49. I found a problem with my posted text. What do I do?
Most postings go smoothly, but problems can happen. Sometimes, one of the servers is down. Sometimes a file gets corrupted for some unknown reason. Sometimes, let's face it, we screw up.
Usually, one of the indexers will tell us about it, but if you catch it first, e-mail whoever sent out your notification e-mail and explain the problem. Don't worry; your original file will be quite safe, since we keep these long after posting them.
V.50. Someone has e-mailed me about my posted text, pointing out errors.
Great!
Since you're the original producer, you're in the best position to decide whether these are real errors. If they're right about it, tell the Posting Team and we'll correct the text.
V.51. Someone has e-mailed me about my posted text, thanking me.
Nice feeling, isn't it? :-)
About Proofing
V.52. What role does proofing play in Project Gutenberg?
A very big one!
Typists' work doesn't usually need many corrections, but unfortunately, scanners and OCR packages are far from perfect, and scanned text varies from "almost-right" down to "maybe I should consider typing instead of scanning". Proofing is the process that turns a scan into a readable e-text.
Proofing a typist's work is straightforward; you just read it, and keep an eye out for mistakes. Typists typically have few mistakes in their texts, but the errors that they do make tend to be hard to spot. Proofing OCRed text has its quirks, and you can expect many, many errors to correct.
The only thing that all proofers agree on is to differ in their methods. Some people scan and almost complete the proofing process within their OCR package, others do no editing at all within their OCR. Some spell-check first, others spell-check last. Some work through in one pass, doggedly line by line, others make several light passes. Some start at the end and work backwards! Some proofers mark all queries with special characters like asterisks (*) in the text, most just make all the obvious changes and mark only the dubious ones. Some people always send their texts out for proofing, others prefer to do it all themselves.
So this guide is not prescriptive; this is not the "only way" to do it. The only rule is that, at the end of the process, your e-text should be as error-free as you can make it, and should conform to Gutenberg's editing standards, which are mostly just common sense guidelines to make readable text.
The aim of this FAQ is to give you an understanding of what text looks like when it comes fresh off the scanner, and an overview of the whole process by which it becomes a publishable e-text.
V.53. What is Distributed Proofing?
It has always been common for volunteers to share proofing work among themselves — you take the first five chapters, I'll take the next, and so on.
When you're just starting as a PG volunteer, you should go to one of the Distributed Proofing sites [B.4] and do some work there to get a grounding in the basics and a feel for whether you would like to continue working in PG. In distributed proofing, you get a very short section, as little as a page of text at a time, and usually an image file of the page as it scanned. You then make the text match the image. This is a great start, since all you have to do is read, compare and correct. However, other work also needs to be done, and will normally be done by the project managers of these sites. The samples below give you an idea of the whole process, and also some ideas of what proofing a whole book from start to finish is like.
Most people now start at Project Gutenberg's Distributed Proofreaders http://www.pgdp.net .
V.54. What do I need to proof an e-text?
You actually need only two things: the e-text itself and a text editor or word-processor that can handle book-sized files and save them as text.
Nearly all word processors and text editors in current use will work. Volunteers use many common programs, including WordPerfect, Microsoft Word, WordPad, DOS EDIT, vi, Brief, Crisp, EditPad, MetaPad, emacs, AbiWord, and the word processors from Open Office abd AppleWorks. And all of these are in actual use by volunteers today. Since all of them contain the necessary basic functions, the best program is the one you're most comfortable with.
Be cautious with recent, powerful word-processors that "auto-correct" text, or use "smart quotes" or any other such automatic retyping or formatting feature, since they can Do Bad Things to your e-text without your consent! When using any such package, it is best to switch off any feature that makes changes without asking you.
Two utilities which may come in useful are a spell-checker and a version difference checker. These may be built into your word processor, or you may have them as separate packages.
A spell-checker is like a chain-saw: a powerful tool, but one to be used very carefully. It is very easy to say "Yes" to the wrong change, and make a really bad mess of the text. Spell-checkers have problems with proper names, foreign words, archaic usages, and dialects. Incautious use can leave you with a text such as that immortalized in the
- Owed two a Spell in Chequer.
- Eye half a spell in chequer,
- It cane with my Pea Sea.
- It plane lee marques four my revue
- Miss steaks eye can knot sea.
Every e-text should pass through a spell-checker at some point, but the human half of the partnership needs a very light hand on the confirmations of change!
A difference checker, such as FC or COMP for MS-DOS, diff for Unix or ExamDiff <http://www.prestosoft.com/examdiff/examdiff.htm> for Windows, may also come in handy. A difference checker compares two versions of the text, and points out the changes. This is important when you've sent a text out for proofing, and you get it back with changes. Rather than re-reading the whole text, you can use a difference checker to highlight the changes so that you can verify them against the printed text. As a proofer, you can use it to compare the original text with what you're sending back to ensure that you've only changed what you meant to change.
V.55. Do I need to have a paper copy of the book I'm proofing?
No.
Your job as proofer is to ensure that the e-text you're working on is readable in itself, and contains no obvious errors. Where you think there might be an error, but you're not sure, you mark the spot in the e-text, and let the volunteer who has the paper book look it up.
V.56. What's the difference between "first proof" and "second proof"?
These are fuzzy terms used to indicate how accurate the e-text is, and what type of work is needed to improve it. Quite commonly, the same volunteer who scans the book proofs the whole thing in one or two passes. Sometimes, given a good scan, the text can be sent out for "first proof" with little or no preparatory fixing-up. Often, the scanner makes quite a lot of corrections, then sends the text out for "second proof".
A text is ready for first proofing when it's obvious that there are plenty of errors, but it's possible to figure out, in almost every case, what the correct text should be without needing to refer to the book.
The objective of first proofing is to eliminate all the obvious errors, so that if you speed-read quickly through the text, you probably won't notice any.
Second proofing involves taking a text that has been first-proofed and correcting all the remaining, more subtle errors. Often, some simple errors such as incorrect spacing and quotes may be left for second proofing. Texts that have been typed instead of scanned will always be of at least second-proof quality.
V.57. What do I do with an e-text sent to me for proofing?
First, establish reasonable expectations. A typical book takes 10-15 hours of concentrated effort, and when you first start, you're climbing a learning curve. For your first session, decide to mark out a chapter or two — something like 500 to 1,000 lines — and work only on that. If you get through 1,000 lines in your first sitting, you have done extremely well! It's a good idea to send this first 1,000 lines or so back immediately. The volunteer who sent you the e-text will comment on it, and let you know about any style guidelines you may have breached or common errors you may have missed. Most beginning proofers do make mistakes, so don't worry about it — it's easier to correct these in 1,000 lines than to go back over them in 15,000 lines!
You will usually receive the e-text as an attachment to your e-mail. It's better to send e-texts as attachments than to paste them as text into the body of the e-mail to make sure that the text isn't changed by different e-mail clients. It's better to send e-mailed attachments as ZIP files [R.20], since e-mails sent as text can be damaged along the way. But whether you receive a TXT file or a ZIP file that you have to open, you should save the .TXT file to your hard disk and open it with your editor.
It may be that the text you see appears double-spaced — every second line is blank — or that all the text is on one incredibly long line. This is a familiar effect when moving between a DOS/Windows computer and a Mac or Unix system, but it can happen between any two editors. It is caused by the use of different characters to mark the end of a line. If you have this problem, ask whoever sent you the text to re-send it, telling them what kind of computer and editor you have.
Now you make any changes that obviously need to be made, and mark any places where the text looks wrong, but you're not sure what the right text should be. You can usually use asterisks (*) to mark these dubious spots, but you might use other characters if the text already contains asterisks. When in doubt, mark them all, and let the volunteer with the text sort them out!
It is usually best not to make global changes to line lengths by reformatting lots of paragraphs, since the person who sent you the e-text may want to use a difference checker when you return it, and changed line-lengths throughout mean that every line will be different.
When working on a long text, or when making a lot of changes, it may be wise to save several versions of the text with different filenames at different stages so that if something goes badly wrong, you can revert to the last good version. This applies especially to saving the text just before performing a spell-check.
When you're finished with the e-text, make sure you save it as a plain text file (.TXT) and send it back by zipping it if you can, and attaching it to an e-mail.
V.58. What kinds of errors will I have to correct?
Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.
Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.
The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.
The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.
Lower-case m is often mistaken for rn or ni.
The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.
For example:
" Hello1' caIled jirnmy breczily. 11Anyone home ? " There seemed to he no-oneabout. Only tbe eat beard him."
should read:
"Hello!" called Jimmy breezily, "Anyone home?" There seemed to be no-one about. Only the cat heard him.
As well as scanner errors, which affect one letter at a time, you have to keep an eye out for editing mistakes by the volunteer who scanned the text or by previous proofers. These are typically cases where a whole line, paragraph or page has been omitted or misplaced. They show up as sentences that don't make sense, or paragraphs that don't follow from the previous one.
This means that you have to keep reading the flow of the text, so that you can spot context errors as well as typos.
V.59. How long does it take to proof an e-text?
This depends on how long the e-text is, how clean the text is when you start, and how thorough you're being, as well as how much time per day you can give it and how fast you can proof.
On a first proof, it can take a very long time to get the e-text to a readable condition if it scanned badly. As a beginner, you would be unlikely to be given such a difficult text to work with. First proofs are usually done by the same person who did the scanning, and are only given out in the context of established scanning/proofing teams.
You might expect to proof anywhere between 500 and 2,000 lines per hour during a second proof. A short novel or novella might have as few as 6,000 or 7,000 lines; War and Peace weighs in at about 54,000 lines. Most novels run to 10,000 to 15,000 lines. So you might spend anything between 5 and 30 hours second-proofing a standard book, with 10 to 15 hours being typical.
For an average novel, a week or two for second proofing is good going. A month is reasonable.
Proofing an e-text is a significant amount of work, and you may find it psychologically more comfortable to take on a chunk at a time — say 1,000 lines per session — and send that proofed section back, rather than wait until the whole job is done before sending anything back. This helps to avoid the fairly common case where you keep falling behind where you expect to be until you dread the thought of getting back to the text, and finally just abandon it.
If you find after a while that you just don't want to continue, please tell the person who sent you the text that you're not going ahead with it. It's very frustrating for the volunteer who scanned the book, and who wants to get it posted, to wait for two or three months, only to have to start all over again with another proofer.
V.60. Are there any special techniques for proofing?
The classic way to proof is to open the text in your editor or word processor, and just start reading carefully.
This method has received a major boost since editors and word processors have added a feature of showing squiggly red underlines under words not in their dictionary. While this is very useful, you still need to read carefully, since not all errors produce misspelled words. The classic, and very common, example of this is scanning "he" for "be". These visual spellchecks also commonly do not check words beginning with capitals. Capitalized words are commonly names not in the dictionary, and when checking of capitalized words is switched off, they will not query "Tbe". Other errors that a spellchecker doesn't look for include missing spaces, mismatched quotes and misplaced punctuation. For these, you can try gutcheck [P.1]. And of course, no automatic check will find omitted lines or words. Worse, spellcheckers will query words not in their dictionary that might be quite correct, and this can be quite troublesome when dealing with older texts or dialect.
Still, if your concentration is up to the job, scrolling through a text with non-dictionary words underlined in red is a fast and effective way of giving a text the final once-over.
Volunteers have also used other techniques for proofing. Some people can't sit at their screen and read for hours; many people don't want to.
Some people just use the good old-fashioned method of printing out the text to be proofed, and blue-pencilling the mistakes.
It is becoming fairly common now for people to load the text onto their PDA, and read it from that. Mistakes found can be bookmarked or jotted down and fixed when they go back to their PC.
Getting your computer to read the text aloud is a very effective way of achieving high accuracy. Modern PCs have audio capabilities built in, and it is possible to find free or cheap shareware "read-aloud" text-to-speech packages for just about everything. Some PDAs are also capable of doing text-to-speech.
The first time you try text-to-speech, it will probably sound and feel a little strange, but you will quickly learn to hear errors in words. This can be very effective, but you should have given the text at least a light proofing before you begin; it is hard to deal with a high number of errors using a text-to-speech method.
When proofing by a speech program, you either set your text-to-speech program to pronounce all punctuation, or, if that is not possible, you make a special version of your text to feed it, first doing a global replace of "," with " comma ", ";" with " semi-colon ", and so on. Mark a block of 500 to 1,000 lines for reading aloud, and set the reading speed to whatever is comfortable for you. Then you sit down with the original book in front of you, and listen. When you hear an error, mark the place in the text with a light pencil. Stopping the reading at every error, editing the text and restarting is possible, but it breaks the flow, and ends up taking longer. When the reading is done, go to your keyboard and correct the errors found.
V.61. What actually happens during a proof?
Stage One — The original Scan
We start with a scanned e-text, in this case a paragraph from The Odyssey. The paragraph used as an example here has been "enhanced" with more errors than in the real scanned text, so that you can see samples of many problems all in one place.
We begin by looking at the original OCRed text, of which our sample section reads:
1There Periniedes and Eurylochus held the victims, but l
drew my sharp sword from my thigh, and dug a pit, as it were
a cubit in length and breadth, and about it poured a drink-
offering to all the dead, first with mead and thereafter with
sweet wine, and for the third time with water, And 1 sprink-
BOOK XL
ODYSSEY X, 24-56.
173
ODYSS.EY XI, %4-56. 173
lef white incal thereon, and entreated with many prayers
strengthless beads of the dead, and prornised that on my
return to Ithaea 1 would offer in my halls a barren heifer,
the best 1 had, and fil the pyre with treasure, and apart unto
Teiresias alone sacrifice a black rarn without spot, the fairest
of my flock. But when 1 bad hesought the tribes of the
d with vows and prayers, 1 took the sheep and cut their
s over the trench. and the dark blood flowed forth,
he spirits of the dead that he departed gathered
from out of Erebus.
It's clear that we should tidy up the page headings and numbers that have been scanned in with the main text, and that we should separate the paragraphs and remove the spaces inserted by the scan at the start of some lines. We also need to restore some of the text that got lost in the scan. Since there isn't much of it, we just type it in. Having done this, we get to . . .
Stage Two — First pass through the scanned text
At this point, we have a complete text. All of the words are actually there, and we have eliminated page breaks and other extraneous artifacts of proofing. Again, mileage varies: some people like to preserve page breaks and numbering until much later, to make it easy to refer back from the e-text to the book.
Our job in this phase is to fix all of the obvious scanning errors and double-check that we really do have all the text. Our aim here is to create an e-text that is ready for First Proof. In fact, since it's fairly clear what all the words are, this text could be considered ready for first proof.
1There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And 1 sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad besought the tribes of the dead with vows and prayers, 1 took the sheep and cut their throats over the trench. and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.
Now we convert those numeral 1s to capital Is and to quotes, where appropriate, we straighten up the quotes and we deal with other obvious scanning errors, which brings us to . . .
Stage Three — The First Proof
At this point, we could hand over the text to an experienced proofer who doesn't have a copy of the book. This would be called a "first proof". An e-text is at first proof stage when there are still plenty of errors, but in each case it's pretty obvious what the correct word is. The excerpt now looks like normal text.
Unfortunately, in stage two above, we accidentally deleted a line.
'There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And I sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea I would offer in my halls a barren heifer, Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when I bad besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.
Stage Four — Corrections from First Proof
We receive the first proof back from the proofer, and find that it has been mostly corrected.
The corrections made were "l/I", "there after/thereafter", "prornised/promised", "bad/had", and "rarn/ram".
We have also wrapped the lines — at 60 characters in this case, but it is commonly as much as 70 characters per line. Sentences which look wrong, but where it isn't clear what the right text should be, have been marked with asterisks (*).
'There Periniedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white incal * thereon, and entreated with many prayers the strengthless beads of the dead, and promised that on my return to Ithaea I would offer in my halls a barren heifer, * Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.
We look up the text where the first proofer has asterisked it, and make the corrections.
Stage Five — From Second Proof to The Final Result
The text is now ready for second proofing. An e-text is ready for second proofing when you can skim through the text without noticing that there are errors.
We can either do a second proof ourselves, or send it out for second proofing.
Second proofing involves a very careful reading of the text, looking for small errors. In some ways, it's much harder than first proofing, since it's very easy to let your eyes run on auto-pilot and in doing so, miss subtle errors.
Having performed the second proof, which caught errors like "beads/heads", "Ithaea/Ithaca", "Periniedes/Perimedes" and "he/be", we now have our final e-text.
'There Perimedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white meal thereon, and entreated with many prayers the strengthless heads of the dead, and promised that on my return to Ithaca I would offer in my halls a barren heifer, the best I had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that be departed gathered them from out of Erebus.
Hooray! At long last we have an e-text to post, which can be downloaded, read and enjoyed by anyone in the world from now on.
About Net searching
V.62. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?
You can submit it, but you can't "just" submit it.
We wish we could give a permanent home to all the etexts that people have produced and placed on the Net, but without proof of their public domain [C.10] status, we can't.
We need to be able to prove that the eBooks we publish are in the public domain, so, in order to use one of the many texts that are just floating around the Net, you need to find a matching paper edition that we can prove is eligible [V.18].
(By the way, please be sure that it isn't already in the PG archive. A lot of texts circulating on the Net originated at PG, and people quite often submit them back to us.)
Before you get into this, you should check whether the text you have found is likely to be in the public domain in the U.S. A quick way to verify this is to hit the Library of Congress Catalog site at <http://catalog.loc.gov> and search for the title or author. If you find no publications before 1923, then you should probably move on; the Library of Congress doesn't list every book, and in particular doesn't list all books published outside the U.S., but, if there isn't a pre-1923 copy there, it may be difficult to follow up on. If you're not dissuaded, do a search on the Net for used book shops that might have pre-1923 copies.
Sometimes, with a text on the Net, you know who typed it; it's on someone's website, or the transcriber is named in the text. Sometimes, the text has just been floating around Usenet or old gopher sites for years, with no attribution.
The first thing to remember is that we would like to give credit to the original transcriber if they want it, and if we can identify them.
The next thing to consider is that the original transcriber may well have an eligible copy of the book, and may be able to provide TP&V [V.25] for it.
So, if you can locate the original transcriber, it makes sense to e-mail them, explain what you propose to do, and ask them whether they can help with copyright clearance and whether they would like to be credited in the PG edition. Often, you will get no response, or a response but no prospect of material that will help with clearance, but sometimes you will get lucky.
If the transcriber can't help with TP&V, it's up to you to find a matching paper edition of the same book. This may not be as hard as it sounds. Libraries can help, and may get editions for you on interlibrary loan.
This is an ideal way for students, academics and librarians to contribute texts to PG, since you probably have access to a good library with stocks of old books to find matching paper editions.
If you find a matching paper edition, you then need to compare the etext you found with the book. Legally, what we're trying to prove here is that we have done "due diligence" — that we have done our best to prove that the etext is indeed a copy of a public domain work.
The minimum "due diligence" we can perform is to compare the first and last pages of each chapter, (or every 20 pages where the book is not neatly divided into chapters of about that size). You should list all of the differences between the book and the etext that you find on those pages. It is to be expected that there will be some minor differences of punctuation, spacing and spelling, and even perhaps of wording. Minor differences are OK, but we do need to list them, to prove that we did the comparison. When you have your lists, you can send in the TP&V as normal, accompanied by your lists, for clearance.
Many texts floating round without attribution, and indeed many with attribution, could do with a thorough checking, and another option you have is "comparative retyping", where you go through the whole etext, proofing it carefully against the cleared paper book, and changing everything that is different in the etext to match the paper edition. If you do this, you don't need to produce a list of differences, since there won't be any by the time you've finished; you can just submit it as a normal text — and it may well be a lot cleaner! However, if you do take this path, please do a very thorough job on the proofing and comparison.
If the etext you find has been marked up, in HTML for example, you should remove all HTML for the PG edition, because, even though the text itself has been proved to be in the public domain, the original transcribers may hold copyright on the HTML markup, even if you can't find them. If you do want to make a HTML edition of it for PG, strip out all of the original markup and then re-add your own markup.
If you do find the producer and he or she wants to be identified, you may submit a double credits line like:
- Transcribed by Sally Wright <theoriginaltranscriber@example.com>
- Produced for PG by You <you@example.com>
V.63. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Why should I submit it to PG?
The first reason is file safety.
Yes, we accept that the file is already available to everyone today, but it may not be safe in the long term. We've seen college students who put books on their personal site, and then lose that site when they graduate. We've seen individuals who transcribe several books, and later lose interest, or move, or die, and the work they've done is lost. We've seen small projects with a few volunteers who produce and post books for a few years, but then break up or run out of funds to maintain their site. We've seen large institutions drop their collections as part of a cost-cutting exercise. We've even seen organizations lock public domain works up behind licenses, requiring users to commit to registration and a "no copying" agreement before downloading them.
Whenever a set of etexts is published and distributed by only one person or organization, there is a danger that their etexts will disappear from the Net sometime. We want all etexts to be spread as widely as possible, copied as much as possible, so that no one event or loss, or whim of a sponsor, can obliterate them.
We think that the PG collection is, for that reason, the safest place to put a text for its long-term survival. There are copies of the PG archives all over the world, on public servers and private CDs. PG publications are widely converted, collected and read on PDAs. Other text projects copy works from PG.
The PG archive is so valuable, yet free and easily portable, that even if every current PG volunteer vanished overnight, people around the world would copy and preserve it. Even if PG itself decided to withdraw all our texts, we couldn't do it, because so many people have made copies.
The second reason is legal safety.
Unlike some other projects and individual efforts, PG retains documentary proof of the public domain status of its texts. This is more valuable than it might appear at first glance.
Publishers often claim a new copyright [C.17] on works that they republish, and as time goes on, it becomes harder and harder to prove that a particular book is in the public domain. Walk into your local bookstore and check out how many works by Shakespeare, Poe, Dickens, and Twain have copyright notices on them! People who want to translate these, or create derivative works like screenplays or lyrics or films must first prove that they are basing their work on a public domain edition, but the creeping copyright practices of commercial publishers make that difficult.
Here's a practical example: we were approached by a film student who wanted to make a short piece based on characters from James Joyce's "Ulysses". But before he could do that, he needed to confirm that the material on which he was basing his movie was in the public domain, and all the editions he could find were copyrighted. However, because PG had already established the public domain status of Ulysses, we could point him to our established PD version, and even tell him where to find a paper copy published in 1922. Without that evidence, he could not have made his project.
V.64. I have already scanned or typed a book; it's on my web site. How can I get it included in the Gutenberg archives?
Great! We get these a lot, but it's always nice to see another!
You need to send us the TP&V [V.25] so that we can prove that your edition is in the public domain. If you don't have the TP&V, you will need to find a matching paper book with eligible TP&V for us to be able to use it.
V.65. I have already scanned or typed a book; it's on my web site. The world can already access it. Why should I add it to the Gutenberg archives?
The Project Gutenberg archives are widely copied and searched, and much safer and more permanent that any individual website can possibly be. We aim to keep this collection together over not just years, but centuries. You took the trouble to transcribe this book. We can relate; that's what we do, as well. We know you want this work to survive you and your ISP, and we believe we can do that. And it's not as if you have to take it off your website when we make a copy; you're just using your candle to light another!
If you want to let readers know that your site has other related material, you can put that information in the Credits Line [V.47]. Taking a real-world example, you could ask us to add this to the Credits line for a C. M. Yonge text:
- A web page for Charlotte M. Yonge will be found at www.menorot.com/cmyonge.htm
V.66. I have already scanned or typed a book, but it's not in plain text format. Can I submit it to PG?
Yes, of course. We'll be happy to discuss format options with you, and we're quite experienced in converting between multiple formats and deciding which formats work best and will have the longest life. All you need is to get us a copy of your TP&V [V.25].
About what goes into the texts
V.73. Why does PG format texts the way it does?
PG texts are formatted as plain ASCII, with 60-70 characters per line, with a hard return [CR/LF] at end of line, and some people ask "Why do it this way? You could omit the hard returns and let the reader's word processor or Reader software wrap the lines. You could use "8-bit" accented characters for non-English characters." "You could use ' - ' instead of ' — ' for an em-dash." And so on, through a different choice we could make for every formatting feature. And the answer, of course, is that we could do it differently, and sometimes we do, but mostly we keep to one consistent style.
We'll be discussing each of the formatting decisions below, not only giving the summary PG answer, but also discussing the plusses and minuses of each, and the possible options.
Like any question beginning "Why does/doesn't PG . . . ?", the answer is "Because that's what the volunteers and readers want!". These conventions have been worked out over the years, largely by Michael Hart, our founder and chief volunteer, in conjunction with all of us volunteers, as the result of feedback from readers.
We are guided throughout by the principle that we want to produce texts in the simplest format that will adequately express the content. Quoting Michael Hart (1994):
Etext as developed and distributed by Project Gutenberg since 1971 was never intended to be a copy of a paper or a parchment [remember, first Project Gutenberg Etext was typed in from parchment replicas of the US Declaration of Independence]. The major purposes of Project Gutenberg have always been:
- to encourage the creation and distribution of electronic texts for the general audience.
- to provide these Etexts in a manner available to everyone in terms of price and accessibility [i.e. no special hardware or software], and no price tag attached to the Etexts themselves.
- to make the Etexts as readily usable as possible, with no forms or other paperwork required, and as easily readable to the human eyes as to computer programs, and in fact, more readable than paper.
There is sometimes a conflict between "simplest format" and "adequately express the content"; further, different people have different views on what is "simple" or "adequate". You, the producer of the text, have spent the time and effort to make the eBook available to the world, you have thought more about it than anyone else, and we respect your informed judgment. However, please make sure that your judgment has been informed, by studying the precedents and reasons behind our guidelines.
Where a simple, standard PG-ASCII layout does not, in your view, "adequately express the content", you should think of making your text in another open format, perhaps HTML or XML or TeX, that allows you to use more characters, more formatting options, and images. We are always happy to accept these kinds of files. In these cases, you should also provide a standard PG-ASCII version, even if you feel it is unacceptably degraded, for those who cannot use your preferred format.
Just ten years ago, presentation as plain ASCII was not only a universal standard, it was effectively the only way that most people could view the books. The first version of the HTML specification had been drafted, but was unknown among the general public. XML did not exist. SGML was (as it still is) the province of specialists. Specialized eBook readers and PDAs had not yet appeared.
In 2004, plain vanilla ASCII is still readable everywhere, but people also want to convert our texts into other formats for more convenient loading on readers and web sites. We therefore have to keep in mind that our works will be processed by automatic conversion programs, none of which is perfect, and we have evolved some "defensive formatting" practices, which, while retaining the universality of plain text, also supply clues to automatic converters about how they should treat the layout. These do help to keep converters from making at least the worst mistakes. The most significant "defensive formatting" practices are indenting unwrappable text like quotations, and using _underscores_ rather than CAPITALS for italics. Different volunteers have different priorities: at one extreme, some people want to make the best plain text they can, giving no weight to conversion issues; at the other, some people emphasize the cues that will allow automatic reformatters to convert the texts well, even if that causes some ugliness in the plain text. Most of us operate somewhere between, making the choices we feel are best depending on the context. Getting a text on-line is the important thing; which choices you make in doing so is a matter of detail.
About the characters you use
V.74. What characters can I use?
- You should use plain ASCII for straight English texts.
- When producing a text partly or completely in a language that requires accents, you should use the appropriate ISO-8859 character set for the language, and specify which you are using, and also provide a 7-bit plain ASCII version with the accents stripped.
- When producing a text in a language that doesn't use one of the ISO-8859 character sets, you should use the encoding most commonly used for that language. [e.g. Chinese, Big 5]
- When producing a text containing more characters than can be found in any one of the ISO-8859 character sets, you should use Unicode.
You should use plain ASCII wherever possible — that is, the letters and numbers and punctuation available on a standard U.S. keyboard, without accented letters. The immediate and major exception to this is when you are typing a text written in a language like French or German that requires accents.
There is a problem with using non-ASCII characters. They do not display consistently on all computers; in fact, they do not even display consistently on the same computer! On my computer, for example, what looks like an e-acute in this editor just shows as a black box in another editor, or even using a different font in the same editor. And this is by no means confined to some theoretical minority; we have to deal with it all the time when posting texts.
Further, standards are changing: ten years ago, the character set Codepage 850 [MS-DOS] was very common; now it's rare except in some texts that have survived those ten years.
We want to preserve these texts over centuries, not just decades, and at the moment there is no single clear standard that we can use across all texts. Unicode may perhaps be a future standard, but, right now, it's not something that people use every day, and it's not supported by a lot of common software.
ASCII, while limited, is supported by almost all computers everywhere, so we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original. When we get a text in, say, German, we post two versions of it — one with accents and one without.
V.75. What is ASCII?
Don't get scared by the computer jargon; ASCII (pronounced ASS-key) is just a name for the set of unaccented letters, numbers and other symbols on a standard U.S. keyboard.
ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer.
Just about every computer in the world can show ASCII characters correctly, which makes it ideal for PG's purpose of providing texts that can be read by anyone, anywhere, but ASCII does not include accented characters, Greek letters, Arabic script and other non-English characters, which causes some problems when we produce texts that need non-ASCII characters.
V.76. So what is ISO-8859? What is Codepage 437? What is Codepage 1252? What is MacRoman?
Today's computers mostly work on the basis of dealing with one "byte" at a time. A byte is a unit of storage than can contain any number from 0 through 255 — 256 values in all. It's very convenient for computers to associate one character with each of these numbers, so that we can have up to 256 "letters" viewable from the values stored in one byte. The first 128 values, zero through 127, are defined by ASCII — so, for example, in ASCII, the number 65 represents a capital "A", 97 represents a lowercase "a", 49 stands for the digit "1", 45 for the hyphen "-", and so on.
ASCII doesn't define characters for the values 128 through 255, and in early days computer manufacturers used these values to hold non-ASCII characters like accented letters and box-drawing lines. Of course, 128 wasn't nearly enough values to hold all of the characters that people needed to use for different languages, so they made the character sets switchable, so that a PC in France could use a different set of accented letters from a PC in Poland. Microsoft's version of this was called Codepages. Each Codepage held a different set of non-ASCII characters. Codepage 437, and later Codepage 850, were commonly used for English and some major Western European languages on MS-DOS.
MacRoman was Apple's first codepage, containing most of the accented letters in Latin-derived languages, and MacRoman is still in common use on Apple Macs today.
Later, the International Standards Organization ISO got around to looking at the problem, and defined ISO-8859-1, ISO-8859-2 and so on, as the standards for different language groups. These sets all define the characters 160 through 255 as accented letters and other symbols, and define the 32 characters from 128 through 159 as control characters.
Since Microsoft Windows has no use for the control characters 128 through 159, Windows fonts commonly use Codepage 1252, which has ASCII in the first 128 characters, ISO-8859-1 in characters 160 through 255, and other symbols in the characters 128 through 159. Just to make an already chaotic system worse, all characters can be defined differently in different fonts!
Of course, most of these codepages are incompatible with each other. For example, the byte value 232 shows as a lower-case "e" with a grave accent in ISO-8859-1 and CP1252, a capital letter "E" with diaeresis in MacRoman, a Latin capital letter "Thorn" in CP850, a Cyrillic lower-case "Sha" in ISO-8859-5, a Greek capital letter "Phi" in CP437, and so on. So if you view a text intended for one of these character sets with a program that assumes a different character set, you see gibberish.
The good news, for mostly-English texts at least, is that ISO-8859-1, Codepage 1252 and Unicode agree on the numerical values of the accented characters and symbols to be represented by the values 160 through 255. And everybody accepts ASCII — a pure ASCII file is valid ISO-8859-anything, valid Codepage-anything, and valid Unicode UTF-8.
For more detail about the mappings between Unicode and other formats, you can view:
Unicode<-->ISO-8859 mappings at
Unicode<-->Windows mappings at
and Unicode<-->Apple mappings at
If you're not confused enough by now, please read the excellent guide to the whole "alphabet soup" problem at <http://aspell.com/charsets/>.
V.77. What is Unicode?
Recognizing that no single set of 256 characters can hold all of the symbols necessary for true multi-lingual texts, ISO 10646 was created. This defined the Universal Character Set (UCS) using 31 bits, which has the potential for a staggering 2 billion characters.
The Unicode Consortium is a group of computer industry companies who define the Unicode standard. Unicode accepts the ISO 10646 standards, and adds some restrictions and implementation processes. It plans for a modest million or so characters; however, this is enough for all living and extinct languages, and imaginable future ones too.
Using 4 bytes for each character is wasteful, though, when most characters need only one or two, and there are programming problems with implementing 4-byte characters, so Unicode provides Transformation Formats (UTF) which allow the characters to be encoded using fewer bytes where possible. UTF-8 and UTF-16 are common.
UTF-8, which is the most practical of these from the PG point of view, allows ASCII to be encoded normally, and usually uses two or three bytes for other non-ASCII characters.
Because of the extra work needed to support this extra space, and the fact that most people work mostly in one or maybe two languages, Unicode is being adopted only slowly, and most computer programs in 2002 do not fully support it. But when you need to mix Arabic, Greek, Ogham and Sanskrit in one text, it's the only possible answer!
For more about this, go straight to the source at <http://www.unicode.org>.
V.78. What is Big-5?
Big 5 is an encoding of a set of 13,000+ traditional Chinese characters.
V.79. What are "8-bit" and "7-bit" texts?
For practical purposes, 7-bit texts are plain ASCII; 8-bit texts have accented letters.
This comes from computer jargon. You can represent the 128 characters of ASCII using 7 bits — binary digits — but to represent the 256 characters needed for the various codepages and ISO-8859 standards, like accented letters, you need 8 bits. Hence, we call a text that uses non-ASCII characters in a character set like Codepage 850 or ISO-8859-1 an "8-bit" text.
When we post a text as both 8-bit and 7-bit, as we do when ASCII is not enough to render the text acceptably, we name the file with an "8" or a "7" at the start. So, for example, Crime and Punishment by Dostoevsky is named 8crmp10 for the 8-bit version with accents, and 7crmp10 for the 7-bit version without accents.
See also FAQ [R.35]: "What do the filenames of the texts mean?"
V.80. I have an English text with some quotations from a language that needs accents — what should I do about the accents?
If stripping the accents would unacceptably degrade the book, then submit two versions, one "8-bit" with the accents included and one "7-bit" plain ASCII, and we will post both.
This is a hard choice. What constitutes "unacceptable degradation"?
Clearly this is a decision that all of us in PG have to make. It's a very common problem, and different people have different views. For that matter, different print publishers have different views; you will see the words "debris", "facade" and "cafe" printed with and without accents in different books, and even in different editions of the same book.
There is no clear line, no definitive answer to what level of degradation is acceptable. Most producers feel that there is no point in making a separate version when dealing only with a few foreign words thrown in among the English, but when, for example, some significant dialog between the characters is in French or Spanish, it's not acceptable to lose that content. You, the producer, need to decide this on a case-by-case basis. If you're not sure, discuss it with one of the one of the Posting Team.
If you have made the text with accents, you can choose to make your own 7-bit version and send it to us, or just send the 8-bit version and we'll make the 7-bit version from it. Some people prefer to make their own 7-bit editions; most don't. Whether you use a Microsoft Codepage, one of the ISO standards or MacRoman doesn't matter — we can convert any of them for you.
In 2004, the balance of opinion has shifted sharply towards retaining at least one file which preserves everything, and at least one file in the lowest common denominator — ASCII. So it is now the norm for producers to submit texts in higher character sets like the ISO-8859 family or Unicode, and we down-shift it to post both that as an ASCII version during posting. However, if you are working in Unicode, we really do appreciate it if you also send an ISO-8859 file as well, since that conversion usually needs some attention from someone familiar with the text.
V.81. I have some Greek quotations in my book. How can I handle them?
There is no way to show Greek letters in ASCII. You have three options:
You can just replace the Greek words with [Greek] to indicate to the reader that you have omitted it.
You can "transliterate" the Greek to ASCII. Greek letters do have a correspondence to plain "Latin" letters — for example, the Greek letter "delta" can be represented by the letter "d". There is a clear and simple PG guide to transliteration written by Robert Brewer below. This practice has had a long and honorable history: words like "amphora" and "hubris", for example, are straight transliteration from the Greek. This is usually the best option for the general case of a few quotations in an otherwise non-Greek book. Other transliteration schemes have been designed for more scholarly work, but this is simple and effective for most Greek tags you will encounter in PG work.
See also our Greek transliteration HowTo.
If there is enough Greek to warrant it, and no other accented characters, you may be able to use the ISO-8859-7 character set, and submit both 7-bit and 8-bit versions [V.79]. ISO-8859-7 is for modern rather than classical Greek, but, if necessary, you will surely be able to express the Greek fully in Unicode. However accurate your Greek, that still leaves the issue of what to do with the 7-bit ASCII version, where transliteration is probably still your best bet.
V.82. I want to produce a book in a language like Spanish or French with accented characters. What should I do?
Use the appropriate ISO-8859 Character set [V.76] for your 8-bit version.
About the formatting of a text file
This section of the FAQ goes into great detail about all kinds of formatting questions. However, looked at from a higher level, the only real issue is that we want to render texts clearly, with formatting that reflects the original, so that readers of the plain text format can read them easily, and people converting them to other formats can do so reliably. When you come across a case that is not covered by the detailed guidelines below, keep this ultimate aim in mind, and make the best decision you can. Don't get hung up for hours or days over a question of formatting — if you want advice, look at how other people have handled the same situation in previous texts, or ask other volunteers for their ideas.
V.83. How long should I make my lines of text?
For normal prose, such as you find in a novel, your lines should mostly be 60 to 70 characters long, not shorter than 55, not longer than 75 except where it can't be helped. Never, ever longer than 80, except where you're trying to render a non-text structure, like a family tree.
For poetry, make the text look as much like the book as possible. This also applies to some plays where the lines are clearly intended to be broken at specific points, whether blank verse or not.
V.84. Why should I break lines at all? Why not make the text as one line per paragraph, and let the reader wrap it?
We could either use 70-character lines and let readers unwrap them if they want to, or use infinite-length lines and let readers wrap them if they want to. We choose to wrap the lines so that they are readable on even the simplest of text editors and viewers.
V.85. Why use a CR/LF at end of line?
CR/LF can lead to double-spacing, notably on Mac and Unix, but at least there is a CR in there for Mac users, and there is an LF for *nix users.
If you don't know or care what this is about, please skip blithely on.
There are three differing standards for how to represent the end of a line of text. In brief, Apple Macs use the CR character. Unix and its variants use the LF character. Microsoft systems, from MS-DOS through Windows, use both together.
If you want the history behind these:
CR stands for Carriage Return, and comes from the old typewriter / teletype idea of a command to move the print head from the right of the page back to the left when it reaches the end;
LF stands for Line Feed, and comes from the old typewriter / teletype idea of a command to move the print head down a line;
CR/LF together indicate moving down a line and back to the left of the page.
The history is not relevant to today's computers in principle, but in practice they all use one of these legacy conventions, and there's nothing we can do about it but pick one.
V.86. One space or two at the end of a sentence?
Whichever you prefer, but if using two spaces, please use them only at the end of a sentence, not after abbreviations like "Dr." and "per cent.", and not after non-sentence-ending punctuation like the question-mark in the sentence: "Must you go? when the night is yet so black!"
Many people have strong views on either side of the "one space or two?" question, and we're not about to try and argue with them. Use whichever is most natural for you.
However, if using two, you take responsibility for deciding where the sentence ends. You can't just place two spaces after every period, question-mark and exclamation mark, since periods are also used for abbreviations and ellipses, and question-marks and exclamation-marks don't always end sentences.
V.87. How do I indicate paragraphs?
Just leave a blank line before each paragraph.
V.88. Should I indent the start of every paragraph?
No.
Printers do this when publishing paper books because they do not leave blank lines in the text, but there is no need for indenting in our eBooks.
V.89. Are there any places where I should indent text?
Yes. You should always make poetry look like the original, and that may mean indenting some lines, for example:
I was a child and she was a child,
In a kingdom by the sea;
But we loved with a love that was more than love —
I and my Annabel Lee;
Even when poetry doesn't have indented lines, it is a good idea to indent quotations embedded in prose. Remember, others will be converting your text later — to HTML, to PDA reader formats, to formats that don't even exist yet — and much of this conversion will be done automatically, by computer programs. It is very hard for a program to know when it can and can't re-wrap lines to fit a screen size unless it has a clear signal that this line should not be wrapped. This is one of the biggest problems with auto-converting PG texts.
Just about all formatting programs "know" that lines that are indented shouldn't be wrapped, so by indenting lines just a space or two, you can prevent
I think that I shall never see A poem lovely as a tree.
from turning into
I think that I shall never see A poem lovely as a tree.
in some future reader's eBook.
You don't really need to do this in texts where the whole book is poetry or blank verse, since these will probably be recognized as whole books that shouldn't be rewrapped, but when there are a few lines of quotation amid an acre of straight prose, a few spaces will be a life-saver. Even in the original plain text version, the extra spaces serve to set the quotation off from the main text.
You shouldn't get carried away and indent things 20 spaces for this reason, though. Anything up to four spaces is reasonable; more is excessive. If you're indenting many short verses in this way, keep your number of spaces for indentation consistent throughout the book.
There are some other times when you may judge it best to indent, where text is indented in the paper book, like newspaper headlines or pictures of handwritten notes.
V.90. Can I use tabs (the TAB key) to indent?
No.
The problem with tab characters is that they act differently in different applications. Typically a tab will move the text to the next tab stop, which might be four spaces on your PC, but 20, or none, on someone else's. The effects are unpredictable.
V.91. How should I treat dashes (hyphens) between words?
In typography, there are four standard types of dashes: the hyphen, the en-dash, the em-dash, and the three-em-dash.
Originally, printers called these the "em-dash" because it was the same width as the capital letter M in whichever font they were using, the "en-dash" because it was the same width as the capital letter N, and the "three-em-dash" because it was as long as three capital Ms.
The hyphen is used for hyphenated words, like "en-dash" itself, or "to-day" or "drawing-room". For this, you just press the single dash or hyphen key on your keyboard.
In typography, the en-dash is a little longer than the hyphen, and is typically used for duration, where you could substitute the word "to". For example, if you were printing "1830-1874", or "9:00-5:30", you would use an en-dash instead of a hyphen. The en-dash is also sometimes used as hyphenation between words that are already hyphenated, for example, "bed-room-sitting-room" might use an en-dash as its central dash to emphasize that it is a different type of separator from the plain hyphens before "room". However, there is no ASCII character for an en-dash, and we use the hyphen in these cases. (HTML and some character sets do provide separate entities for en-dash and em-dash.)
The em-dash is shown in print as a longer dash, and for PG purposes, you should render it as two hyphens with no spaces around them.
You use the em-dash as a kind of parenthesis — as I am doing here — or to indicate a break in thought or subject within a sentence. There is no ASCII equivalent of the em-dash; there is no key on your keyboard that you can press to get one. For PG texts, we represent the em-dash as two dashes with no space between or around them--like this.
The em-dash can also be used at the end of a sentence or speech to indicate that the speaker stopped or trailed off. For example:
"When I saw you with Emily, I thought you were-- I thought she was--"
In a case like this, there may be a space following the em-dash, and the context may demand that there should be a space following the em-dash, not because of the em-dash as such, but to make the break between the statements or sentences clear.
These two hyphens represent one character, so you should never break them at line end, with one hyphen at the end of the first line and the other at the start of the second. If you have an em-dash near line end, you can break the line either before or after the em-dash, but never in the middle.
The fourth type of dash, the three-em-dash, is used to represent a missing word, or an undetermined number of missing letters. You will often see it in a sentence like:
- Dr. P——— was known for his honesty.
or
- Dr. ——— was known for his honesty.
where there is a convention that the character's name has been redacted. Logically, we should represent the three-em-dash as six dashes, but you may reduce that to four. Whichever you choose, do use it consistently in the text you're producing.
Unlike the em-dash, you should leave a space in such cases wherever a space would have been before the letters were replaced by dashes.
Here's a summary table of the dashes:
| Name | ASCII | Used for |
|---|---|---|
| Hyphen | - | Hyphenated Words |
| En-dash | - | Durations, like 1914-18 |
| Em-dash | -- | Break in sentence or parenthetical comment |
| Three-em-dash | ------ | Indicating a word that was edited out. |
V.92. How should I treat dashes replacing letters?
If the dashes obviously represent individual letters, use the same number of hyphens. Otherwise, you can use a three-em-dash (see above: 6 or 4 hyphens) in such places.
A common convention when a character in a novel is using bad language, or when reference is given to a character whose full name is not being used, is to replace the letters with dashes. For example,
"That D---l, Mr. C------s will regret his hasty actions!"
In this case, it is clear that "D---l" is meant to represent "Devil" and that there is a character whose name begins with "C" and ends in "s" whose name is not spelled out in full. Where the book makes it clear how many letters are represented by hyphens, just use that number of hyphens.
Where the number of letters omitted is not clear, you can decide how long you want to make your extended dash. Typographers often use the "three-em-dash" for this, so called because it is as wide as three capital Ms. Logically, since we represent an em-dash by two hyphens, we might represent a three-em-dash as six, but if you feel that six hyphens is too long, you can choose a shorter length, like four, but if you do, keep it consistent within your text:
It was in the town of S----, walking on M---- Street, that Sowerby came upon Dr. T---- taking the morning air.
V.93. What about hyphens at end of line?
Remove the hyphens from single words that were wrapped by the printer at line-end on the paper copy. Where two words are joined with a hyphen, you can leave the hyphen at end of the text line.
Books are usually printed with words broken at end of line to make the right side of the text perfectly even. You should remove all such hyphens. For example, in the sentence:
Mary's mouth tightened as she saw the marks on the car- pet, and her hands balled into fists.
you should remove the hyphen from "carpet".
Words which are strung together and hyphenated by the author pose a different question. It is perfectly OK from the point of view of a reader of the plain text version for such a hyphen to occur at end of line, for example:
Now that the guns were silent, convoys brought badly- needed medical supplies and food.
However, be aware that if somebody later rewraps the text for use in a different format like HTML, it is possible that they will introduce a space where it should not be:
Now that the guns were silent, convoys brought badly- needed medical supplies and food.
so there is still a small disadvantage to having a hyphen at line-end.
Sometimes it's not entirely clear whether the hyphen is there because it has to be, or just because it happens to fall at the end of the line:
Daisy rushed to the door, but there were no letters for her to- day, and she retreated sadly.
Sometimes "today" is written as "to-day", especially in older works. So which is this? Should we remove the hyphen or not? In this case, the best thing to do is search the rest of the text for the same word, and see whether it is consistently hyphenated or not in other places.
V.94. What should I do with italics?
There are three different ways volunteers have rendered italics: like THIS, like _this_ and like /this/. Pick one, and use it consistently in your text. As of 2004, the consensus is overwhelmingly in favor of _underscores_, and unless you have some very specific reason — for example, a text where underscores are needed as part of the content — that's what you should use. However, be aware that in older texts, you may see the alternate methods.
There are really two questions here: "How should I render italics?" and "When should I render italics?"
The original PG standard for italics was to render emphasis italics as CAPITALS, using underscores for an italicized _I_, and do nothing for non-emphasis italics like foreign words and names of ships. Since about the end of 2002, the consensus has moved away from this convention.
It had two drawbacks:
- if you do want to preserve italics for non-emphasis words, you may end up with a very ugly text where there are too many capitals.
- it is impossible to convert CAPITALS reliably back into italics, since the original text might have had a capital letter, or even been all capitals in the first place. This is especially true of automatic conversion for people who want to read PG texts on eBook readers.
To overcome these problems, most volunteers now use _underscores_ or, occasionally, /slants/ to render italics. These allow you to preserve all italics without creating an ugly plain-text, and to remove the ambiguity of CAPITALS. Underscores are now the effective standard for italics in PG texts.
Using underscores means that there is no ambiguity, so you don't have to use them for emphasis only, as in the old days. You can and should use them anywhere the original text is italicized.
V.95. Yes, but I have a long passage of my book in italics! I can't really CAPITALIZE or _otherwise_ /mark/ all that text, can I?
No, you really can't. On the other hand, if the author intended that section to stand out, you don't want to ignore that information and withhold it from future readers.
What you can do is format it differently from the rest of the text. For example, if you're averaging a 68-character line throughout normal paragraphs, you could reasonably use shorter lines, like 58 characters, for the italicized section. Going a step further, you could shorten the lines and indent them a space or two as well. This will give a clear signal to future readers and converters that this section is to be treated specially.
V.96. Should I capitalize the first word in each chapter?
No.
Capitalization of the first word is often used in printed material to emphasize the break at the start of a section or chapter on the paper, but it is not necessary in an eBook, and leads to the same kind of ambiguity as does the capitalization of italics, and for far less reason.
If you feel you really must capitalize the first word, we probably won't stop you, but if so, please do it consistently throughout the book, not just in one or two places, so that a future reader can be certain that these capitalized words were a chapter-head convention, and not otherwise intended for emphasis.
V.97. What is a Transcriber's Note? When should I add one?
A Transcriber's Note is a small section you can add to a text you produce to give the reader some information about changes you made to the book when rendering it into text.
A Transcriber's Note is not the same as a footnote — a footnote is part of the text you have transcribed; a Transcriber's Note is a note that you add to the text, explaining something you have done or omitted. If there is a Transcriber's Note, it may be at the top or the end of the text, and it should be clearly marked so that a reader cannot confuse it with the main text or an introduction.
The main thing is to ensure that a reader cannot confuse text that you have added with text that was in the original book.
Transcriber's Notes are rarely needed, but if, for example, you found misprints in the text, or things that might look like misprints even though they're not, you may note them here, if it seems relevant. If there is an image in the book that is important to the content, you may describe it in a note. If there was unusual typography that you had to represent in some uncommon way, you might well explain that here.
You don't need to add a Transcriber's Note just for common conversions like italics, and you should not use such a note to add your own comments or views about the text or the author. It's just there to let the reader know what decision you have made about rendering the text.
Here are some examples of Transcribers' Notes:
- Transcriber's Note:
- The irregular inclusion or omission of commas between repeated words ("well, well"; "there there", etc.) in this etext is reproduced faithfully from the 1914 edition . . .
- Transcriber's Note:
- Inserted music notation is represented like [MUSIC — 2 bars, melody] or [MUSIC — 4-part, 8 bars]
- [Transcriber's Note: This letter was handwritten in the original.]
- Transcriber's Note:
- The spelling "Freindship" is thus in the original book.
- Transcriber's Note: Some words which appear to be typos are printed thus in the original book. A list of these possible misprints follows:
If there is an image that is important to the content you may describe it at the point in the text where it appears, for example:
- [Transcriber's Note: Here there is a map of three islands just West of and parallel to a coastline running SW to NE, with a big X marked on the North of the middle island. A spur of land extends from the mainland, sheltering the islands from the north-east.]
Transcriber's Notes that apply to the whole text should be placed at the start or end of the text — your choice. Notes that pertain to a specific point in the text, like the map example above, should be placed at the point where in the text where they are relevant, but not interrupting a paragraph except where it cannot be avoided.
V.98. Should I keep page numbers in the e-text?
No. But there are exceptional cases . . .
In general, the page numbers of the original book are irrelevant when making a reader's edition for PG; they are annoying and intrusive for anyone trying to read it, and if you did keep them, they would probably be removed by anyone converting it. Get rid of them!
But there are a few books where page numbers are appropriate. Non-fiction books that use page numbers as internal cross-references are the prime example; if, on page 204, the text reads
- "Our studies of plants (see pp. 141-145) show that this is true."
and this kind of cross-reference is frequent throughout the text, then it is probably best to keep the page numbers, since it is otherwise very difficult to honor the author's intent.
In the more common case where cross-references exist, but are not frequent, and not essential to the text, you have several choices: leave the cross-references in, meaningless though the page numbers are, remove the cross-references, change the cross-references to something relevant (like "Start of Chapter 12" instead of "pages 141-145"), or, if you can make it work in context, insert references in the text for the cross-references to point to, like [Reference: Plants] and then reformat the cross-reference like "Our studies of plants (see [Reference: Plants]) show that this is true."
There are a few other cases, where the text you create is likely to be the subject of study or reference, in which it may also be desirable to retain page numbering.
When there are pages at the end of the book with notes referring to page numbers, the simplest answer is to change the page number references to chapter numbers, and add a quote from the page referred to if it's not already in the book's end-notes. That way, a reader can search for the phrase.
V.99. In the exceptional cases where I keep page numbers, how should I format them?
Within brackets of your choice, with one space either side, simply added to the text at the exact point of the page break. Unless there is some [142] special reason, you shouldn't insert a line break or new paragraph when indicating a page number; just insert it in the text, as I did with "142" above.
You should use whichever of round brackets, (143) square brackets, [144] or curly brackets {145} is not used (or least used) within the main text itself, and then use it consistently. Try to make sure that your page numbers cannot be confused with anything else.
Don't run your[146]page[147]numbers right up against words with spaces omitted; this just makes the text hard to read. Use spaces before and after.
Where the page break is at the start of a chapter or headed section, you can put it on a line of its own, for example:
[148]
CHAPTER XI. PLANTS
Where a paragraph begins on a new page, you should put the page number at the start of the paragraph, as:
[149] With the extinction of the dinosaurs …
V.100. Should I keep Tables of Contents?
Yes, but just keep the contents themselves, and not the page numbers for each chapter or section, except where you have kept the page numbers in the whole text. When you have removed the page numbers from the book, it doesn't make much sense to leave them in the TOC.
Here, for example, is a typical TOC. In the original text, each chapter had a page number beside it:
THE DUKE'S CHILDREN CONTENTS 1 When the Duchess was Dead 2 Lady Mary Palliser 3 Francis Oliphant Tregear 4 It is Impossible 5 Major Tifto 6 Conservative Convictions 8 He is a Gentleman 9 'In Media Res' 10 Why not like Romeo if I Feel like Romeo? 11 Cruel 12 At Richmond
Note that I have indented the lines here, to give a sign to automatic converters that these lines should not be wrapped into one paragraph.
V.101. Should I keep Indexes and Glossaries?
If you are working from a pre-1923 publication, then yes.
If you are working from a modern reprint, you must be careful not to take any of the text that might have been added by the modern publisher. If you have any doubt about whether the index or glossary was part of the original printing, you should leave it out. Often with reprints, under your Clearance Line [V.37], you may see an instruction not to use indexes. In such cases, or if there is any doubt at all, don't.
V.102. How do I handle a break from one scene to another, where the book uses blank lines, or a row of asterisks?
Use a blank line, followed by a line of 3 or 5 spaced asterisks or dashes, followed by another blank line.
In a printed book, where the point of view switches from one character to another, or some other break in the narrative is made without a new chapter or headed section, the publisher will often denote the break just by a couple of blank lines. This gives the reader a cue to notice that the point of view has switched, and avoids confusion.
However, a printed book cannot be edited or changed, while an eBook will be edited and converted over its lifetime, and it is likely that if you denote this break just by a couple of blank lines, as in the book, your break may be lost. For example, in automated conversion to a PDA reader format, it is common to merge multiple blank lines into one.
In making a PG e-text, you may indicate this break by a couple of additional blank lines, but, if your text is later converted into another format such as HTML, the extra blank lines may get lost in the editing or rendering. Or the person doing the conversion may simply think that the extra blank line was a mistake, and remove it. To guard against this, you should add an unambiguous visual break such as a line of spaced asterisks:
- * * * * *
The exact layout of your break is not really important, and you can use whatever format you prefer. Blank line followed by five spaced asterisks followed by another blank. Or you could use two blank lines, and dashes instead of asterisks. Just make sure that future readers can be in no doubt that you intended to indicate a break that was really in the original printed text.
V.103. How should I treat footnotes?
In a printed text, the most common treatment for footnotes is to put them at the end of the page to which they refer. Sometimes, editors gather them all at the end of the book. Footnotes are a real formatting problem for an eBook without defined physical pages; there is no agreement between readers about which is the best way to render them.
There are three basic ways of rendering footnotes in an e-text:
You can insert them right into the text, in brackets, at the point in the paragraph where they occur, with or without an indication that they were originally footnotes. This is only reasonable in a text with very short footnotes.
You can insert them after the paragraph to which they refer, either contiguous with the paragraph or as a new "paragraph" of their own, as I am doing with this one. If the text contains any footnotes longer than a line, [1] you should not try to just append them to the paragraph; you should make a new "paragraph" of them, with a blank line before and after.
[1] Some footnotes can go on not only for several lines, but for several pages!
You can gather all footnotes at the end of the e-text, or to the end of the chapter to which they refer.
Of these three, gathering all footnotes to the end of the chapter or the end of the whole text is probably the friendliest option, since it preserves the original intention of allowing the reader to continue reading the main text without interruption. However, it may involve some renumbering and general note-keeping on your part, and may not be needed where there are only a few short footnotes. You can see an ideal example of this kind of footnote marking in our edition of Darwin's The Voyage of the Beagle.
V.104. My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?
No.
If you look closely at these "spaces", you will see that they are not as wide as a normal space — they tend to be half to three-quarters as wide. These don't actually represent spaces as such; they were just a convention used by typesetters to make the text feel less cramped, and they did not express any specific intent on the part of the author.
OCR software tends to see them as full spaces, and one of the jobs you typically have to do when editing a text that has been OCRed is to remove them.
In some texts, this also happens following an opening quote, so your OCR might read a sentence as:
" Hello ! How are you to-day ? "
which you should correct to:
"Hello! How are you to-day?"
Samples of this can be seen in the images used for the FAQ "Why am I getting a lot of mistakes in my OCRed text?" [S.17]
V.105. My book leaves a space in the middle of contracted words like "do n't", "we 'll" and "he 's". Should I do the same?
Unlike the pseudo-spaces before punctuation, these really were intended as spaces indicating the break between words — that is, where we would nowadays contract two words into one, the author or editor has made the contraction, but left them as two separate words.
Since this effect was intended, it is usual to leave the spaces in. Some people who really do n't like this style of spelling do remove them, but generally volunteers want to preserve the text as printed.
V.106. How should I handle tables?
Just line up the information neatly in columns. If you use a non-proportional font [W.5] you will be able to do this reliably. You can also use the dash character "-" , the underscore "_" and the pipe character "|" to make borders if you really need to, but it's usually better to omit them. It is, though, often good to indent your table a little, to set it off from the main text, and to avoid the danger of having it automatically wrapped by some converter later. For example, from "The Albert N'Yanza, Great Basin of the Nile" by Sir Samuel White Baker:
TABLE No. 1.
Table for Increased Reading of Thermometer, using 0 degrees 80 as the
Result of Observations for its Error.
Month. 1861. 1862. 1863. 1864. 1865.
January. . . -- 0'143 0'314 0'487 0'659
February . . -- '157 '328 '501 '673
March . . . 0'000 '172 '344 '516 '688
April . . . '014 '186 '358 '530 '702
May . . . . '028 '200 '372 '544 '716
June . . . . '043 '214 '387 '559 '730
July . . . . '057 '228 '401 '573 '744
August . . . '071 '243 '415 '587 '758
September. . '086 '257 '430 '602 '772
October . . '100 '271 '444 '616 '786
November . . '114 '285 '458 '630 0'800
December . . 0'129 0'300 0'473 0'645 --
V.107. How should I format letters or journal entries?
Make them look like they are in the printed book. If the signature is indented in the book, indent it in the letter. For example:
"Sir,
No consideration would induce me to
change my resolve in this matter, but I am
willing to engage your services as my agent
for a fee of 100 pounds.
"H. Middleton"
When a letter appears in the middle of lots of prose, using shorter lines for the letter is an effective way of making the letter stand out, without resorting to indenting the whole thing.
When the book is largely composed of letters or entries, as happens in an epistolary novel or the publication of somebody's letters or journal, you might reasonably leave two or three (but whichever you choose, keep it consistent throughout the book!) blank lines between entries to give the reader a visual cue that the next is not just a new paragraph, but a new entry, for example:
10 pm. — I have visited him again and found him sitting in a corner brooding. When I came in he threw himself on his knees before me and implored me to let him have a cat, that his salvation depended upon it. I was firm, however, and told him that he could not have it, whereupon he went without a word, and sat down, gnawing his fingers, in the corner where I had found him. I shall see him in the morning early. 20 July. — Visited Renfield very early, before attendant went his rounds. Found him up and humming a tune. He was spreading out his sugar, which he had saved, in the window, and was manifestly beginning his fly catching again, and beginning it cheerfully and with a good grace. I looked around for his birds, and not seeing them, asked him where they were. He replied, without turning round, that they had all flown away. There were a few feathers about the room and on his pillow a drop of blood. I said nothing, but went and told the keeper to report to me if there were anything odd about him during the day. 11 am. — The attendant has just been to see me to say that Renfield has been very sick and has disgorged a whole lot of feathers. "My belief is, doctor," he said, "that he has eaten his birds, and that he just took and ate them raw!" 11 pm. — I gave Renfield a strong opiate tonight, enough to make even him sleep, and took away his pocketbook to look at it. The thought that has been buzzing about my brain lately is complete, and the theory proved.
This is different from the case mentioned in the FAQ [V.102] "How do I handle a break from one scene to another, where the book uses blank lines, or a row of asterisks?". In that case, we added a row of asterisks because future reformatting or conversion could cause confusion about the scene break that was explicitly signalled by the blank lines on paper. In this case, each new letter or journal entry cannot be mistaken by a careful reader, so we don't need asterisks or dashes to signal that; we're just adding a bit of extra space to make it more readable.
V.108. What can I do with the British pound sign?
The British pound sign cannot be expressed in ASCII, but is very common in the works of English novelists. It evolved as a stylized version of the letter L (from the Latin "Librii"), and it's entirely appropriate to represent it as such, either like:
- The horse cost L8 12s. 6d.
or
- The horse cost 8l. 12s. 6d.
This works particularly well where an amount is expressed in pounds, shillings and pence (Librii, soldarii, denarii).
Where there is a simple number of pounds, you may prefer just to use the word:
- She was a handsome widow with 500 pounds a year.
V.109. What can I do with the degree symbol?
Just type out the word "degrees" or the abbreviation "deg." — for example:
- By the time we reached Cairo it was 115 degrees in the shade.
Geographical degrees are more awkward, but should be handled the same way:
- It was at 30 deg. 15' E, 14 deg. 45' N.
In general, any symbol can be represented in words.
V.110. How should I handle . . . ellipses?
Just as I did above . . . and here! Leave one space before and after each dot. Do not break an ellipsis over the end of a line. In principle, an ellipsis is one symbol, like an em-dash, and should not be broken at line end.
A special case arises when an ellipsis follows a sentence instead of being in the middle. . . . In this case, put the period after the last letter of the sentence, as you normally would, then follow the usual format for ellipses. You end up with four dots, with spaces everywhere except before the first.
V.111. How should I handle chapter and section headings?
For a standard novel, you can choose either four blank lines before the chapter heading and two lines after, or three lines before and one line after, but whichever you use, do try to keep it consistent throughout.
Normally, you should move chapter headings to the left rather than try to imitate the centering that is used in some books.
V.112. My book has advertisements at the end. Should I keep them?
Most people seem to think "no", and "no" is the safe choice, but opinions vary.
The typical arguments are: "The ads are not part of the author's intent, so you should remove them." vs. "They give a flavor of the original book, so you should keep them". This latter is particularly cogent when the ads are for other books by the same author.
Decide which of these statements best fits your own views in the case you're looking at; after that, it's up to you!
V.113. Can I keep Lists of Illustrations, even when producing a plain text file?
Yes. As in the case of the Table of Contents, there is no point in including page numbers when your text doesn't have them, but the list of illustrations itself may go in.
V.114. Can I include the captions of Illustrations, even when producing a plain text file?
Yes.
You can format them as short paragraphs of their own, in brackets, with the word Illustration: followed by the caption, something like:
- [Frontispiece: A Flash of Light]
or
- [Illustration: Goldsmith at Trinity College]
Don't interrupt a paragraph to insert one, unless the reader really needs to know that the original illustration was in the middle of the paragraph; place the note between paragraphs instead.
V.115. Can I include images with my text file?
Yes, as I have done with the zipped version of the plain-text format of the original posted FAQ, but in general it makes much more sense, if you want to include images, to make a HTML version of the book and include them there, where they are anchored into the text in a predictable way, and leave them out of the text version. But there are exceptional cases, such as this — I included images with the plain-text FAQ because I wanted you to be able to experiment with them using your own OCR package.
Images included with plain text before etext #10,000, were included with the ZIP file, but not downloadable separately with the plain text file; for example, if your file gets named abcde10.txt, and you included images pic1.gif, pic2.gif and pic3.gif, then abcde10.zip would include all four files, but only abcde10.zip and abcde10.txt would be posted, so the images would be available only within the zip file, so, even if you are including images, you couldn't assume that the reader would be able to see them.
For etexts after 10,000, the directory structure allows us to post the images directly in the etext's directory. However, this is not something that we always want to do: in the vast majority of cases, it makes sense to include the images only with the HTML, where they are expected, and properly anchored and displayed.
If you do include images with plain text, be sure to mention them by filename in a note at the appropriate places in the text file; otherwise readers may not even realize they're there. For example:
- [Illustration: Goldsmith at Trinity College — see goldtrin.gif]
If you do include images with a text file, don't make them too big. Readers downloading zip files of plain text expect them to be relatively small; don't burden them with huge downloads they don't want. Use the same kind of rules and processing that you would for a HTML file, or better still, include the images only with the HTML version.
About formatting poetry
V.116. I'm producing a book of poetry. How should I format it?
Make it look like the original.
The only formatting change that you might consider is to limit the amount of centering. Often, in a poetry book, the title of a poem may be centered, when the body of the verse isn't. This can work on paper, particularly when the page is narrow, but "centering" the title on a 70-column line can mean that the title ends up far to the right of the body of the poem, which looks untidy. And even if you center the title correctly over the body of this poem, the next poem may have longer lines, and so its title may not have the same center as the first poem, and the title of one will be off-center with the title of the next!
If you have this kind of formatting in your book, you should consider moving all of the poem titles to the left margin rather than try to keep compensating for different line centers. It's more consistent, and easier to read, if you just left-align all titles. To see a not-quite-successful attempt at centering the titles over the poems, take a look at the Poems of Emily Dickinson, available from <http://www.gutenberg.org/dirs/etext00/1mlyd10a.txt>
In that case, it would have been better to left-align the numbers and titles. Centering isn't really an effective formatting choice in etexts.
V.117. I'm producing a novel with some short quotations from poems. How should I format them?
As nearly as possible like they look in the book, with the exception that you should indent the whole verse anywhere between 1 and 4 spaces from the left. This is to give a signal to automatic conversion programs that these lines should not be wrapped.
For an example of a novel with many differently formatted quotations embedded, see the "a" version of Clotel, file clotl10a.txt, Etext number 2046, from the year 2000, which you can find at <http://www.gutenberg.org/dirs/etext00/clotl10a.txt>
Some of these quotations touch the left-hand column; today, we would think it better to insert at least one space before every line.
About formatting plays
V.118. How should I format Act and Scene headings?
Pretty much like chapter headings. You can use 4 blank lines between acts, and 3 blank lines between scenes, or 3 between acts and 2 between scenes. If your book has "END OF ACT/SCENE" footers, leave them in the etext.
You may center act/scene headers and footers if they are centered in the book, but it's usually best to left-align them, for the same reasons it's usually best to left-align poem titles in poetry.
V.119. How should I format stage directions?
Generally, in brackets.
In printed texts, it is common to show stage directions as italics inside brackets. You don't have the option of italics in plain text, and you shouldn't need to use _underscores_ or /slants/, and certainly not CAPITALS, to indicate italics for stage directions. Normal text within the brackets is all you need. It will be immediately clear to a reader that bracketed text consists of stage directions.
[Square brackets] are most common for stage directions, but (round) or {curly} brackets will work too, if there's a reason why they are preferable in the case of your text. Just make sure that you use the same kind of brackets consistently and only for stage directions — don't use round brackets for stage directions if characters' speeches also contain text in round brackets.
Some printed plays follow the convention of not closing brackets when the direction is at the end of a speech or scene. For example: [Exeunt.
Where the book doesn't close the bracket in a case like this, you shouldn't either.
V.120. How should I format blank verse?
Just like normal verse in poetry. Make it look like the printed book. Left-align it, and make one line of etext the same length as one line of print.
Sometimes in blank verse, a speech may start mid-line, and the print reflects that by leaving a space on the left, and starting mid-way. In a case like that, do the same in the etext.
About some typical formatting issues
V.121. Sample 1: Typical formatting issues of a novel.
Look at the image novel.gif. It shows a page of a novel, with several typical formatting decisions to be made.
We note that there is no end-quote on the first paragraph, but that's OK, since the second paragraph is a continuation by the same speaker, so the first paragraph doesn't need a closequote. There is also an italicized "I", which will end up with underscores, but there is nothing else to give us any difficulty.
In the second paragraph, we have an ellipsis, an italicized French word with an accented letter, the British pound symbol, and an italicized "Here".
The ellipsis is simple.
Let's assume we're making this into a 7-bit text, so we're going to convert the non-ASCII character a-circumflex and the pound sign. The a-circumflex just goes to an "a", but we have several choices we can make about the pound sign.
The italicized "Here" is clearly for emphasis, so we will mark that up. The word "flaneur" is italicized because it is not English, but possibly also for emphasis . . . if the sentence had read "The Major is a fool", with the word "fool" italicized, it would clearly be emphasis. As it stands, we don't know whether emphasis is intended. This doesn't matter if we are just using _underscores_ or /slants/ to render italics, but if we use CAPITALS, we're going to have to impose our best guess on one side or the other.
The third paragraph shows some vaguely familiar squiggles — Greek letters! We hit the PG transliteration guide at [V.81] and spell it out . . . rough-breathing upsilon = hu; beta = b; rho = r; iota = i; final sigma = s. So the Greek word transliterates as "hubris". Since hubris is a familiar word, we don't need to make a fuss about it, though we may _italicize_ it.
We then have a note, which we will format a little differently from the main text to help it stand out, and a new chapter heading.
We should certainly indent the second line of the Byron quotation to preserve its original form, but we have the option whether or not to indent the first line a little to signal to any future automatic converter that this is not to be rewrapped.
In the first paragraph of the new chapter, we need to get rid of the hyphenation of "Wentworth" at line-end and fix the two em-dashes.
In the second paragraph of the new chapter, we have a long dash between "d" and "l", clearly meant to denote "devil", so we will fill it in with three dashes, and we see a three-em-dash after "Lord H", so we can use six, or possibly four, dashes for that.
Finally, we have a table, a list of money values against names.
Depending on the standards we've chosen to use throughout the book, we could render these details in a variety of ways. For illustration, here are two acceptable possibilities:
"I shall go down to Wokingham", said Middleton, "a few days
before the election, and the Major will stay here. I
understand that there will be no other candidate, and _I_
shall take the seat.
"The Major is a . . . _flaneur_. He has no interest beyond
his own advancement. I can buy him for a hundred pounds.
_Here_ is his answer."
Wallace wondered at the _hubris_ of his friend, and
examined the note Middleton thrust upon him.
"Sir,
No consideration would induce me to
change my resolve in this matter, but I am
willing to engage your services as my agent
for a fee of 100 pounds.
H. Middleton"
CHAPTER XV
THE ELECTION
Now hatred is by far the longest pleasure;
Men love in haste, but they detest at leisure.
— — BYRON
On hearing of Middleton's visit, Mr. Wentworth began his
preparations. Meeting with Thomas Lake and Riley at the
back of the tap-room of The Bull — where the landlord saw
to it that they remained undisturbed — he laid out their
plan of campaign.
"That d — -l Middleton shall not have the seat," he raved,
"not for Lord H — — — ; no, nor for a hundred Lords! We
shall see to it that every man's hand is turned against
him when he arrives."
Lake unfolded a paper from his vest-pocket and smoothed it
on the table. "Here are the expenses we should undertake."
Doran L13 10s.
Titwell L 8 7s. 6d.
St. Charles L25
"I shall go down to Wokingham", said Middleton, "a few days
before the election, and the Major will stay here. I
understand that there will be no other candidate, and _I_
shall take the seat.
"The Major is a . . . flaneur. He has no interest beyond
his own advancement. I can buy him for L100. HERE is his
answer."
Wallace wondered at the hubris of his friend, and examined
the note Middleton thrust upon him.
"Sir,
No consideration would induce me to change my resolve
in this matter, but I am willing to engage your services as
my agent for a fee of L100.
H. Middleton"
CHAPTER XV
THE ELECTION
Now hatred is by far the longest pleasure;
Men love in haste, but they detest at leisure.
— — Byron
On hearing of Middleton's visit, Mr. Wentworth began his
preparations. Meeting with Thomas Lake and Riley at the
back of the tap-room of The Bull — where the landlord saw
to it that they remained undisturbed — he laid out their
plan of campaign.
"That d — -l Middleton shall not have the seat," he raved,
"not for Lord H — — ; no, nor for a hundred Lords! We
shall see to it that every man's hand is turned against
him when he arrives."
Lake unfolded a paper from his vest-pocket and smoothed it
on the table. "Here are the expenses we should undertake."
Doran 13l. 10s.
Titwell 8l. 7s. 6d.
St. Charles 25l.
V.122. Sample 2: Typical formatting issues of non-fiction
While non-fiction is not in principle any more difficult to format than fiction, many non-fiction books have lots of features like illustrations, tables, section sub-headings and footnotes, that require some extra work on the part of the producer. If the illustrations are essential, you should consider adding a HTML format file to allow you to present them.
See the page image nonfic.gif. This presents many formatting changes: the centered title will go to the left; the italicized chapter contents will become regular text, and the em-dashes will become " — "; the degree symbol needs to be replaced with ASCII "deg.", and of course we need to render the table readably. After all that, we have to deal with the footnote.
Here is a reasonable rendering of this page:
CHAPTER XI
STRAIT OF MAGELLAN. — CLIMATE OF THE SOUTHERN COASTS
Strait of Magellan — Port Famine — Ascent of Mount Tarn —
Forests — Edible Fungus — Zoology — Great Sea-weed —
Leave Tierra del Fuego — Climate — Fruit-trees and
Productions of the Southern Coasts — Height of Snow-line
on the Cordillera — Descent of Glaciers to the Sea —
Icebergs formed — Transportal of Boulders — Climate
and Productions of the Antarctic Islands — Preservation
of Frozen Carcasses — Recapitulation.
An equable climate, evidently due to the large area of sea compared
with the land, seems to extend over the greater part of the
southern hemisphere; and, as a consequence, the vegetation partakes
of a semi-tropical character. Tree-ferns thrive luxuriantly in Van
Diemen's Land (lat. 45 degrees), and I measured one trunk no less
than six feet in circumference. An arborescent fern was found by
Forster in New Zealand in 46 degrees, where orchideous plants are
parasitical on the trees. In the Auckland Islands, ferns, according
to Dr. Dieffenbach [82] have trunks so thick and high that they may
be almost called tree-ferns; and in these islands, and even as far
south as lat. 55 degrees. in the Macquarrie Islands, parrots
abound.
On the Height of the Snow-line, and on the Descent of
the Glaciers in South America.
[For the detailed authorities for the following table,
I must refer to the former edition:]
Height in feet
Latitude of Snow-line Observer
----------------------------------------------------------------
Equatorial region; mean result 15,748 Humboldt.
Bolivia, lat. 16 to 18 deg. S. 17,000 Pentland.
Central Chile, lat. 33 deg. S. 14,500 - 15,000 Gillies, and
the Author.
Chiloe, lat. 41 to 43 deg. S. 6,000 Officers of the
Beagle and the
Author.
Tierra del Fuego, 54 deg. S. 3,500 - 4,000 King.
In Eyre's Sound, in the latitude of Paris, there are immense
glaciers, and yet the loftiest neighbouring mountain is only 6200
feet high. Some of the icebergs were loaded with blocks of no
inconsiderable size, of granite and other rocks, different from the
clay-slate of the surrounding mountains. The glacier furthest from
the pole, surveyed during the voyages of the Adventure and Beagle,
is in lat. 46 degrees 50 minutes, in the Gulf of Penas. It is 15
miles long, and in one part 7 broad and descends to the sea-coast.
But even a few miles northward of this glacier, in Laguna de San
Rafael, some Spanish missionaries encountered "many icebergs, some
great, some small, and others middle-sized," in a narrow arm of the
sea, on the 22nd of the month corresponding with our June, and in a
latitude corresponding with that of the Lake of Geneva!
In this case, I made some decisions. I made the lines in the contents at the top a bit shorter than usual, to help them stand out. I decided to use the full word "degrees" rather than "deg." where I could, but not in the table, where I shortened the entries as much as possible while preserving the sense. Since I was using the full word "degrees", I decided to go the whole hog and use the word "minutes" for the minutes symbol as well, (though the minutes symbol, a single quote, is in the ASCII set) since it seemed to make the text more readable than using the word degrees with the minutes symbol. I also made a choice about the table layout.
You might prefer different choices in some of these cases, and, as in our example of fiction above, there was more than one way to do it. However, this is a reasonable rendering.
What happened to the footnote? and how did it become [82] rather than the [1] of the original? In this case, I decided to put all footnotes at the end of the whole text, and renumber them accordingly. So the footnote on this page became number 82 in the overall text, and down at the end of the whole text, I would put:
[82] See the German Translation of this Journal; and for the other facts, Mr. Brown's Appendix to Flinders's Voyage.
I could also have transcribed this as:
. . . Forster in New Zealand in 46 degrees, where orchideous plants are parasitical on the trees. In the Auckland Islands, ferns, according to Dr. Dieffenbach [*] have trunks so thick and high that they may be almost called tree-ferns; and in these islands, and even as far south as lat. 55 degrees. in the Macquarrie Islands, parrots abound. [*] See the German Translation of this Journal; and for the other facts, Mr. Brown's Appendix to Flinders's Voyage.
if I chose to put each footnote with its own paragraph.
V.123. Sample 3: Typical formatting issues of poetry
Poetry is easy to format: just be sure to use a non-proportional font, and make it look as much like the text as possible. To avoid ragged-looking centering, left-align titles.
In a whole book of poetry, there is no need to leave an indentation before every line; unlike a verse lost in fields of prose, there is little danger that someone will wrap it by mistake.
Look at the image poetry.gif. On this page, we have an enlarged first letter to start each poem, and capitals following — we can remove all that. The titles are centered, so we will move them left.
There are line-numbers at every fifth line, and these are common in poetry, especially where footnotes reference lines. We will keep these out on the right-hand margin.
The third poem obviously intends the centering of its last lines in each verse as a feature, so we will keep that as best we can.
The resulting etext looks like:
Mistress Mary
Mistress Mary, quite contrary,
How does your garden grow?
With cockle-shells, and silver bells,
And pretty maids all in a row.
Ozymandias.
I met a traveller from an antique land
Who said: Two vast and trunkless legs of stone
Stand in the desert. . . . Near them, on the sand,
Half sunk, a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command, 5
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed:
And on the pedestal these words appear:
'My name is Ozymandias, king of kings: 10
Look on my works, ye Mighty, and despair!'
Nothing beside remains. Round the decay
Of that colossal wreck, boundless and bare
The lone and level sands stretch far away.
NOTE:
9 these words appear: in some editions : this legend clear.
The Rosary.
The hours I spent with thee, dear heart,
Are as a string of pearls to me;
I count them over, every one apart,
My rosary.
Each hour a pearl, each pearl a prayer, 5
To still a heart in absence wrung;
I tell each bead unto the end — and there
A cross is hung.
Oh, memories that bless — and burn!
Oh, barren gain — and bitter loss! 10
I kiss each bead, and strive at last to learn
To kiss the cross,
Sweetheart,
To kiss the cross.
V.124. Sample 4: Typical formatting issues of plays
Look at the image play.gif. Stage directions are indicated by italics and square brackets. We don't have to do much special work with this — lose the italics, but keep the square brackets. The setting for scene I, act II is also italicized, but without square brackets. If we wanted to emphasize this, we could use shorter lines or add square brackets, but it probably isn't necessary here. We're using 4 blank lines between acts and 3 between scenes, so we mark these accordingly. We leave one blank line between speeches. And following these simple conventions, we get:
JACK. There's a sensible, intellectual girl! the only girl I ever cared for in my life. [ALGERNON is laughing immoderately.] What on earth are you so amused at? ALGERNON. Oh, I'm a little anxious about poor Bunbury, that is all. JACK. If you don't take care, your friend Bunbury will get you into a serious scrape some day. ALGERNON. I love scrapes. They are the only things that are never serious. JACK. Oh, that's nonsense, Algy. You never talk anything but nonsense. ALGERNON. Nobody ever does. [JACK looks indignantly at him, and leaves the room. ALGERNON lights a cigarette, reads his shirt-cuff, and smiles.] END OF THE FIRST ACT SECOND ACT SCENE I Garden at the Manor House. A flight of grey stone steps leads up to the house. The garden, an old-fashioned one, full of roses. Time of year, July. Basket chairs, and a table covered with books, are set under a large yew-tree. [MISS PRISM discovered seated at the table. CECILY is at the back watering flowers.] MISS PRISM. [Calling.] Cecily, Cecily! Surely such a utilitarian occupation as the watering of flowers is rather Moulton's duty than yours? Especially at a moment when intellectual pleasures await you. Your German grammar is on the table. Pray open it at page fifteen. We will repeat yesterday's lesson.
About problems with the printed books
V.125. I found some distasteful or offensive passages in a book I'm producing. Should I omit them?
Please don't. Readers understand that books are works of their time and place, reflecting the opinions and prejudices of the people who wrote them, and the people they observed. We shouldn't try to pretend those prejudices out of existence. It may be, in a century or two, that our descendants are repulsed by our prejudices.
It is perfectly normal, for all kinds of reasons, not to want to produce a particular book, but producing one while deliberately removing passages is censorship, and is unfair to our readers.
If you find it too disturbing to handle the content, you can of course abandon the book, or pass it along to some other volunteer.
V.126. Some paragraphs in my book, where a character is speaking, have quotes at the start, but not at the end. Should I close those quotes?
Probably not.
When one character is making a speech that spans more than one paragraph, it is usual not to close the quotes until the speech is finished. This avoids confusion about whether the next paragraph is the same speaker or another — once a character has started speaking, there are no closequotes until the speech is finished. However, there are openquotes at the start of each new paragraph during the speech. This makes the quotes unbalanced, but it isn't a misprint; it's deliberate.
If this is not the case, if the same character is not continuing the speech in the next paragraph, then you may have found a typo in the book. [R.26]
V.127. The spelling in my book is British English (colour, centre). Should I change these to American spellings?
No.
Stay true to the edition you have. And this applies the other way, as well: if you have an American edition of a work by an English author, please leave the spelling as it is.
V.128. I'm nearly sure that some words in my printed book are typos. Should I change them?
The first thing to be aware of is that typos in books are not as rare as most people think. You may never have noticed typos in your normal reading, but under the kind of scrutiny that a book gets while being produced for PG, they often do become noticeable. It's quite common to find anything up to ten typos in a book.
Before you decide it's a typo, though, check that the same word doesn't occur elsewhere in the book with the same spelling. Often, the words or spelling used by pre-20th Century authors may just not be familiar to you.
When you find something that you believe to be a typo, you have four options: pretend you didn't see it :-), change the typo and add a transcriber's note [V.97], change the typo without a transcriber's note, or leave the typo as it is and add a transcriber's note. If you are adding a note, do it at the top or bottom of the file; don't try to work it into the text, and don't use the [sic] convention, since the reader won't know whether the [sic] was added by you or an earlier publisher.
In general, it's safest to leave the typo in place and add a note at the end of the file, listing the words you believe to be typos; that is the least contaminating and intrusive method. When adding the note, you don't need to leave a mark in the main text. You can just say something like:
- [Transcriber's Note: "haw" near the end of chapter 15 appears to be a misprint for "hawk".]
The danger in making changes is that you may be wrong, and we really don't want to corrupt the text. This is particularly so in some old books where archaic usages, now obsolete, may look downright wrong to modern eyes. Sometimes, though, a typo is just so blindingly obvious that it warrants immediate replacement. Even in these cases, conscientious people will sometimes add a note, something like:
- [Transcriber's Note: in chapter 12, I have changed "he stood on the tock", to "he stood on the rock".]
V.129. Having investigated what looks like a typo, I find it isn't. Do I need to do anything?
Often in PG work, you come across an odd word or usage. Might be a typo; might not. You check it out, and find that it is deliberate — perhaps a word from local dialect that just happens to resemble a different word, perhaps the author is using an odd word or spelling to make a point with the language. Especially if it's an isolated incident, and especially if it's not obvious, you can add a transcriber's note to the end noting that the word is thus in your edition, and that it is probably right. This may prevent some well-intentioned converter from changing it.
It's rare that you will need to do this; you may encounter such a case only once in a hundred PG books, but it is an option.
V.130. Aarrgh! Some pages are missing! Do I have to abandon the book?
No. It happens more often than you might think, and we're quite used to dealing with it.
Finish the book, and ask other volunteers to help by finding another copy of the book to fill in the missing section. For something like this, you can try asking on [V.12] gutvol-d, or in the Content Provider forums on one of the [B.2] DP Sites, or ask Michael Hart to put a note in the Newsletter asking for assistance. We can post the book incomplete, and put a Transcriber's Note [V.97] in the header asking any future reader who has a copy to fill in the gap.
V.131. Some words are spelled inconsistently in my book (e.g. sometimes "surprise", sometimes "surprize"). Should I make them consistent?
No.
English spelling didn't really standardize until the start of the 20th Century (and even then it fractured; e.g. "standardize" vs. "standardise") and the further back you go, the more inconsistent it becomes. Shakespeare, for example, signed his own name with several different spellings.
Where your printed edition genuinely uses alternate spellings of the same word, you should preserve them.