Scanning FAQ (old)

This page is partially or entirely outdated: Please see current guidance in the help section. Please visit Distributed Proofreaders to view forums where best practices are actively discussed and maintained.

What is a scanner?

A scanner is a machine that makes an image, a picture of the page that is fed to it, and sends that image to your computer. It only makes an image, like a camera does; it doesn’t turn that image into text.

What types of scanners are there?

The most common type of scanner, the kind you’re likely to find in your local computer store, is a flatbed scanner. It has a glass bed usually a bit bigger than Letter paper size (or A4 if you live in Europe! :-) and most of the common models are optimized for typical office correspondence. One of these may cost anything from under $100 to $400, depending on its features, or you can pick them up cheaper second-hand. You use this by placing the paper or book face-down flat onto the glass, and scanning from there. This is the kind of scanner most commonly used by PG volunteers.

Some stores will call sheetfed scanners a different category. These are flatbed scanners with Automatic Document Feed (ADF), but they are fundamentally the same machine, and the ADF sheetfeeder unit may often be bought as an accessory to the flatbed scanner. Recently, a few sheetfed scanners have appeared that are very small, without a full flatbed, just a narrow strip that the paper rolls through. Avoid these for PG work; you often need to be able to scan the book flat.

Hand scanners, as their name implies, are much smaller, and typically very cheap, or even thrown in free. You use these by holding them in your hand and running them along the text like a brush. These are really not intended for PG work; you need a very steady hand movement to get them to scan a page of text into a readable image, and they shouldn’t be considered as an option for a 400-page book — scanning and OCR is tough enough without that!

You can think of production scanners as industrial-strength flatbed scanners. The basic mechanisms are the same, but a production scanner will certainly have ADF (sheetfeeder), more features and speed, and be rated for very high volume scanning. Production scanners are used by publishers, businesses with high-volume paper processing needs, and print shops. This last is useful, because you may be able to get some scanning done by a print shop. It can’t hurt to ask. If you’re thinking about buying one of these babies (and who among us hasn’t? :-), be sure you have $2000 or more to spend.

Drum scanners are mostly used by publishers for professional, high-quality artwork. The paper is placed on the surface of a drum that rotates past a fixed scanning head. The drum can be very large. Because the sensors don’t have to move, the electronics and optics can be of higher quality, and produce very accurate, high-definition images. They are exactly what you would want for making professional quality scans of old movie posters, but they’re expensive, and not very useful for scanning War and Peace to OCR.

Planetary scanners are a different breed to all the others. They are really not scanners at all, but a very high-end digital camera on a stand. You place the book face-up with the pages open, with the camera looking straight down on it. It takes a picture, and passes it on to the connected computer. Planetary scanners are ideal for old, fragile, valuable books that can’t be exposed to the stress of normal scanning. They typically come supplied with specialized software, sometimes even their own dedicated computer, and they are very, very expensive — $20,000+.

Which scanner should I get?

For most people, the answer is simple. Unless you have a lot of money and are sure you will be scanning a lot of books, you should get a normal, consumer-or-office type flatbed scanner, with or without an ADF sheetfeeder.

Having decided that, you’re faced with the question of which scanner to buy. More good news! The market in scanners is very competitive, and there are many top-line vendors all watching each others’ features like hawks, eager to deliver the highest-spec machine they can. There are only a couple of critical factors in this decision — most of it is about getting the best buy.

For PG work, you really need an optical resolution no less than 300 by 300 dpi (dots per inch), and 600 by 600 is very desirable. Obviously, more is better, but it would be very rare to need more than 600 dpi for PG work. Pay no attention to the “interpolated” or “enhanced” resolution, where the software “guesses” what dots should fill in the gaps — you’re only interested in the optical resolution. The good news is that it’s very difficult to find modern scanners with a maximum optical resolution of less than 600 dpi, but if you’re buying second-hand, you should check this out first.

You will also need a scanning surface on the glass big enough to place your book with two facing pages flat. Again, the good news is that it’s very hard to find a flatbed whose scanning surface is too small for PG work, since these scanners tend to be designed to handle office paper, which is about the right size. Most flatbed scanners have scanning surfaces of about 8.5” by 11.5”, and this is standard for PG work. If you’re working on books with very large pages, you may need to resign yourself to scanning one page at a time, but buying a scanner with a big flatbed for these rare occasions will be much more expensive.

You must make sure that you get a scanner that will connect correctly to your computer. There are four main types of connections commonly available: SCSI, USB, FireWire (IEEE 1394) and parallel.

SCSI (Small Computer Systems Interface) is the highest-quality option, but it means that you need a SCSI card in your computer, and be willing to figure out how to install it. If you’re already a SCSI enthusiast, you don’t need to read further; if you’re not, I suggest you avoid it unless you enjoy tinkering. Production scanners mostly require SCSI.

Parallel-port connections used to be common, as a cheaper, easier alternative to SCSI. Since the introduction of USB they have become rarer, but you will still see them for sale second-hand. These plug into your printer port, and don’t require any further engineering skills.

Most new scanners hook up using a USB (Universal Serial Bus) interface, which is a no-muss, no-fuss “plug-in and go” option, but be sure, if you have an old PC, that it actually has a USB port and that your operating system supports it; some older Windows PCs and Macs may not. If your PC doesn’t support USB, you should probably look at Parallel-port scanners.

If you’re buying second-hand — and used scanners can be very cheap — make absolutely sure that you’re getting the original software that came with the scanner, and that that software will work with your current operating system on your PC.

Having ensured that your choice of scanners passes these tests, you’re now free to indulge your tastes for any extras you like. Color is nice, but rarely used, since we mostly transcribe older books that have no color printing. Higher resolutions are comforting to have, both since you may occasionally find them useful and because it shows that the optics are of higher quality than you actually need for your PG scans.

If you are nervous about your choice of scanner, or how easy it is to get one working, feel free to contact other PG volunteers for their opinions, as described in the FAQ “How do PG volunteers communicate?” [V.12].

What is ADF?

ADF stands for Automatic Document Feed, and it’s just a jargon term for a sheetfeeder, where you put in a stack of pages to be scanned and go away while that’s happening instead of putting in each page manually.

Should I get ADF?

That depends. Yes, ADF is a great idea, and can be a huge work-saver, and if you have the cash to spend, it may well be worth it. But ADF has a dirty little secret: like any other gizmo with moving parts, it occasionally jams. The sheetfeeders built into these low-cost machines are aimed at handling typical office paper straight from the laser printer — large, smooth, good quality, with perfectly-cut, perfectly-aligned edges. In your PG work, you will be dealing with hundred-year-old pages of various thicknesses and textures, usually much smaller than the sheetfeeder was designed to work with. And you will have to have cut the pages, and may leave ragged edges in doing so.

Under these conditions, you may find that paper often jams in your sheetfeeder, and it defeats the purpose if you have to stand over the scanner while it works, or if you end up having to lift the cover and use your scanner as an ordinary flatbed, or, worse, if your paper gets scrunched up as if a dog had been playing with it.

And of course, in order to feed the pages through, you will have to cut them out of the book, destroying it. (It may be possible, with the help of a bookbinder, to have the pages professionally cut, and later re-bound.)

With ADF, you probably won’t actually scan much faster than scanning flat, but you won’t have to keep turning over the pages during that time.

So when you’re making that choice, think carefully. If money isn’t a problem, or you do expect to be working with cut sheets, then go ahead and get a sheetfeeder — it’s great when it works! But don’t be disappointed when it doesn’t work all the time.

What’s a “TWAIN driver” and why do I need one?

A TWAIN driver (see twain.org) is a piece of software that installs onto your Windows PC or Mac and controls your scanner from there. With any modern scanner, there will be a TWAIN driver included in its software package. Once installed, you shouldn’t have to think about it again, or even know it’s there.

A modern OCR package will usually find your TWAIN driver and use it to control the scanner. This is very handy. There may also be a small scanning package with your TWAIN driver, which will provide a screen where you can make fine adjustments to scanner settings, and start scans. You probably won’t need this, since your OCR package will probably do it for you, but it may be useful for semi-manual control of the scanner.

Unix-based systems like Linux use SANE (http://www.sane-project.org/)[http://www.sane-project.org/] rather than TWAIN drivers.

How do I scan a book?

This depends on whether you have cut the pages out, or whether you are working with an intact book.

If you have cut the pages out, and you have an ADF, then you will obviously feed them through that.

If you don’t have an ADF, there usually isn’t much point in cutting the pages. Most modern OCR will recognize a “dual-page” or “two-up” scan, and, if yours does, then that’s normally the best option. Scanning the uncut book, open and flat, is the most common scanning method used in PG.

Take the book and place it open, flat on the scanner glass. To fit both pages on the glass, you may need to position it lengthways, at 90 degrees to its natural angle. Most OCR software will recognize that the image has been rotated through a right-angle, and will correct it when it reads the text.

A common problem with scanning an opened book is “guttering”, which happens when the spine of the book is not pressed flat enough, and the inside of each page, where it meets the spine, is curved against the glass. There’s more about this, and an example, scan3, in the FAQ [S.17] “Why am I getting a lot of mistakes in my OCRed text?”. To avoid guttering, make sure that the spine is held down throughout the scan. (Some people put a weight on the spine to hold the spine down on each scan; others just press their hand against it.)

Another common problem is light scattering, when too much light gets into the scanner. The scanner head detects light, and you want the only internal light source to be from the scanner itself, not ambient room light or sunlight. Scanners have covers, that are intended to be closed while scanning, for a controlled light level, but when you’re scanning a book held open and flat, you can’t close the cover fully. In a bad case, this can lead to a condition of the scan like overexposure of film and you can see an example in scan4 of the FAQ [S.17] “Why am I getting a lot of mistakes in my OCRed text?”. If this happens, just make sure that your room is dim while you scan — don’t have a ray of bright sunlight bouncing around the inside of the scanner!

Occasionally, when scanning cut pages with very thin paper, you may get a shadow of the text on the other side showing through. If this happens, you can try covering the inside of the scanner lid, which is normally white, with a piece of black paper.

Many modern OCR packages will control the scanner automatically, and you may be able to set your OCR so that it does an automatic timed scan every, say, 30 seconds. This is a great timesaver, since you don’t have to go back and forth between the scanner and the screen. Just set your timer, hold down the book for the scan, take the book up, turn the page, put it down again, and wait for the next scan to start. Set the timer for whatever interval you are comfortable with. Highly recommended, if your OCR or scanning package can do it.

By default, most scanners will always scan the entire area of the flatbed, but usually, your book will occupy only about half of it. Look for a setting on your OCR or scanning package which allows you to reduce the area that the head scans. Just scan enough to get the image of your pages. This makes the time for each scan and subsequent OCR recognition shorter, and in a really good case can cut your total scanning and OCR time in half.

Scanning all pages together is usually fastest, but you may prefer to scan each double-page, then correct it in your OCR package’s editor, then scan the next. This is a more leisurely approach favored by some volunteers.

My book won’t open flat enough for a good scan, and I don’t want to cut the pages.

Well, then, you have a difficult choice to make, but you do still have several options:

You can accept a poor-quality scan, and spend a lot of time fixing up the guttering on the margins.

You can bite the bullet, and cut the pages.

You can type the book, or find a typist who will work on it for you.

You can find a print shop or bookbinder who will cut the pages professionally, and re-bind the book when you’re done. You may even replace it with a fresh new binding that will give the book a new lease of life.

Take your choice.

Most books will open flat enough for an adequate scan, though you may have to put stress on the spine to do it.

If you have a really precious book, and you can’t find a typist, you might consider the options of a digital camera [S.11] or finding someone with a planetary scanner [S.2] to scan it for you.

Michael Hart said: “I would give up every book I own, including my first edition of the OED, my Civil War edition of the Merriam Webster’s Unabridged, etc., etc., etc., so everyone could use it any time they wanted rather than that only I or my friends could use it . . . and obviously I could use it too.”

Fortunately, it rarely comes to that.

How long does it take to scan a book?

Putting the book flat on the glass means that you scan two pages at a time. A reasonable modern scanner will scan the area of two typical pages at 400dpi in anywhere from 20 to 40 seconds — let’s call it 30 seconds for two pages. That’s four pages a minute, or 240 pages an hour. You could reasonably get through a 400 page book in two hours, even allowing for an occasional break or glitch.

Of course, you should also allow time for scanning a few trial pages with different settings before you start, to decide which settings to use. Ten minutes spent here can save you hours of proofreading time.

There are two big tips that can save you a lot of scanning time:

If your OCR or scanner control package has a timer setting, that automatically keeps scanning without user intervention, you can forget about the screen and just keep turning the pages as needed.

You should set your scanner just to scan the area the book covers on the glass. By default, your software will probably scan the full area of the glass, and usually, your book won’t need that. By scanning only what you need, you may typically save anything from 20% to 70% of the time taken to scan the full area. If your book is small enough to open flat across the scanner instead of “down” the side, 400 pages an hour is not out of the question with this trick.

What scanner settings are best?

For a given book, scanner, PC and OCR software, there must be some “ideal” scanner settings, but if you change any of these components, the ideal scanner settings will change with them. Some OCR packages recognize greyscale better than black and white; some don’t like greyscale at all. Some books have small print needing higher resolution; some are speckled so that higher resolution leads to more errors.

Obviously, the best settings also depend on the individual book, and some books will require you to get downright creative with the settings, but most PG books are scanned in Black and White or greyscale, somewhere between 300dpi and 600dpi.

This decision is a trade-off between speed and accuracy, and an illustration of the difference between principle and practice. In principle, a true-color, 9600dpi scan is a much better rendering of the page than a B&W 400dpi scan. In practice, all that extra information doesn’t usually help the OCR make better distinctions between letters, and the larger and more detailed the scan, the longer it takes to make the scan, the more disk space the image file takes, and the more processing time and memory the OCR package needs to recognize it.

A further paradox emerges when considering higher vs. lower resolutions: depending on the paper and ink quality, you may see more errors start to appear on very high resolution scans. These are caused by small imperfections in the paper or ink spots that show up on the high-res scan, and that the OCR tries to interpret as letters or punctuation.

So, in summary, bigger is better, but only up to a point.

Brightness is a setting often neglected, that can make quite a big difference to your results. Look at the scanned image: if you see lots of dark patches, make your scan lighter; if your letters appear thin and faded, make your scan darker.

See the FAQ “Why am I getting a lot of mistakes in my OCRed text?” for some typical scans and results.

Can I use a digital camera in place of a scanner?

Digital cameras are getting better resolution all the time, and some volunteers have experimented with making a kind of home-made planetary scanner from a digital camera and a stand. So far, the results don’t quite match a dedicated scanner, but as digital cameras improve, this may become a common option. One problem, which planetary scanners use specialized software to correct, is that the natural curve of the pages near the middle of the book tends to give a foreshortened aspect to the letters there, which can cause problems for OCR software, like guttering.

Whatever the current problems, the prospect of using digital cameras is exciting, because it will mean that non-typists will be able to produce old books borrowed from libraries without worrying about scan quality vs. damage to the spine.

What is OCR?

OCR stands for Optical Character Recognition. This is very important software that looks at the picture of the page that your scanner has supplied, and turns it into text.

When the scanner delivers the image of the page, that image is only a picture. You can’t, for example, search for text in it, or edit the text to add a blank line. Your editor or word processor can’t work with it. The OCR program does the job of “reading” and “typing” the image for you. OCR packages call this “reading” or “recognizing”.

What differences are there between OCR packages?

One word: huge. All OCR packages do the same job, but they do it in different ways, with different features, and with different levels of accuracy. OCR can save you a lot of time, or cost you a lot of time. It’s really worth putting some effort into making sure you get the right OCR package, and, once you have it, into understanding how to use it. It’ll save you time in the long run.

How accurate should OCR be?

OCR packages commonly say that they are “99%+” accurate, or something like that. Let’s analyze what that actually means: say there are 1,000 characters (letters) on each page, then with 99.9% accuracy, you would expect to have to make 1 correction per page. With 99% accuracy, that would be up to 10 corrections per page. And in a 400-page book, this all adds up.

But there’s a “Your Mileage May Vary” clause built into that. Typically, the manufacturers test their OCR on fresh, laser-printed or press-printed copy with perfect scans, and this is fair, since they are aiming their products primarily at businesses that process these kinds of materials. You are not dealing with fresh print; you’re dealing with old books, yellowed, spotted, marked, imperfectly printed in the first place, and possibly using unfamiliar fonts. And it’s unlikely that you will have the patience to get a perfect scan on every page. The result is that the accuracy of OCR for typical PG work doesn’t match the accuracy on images of perfect, fresh paper.

Apart from the scan quality, OCR also has to contend with different fonts and sizes for the letters.

However, if you’re getting more than 10 errors per page, you should look at some examples of OCR in the FAQ “Why am I getting a lot of mistakes in my OCRed text?”.

Which OCR package should I get?

The accuracy of OCR software has improved enormously in the last few years, and OCR technology looks likely to keep improving even faster than software in general. Further, there is competition in this area, and products leapfrog each other with new versions regularly. The brands most commonly mentioned by PG volunteers (mid-2002) are Abbyy, OmniPage and TextBridge [P.1], and trial versions of all three have been available for download over the Web, and may still be when you read this. [Warning: these are big downloads — 40MB or more.]

Most common OCR packages will offer two main working options: to scan a page and view/edit the resulting text on the spot before saving, and to scan a whole batch of pages together and view/edit them all later. Some people like to fix up one page at a time; others prefer to get all of the OCR work done at once, then get the whole text into their editor. Most OCR software will cater for both, and if this is important to you, you should check that the OCR you’re buying supports the way you want to work.

If you intend to work in a language other than English, make sure that the OCR you buy supports the characters in your language.

Some OCR software has a “training” or “learning” mode. Using this mode, it scans and “reads” or “recognizes” a page, then you correct that page, and the OCR “learns” from its mistakes and tries to do better on the letters it misread when it recognizes the next page. If you’re dealing with a very rare font, this can make a difference to your OCR quality, but modern OCR packages come with enough inbuilt font knowledge for most languages, and you probably won’t need this.

If possible, try a couple of OCR packages before you decide. If you want opinions on specific versions, contact other PG volunteers and ask for their opinions, as described in the FAQ “How do PG volunteers communicate?”

What types of mistakes do OCR packages typically make?

Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.

Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.

The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.

The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.

Lower-case m is often mistaken for rn or ni.

The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.

For example:

" Hello1' caIled jirnmy breczily.  11Anyone home ? "

There seemed to he no-oneabout. Only tbe eat beard him."

should read:

"Hello!" called Jimmy breezily, "Anyone home?"

There seemed to be no-one about. Only the cat heard him.

Why am I getting a lot of mistakes in my OCRed text?

If you’re new to OCR, you may have come with the idea that OCR is almost perfect, and just makes a few mistakes now and then. No. It’s slightly amazing that OCR works at all, and when it does, it isn’t perfect.

You might reasonably expect to average anything up to 10 errors per page for typical PG work; if you’re seeing more, then there is a problem with

Problems with the printed book fall into three categories: bad printing, age, and unusual fonts. Bad printing consists of problems like too much or too little ink on the press at the time the book was printed, and irregularities in the print where the metal type was damaged. Age causes yellowing — even browning — of the paper, and faded print. Unusual fonts may be hard for OCR to recognize, and very tightly-spaced print may make adjacent letters seem to touch, which confuses OCR software.

There are many ways for you to have problems with your scan. Obviously, if your scanner is defective or the glass is dirty, you will notice it immediately, but there are many mistakes you can make that will result in a poor-quality image, and cause later problems for your OCR.

You may not be able to control the quality of the paper you have to work with, but there is a lot you can do about the quality of your scan.

The two mistakes that people inexperienced with scanners most commonly make are not holding the spine down firmly enough to get a flat image of the paper, and not setting the brightness correctly, or letting too much light get in. In your early scans, watch out for these problems.

First, if you haven’t already, read the FAQ “How do I scan a book?” and check that you’re following the basic recommendations there.

Now let’s look at some samples, and see the kinds of problems you might encounter.

A disclaimer about these samples: specific OCR packages are named, but you should not take these as a fair and comprehensive comparative review of the software. The object of this exercise is to show typical scanning conditions and problems, and the resulting OCR output. OCR packages have quite a range of variance within themselves, may work better on some texts than others, may improve with “training” or different settings, and I have even seen the same OCR package produce different text from the same image with the same settings! Further, since OCR quality is improving rapidly, and packages leapfrog each other in quality, the next version of a particular brand may be vastly better than any of the software mentioned here. Of particular interest in this context is the leap in quality between OmniPage 10 and OmniPage 11.

Scan 1 — A perfect Scan

Scan 1 is as near to a perfect scan as you can expect in PG work. It comes from The Founder of New France by Charles W. Colby. It is only a 300 dpi image, but given the quality of the print and of the scan, 300 dpi is all we need. Ironically, it comes from Gardner Buchanan, who complains about the age and infirmity of his scanner in his description of how he produces a text. The moral is that you don’t have to have the latest equipment to get good results!

It doesn’t really need any comment, and all of the packages except gocr rendered it perfectly. Note the fake “space” before the semicolon — if you look closely at the image, you will see why the OCR packages mistook it for a full space, as discussed in the FAQ [V.104] “My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?”

Champlain was now definitely committed to
the task of gaining for France a foothold in
North America. This was to be his steady
purpose, whether fortune frowned or smiled.
At times circumstances seemed favourable ;
at other times they were most disheartening.
Hence, if we are to understand his life and
character, we must consider, however briefly,
the conditions under which he worked.

gocr 0.3.6 converted this as:

Champtain was now definitely committed to
the task of gaining for France a foothotd in
_orth America.  This was to be his steady
purpose, whether fortune frowned or smiled.
At times circumstances seemed favourable .,
at other times they were most disheartening.
_ence, if we are to understand his life and
character, we must consider, however brieRy,
the conditions under which he worked.

Scan 2 — A Typical Scan

Scan 2 is a paragraph from Baroness Orczy’s Castles in the Air. Notice the ink-splotch above the capital “I” in the first line, which will give our OCR some problems. The page is also unevenly inked elsewhere, and I have scanned it with the brightness level a bit too high.

I have made two separate scans, one at 300 dpi and one at 400 dpi, both black and white. The page was cleanly cut, and carefully placed straight onto the scanner glass with the cover down. The original print is somewhere between the size of Times New Roman 10 and 11, with capital letters about 2.2 millimeters high, but better and more clearly spaced. These scans are fairly typical for PG work. Because of the relatively large letters, and the reasonable scan, there isn’t much difference between the text produced from the 300 dpi scan and the 400 dpi scan.

I actually cut this book to get the pages out so that I could feed it through my ADF, but the paper is so thick and textured that it sticks together, and jams when feeding through. The thick, absorbent paper, combined with the uneven inking, means that, no matter how good the scan, any OCR has to contend with the irregular edges of letters, which are clearly visible even at 300 dpi.

Here is the output for these scans from some OCR software packages. I changed just one thing: Abbyy recognized the em-dashes as such, and output them as a special character in Codepage 1252 for em-dashes, which isn’t available in ASCII, so I converted that to the PG standard 2 dashes.

Abbyy FineReader 6:

 Yes, indeed, I was on the track of M. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 which had ever been made in France. Not only that. I
 had also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully, a
 fiend of cruelty. In very truth my fertile brain %vas
 seething with plans for eventually laying that abominable
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 francs — a goodly sum in those days, Sir — was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should see the
 light of gratitude shining out of a pair of lustrous blue
 eyes, and a winning smile chasing away the look of
 fear and of sorrow from the sweetest face I had seen for
 many a day.
 Yes, indeed, Twas on the track of M. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 which had ever been made in France. Not only that. I
 had also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully, a
 fiend of cruelty. In very truth my fertile brain was
 seething with plans for eventually laying that abominable
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 francs — a goodly sum in those days, Sir — was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should see the
 light of gratitude shining out of a pair of lustrous blue
 eyes, and a winning smile chasing away the look of
 fear and of sorrow from the sweetest face I had seen for
 many a day.

gocr 0.3.6:

 __e_, indeed, f___as on_the track of h_. hristide Fournier,
 3nd of one of the most im__ant hau1s of enem)_ goods
 ___hich had e__er been made in France.  h?ot onl3_ that.  I
 had a1so before me one of the most brUtish crimînat_s it
 h__4 e___er been m31 misfortune to co_me acro__3.  A bu113_, a
 tiend oí cruelt__.  In very truth m3_ fertiIe brain ___as
 s_e_1_::_g __-ith planS for e__entua113_ _ay:ng that abominab1e
 ru_iin b.__ t1_e hee1s . hanginig __ou1d be a n_erciful pun-
 i;__,i__gnt íor such a miscreanf.  yes, in_i__ee3, fj_1e thou3and
 francî-a b_ood13_ sum in those days, _ir-_vas practica1l3_

 a3_ured me.  _ut o___er and above n_ere lucre there was
 the certaint_v that in a few_ da3_s' ti_e I shou1d see the
 lib_ht of gratitude shininb_ out of a pair _f _usLtrous btue
 e3_e3_, and a ___inning smi1e chasing a__ay the Ioo_ of
 _ear and of sorrow from the s__eetest iace T had Seen fof
 man)_ a day.
 Yes, indeed, f___as on the track of h__. Ariseide Fournier,
 and of one of the most important hau1s _f enemy goods
 ___hich had ever been made in France.  NoEUR on1y that.  I
 had also before me one of the most brutish crimina1s it
 h_ad ever been my misfo__tune to come acros__.  A bu11y, a
 fiend of crue1ty.  _n very truth my fertib brain _vas
 seeî3_:i_g __ith plans for e__entua11p 1aying _at abom_in_ ab1e
 ru_an by the heels. hanging _____ou1d _ a merciful pun-
 iï_h_ment for such a miscreant.  Yes, indeed, five thou__and
 f_ancs-a b_ood1y sum in those days, _ir-_vas practica1ly
 a3îured me.  But over and above mere _ucre th.ere was
 th_e certainty that in a few days' ti_e _ shou1d see the
 1i__t of gratjtude shining out of a pair o_, _userous b1ue
  b                                .
 e__es, and a __inning smi1e chasing away the l_k of
 _,ear and of sorrow from the s___,eetest face _ _ad _.een _o_
 many a day.           .             .

Recognita Standard 3.2.7AK:

 ~'es, indeed, ~w-as on the track of ltT. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 "=hich had ever been made in France. ~Tot only that. I
 ha~i also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully-, a
 fiend of cruelty. In very truth my fertiIe brain was
 s; ething w-ith plans for eventually iaying that abominable
 ruffian by the heels : hanging ~-ould be a merciful pun-
 ishment for such a miscreant. ires, indeed, five thousand
 franes-a goodly sum in those days, Sir-was practically
 as~ured me. But over and above mere lucre there was
 thP certainty that in a few days' time I should see the
 light of gratitude shining out of a pair of lustrous btue
 ey·es, and a winning smile chasing away the hk of
 fear and of sorrow from the sweetest face I had seen for
 many a day.
 Yes, indeed, l~was on the track of h~i. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 w~hich had ever been made in France. lVot only that. I
 had also before mP one of the most brutish criminals it
 had ever been my misfortune to come acrass. A bully, a
 fiend of cruelty. In very truth my fertile brain was
 seething with plans for ez~entually laying that abomin_ able
 ruffian by the heels : hanging ~~.-ould be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 f:ancs-a goodly sum in those days, Sir-was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should~ see the
 Iight of gratitude shining out of a pair of iEustrous blue
 eyes, and a w inning smile chasing away the Iook of
 fear and of sorrow from the s"-eetest face ~ had seen ~'or
 rr~any a day.

OmniPage Pro 10:

     Yes, indeed, twas on the track of 11T. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 which had ever been made in France. Not only that. I
 ha(i also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully, a
 fiend of cruelty. In very truth my fertile brain was
 seething with plans for eventually laying that abominable 
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 francs-a goodly sum in those days, Sir-was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should see the 
 light of gratitude shining out of a pair of lustrous blue
 eyes, and a winning smile chasing away the look of
 fear and of sorrow from the sweetest face I had seen for 
 many a day.
     Yes, indeed, fwas on the track of h-I. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 which had ever been made in France. Not only that. I
 had also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully, a
 fiend of cruelty. In very truth my fertile brain was
 seething with plans for eventually laying that abominable
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 francs-a goodly sum in those days, Sir-was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should see the
 light of gratitude shining out of a pair of lustrous blue
 eyes, and a winning smile chasing away the look of
 fear and of sorrow from the sweetest face I had seen for
 many a day.

OmniPage Pro 11:

 Yes, indeed, twas on the track of AT. Aristide Fournier, 
 and of one of the most important hauls of enemy goods 
 which had ever been made in France. Not only that. I 
 had also before me one of the most brutish criminals it 
 had ever been my misfortune to come across. A bully, a 
 fiend of cruelty. In very truth my fertile brain was 
 seething with plans for eventually laying that abominable 
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand 
 francs-a goodly sum in those days, Sir-was practically 
 assured me. But over and above mere lucre there was 
 the certainty that in a few days' time I should see the 
 light of gratitude shining out of a pair of lustrous blue 
 eyes, and a winning smile chasing away the look of 
 fear and of sorrow from the sweetest face I had seen for 
 many a day.
 Yes, indeed, fwas on the track of h-I. Aristide Fournier, 
 and of one of the most important hauls of enemy goods 
 which had ever been made in France. Not only that. I 
 had also before me one of the most brutish criminals it 
 had ever been my misfortune to come across. A bully, a 
 fiend of cruelty. In very truth my fertile brain was 
 seething with plans for eventually laying that abominable 
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand 
 francs-a goodly sum in those days, Sir-was practically 
 assured me. But over and above mere lucre there was 
 the certainty that in a few days' time I should see the 
 light of gratitude shining out of a pair of lustrous blue 
 eyes, and a winning smile chasing away the look of 
 fear and of sorrow from the sweetest face I had seen for 
 many a day.

Textbridge Millennium Pro:

 Yes, indeed, rwas on the track of M. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 which had ever been made in France. Not only that. I
 hail also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully, a
 fiend of cruelty. In very truth my fertile brain was
 seething with plans for eventually laying that abominable
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 francs-a goodly sum in those days, Sir-was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should see the
 light of gratitude shining out of a pair of lustrous blue
 eyes, and a winning smile chasing away the look of
 fear and of sorrow from the sweetest face I had seen for
 many a day.   
  Yes, indeed, f was on the track of M. Aristide Fournier,
 and of one of the most important hauls of enemy goods
 which had ever been made in France. Not only that. I
 had also before me one of the most brutish criminals it
 had ever been my misfortune to come across. A bully, a
 fiend of cruelty. In very truth my fertile brain was
 seething with plans for eventually laying that abominable
 ruffian by the heels: hanging would be a merciful pun-
 ishment for such a miscreant. Yes, indeed, five thousand
 francs-a goodly sum in those days, Sir-was practically
 assured me. But over and above mere lucre there was
 the certainty that in a few days' time I should see the
 light of gratitude shining out of a pair of lustrous blue
 eyes, and a winning smile chasing away the look of
 fear and of sorrow from the sweetest face I had seen for
 manyaday.   

Scan 3 — Guttering and Smaller Print

Scan 3 is a paragraph from The Egoist by George Meredith. It was scanned in a dim room, with the scanner cover open and the book held open, flat against the scanner glass. However, the spine was not pressed firmly enough against the glass, and as a result you can see that the words on the left-hand edge (which were near the spine) appear to be slanted, a bit distorted, and not well lit. This problem is familiar to people who scan for PG — everybody gets distracted sometimes, and fails to keep enough pressure on the spine. As you see from the results below, it caused problems for all of the OCR packages on the words affected. If you find this kind of “guttering” regularly in your own scans, where the characters near the spine are not being recognized correctly by your OCR, you need to make sure that your book is down as flat as possible before making a scan.

I have made two separate scans, one at 300 dpi and one at 400 dpi, both black and white. Because of the smaller size and the guttering problem, the 400 dpi scan made for better quality text in this case.

Here’s the output from the sample OCR: Abbyy FineReader 6:

 NEITHER Clara nor Vernon appeared at the mid-day table,
 n Middleton talked with Miss Dale on classical matters,
 like a good-natured giant giving a child the jump from
 stone to stone across a brawling mountain ford, so that an
 uncdified audience might really suppose, upon seeing her
 over the difficulty, she had done something for herself. Sir
 \Villoughby was proud of her, and therefore anxious to
 soltlo her business while he was in the humour to lose her.
 He hoped to finish it by shooting a word or two at Vernon
 before dinner. Clara's petition to be set free, released from
 him, had vaguely frightened even more than it offended hia
 nrido.
 NEITHER Clara nor Vernon appeared at the mid-day table.
 Dr. Middleton talked with Miss Bale on classical matters,
 like a good-natured giant giving a child the jump from
 stone to stone across a brawling mountain ford, so that an
 unedified audience might really suppose, upon seeing her
 over the difficulty, she had done something for herself. Sir
 "VVilloughby was proud of her, and therefore anxious to
 settle her business while he was in the humour to lose her.
 He hoped to finish it by shooting a word or two at Vernon
 before dinner. Clara's petition to be set free, released from
 him, had vaguely frightened even more than it offended his
 pride.

gocr 0.3.6:

 __,,,____,_ Cl,_I._c nor Vernon a__e_Ped _t tl_le _id_da_ tab1e_
 _, _ii_(__etoiI f,,_lk(;cl with _MiSs _ale _U_1d_ abS8iG_l I_i_t_t_l.__
 i,_i,;,_ .,, _(_u_-i,L_t_ii.e(l 6iiLIblt 6'7_V. ill_ _ C 'll .  tf e__Ul__b rU_l
 gt(),ii_, tu _fj(),I(, ,_uruSS.,__ T__ Illl_ g UlOUUt_lU  o_ _ 8O .t _' t_ail
 u,,_,_ifj(;il ;,_i((ic,IGG l_i_' lt re_ y 8UE)_OB_'_ U_Oll 8eelll6  lttr
 _,__i. t_ic (li__icu1ty, SIIe t1_d iluI_e 8ol_eth_ng_ fo_ be_.Self.  _i__
 _ji___()_i___lIl)y w,,s prui_il of heT_ and k__eTefope an_iouS  to
 _(_(.__u l___i. i)i__, ii,ess wIlile he Wa8 in the hU_ouT to luse Iier_
 j__ l_()_)(_(l t() tiiIish it b_ ShOOtiltg a WOTd o__ t_O &t Verno_
 _o__(),__ (li,_iIci._  Cl__T_'S _eti_tio_ tO be Set fTee_.Te1ea8ecl fro_
 )ii))),, lIL_Ll v_b__uely f_.ighteUe  eVen _OTe kba_ lt OfEe_ded hi_
 pi_i..(l_u- .  _  ,  ,   — .___ _ _,- - -__-
  ________ Cl__i.a nop Vernon appeared &t t'h_e _id_day t__le_
 D_. _id(lle_oi_ t_lked with Miss _ale ,on _ _Ssi__l __i tt_r_'_
 iij_e _ 6ood-n___tLi_.ed 6iai_t 6_i_ing & Ghild the ___np _'_.on_
 _tune to _tone aGro_S a braWlin( __ inOU__taiß _foPd_ So t2_at a__
 u__p,(_ified ___idiei_Ge _ni62it real y 8uppO.8e_ upon _seeii_6 l_e_
 o______ the difhculty_ she had done _o_neth_n6 fop ber_elf_  _i_
 _viljoli____k)y w__s proud of heT, and the_efo_e an_iouS to
 ___.tle li__i. i)u__inesS Whike he W_S î_ the hum'ou_ to_ lose her_
 __e l_op(_d to finish it by 8hooting a wopd o_ tWo ak Verno__ _
 _eforR_ _(in_icr_  Clara's petition to _ Set _free, releaSed fro_
 )ii__, h_d va6uely frigbte_ed eve_ _ore tban it o_e_ded hiD
 pi.icle.  -.  -  -   -  -  - '

Recognita Standard 3.2.7AK:

 ~rFr~rrmx Clara nor Vernon apneared at the mid-da~'table.
 Dr. bLidrlleton talkc;d wi.th Miss Dale vn elassieal matters,
 like a ~n~a-mZtured giant gi.ving a child th© jucnp frvm
 stonc to stone across a brawling mounta,in ford, so that au
 uiicilificd .ruciicucc mil;·ht really suppasc, upon seeixig hor
 ·n~er thc ciillicul.ty, she had clouo something for herself. Sir
 ~Villcm;;lrlry wvs proua of her, and therefors angiaus to
 sct.tla lrur tn~sincss while he was in the humoar to lose her.
 lle lu,hcot to iinish it by shooting a word ar two at Vernon
 bol'ore ~linncr. Clara's petition to bo set froe, released £rom
 JGGnt., hvd vagucly frighteued even more than it offended hia
 ri~le.
 p
 NEITfi~R Clara nor Vernon appeareci at the xnid-day table.
 Dr. Middleton talked with Miss Dalo on classics,l rnatters',
 like a good-natured giant giving a child the jtimp from
 stone to stone across a brawling mountain ford, so that an
 unedified audience might really suppose, upon ~ seeing her
 over the difficulty, she had done something for herself. Sir
 yillon ;hby was proud of her, and therefore anxiotis to
 scttle luer business while he w~as in the hurxiour to lose her:
 He hoped to finish it by shooting a word or two at Vernon
 before dinner. Clara's petition to be set free, released from
 jcLm, had vaguely frighteued even more than it offended his
 pride.

OmniPage Pro 10:

     NF r~rn,Px Clara nor Vernon appeared at the mid-dap table.
 Dr. Middleton talked with Miss Dale on classical matter,
 like .t good-natured giant giving a child the jump from
 stone to stone across a brawling mountain ford, so that an
 uneVified audience might really suppose, upon seeing her
 over the difficulty, she had done something for herself. Sir
 jV;llo,r;;lrl>y was proud of her, and therefore anxious to
 set.tlo lror Uusiness while he was in the humour to lose her.
 Ile. lropcol to finish it by shooting a word or two at Vernon
 bol'ore dinner. Clara's petition to beset free, released from
 )zinc, had vaguely frightened even more than it offended his
 pride.
     NEITHER Clara nor Vernon appeared at the mid-day table.
 Dr. Middleton talked with Miss Bale on classical matters',
 like a good-natured giant giving a child the jump from
 stone to stone across a brawling mountain ford, so that an
 unedified audience might really suppose, upon ~ seeing her
 over the difficulty, she had done something for herself. Sir
 yillou ;hby was proud of her, and therefore anxious to
 settle her business while he was in the humour to lose her.
 He hoped to finish it by shooting a word or two at Vernon
 before dinner. Clam's petition to be set free, released from
 him, had vaguely frightened even more than it offended his
 pride.

OmniPage Pro 11:

 NF f,rnMR Clara nor Vernon appeared at the mid-day table. 
 Dr. Middleton talked with Miss Dale on classical matters, 
 like .t good-natared giant giving a child the jump from 
 stone to stone across a brawling mountain ford, so that an 
 une(lifie(l audience might really suppose, upon seeing her 
 over the difficulty, she had done something for herself. Sir 
 jVillon;hl)y was proud of her, and therefore anxious to 
 setale leer business while he was in the humour to lose her. 
 lle hoped to finish it by shooting a word or two at Vernon
 bofore dinner. Clara's petition to beset free, released from 
 )lint, had vaguely frightened even more than it offended his 
 pride.
 -.2 ..1_ - ____
 NEITHER Clara nor Vernon appeared at the mid-day table. 
 Dr. Middleton talked with Miss Dale on classical matters', 
 like a good-natured giant giving a child the jump from 
 stone to stone across a brawling mountain ford, so that an 
 unedified audience might really suppose, upon,seeing her 
 over the difficulty, she had done something for herself. Sir 
 Willoughby was proud of her, and therefore anxious to 
 settle her business while he was in the huniour to lose her. 
 Il"e hoped to finish it by shooting a word or two at Vernon 
 before dinner. Clara's petition to be set free, released from 
 hint, had vaguely frightened even more than it offended his 
 pride. - -

TextBridge Millennium Pro:

 NErr'!'~~ Clara nor Vernon appeared at the mid.day table.
 pr. ~1id(lIeto11 talked with Miss Dale on classical matters,
 like a good-natured giant giving a child the jump from
 stone to stone across a brawling mountain ford, so that au
 ~1edifi~ tLU(llCIlCC might really suppose, upon seeing h er
 over the (hjiheulty, she had done something for herself. Sir
 wiflouighby was proud of her, and therefore anxious to
 settle her business while he was in the humour to lose her.
 lie ho1)ed to finish it by shooting a word or two at Vernon
 before dinner. Clara's petition to be set free, released from
 him, had vaguely frightened even more than it offended his
 prú~t~.
THER Clara nor Vernon appeared at the mid-day table.
 Pr. Middleton talked with Miss Dale on classical matters,
 like a good-natured giant giving a child the jump from
 stone to stone across a brawling mountain ford, so that an
 une(lified audience might really suppose, upon - seeing her
 over the difficulty, she had done something for herself. Sir
 Willoughby was proud of her, and therefore anxious to
 settle hier l)uSifleSS while he was in the humour to lose her.
 lie hoped to finish it by shooting a word or two at Vernon
 before dinner. Clara's petition to be set free, released from
 hirn~, had vaguely frightened even more than it offended his
 pri(le.

Scan 4 — A Really Bad Case!

Scan 4 is a paragraph from Pope’s translation of Homer’s Odyssey. This is a very, very tough one. It was obviously a cheap printing to begin with, using thin, poor-quality paper in a page size of 6” by 4.5”, with capital letters about 1.5 mm high, a little bigger than Times New Roman size 8. Text this small really needs a higher-resolution scan. The book was falling apart when I got it, the ink was fading and flaking, and there was no point in even thinking about trying to scan it flat, so I cut the pages. To add an extra challenge, I scanned the sample with the cover open in a medium-lit room for the 300 dpi and 400 dpi scans, but closed the cover for the 600 dpi to show the best quality I could possibly get. (I was pleased to note that Abbyy, while recognizing the page in the 300 dpi and 400 dpi images, flashed up a suggestion that I should lower the brightness of the scan.)

This particular book was one I sporadically tried to produce, without success, on an older scanner and a bundled OCR program over a period of two years, back in 98/99. Eventually, in 2000, it was the first book processed through Charles Franks’ Distributed Proofreaders site. The initial text produced by the OCR was very poor, but the human volunteers made up for it! Thanks, guys! Today, just two years later, with a better scanner and better OCR, I could have done it myself, as you will see from the best of the results of the 600 dpi scans. That’s how much things have improved recently.

A separate point to note here is that you can see the “three-quarter space” effect before the exclamation mark and semi-colon that was discussed in [V.104].

The results of the OCR are:

Abbyy FineReader 6:

 " Ah me ! on what inhospitable coast,
 On Tvh.it new region is Ulysses toss'd ;
 Possess'd by wild barbarians fierce in arms ;
 Or men. whose bosom tender pity warms ?
 What sounds are these that gather from the shores ?
 The voice of nymphs that haunt the sylvan bowers,
 The fair-hair'd Pryads of the shady wood ;
 Or azure daughters of the silver flood ;
 Or human voir-e? but issuing1 from the shades,
 AVhv cease I straight to learn what sound invades?"
 " Ah me ! on what inhospitable coast,
 On what new region is Ulysses toss'd ;
 Possess'd by wild barbarians fierce in arms ;
 Or men, whose bosom tender pity warms '?
 "What sounds are these that gather from the shores ?
 The voice of nymphs that haunt the sylvan bowers,
 The fair-hair'd Dryads of the shady wood ;
 Or azure daughters of the silver flood ;
 Or human voice? but issuing from the shades,
 Why cease I straight to learn what sound invades?"
 " Ah me ! on what inhospitable coast,
 On what new region is Ulysses toss'd ;
 Possess'd by wild barbarians fierce in arms ;
 Or men, whose bosom tender pity warms ?
 "What sounds are these that gather from the shores ?
 The voice of nymphs that haunt the sylvan bowers,
 The fair-hair'd*Dryads of the slrady wood ;
 Or azure daughters of the silver flood ;
 Or human voice? but issuing from the shades,
 Why cease I straight to learn what sound invades?"

gocr 0.3.6:

[The 300 and 400 dpi scans produced nothing recognizable. The result of the 600 dpi scan is below.]

    _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_
 On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ;
 _(3s3gs3_d l3.__ ___iiíi l3_3__b___i_c_i3_ fie_Ce in il__S- _
 Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ?
 ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ?
 '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_
 3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _
 Op az(_pe da_____litc__s of _tlie sil __?r t1ood ;
 Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _
 __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_ — li__t so_nd-in__ad_S___

Recognita Standard 3.2.7AK:

 .: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t,
 On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ;
 Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ;
 Or u.~u. w-Ln.e bossum tender pit~- warna'?
 ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ?
 'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5,
 'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood;
 Or az.lre dau~~l.ts~: oY tl:c ·:iv-~~r floo;:3 ;
 C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c· ~had~~,
 11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"
 " ~h me ! ou "-Mat iuMospita~le coast,
 On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ;
 Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ;
 Or m~ n, "-hose hosom tender pit~- warm5 ?
 Marcellohat ~ounds are tlmse tMat ~;atMer from t:he shores ?
 ~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers
 .
 Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ;
 Or aznre dau~liters of tMe sil~-~r fiood ;
 Or lmman ~-oi:~e'? but iauin~ frotn the shades, a
 lVly cea.~e I straibht to learn "-Mat souud in~ad°s?"
 " Ah me ! on what inhospitable coast
 On ~~-hat new r e~ion is L;1 ~-sses toss'd ~
 ,
 Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ ·
 Or men, whose hosom tender pit~l ~varn~s ?
 ~'G'l~at somnds are these tliat ~atl~er from the shores ?
 ~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers,
 Tlie fair -hair'd DMarcello-ads of tl~e slmdy wood ;
 Or azure daylltcrs of tlle silver flood ;
 Or lm:nan voice? uut issL~ing from the shades,
 ~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"

OmniPage Pro 10:

 On "M.^t new reion is 1=1;-a:e~ to-s'd ;
 P"::e:~'d hw "ild Larba.:an~ fierce in arms ;
 Or inn. "-hnse bo.,om tender pity warms
 What <m-,n ds are thFSe that gather from the shores?
 '1-l.e vo_,e o2 u~vnhit: thm hn,,-,nt The sylvan bowers,
 The is ;r-ha;r'd h.-;-ads of the liz-Ay iNood
 Or azure dau_ht;- of tl:c o=1 cr flooj ;
 Or hnnmn wire? l,11t i — rii:g from the shadP3,
 Al-ly cease I straiAlit to learn what sound invades?"
     'Wh me ! on what inhospitable coast,
 On what new region is L fusses toss'd ;
 Possess'd br wild barbaric ns fierce in arms ;
 Or men, whose bosom tender pith- warms
 AN-hat sounds are these that gather from the shores ?
 The voice of nymphs that Haunt the sylvan bowers,
 The fair-hair'd IWvads of the shady -wood ;
 Or azure daughters of the silver flood ;
 Or human voice? bat iauina from the shades,
 Why cease I straight to learn what sound invades?"
     " Ah me! on what inhospitable coast,
 On what new region is Ll ysses toss'd ;
 Possess'd bv -wild barbarians fierce in arms ;
 Or men, whose bosom tender pity warnis ?
 AVlia± sounds are these that gatller from the shores
 The voice of nYI11pliS that haunt the -sylvan bowers,
 The fair -hair'd D.-yads of the shady wood ;
 Or azure daughters of the silver flood ;
 Or human voice? lout issuing from the shades,
 Why cease I straight to learn what sound invades?"

OmniPage Pro 11:

 .` lh in-' on what inhospital,le co-st, 
 On xclznt near region is t 1:-sse~ toss'(: ; 
 Possess'd bY Mild barbarians fierce in aims ; 
 Or inn. whose boson tender pity warms
 What <m-,n ds are tlipse that gather from the shores ? 
 '_I-I.e 1-o=,- of nv:npii? that haunt the sylvan bowers, 
 She ra;r-ha;r'd 1):, ads of the shad- wood ;
 Or az.ire dau_lit~- of tl:e silo-:-r flood ;
 Or human voice? l,,tt i?snina from the shadpq, 
 Al-lry cease I straiAit to learn shat sound invades?"
 ''' :Ah me ! on what inhospitable coast, 
 On iyhat new region is Ulysses toss'd ; 
 Possess'd br wild barbarimis fierce in arms ; 
 Or men, whose bosom tender pity warms 
 AN-hat sounds are tliese that gather from the shores ? 
 The voice of nymphs that haunt the sylvan bowers, 
 The fair-hair'd D~ yads of the shady -wood
 ;
 Or azure dau.L-hters of the silver flood ;
 Or human voice? but issuing from the shades, 
 Why cease I straight to learn what sound invades?"
 " Ah me! on what inhospitable coast, 
 On what new region is Ulysses toss'd ; 
 Possess'd by -wild barbarians fierce in arms ; 
 Or n1en, whose bosom tender pity warnis ? 
 AVliat sounds are these that gather from the shores 
 The voice of nyniplis that haunt the sylvan bowers, 
 The fair-hair'd Dryads of the shady Wood ;
 Or azure daughters of the silver flood ;
 Or human voice? but issuing from the shades, 
 Why cease I straight to learn what sound invades?"

TextBridge Millennium Pro:

     no on what inhe~ptaEie coast,
 On what new realun is hivs,e' to5sd
 ,s~s Ä-~d liv wild lie il)~m.ihI fir see in al-rn~
 Or u~,-n. w'linse bo,uuiu tender pity warnls
 Wl at ~ are t1ie~e that ~atler from the shores ?
 'n.e a oro of imvntpirs tint he~nt the sad van bowers,
 'flie tah'-ha~r'd D~vahs ct the shady wood
 1)1' az Ire dauul~t ~ of tl,e shvr flood
 Or liunian vi i 'I ? h'tt is- eng from the shades,
 \VIiv cea-~e I straight to learn w hat sound invades 1"
   Ah me on what inhospitable coast,
 On what new region is U vases toss'd
 Possess'd by wild barbarians fierce in arms
 Or men, whose bosom tender pity warms ~
 What sounds are these that gather from the shores?
 The voi'e of nymphs that haunt the sylvan bowers,
 The fair-baird Prvads of tl~e shady wood
 Or azure daughters of the silver flood
 Or human vuiae? but issuing fi'om the shades,
 Why cease I straigl~t to learn what sound invades?"
   Ah me on what inhospitable coast,
 On what new region is Ulysses toss'd
 Possess'd by wild barbarians fierce in arms
 Or men, whose bosom tender pity warms?
 What sounds are these that gather from the shores?
 rfhe voice of nymphs that haunt the sylvan bowers,
 The fair-hair'd Dtyads of the shady wood;
 Or azure daughters of 'the silver flood
 Or human voice? but issuing from the shades,
 Why cease I straigl~t to learn what sOund invades?"

Conclusion

Small mistakes in scanning, like letting too much light in, getting your scanner settings wrong for the page, or not pressing the paper flat enough, can make a major difference to the final quality of the text that you will have to correct.

Sometimes, no matter what you do with your scanner, problems with the paper or the print will make it difficult for your OCR package to give good output.

Generally, bigger is better within the range 300dpi-600dpi, but you only need higher resolution with more difficult material.

Different OCR packages will produce widely differing texts from the same images. Given a really good image, most OCR software will work acceptably, but when you have lower quality material to work with, the gap between OCR packages shows clearly.

I got an OCR package bundled with my scanner. Is it good enough to use?

That depends on how well your package performs on the actual scans that you do, and how much you value your time vs. money. Most scanners are bundled with OCR software, but these OCR packages are often older or “brain-damaged” versions, with their functionality deliberately lowered. It’s unlikely that you’ll get a current-version, top-of-the-line OCR package thrown in for free.

You may have to pay extra for better OCR, but it means that you spend less time making corrections. The question is how much better you want your OCR to be.

Save the images from the FAQ “Why am I getting a lot of mistakes in my OCRed text?” and try processing them with the OCR you have. Compare the quality of the text produced with the quality of the samples. This should give you some idea of how your OCR compares to others.

Try a few pages from your book with your OCR. How many mistakes do you see on each page? Do you find that acceptable?

I want to include some images with a HTML version. How should I scan them?

If it’s color in the printed book, then it’s desirable to use color in the scan. Otherwise, try both greyscale and B&W, and see which gives you the best image. It’s usually better to scan images in a higher resolution than you’re going to use, and then use an image manipulation package to reduce them [H.10] to a size appropriate for your HTML file. The HTML FAQ has details on how to present them. An initial scan at 600dpi is often good. Image manipulation programs will also allow you to “clean up” the pictures, by increasing contrast, despeckling, or other filtering.

I want to include some images with a HTML version. What type of image should I use?

GIF, JPEG and PNG images are supported by current browsers, and you should stick with those unless you have a specific reason not to.

GIF and PNG tend to be more efficient — provide better quality at a given file size — for simple line-drawings; JPEG is usually better for photographic images.

Will PG store scanned page images of my book?

Yes, Project Gutenberg would be pleased to include your scans. However, if the scans you utilized are from another online source such as The Internet Archive, then instead the scan links are provided as part of the copyright clearance, and there is probably no need to have an additional copy in Project Gutenberg.

While page images cannot be searched, or converted to other text-based formats for reading, they do have some value — for checking possible errors in the transcription, for holding images that might not have been preserved in HTML, for checking cited page numbers, for re-printing, and just generally for anyone who wants in-depth information on the source paper. This is not our core purpose, and the page images must be seen as an adjunct to the text rather than a main feature. However, disk space and bandwidth are now plentiful enough that it is practical to preserve these, if only for the relatively few people who might make use of them.

We do have to be careful in our use of space and bandwidth, though. To use 40 KB per page is reasonable, given today’s resources; to use 140 KB per page is not. Thus, we insist on maximally-compressed black and white page images only, for normal pages, and the best size-to-quality ratio we can get for pictures.

Our current guidelines on the submission of page images are:

  1. PG is now accepting page images of books posted. Page images will be posted only as an addition to an etext posted in the normal way — we will not post page images without plain text.
  2. Page images are an option; they are not and will not be required for the posting of a text.
  3. All page images should be good enough to work reasonably well with OCR packages, up to 600 dpi, and should be stored as black-and-white TIFFs with CCITT-4 (aka ITU-G4 or Fax Group 4) compression. This is important, so that we keep the overall file size down to a sustainable level. With this compression, a typical 600dpi page can be stored for about 40KB. Our ability to post these images depends on the file sizes staying fairly reasonable. Pages such as color pictures or greyscale photos that cannot reasonably be stored as black-and-white only should be stored as TIFF or JPEG with the best compression you can get for that image. (Note: Irfanview for Windows does this nicely individually or in batch. ImageMagick v 6.x: convert myimage.png -compress group4 myimage.tif )
  4. Each page image should be a separate file and named with the page number within the set; e.g. 001.tif, 002.tif, etc. Separate, non-page images, such as covers or color images scanned separately from the pages, should have suitable names, such as “cover.jpg” or “072-image.tif” All page images for the book will be zipped into one file, to be called FILENUMBER-page-images, e.g. 12345-page-images.zip for etext #12345, and stored in the main directory for that etext. It will unzip to a subdirectory ./page-images, but we will not post separate page images in that directory, since that would double the space used, and we believe that people who want to consult the images will probably want them all. So, for now at least, if you want the images, you must download the ZIP file.

Page images submitted to Distributed Proofreaders [B.2] are automatically saved, and, while not publicly available today, will probably become so in the future.

For storing higher resolution page images or pictures than we can reasonably post today, you might consider the Internet Archive.