I’ve been correcting the texts in the newspaper articles on the CDNC for almost 4 years, I am now 12th in the Text Correctors Hall of Fame on CDNC with over 160,000 lines of text corrected. During that time, I’ve come up with some techniques I’d like to share. There are no real guidelines on how to correct these texts that I’ve seen, much of it is common sense but I have decided to write down my methods and share them. By correcting texts, I mean I (and others) are retyping in and correcting searchable texts for the online scans of old California newspapers.
How CDNC works:
The CDNC is based on an online software system called VERIDIAN that we will work with here. It is hosted by UC Riverside. Like any software system, it has its quirks but we work with it to get the best corrections we can.
Note: Sometimes in old newspapers you will read items that are, by today’s standards, racist or that disparage some group of people. In some cases the journalistic style is raw or brutal. I simply grit my teeth and get on with the correcting.
I don’t believe we should change or remove these texts, you should leave the offensive word or article as is. This is the value of historic texts, it shows us things that were common 100 years ago but are uncommon today. If you remove them from the text, those words still exist in the scanned images.
For this tutorial, let’s work on a column of advertisements from the last page of the Healdsburg Enterprise, February 21, 1878. This is simply an example of what I was working on this evening (January 21, 2018).
To start correcting, you need to have an account and log in. Creating one is simple and costs nothing. You can then select the random issue presented to you at login or you can search for a specific issue and jump in. I’m currently working on the HEALDSBURG ENTERPRISE from 1878. It’s a local paper and the time period interests me.
Once you select an issue, the system presents you with an image of a page of a newspaper on the right and on the left a smaller column with two tabs, ISSUE and ARTICLE. Let’s call that smaller box on the left side the CORRECTING BOX.
If the ARTICLE tab is selected in the CORRECTING BOX, you will see lines of text and on the right in the scanned image you will see a highlighted section of the newspaper.
If ISSUE tab is selected, you will see a column of headers that correspond to the columns on the newspaper image. Selecting one of the headers will also highlight a column on the right and trigger a switch to the ARTICLE tab. (See How to Correct the ISSUE headings below.)
The texts we correct are used by search engines and the system (Veridian) to find items in the old papers. Events, names, and other key data are important for searching and careful correcting is important.
To begin, select CORRECT THIS TEXT from the CORRECTING BOX in the ARTICLE tab.
There are now four buttons at the top of the correcting box:
SAVE will be grayed out until you make a correction to something in the correcting box.
SAVE&EXIT is also gray until you make a correction, this will save and take you back out of correcting mode.
CANCEL discard your changes
NEXT is always available – it will take you to the next column available for correction.
The RED BOX on the left in the correcting box is where you type corrections. The RED BOX on the right is where you are on the original page. In this case the red box on the right side does not highlight all the text on the line on the left, so we simply add in the missing text in the CORRECTING BOX.
You can see in the above image, there are also some graphics embedded in the original text (the right pointing finger for example) that are transcribed as random characters, in this case ‘&T’. When corrected, we will delete them.
The image above is a corrected version of this advertisement. I’ve added in the texts omitted by the software and corrected mis-transcribed texts.
On line 5, the software pushed ‘HAIRPRODUC’ into one word, correcting it to ‘HAIR PRODUCER’ makes the line complete. On line 21, you can see the ‘&T’ is deleted and the line begins with the text immediately after the right pointing finger.
In the case of Mrs. Moore’s address, 1008½ Market Street, the scanner can’t easily do fractions, so I enter the ASCII code for ½ by typing in ALT-171. Most ASCII codes are easily looked up and VERIDIAN seems to handle most of them well.
FONTS are not important in correcting. Fancy fonts sometimes trips up the software and your ability to transcribe them is much better than any machine. Simply stick to spelling and capitalization and ignore the rest.
Small CAPS – In cases where SMALL CAPS are used, (“Bank of Healdsburg” above) I use normal capitalization (upper and lower case) in the corrected text.
Numbers should be reviewed carefully, often you will see $IOO for one hundred that should be $100, capital I for 1 and capital letter O for zero. Your browser settings and that font you use will make this easy to see or easy to miss. In my case, I use the CHROME browser for correcting as it allows some automated spell checking on the highlighted texts and points out incorrect numbers in a consistent way.
Depending on the font and page condition, the scanner and OCR system can transpose letters. It has a hard time distinguishing ‘b’ from ‘h’, ‘e’ and ‘r’, ‘3’ and ‘8’, for example. More modern papers have fewer issues than the older ones with the serif fonts and poor quality. I’ve gone back to reread my corrections many times only to find ‘he’ instead of ‘be’ or ‘.l’ vs ‘J’, and ‘I’ vs ‘i’. You will find these through-out the texts and you should scan for them as you work.
I also use the mouse very little once I’m in correcting mode, I’ve trained myself to use TAB to move down a line, and SHIFT+TAB to move up a line. In VERIDIAN you work one line at a time.
Use Control+LEFT or RIGHT ARROW to move a word right or left.
You can use Control+SHIFT+LEFT or RIGHT ARROW, you can highlight words for easy replacement or copying.
END and HOME will take you to the end of the line and SHIFT+HOME or SHIFT+END and will select to the end of a line.
Control+A will select all the text in the line.
Control+Z will undo your commands.
I don’t use a MAC so can’t tell you what to do on that keyboard.
Running Out of Room – One problem is when the correcting box editor runs out of lines and the article has more text to enter.
In this case, I continue to type the texts from the image into the last line, basically adding on until done. In this example, you can’t see all of the text in the last line in the CORRECTING BOX. It is:
“F. J. SCHWAB, CUSTOM Boot and Shoe MAKER, North Side of Plaza, Healdsburg. HAVING OPENED A BOOT AND SHOE shop at the above location, I am prepared to make ANY STYLE OF BOOT OR SHOE TO ORDER. None but the Best of Materi-als Used and Perfect Fit Guaranteed. Give us a trial and satisfy yourself of the superiority of my work.”
That’s a long line! I tried to keep the capitalization and punctuation. I haven’t found a limit for adding in texts in this fashion, it is possible there is one. If you do this, you should check your data after you save to make sure it worked as expected.
If a lot of text is missing, you can go back up in the correcting box and begin to merge lines to free up space to add the texts. I try to keep these grouped together, merging a single advertisements for example.
Hyphens. If a word is broken between two lines and the hyphen is the last character on the line, VERIDIAN will remove it when it generates the text view of the article. However, in the example above ‘Materi-als’ is not the last character in the line, the hyphen is not eliminated by VERIDIAN when it creates the text for that article. So, inline, don’t hyphenate.
You can see the text for any article two ways: 1 – pick a column on the right/image side of the screen, then right click and select TEXT OF THIS ARTICLE from the menu or 2 – return to viewing mode and the text for the last selected article will appear in the correcting box. Your corrections will appear after you save your work.
Damaged images. I’ve also been dealing with a number of pages of the HEALDSBURG ENTERPRISE that are damaged. Humans are much better suited to piecing this together than computers are so try your best, you can only enhance things.
In these cases, I use ellipses ‘ … ‘ to indicate texts that can’t be read and put any word guesses in parentheses. So, for this example, a line like
“(ers) of Berlin and Rome would prefer to … terms with the Pope,” shows how I handle this. In the next line I made an assumption that “Vatican” was correct though difficult to make out, but (the) was a guess.
Correcting the ISSUE headings.
The texts here are also found in the ARTICLE view; they are in a separate box just below the ARTICLE tab. There is an EDIT command next to the texts on both tabs. Often, this needs correcting too and sometimes VERIDIAN gets them completely wrong. In the CORRECTION BOX, under either tab, select [EDIT] to make the change; this is the only place you can access them.
Texts like this can be a real puzzle. In this case, I simply tried to use the space allocated to type in as much text as makes sense. Thus, if someone was searching for JULIUS KING, they would find this.
As mentioned above, here is the case where I had to squeeze two ads into two lines as the editor had no more lines to use yet there was more to be added. In this case I pushed all of the entire ad for LANNAN & DEMPSEY onto three lines leaving me one line for to add the entire text for ARTISTIC PHOTOGRAPHS.
When I encounter this — the selection box on the right overlaps two (or more) lines — I simply make room for all of the text and add them as a single line. This image is after I made the corrections.
Also, the text below it will read ‘RUPTURE Use no more Metallic’ and the next line ‘Trusses! No More suffer-ing from Iron Hoops or Steel Springs! ROWE’S’
Sometimes, when you encounter overlapping lines, you will see one or more lines repeated. I always go back and review what I’ve corrected and try to adjust this in a way that makes sense, but I always try to eliminate the duplicated lines and keep the flow of the texts in order.
Train Schedules, Weather tables, Long Lists.
I often do not correct all the lines in train timetables, tide tables, and other things that use multiple columns across. The example above is how VERIDIAN will render them. The image below is how I will correct this example. I’ve extracted as much information as possible, in this case the texts across the top and bottom, and cleared out the gibberish on the remaining lines. A searcher could find the “METEOROLOGICAL OBSERVATIONS” or the man’s name if needed.
Train schedules might simply list the railroad or other identifying texts and leave out the actual schedule. On long lists of names that newspapers of record usually publish, you can slog through them or do part of them, or skip them altogether.
Use of do and … in long lists – When you see “do” in a long list, it is an abbreviation of “ditto” and is used along with a double quote to indicate repetition of a preceding entry in the list. Above, you can see the red underlined “do” showing sugar per pound.
I use an ellipsis (… ) to quickly show a column break, in this case between the description and the prices. It is normally pasted in and removes the need to replicate the longer series of dots used in the printed columns.
Classified Tags are often found at the last line of an ad. I often delete them from the texts as they are pretty much unsearchable and they are preserved in the images.
Classified Ads are often repeated from issue to issue in the same newspaper and the first corrected can serve as a template to copy from. I do this by opening another window of the browser and navigating to the corrected issue, then finding the classified, going into CORRECTING MODE and copy using CNTL-C and paste with CNTL-V from one window to the next. (One nice thing about having a PREMIUM account is the tracking of RECENT ACTIVITY.)
You can replicate errors in your corrected texts if you are not careful and you can get confused about which window is the “corrected” and the “correctee“. Using cut and paste between browser windows is highly productive, however, and if applied carefully, can save many key strokes.
In many court related notices and in some state decrees, an image of the official seal is used. For most cases, I simply remove the system’s attempt to translate these lines, as shown in the above images. If you feel the seal contains some relevant text for searching, by all means add that to the correction.
I also rewrite the opening headers as shown above, again, the objective is to keep the text searchable.
Using Voice Recognition.
If I encounter a page of text that is very badly garbled by the software, I’ll turn on voice recognition and dictate the texts. This works well for speeches, sermons, laws and announcements, but not so well for shorter news and advertising copy. You have to be careful here to as voice recognition will often transcribe your spoken words incorrectly or you lose the capitalization of the original text. For example, the two systems I tried had problems with THIRD vs 3rd and other numbers.
I believe correcting the scanned newspaper texts in CDNC is a good thing to do, it makes the texts of the newspapers available to search engines and internet users in general. It also affords you a chance to relive history. I have found some fascinating stories correcting the collection, for example here are some things I liked and added to this blog so I wouldn’t forget them.
Without corrections, as you will see below, I’d estimate up to 50% of the old newspapers, prior to about 1920, would not be complete and thus unsearchable. Later newspapers used fonts more suitable to optical character recognition (OCR) and their condition is newer so in better shape for scanning.
I welcome comments here on this blog, but I insist that they be politely written or they will not be approved and shared. I also hope that if others who share this correcting habit and have better ideas or methods, you add them here, I will attempt to accommodate you as best I can. Quite possibly, with more input, we might generate a style guide of sorts.
For those who would like more background on the effort to digitize and publish newspaper archives, the CDNC site offers a PDF called Digitizing California’s Newspapers: A Guide and Best-‐Practices. Check your downloads area after you click this link.
And finally, the OCR process can produce some funny lines, it almost makes me want to leave them as is: