{"id":1780,"date":"2018-01-29T16:57:42","date_gmt":"2018-01-30T00:57:42","guid":{"rendered":"http:\/\/www.sergneri.net\/wordpress\/?p=1780"},"modified":"2018-01-29T16:57:42","modified_gmt":"2018-01-30T00:57:42","slug":"correcting-texts-for-the-cdnc-california-digital-newspaper-collection","status":"publish","type":"post","link":"https:\/\/www.sergneri.net\/wordpress\/index.php\/2018\/01\/29\/correcting-texts-for-the-cdnc-california-digital-newspaper-collection\/","title":{"rendered":"Correcting Texts for the CDNC &#8211; California  Digital Newspaper Collection"},"content":{"rendered":"<p>\t\t\t\t<strong>Introduction <\/strong><br \/>\nI&#8217;ve been correcting the texts in the newspaper articles on the CDNC for almost 4 years, I am now 12th in the Text Correctors Hall of Fame on CDNC with over 160,000 lines of text corrected. During that time, I&#8217;ve come up with some techniques I&#8217;d like to share. There are no real guidelines on how to correct these texts that I&#8217;ve seen, much of it is common sense but I have decided to write down my methods and share them. By correcting texts, I mean I (and others) are retyping in and correcting searchable texts for the online scans of old California newspapers.<\/p>\n<p><strong>How CDNC works:<\/strong><br \/>\nThe CDNC is based on an online software system called VERIDIAN that we will work with here. It is hosted by UC Riverside. Like any software system, it has its quirks but we work with it to get the best corrections we can.<\/p>\n<p><a href=\"https:\/\/cdnc.ucr.edu\/site\/about_us.html\" target=\"_blank\" rel=\"noopener noreferrer\">The CDNC ABOUT page is here, <\/a> read this first, good background info!<br \/>\n<a href=\"https:\/\/cdnc.ucr.edu\/cgi-bin\/cdnc\" target=\"_blank\" rel=\"noopener noreferrer\">You can find the collection here.<\/a><br \/>\n<a href=\"https:\/\/cdnc.ucr.edu\/cgi-bin\/cdnc?a=ur&amp;command=ShowRegisterNewUserPage&amp;opa=e%3d-------en--20--1--txt-txIN--------1&amp;e=-------en--20--1--txt-txIN--------1\" target=\"_blank\" rel=\"noopener noreferrer\">You can create an account here.<\/a><\/p>\n<p><em>Note: Sometimes in old newspapers you will read items that are, by today&#8217;s standards, racist or that disparage some group of people. In some cases the journalistic style is raw or brutal. I simply grit my teeth and get on with the correcting.<br \/>\nI don&#8217;t believe we should change or remove these texts, you should leave the offensive word or article as is. This is the value of historic texts, it shows us things that were common 100 years ago but are uncommon today. If you remove them from the text, those words still exist in the scanned images.<\/em><\/p>\n<p>For this tutorial, let&#8217;s work on a column of advertisements from the last page of the <em>Healdsburg Enterprise, February 21, 1878<\/em>. This is simply an example of <a href=\"https:\/\/cdnc.ucr.edu\/cgi-bin\/cdnc?a=d&amp;d=HE18780228.2.27&amp;e=-------en--20--1--txt-txIN--------1\" target=\"_blank\" rel=\"noopener noreferrer\">what I was working on this evening<\/a> (January 21, 2018).<\/p>\n<p><strong>To start correcting<\/strong>, you need to have an account and log in. Creating one is simple and costs nothing. You can then select the random issue presented to you at login or you can search for a specific issue and jump in. I&#8217;m currently working on the HEALDSBURG ENTERPRISE from 1878. It&#8217;s a local paper and the time period interests me.<\/p>\n<p>Once you select an issue, the system presents you with an image of a page of a newspaper on the right and on the left a smaller column with two tabs, <strong>ISSUE <\/strong>and <strong>ARTICLE<\/strong>. Let&#8217;s call that smaller box on the left side the <strong>CORRECTING BOX<\/strong>.<\/p>\n<div id=\"attachment_1852\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/veridian-reader-mode-article-red-box.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1852\" class=\"size-large wp-image-1852\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/veridian-reader-mode-article-red-box-1024x611.png\" alt=\"\" width=\"700\" height=\"418\" \/><\/a><p id=\"caption-attachment-1852\" class=\"wp-caption-text\">Veridian Article Mode \u2013 corrected texts \u2013 Click to enlarge.<\/p><\/div>\n<p>If the <strong>ARTICLE tab<\/strong> is selected in the CORRECTING BOX, you will see lines of text and on the right in the <strong>scanned image<\/strong> you will see a highlighted section of the newspaper.<\/p>\n<p>If <strong>ISSUE tab<\/strong> is selected, you will see a column of headers that correspond to the columns on the newspaper image. Selecting one of the headers will also highlight a column on the right and trigger a switch to the <strong>ARTICLE <\/strong>tab. (See How to Correct the ISSUE headings below.)<\/p>\n<p>The texts we correct are used by search engines and the system (Veridian) to find items in the old papers. Events, names, and other key data are important for searching and careful correcting is important.<\/p>\n<div id=\"attachment_1873\" style=\"width: 743px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Cor-Box-Tabs.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1873\" class=\"size-full wp-image-1873\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Cor-Box-Tabs.png\" alt=\"\" width=\"733\" height=\"291\" \/><\/a><p id=\"caption-attachment-1873\" class=\"wp-caption-text\">Details &#8211; Correcting Box<\/p><\/div>\n<p>To begin, select <strong>CORRECT THIS TEXT<\/strong> from the CORRECTING BOX in the ARTICLE tab.<\/p>\n<div id=\"attachment_1874\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Complete-paragraph-before-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1874\" class=\"size-large wp-image-1874\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Complete-paragraph-before-1-1024x709.png\" alt=\"\" width=\"700\" height=\"485\" \/><\/a><p id=\"caption-attachment-1874\" class=\"wp-caption-text\">Before correcting<\/p><\/div>\n<p>There are now four buttons at the top of the correcting box:<br \/>\n<strong>SAVE<\/strong> will be grayed out until you make a correction to something in the correcting box.<br \/>\n<strong>SAVE&#038;EXIT<\/strong> is also gray until you make a correction, this will save and take you back out of correcting mode.<br \/>\n<strong>CANCEL<\/strong> discard your changes<br \/>\n<strong>NEXT<\/strong> is always available &#8211; it will take you to the next column available for correction. <\/p>\n<p>The <strong>RED BOX<\/strong> on the left in the correcting box is where you type corrections. The <strong>RED BOX <\/strong>on the right is where you are on the original page. In this case the red box on the right side does not highlight all the text on the line on the left, so we simply add in the missing text in the CORRECTING BOX.<\/p>\n<p><strong>Graphics<\/strong>:<br \/>\nYou can see in the above image, there are also some <strong>graphics<\/strong> embedded in the original text (the right pointing finger for example) that are transcribed as random characters, in this case &#8216;&amp;T&#8217;. When corrected, we will delete them.<\/p>\n<div id=\"attachment_1785\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Complete-paragraph-corrected-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1785\" class=\"size-large wp-image-1785\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Complete-paragraph-corrected-1-1024x873.png\" alt=\"\" width=\"700\" height=\"597\" \/><\/a><p id=\"caption-attachment-1785\" class=\"wp-caption-text\">Complete text corrected.<\/p><\/div>\n<p>The image above is a corrected version of this advertisement. I&#8217;ve added in the texts omitted by the software and corrected mis-transcribed texts.<\/p>\n<p>On line 5, the software pushed &#8216;HAIRPRODUC&#8217; into one word, correcting it to &#8216;HAIR PRODUCER&#8217; makes the line complete. On line 21, you can see the &#8216;&amp;T&#8217; is deleted and the line begins with the text immediately after the right pointing finger.<\/p>\n<p><strong>Fractions<\/strong>.<br \/>\nIn the case of Mrs. Moore&#8217;s address, 1008<strong>\u00bd<\/strong> Market Street, the scanner can&#8217;t easily do fractions, so I enter the ASCII code for \u00bd by typing in <strong>ALT-171<\/strong>. Most ASCII codes are easily looked up and VERIDIAN seems to handle most of them well.<\/p>\n<p><strong>FONTS<\/strong> are not important in correcting. Fancy fonts sometimes trips up the software and your ability to transcribe them is much better than any machine. Simply stick to spelling and capitalization and ignore the rest.<\/p>\n<div id=\"attachment_1829\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/CAPITALS.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1829\" class=\"size-large wp-image-1829\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/CAPITALS-1024x250.png\" alt=\"\" width=\"700\" height=\"171\" \/><\/a><p id=\"caption-attachment-1829\" class=\"wp-caption-text\">Small CAPS &#8211; Bank of Healdsburg<\/p><\/div>\n<p><strong>Small CAPS<\/strong> &#8211; In cases where SMALL CAPS are used, (&#8220;Bank of Healdsburg&#8221; above) I use normal capitalization (upper and lower case) in the corrected text.<\/p>\n<p><strong>Numbers <\/strong>should be reviewed carefully, often you will see $IOO for one hundred that should be $100, capital I for 1 and capital letter O for zero. Your browser settings and that font you use will make this easy to see or easy to miss. In my case, I use the CHROME browser for correcting as it allows some automated spell checking on the highlighted texts and points out incorrect numbers in a consistent way.<\/p>\n<p>Depending on the font and page condition, the scanner and OCR system <strong>can transpose letters<\/strong>. It has a hard time distinguishing <strong>&#8216;b&#8217;<\/strong> from <strong>&#8216;h&#8217;<\/strong>, <strong>&#8216;e&#8217;<\/strong> and <strong>&#8216;r&#8217;<\/strong>, <strong>&#8216;3&#8217;<\/strong> and <strong>&#8216;8&#8217;<\/strong>, for example. More modern papers have fewer issues than the older ones with the serif fonts and poor quality. I&#8217;ve gone back to reread my corrections many times only to find &#8216;he&#8217; instead of &#8216;be&#8217; or &#8216;.l&#8217; vs &#8216;J&#8217;, and &#8216;I&#8217; vs &#8216;i&#8217;. You will find these through-out the texts and you should scan for them as you work.<\/p>\n<p><strong>Navigation<\/strong>.<br \/>\nI also use the <strong>mouse <\/strong>very little once I&#8217;m in correcting mode, I&#8217;ve trained myself to use <strong>TAB <\/strong>to move down a line, and <strong>SHIFT+TAB <\/strong>to move up a line. In VERIDIAN you work one line at a time.<br \/>\nUse <strong>Control+LEFT or RIGHT ARROW<\/strong> to move a word right or left.<br \/>\nYou can use <strong>Control+SHIFT+LEFT or RIGHT ARROW<\/strong>, you can highlight words for easy replacement or copying.<br \/>\n<strong>END <\/strong>and <strong>HOME <\/strong>will take you to the end of the line and <strong>SHIFT+HOME <\/strong>or <strong>SHIFT+END <\/strong>and will select to the end of a line.<br \/>\n<strong>Control+A <\/strong>will select all the text in the line.<br \/>\n<strong>Control+Z<\/strong> will undo your commands.<br \/>\nI don&#8217;t use a MAC so can&#8217;t tell you what to do on that keyboard.<\/p>\n<p><strong>Running Out of Room<\/strong> &#8211; One problem is when the correcting box editor runs out of lines and the article has more text to enter.<\/p>\n<div id=\"attachment_1803\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Extra-Texts-at-end.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1803\" class=\"size-large wp-image-1803\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Extra-Texts-at-end-1024x469.png\" alt=\"\" width=\"700\" height=\"321\" \/><\/a><p id=\"caption-attachment-1803\" class=\"wp-caption-text\">One line left for entry and more text at right.<\/p><\/div>\n<p>In this case, I continue to type the texts from the image into the last line, basically adding on until done. In this example, you can&#8217;t see all of the text in the last line in the CORRECTING BOX.  It is: <\/p>\n<ul>\n&#8220;F. J. SCHWAB, CUSTOM Boot and Shoe MAKER, North Side of Plaza, Healdsburg. HAVING OPENED A BOOT AND SHOE shop at the above location, I am prepared to make ANY STYLE OF BOOT OR SHOE TO ORDER. None but the Best of Materi-als Used and Perfect Fit Guaranteed. Give us a trial and satisfy yourself of the superiority of my work.&#8221;<\/ul>\n<p>That&#8217;s a long line! I tried to keep the capitalization and punctuation. I haven&#8217;t found a limit for adding in texts in this fashion, it is possible there is one. If you do this, you should check your data after you save to make sure it worked as expected.<\/p>\n<p>If a lot of text is missing, you can go back up in the correcting box and begin to merge lines to free up space to add the texts. I try to keep these grouped together, merging a single advertisements for example.<\/p>\n<p><strong>Hyphens<\/strong>. If a word is broken between two lines and the hyphen is the last character on the line, VERIDIAN will remove it when it generates the text view of the article. However, in the example above &#8216;Materi-als&#8217; is not the last character in the line, the hyphen is not eliminated by VERIDIAN when it creates the text for that article. So, inline, don&#8217;t hyphenate.<\/p>\n<p>You can see the text for any article two ways: 1 &#8211; pick a column on the right\/image side of the screen, then right click and select TEXT OF THIS ARTICLE from the menu or 2 &#8211; return to viewing mode and the text for the last selected article will appear in the correcting box. Your corrections will appear after you save your work.<\/p>\n<p><strong>Damaged images.<\/strong> I&#8217;ve also been dealing with a number of pages of the HEALDSBURG ENTERPRISE that are damaged. Humans are much better suited to piecing this together than computers are so try your best, you can only enhance things.<\/p>\n<div id=\"attachment_1805\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/damaged-texts-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1805\" class=\"size-large wp-image-1805\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/damaged-texts-1-1024x206.png\" alt=\"\" width=\"700\" height=\"141\" \/><\/a><p id=\"caption-attachment-1805\" class=\"wp-caption-text\">Damaged Texts.<\/p><\/div>\n<p>In these cases, I use ellipses &#8216; &#8230; &#8216; to indicate texts that can&#8217;t be read and put any word guesses in parentheses. So, for this example, a line like<br \/>\n&#8220;(ers) of Berlin and Rome would prefer to &#8230; terms with the Pope,&#8221; shows how I handle this. In the next line I made an assumption that &#8220;Vatican&#8221; was correct though difficult to make out, but (the) was a guess.<\/p>\n<p><strong>Correcting the ISSUE headings. <\/strong><\/p>\n<div id=\"attachment_1790\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/veridian-reader-mode-issue.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1790\" class=\"size-large wp-image-1790\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/veridian-reader-mode-issue-1024x692.png\" alt=\"\" width=\"700\" height=\"473\" \/><\/a><p id=\"caption-attachment-1790\" class=\"wp-caption-text\">Issue column on the left.<\/p><\/div>\n<p>The texts here are also found in the ARTICLE view; they are in a separate box just below the ARTICLE tab. There is an EDIT command next to the texts on both tabs. Often, this needs correcting too and sometimes VERIDIAN gets them completely wrong. In the CORRECTION BOX, under either tab, select [EDIT] to make the change; this is the only place you can access them.<\/p>\n<p><strong>Vertical texts.<\/strong><\/p>\n<div id=\"attachment_1811\" style=\"width: 710px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Vertical-Adv.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1811\" class=\"wp-image-1811 size-large\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Vertical-Adv-1024x500.png\" alt=\"\" width=\"700\" height=\"342\" \/><\/a><p id=\"caption-attachment-1811\" class=\"wp-caption-text\">Click image to enlarge.<\/p><\/div>\n<p>Texts like this can be a real puzzle. In this case, I simply tried to use the space allocated to type in as much text as makes sense. Thus, if someone was searching for JULIUS KING, they would find this.<\/p>\n<div id=\"attachment_1812\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Vertical-Adv-corrected.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1812\" class=\"size-large wp-image-1812\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Vertical-Adv-corrected-931x1024.png\" alt=\"\" width=\"700\" height=\"770\" \/><\/a><p id=\"caption-attachment-1812\" class=\"wp-caption-text\">Click to enlarge.<\/p><\/div>\n<p>As mentioned above, here is the case where I had to squeeze two ads into two lines as the editor had no more lines to use yet there was more to be added. In this case I pushed all of the entire ad for LANNAN &amp; DEMPSEY onto three lines leaving me one line for to add the entire text for ARTISTIC PHOTOGRAPHS.<\/p>\n<p><strong>Overlapping Lines.<\/strong><\/p>\n<div id=\"attachment_1818\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Overlapping-Lines.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1818\" class=\"size-large wp-image-1818\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Overlapping-Lines-1024x538.png\" alt=\"\" width=\"700\" height=\"368\" \/><\/a><p id=\"caption-attachment-1818\" class=\"wp-caption-text\">Overlapping Lines. Click to enlarge<\/p><\/div>\n<p>When I encounter this &#8212; the selection box on the right overlaps two (or more) lines &#8212; I simply make room for all of the text and add them as a single line. This image is after I made the corrections.<br \/>\nAlso, the text below it will read &#8216;RUPTURE Use no more Metallic&#8217; and the next line &#8216;Trusses! No More suffer-ing from Iron Hoops or Steel Springs! ROWE&#8217;S&#8217;<\/p>\n<p>Sometimes, when you encounter overlapping lines, you will see one or more lines repeated. I always go back and review what I&#8217;ve corrected and try to adjust this in a way that makes sense, but I always try to eliminate the duplicated lines and keep the flow of the texts in order. <\/p>\n<p><strong>Train Schedules, Weather tables, Long Lists.<\/strong><\/p>\n<div id=\"attachment_1859\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Weather-Report.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1859\" class=\"size-large wp-image-1859\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Weather-Report-1024x941.png\" alt=\"\" width=\"700\" height=\"643\" \/><\/a><p id=\"caption-attachment-1859\" class=\"wp-caption-text\">Multi-column tables.<\/p><\/div>\n<p>I often do not correct all the lines in train timetables, tide tables, and other things that\u00a0 use multiple columns across. The example above is how VERIDIAN will render them. The image below is how I will correct this example. I&#8217;ve extracted as much information as possible, in this case the texts across the top and bottom, and cleared out the gibberish on the remaining lines. A searcher could find the &#8220;METEOROLOGICAL OBSERVATIONS&#8221; or the man&#8217;s name if needed.<\/p>\n<div id=\"attachment_1860\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Weather-Report-Corrected.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1860\" class=\"size-large wp-image-1860\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Weather-Report-Corrected-1024x894.png\" alt=\"\" width=\"700\" height=\"611\" \/><\/a><p id=\"caption-attachment-1860\" class=\"wp-caption-text\">Multi-column text corrected.<\/p><\/div>\n<p>Train schedules might simply list the railroad or other identifying texts and leave out the actual schedule. On long lists of names that newspapers of record usually publish, you can slog through them or do part of them, or skip them altogether.<\/p>\n<div id=\"attachment_1861\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Rail-Time-Table.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1861\" class=\"size-large wp-image-1861\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Rail-Time-Table-1024x848.png\" alt=\"\" width=\"700\" height=\"580\" \/><\/a><p id=\"caption-attachment-1861\" class=\"wp-caption-text\">Railroad Time Table Corrected.<\/p><\/div>\n<p><div id=\"attachment_1895\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Commodity-List.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1895\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Commodity-List-1024x290.png\" alt=\"\" width=\"700\" height=\"198\" class=\"size-large wp-image-1895\" \/><\/a><p id=\"caption-attachment-1895\" class=\"wp-caption-text\">Commodity List 1851<\/p><\/div><br \/>\n<strong>Use of do and &#8230; in long lists<\/strong> &#8211; When you see &#8220;do&#8221; in a long list, it is an abbreviation of &#8220;ditto&#8221; and is used along with a double quote to indicate repetition of a preceding entry in the list. Above, you can see the red underlined &#8220;do&#8221; showing sugar per pound.<br \/>\nI use an ellipsis (&#8230; ) to quickly show a column break, in this case between the description and the prices. It is normally pasted in and removes the need to replicate the longer series of dots used in the printed columns.<\/p>\n<p><strong>Classified Tags<\/strong> are often found at the last line of an ad. I often delete them from the texts as they are pretty much unsearchable and they are preserved in the images.<\/p>\n<div id=\"attachment_1824\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/classified-tag.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1824\" class=\"size-large wp-image-1824\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/classified-tag-1024x132.png\" alt=\"\" width=\"700\" height=\"90\" \/><\/a><p id=\"caption-attachment-1824\" class=\"wp-caption-text\">Tags in Classified Ads.<\/p><\/div>\n<p><strong>Classified Ads<\/strong> are often repeated from issue to issue in the same newspaper and the first corrected can serve as a template to copy from. I do this by opening another window of the browser and navigating to the corrected issue, then finding the classified, going into CORRECTING MODE and copy using <strong>CNTL-C<\/strong> and paste with <strong>CNTL-V<\/strong> from one window to the next. (One nice thing about having a PREMIUM account is the tracking of RECENT ACTIVITY.)<br \/>\nYou can replicate errors in your corrected texts if you are not careful and you can get confused about which window is the &#8220;corrected&#8221; and the &#8220;<em>correctee<\/em>&#8220;. Using cut and paste between browser windows is highly productive, however, and if applied carefully, can save many key strokes.<\/p>\n<p><strong>Official Seals.<\/strong><\/p>\n<div id=\"attachment_1890\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/02\/probate-seal-before.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1890\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/02\/probate-seal-before-1024x191.png\" alt=\"\" width=\"700\" height=\"131\" class=\"size-large wp-image-1890\" \/><\/a><p id=\"caption-attachment-1890\" class=\"wp-caption-text\">Probate Seal Uncorrected &#8211; Click to Enlarge<\/p><\/div>\n<p><div id=\"attachment_1889\" style=\"width: 710px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/02\/probate-seal-corrected.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1889\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/02\/probate-seal-corrected-1024x153.png\" alt=\"\" width=\"700\" height=\"105\" class=\"size-large wp-image-1889\" \/><\/a><p id=\"caption-attachment-1889\" class=\"wp-caption-text\">Probate Seal Corrected &#8211; Click to Enlarge<\/p><\/div><br \/>\n In many court related notices and in some state decrees, an image of the official seal is used.  For most cases, I simply remove the system&#8217;s attempt to translate these lines, as shown in the above images. If you feel the seal contains some relevant text for searching, by all means add that to the correction.<\/p>\n<div id=\"attachment_1893\" style=\"width: 310px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Court-Header-Corrected.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1893\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Court-Header-Corrected-300x47.png\" alt=\"\" width=\"300\" height=\"47\" class=\"size-medium wp-image-1893\" \/><\/a><p id=\"caption-attachment-1893\" class=\"wp-caption-text\">Court Header Corrected &#8211; Click to Enlarge<\/p><\/div>\n<p>I also rewrite the opening headers as shown above, again, the objective is to keep the text searchable.<\/p>\n<p><strong>Using Voice Recognition.<\/strong><br \/>\nIf I encounter a page of text that is very badly garbled by the software, I&#8217;ll turn on <strong>voice recognition<\/strong> and dictate the texts. This works well for speeches, sermons, laws and announcements, but not so well for shorter news and advertising copy. You have to be careful here to as voice recognition will often transcribe your spoken words incorrectly or you lose the capitalization of the original text. For example, the two systems I tried had problems with THIRD vs 3rd and other numbers.<\/p>\n<p><strong>Summary<\/strong><br \/>\nI believe correcting the scanned newspaper texts in CDNC is a good thing to do, it makes the texts of the newspapers available to search engines and internet users in general. It also affords you a chance to relive history. I have found some fascinating stories correcting the collection, for example <a href=\"https:\/\/www.sergneri.net\/wordpress\/?cat=10\" target=\"_blank\" rel=\"noopener noreferrer\">here are some things I liked <\/a> and added to this blog so I wouldn&#8217;t forget them.<\/p>\n<p>Without corrections, as you will see below, I&#8217;d estimate up to 50% of the old newspapers, prior to about 1920, would not be complete and thus unsearchable. Later newspapers used fonts more suitable to optical character recognition (OCR) and their condition is newer so in better shape for scanning.<\/p>\n<p>I welcome comments here on this blog, but I insist that they be politely written or they will not be approved and shared. I also hope that if others who share this correcting habit and have better ideas or methods, you add them here, I will attempt to accommodate you as best I can. Quite possibly, with more input, we might generate a style guide of sorts.<\/p>\n<p>For those who would like more background on the effort to digitize and publish newspaper archives, the CDNC site offers a PDF called <a href=\"https:\/\/cdnc.ucr.edu\/site\/files\/BestPracticesforCaliforniaNewspaperDigitization.pdf\">Digitizing California\u2019s Newspapers: A Guide and Best-\u00ad?Practices<\/a>. Check your downloads area after you click this link.<\/p>\n<p>And finally, the OCR process can produce some funny lines, it almost makes me want to leave them as is:<\/p>\n<p><a href=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Hats-and-Rats.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.sergneri.net\/wordpress\/wp-content\/uploads\/2018\/01\/Hats-and-Rats-1024x269.png\" alt=\"\" width=\"700\" height=\"184\" class=\"alignleft size-large wp-image-1922\" \/><\/a> \t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction I&#8217;ve been correcting the texts in the newspaper articles on the CDNC for almost 4 years, I am now 12th in the Text Correctors Hall of Fame on CDNC with over 160,000 lines of text corrected. During that time, I&#8217;ve come up with some techniques I&#8217;d like to share. There are no real guidelines &#8230; <span class=\"more\"><a class=\"more-link\" href=\"https:\/\/www.sergneri.net\/wordpress\/index.php\/2018\/01\/29\/correcting-texts-for-the-cdnc-california-digital-newspaper-collection\/\">[Read more&#8230;]<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,3,17],"tags":[328,379,490,590,1639,1716,1734],"class_list":{"0":"entry","1":"post","2":"publish","3":"author-sergneri","4":"has-excerpt","5":"post-1780","7":"format-standard","8":"category-california-history","9":"category-california-newspaper-archive","10":"category-thinking-about","11":"post_tag-california-digital-newspaper-collection","12":"post_tag-cdnc","13":"post_tag-correcting-texts","14":"post_tag-editing","15":"post_tag-style-guide","16":"post_tag-tips","17":"post_tag-tricks"},"_links":{"self":[{"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1780","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/comments?post=1780"}],"version-history":[{"count":0,"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1780\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/media?parent=1780"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/categories?post=1780"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sergneri.net\/wordpress\/index.php\/wp-json\/wp\/v2\/tags?post=1780"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}