OCR is a system for converting images into text to allow searching by text or key words. The process attempts to replicate the human processes involved in reading - both the optical system and the brain processing with regard to interpreting the images as letters and then words. The actual process generally works by first breaking down an image into lines and individual cells - each containing a single character, and then attempting to match the cell image with an individual letter, number or punctuation mark etc.
The process has good success with with standard print fonts and good quality images.. but does not always cope very well with changing fonts mid-document, poor quality images and varying layouts (as in a newspaper with multiple columns).
An example of an every day OCR process which generally works well is the line of code at the bottom of your cheques and lodgement slips in the 1970s style computer writing. This is read automatically by a small scanner on the clerks desk in the bank and saves retyping your account details. The style of text was deliberately chosen as being the easiest to read by the OCR technology available at the time, and as a result is very reliable.
Reliable OCR requires considerable 'intelligence' and it, along with speech recognition, are considered challenging in computer terms. The technology is improving all the time but will probably never quite match the skill of a human eye and brain in terms of deciphering difficult to read images, due to the background knowledge and context details which a person can apply to the problem.
edit : In smaller projects (e.g. a document of several pages, or a few pages from a book) an initial proof conversion is provided to the user to cross-check. Any sections which were considered suspect by the OCR process are highlighted for operator to examine and correct if nessecary. In addition the usual spell checking tools may be used to highlight any other issues. Due to the large scale of the Irish Times project and the labour intensive nature of the work, I suspect that this Quality Assurance (QA) stage was probably limited by budget.
Shane