Skip to main content
LOGO: Queen's University Belfast
CDDA - White Logo

ProLector

ProLector provides a variety of features designed to read even poor-quality documents (e.g. automatic splitting of ligatures and automatic joining of broken letters).

See a page of Gaelic language from ‘Cúl le muir agus Scéalta eile’’ as part of the Foclóir na Nua-Ghaeilge, Corpas na Gaeilge project. The page is digitised and a high resolution (1200 dpi (dots per inch)) uncompressed TIFF output is created. The image is then cleaned, cropped to enable processing into our software engine ProLector.

Pic of ProLector Book   Page in Book of ProLector Book

 ProLector - FontBase

 

This screen shot shows the image, fontbase pattern system at 1,009 and text output with 7 pages read containing 11,379 chars

A fontbase is created using the Centre’s Unicode Character Index, capturing patterns to enable conversion into the required symbols.  The fontbase is continually re-assessed to ensure no error has crept in.

 ProLector - Training Page

 

This screen shot shows ProLector in Training mode, the analyst entering patterns into the fontbase.  This pattern is a Gaelic g.

As you can see ProLector has many functions, the analyst can set error rates change font styles, remove dirt etc.

 ProLector - Interactive Page  

This screen shot shows ProLector in Interactive (manual) mode, the analyst capturing patters not picked up by the designed fontbase. This pattern is a Gaelic Bh.

 ProLector - Interactive Page - aacute  

Another screen shot shows ProLector in manual mode, the analyst capturing patterns not picked up by the designed fontbase. This pattern is a Unicode symbol á Aacute 00C1

 ProLector - Raw Text   

This screen shot shows raw capture in text format.  The centre has created a unique way of capturing Unicode systems see &aa. which in turn is the Unicode system á Aacute 00C1

 UTF8 Raw Text  

This screen shows the full capture of multiple pages in raw text format. 

 Visual Basic for Applications  

This screen shot shows a glimpse of the VBA (Visual Basic for Applications).  The Centre creates a bespoke macro to convert the raw output into a Word file, which gives a representation of the printed page.

 MsWord  

This screen shot shows the converted text, with a colour coded highlighting system to assist the analyst with quality control.

 QA - Quality Assurance   

This screen shows how the analyst post-processes both image and output.  Correcting when necessary.

Final formats will be delivered to the agreed service level specifications.

This particular project is with the Royal Irish Academy, click here for more details

OCR (Optical Character Recognition)
OCR (Optical Character Recognition)