Simply OCR means Optical Character Recognition. We can extract text and layout information from image file like MDI and TIFF file format. When one scans a paper page into a computer, it produces just an image file, a photo of the page. The computer cannot understand the letters on the page; you would use OCR software to convert it into a text or word processor file so that you could do those things.
it can be performed by Microsoft Office Document Imaging Object Model,for it we are need to use the MODI Library in a Development Project. so first we understand that what is MODI object model
The MODI object model consists of the following objects:
Document object: represents an ordered collection of pages (images).
Image object: represents a single page of a document.
Layout object: represents the results of optical character recognition (OCR) on a page.
MiDocSearch object: exposes document search functionality.
Viewer control: is an ActiveX control that displays the pages of a document
Example for extracting text from tiff format file:
Function CreateOCRText() As string Dim strWordInfo As String="" Dim docs As New MODI.Document docs.Create("C:\test.tif") Success = Analyse(docs) If Success Then Dim j As Integer For j = 0 To miDoc.Images.Count - 1 strWordInfo = strWordInfo & " " & miDoc.Images(0).Layout.Text Next strWordInfo = strWordInfo.Replace("'", "''").ToString() End If Return strWordInfo End Function Function Analyse(ByVal Doc As MODI.Document) As Integer If Doc Is Nothing Then Exit Function End If Try ' the MODI call for OCR ' _MODIDocument.OCR(_MODIParameters.Language, ‘_MODIParameters.WithAutoRotation, _MODIParameters.WithStraightenImage) Doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, True, True) Analyse = 1 Catch ex As Exception 'MessageBox.Show("OCR was successful but no text was recognized") Analyse = 0 End Try End Function
Note : The most important point here to performing all tasks is to add a reference to ” Microsoft Office Document Imaging Type Library”, In case of
Microsoft Outlook 2003, Add ” Microsoft Office Document Imaging 11.0 Type Library ”
Microsoft Outlook 2007, Add ” Microsoft Office Document Imaging 12.0 Type Library “