OCR through MODI for extracting text information from Image file

Simply OCR means Optical Character Recognition. We can extract text and layout information from image file like MDI and TIFF file format. When one scans a paper page into a computer, it produces just an image file, a photo of the page. The computer cannot understand the letters on the page; you would use OCR software to convert it into a text or word processor file so that you could do those things.

it can be performed by Microsoft Office Document Imaging Object Model,for it we are need to use  the MODI Library in a Development Project. so first we understand that what is MODI object model

 

The MODI object model consists of the following objects:

               Document object:   represents an ordered collection of pages (images).

               Image object:          represents a single page of a document.

               Layout object:         represents the results of optical character recognition (OCR) on a page.

                MiDocSearch object: exposes document search functionality.

              Viewer control:          is an ActiveX control that displays the pages of a document

Example for extracting text from tiff format file:

  
Function CreateOCRText() As string
        Dim strWordInfo As String=""
        Dim docs As New MODI.Document
        docs.Create("C:\test.tif")
        Success = Analyse(docs)
        If Success Then
            Dim j As Integer
            For j = 0 To miDoc.Images.Count - 1
                strWordInfo = strWordInfo & " " & miDoc.Images(0).Layout.Text
            Next
            strWordInfo = strWordInfo.Replace("'", "''").ToString()
        End If
        Return strWordInfo 
End Function
 
Function Analyse(ByVal Doc As MODI.Document) As Integer
        If Doc Is Nothing Then
            Exit Function
        End If
        Try
            ' the MODI call for OCR
            ' _MODIDocument.OCR(_MODIParameters.Language, ‘_MODIParameters.WithAutoRotation, _MODIParameters.WithStraightenImage)
            Doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, True, True)
            Analyse = 1
        Catch ex As Exception
            'MessageBox.Show("OCR was successful but no text was recognized")                 
            Analyse = 0
      End Try
End Function

Note : The most important point here to performing all tasks is to add a reference to ” Microsoft Office Document Imaging Type Library”, In case of

 Microsoft Outlook 2003, Add ” Microsoft Office Document Imaging 11.0 Type Library ”
 Microsoft Outlook 2007, Add ” Microsoft Office Document Imaging 12.0 Type Library “

  • Paul

    Have you tried SmartOCR yet? It is a new software application which offers over 99.8 percent accuracy and has a very nice interface. http://smartocr.com