.net, OCR, Optical Character Recognition

Optical Character Recognition with C# in Classic Desktop Applications – Part #1, using Tesseract

Recently I’ve become interested in optical character recognition (OCR) – I’ve discussed this with some peers and their default reaction is that the software necessary to do this is very expensive. Certainly, there are commercial packages available to carry out this function, but I wanted to investigate if there were any lower cost options available which I could use in a .NET project.

After some investigation, I found three options:

  • Tesseract – a library with a .NET wrapper;
  • Windows.Media.Ocr – a library available for Windows Store Apps;
  • Project Oxford – OCR as a Service, a commercial product supplied by Microsoft which allows 5,000 transactions per month for free.

In this post, I’ll demonstrate how to use Tesseract – in two future posts, I’ll use the Windows.Media.Ocr library, and Project Oxford to carry out OCR.

Tesseract – an OCR library with a .NET wrapper

Tesseract is an OCR library available for various different operating systems, licenced under Apache 2. I’ll look at getting this working in C# under Windows.

In order to compare these three options, I needed a single baseline – an image with some text. I decided to take a screenshot of my previous blog post. sample_for_reading

This image seemed useful because:

  1. The font face isn’t particularly unusual, so should be a reasonable test for automated character recognition.
  2. There are a few different font sizes, so I’ll be interested to see how the software copes with this.
  3. There are different font colours – the introduction at the top of the page is in a light grey font, so should be quite challenging for the software to read.

As usual, I’m providing simple code which just gets text from an image – this isn’t meant to be an example of SOLID code, or best practices.

Tesseract is quite simple to set up and use – these instructions were heavily influenced by content from Charles Weld’s GitHub site. I’ve tried not to copy things verbatim – this is a description of what I needed to do to get things working.

1. First open Visual Studio and create a new C# Console application named “TesseractSampleApplication”.

2. Next, open the Package Manager Console and install the Tesseract nuget package using the command below:

Install-Package Tesseract 

This will add the necessary binary library to the project – Tesseract.dll. Also, there’ll be two folders added to the project, named “x86” and “x64”, containing other binaries.

3. You now need to add the English language files – these need to be in a project folder named “tessdata”. You can get these English language files from this location. The folder name can’t be changed or you’ll get an error.

4. As an optional step you can add configuration to the App.config file, which enables verbose logging. This helps a lot when things go wrong, and I got this code from this location.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <startup
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6" />
    </startup>
  <system.diagnostics>
    <sources>
      <source name="Tesseract" switchValue="Verbose">
        <listeners>
          <clear />
          <add name="console" />
          <!-- Uncomment to log to file
                <add name="file" />
                -->
        </listeners>
      </source>
    </sources>
    <sharedListeners>
      <add name="console" type="System.Diagnostics.ConsoleTraceListener" />
 
      <!-- Uncomment to log to file
        <add name="file"
           type="System.Diagnostics.TextWriterTraceListener"
           initializeData="c:\log\tesseract.log" />
        -->
    </sharedListeners>
  </system.diagnostics>
</configuration>

5. Finally, the C# code – this very simple application just looks at the image I show above, and interprets text from it.

namespace TesseractSampleApplication
{
    using System;
    using Tesseract;
    
    class Program
    {
        static void Main(string[] args)
        {
            var ENGLISH_LANGUAGE = @"eng";
 
            var blogPostImage = @"C:\Users\jeremy\Desktop\sample_for_reading.png";
 
            using (var ocrEngine = new TesseractEngine(@".\tessdata", ENGLISH_LANGUAGE, EngineMode.Default))
            {
                using (var imageWithText = Pix.LoadFromFile(blogPostImage))
                {
                    using (var page = ocrEngine.Process(imageWithText))
                    {
                        var text = page.GetText();
                        Console.WriteLine(text);
                        Console.ReadLine();
                    }
                }
            }
        }
    }
}

Compile and run the above code – if you added the configuration code in step 4, you’ll see a large amount of logging text, and finally the text that Tesseract reads from the image.

I found that the text interpreted from the image was:

JEREMY LINDSAY

Building a 3d printer – Taz-5,
Part 8: Building the X-axis

Last time I attached the threaded rod and guide rails for the Zraxis. With these in
place, I’m now able to start building the Xraxis.

Afew notes on this post before I begin:

1.| ran outcfblackfilamentwhile buildingthis part,sol had to usethe
yellow filament l’ve been using for my other project.

2. This was one ofthe trickiest parts ofthe project so far. The Xraxis involves
a few pieces being bolted together, and I had issues with ABS parts
shrinking slightly , which meant that holes corresponding to each other
on different parts sometimes didn’t line up perfectly.

So a few comments are:

  1. Generally this was very good. There were a few small things that went wrong:
    • Z-axis” was interpreted as “Zraxis“, so the hypen wasn’t seen correctly.
    • I ran out of black filament while” was interpreted as “| ran outcfblackfilamentwhile” – the capital letter “I” was seen as a pipe character, and there were issues with spacing.
  2. The black text was recognised – however the light grey text beside my name, the brown category words, and the date of the blog post were not interpreted at all.

Conclusion

Tesseract is a good open source option for optical character recognition in C# applications. It’s simple to get started with Tesseract, and interpreted text well from the sample tested. However, there were some small issues around spacing and occasionally problems with character recognition.

Next time in this series, I’ll use the Windows.Media.Ocr library to interpret text from the same image.