A Quick and Dirty Guide to Tesseract OCR

Tesseract OCR

Are you a note-taker, journal writer, or just prefer handwriting over typing when you are recording your thoughts? You are not alone as many of us have dozens of notebooks that we scribble random ideas for safekeeping. Maybe you would like to backup your notes and convert them to digital format (text)?

It used to be that a job like this would take weeks, months, or even years! Now you can convert handwritten text images straight to text and more with Tesseract OCR. Tesseract OCR, you ask? Keep reading to find out what this powerful software is all about.

What Is OCR?

First off, we need to explain what OCR is. OCR is an acronym short for optical character reader or optical character recognition. Simply put, an OCR device or application converts images of language into machine-encoded text.

OCR basics are rooted in the ability of a machine to recognize images of characters and match them with characters of any language. This is easier said than done as machines are terrible at translating visual input.

What Is Tesseract OCR?

Tesseract is a software engine that performs optical character recognition (OCR). It was first developed by Hewlett-Packard at their lab in Bristol England and Greeley, Colorado between the years 1985 and 1994. Besides some coding migration to later versions of the C# language and the .net framework, the original software didn’t change much until 2005.

In 2005, Hewlett Packard and the University of Nevada, Las Vegas (UNLV) released Tesseract as open-source software under the Apache License. Recognizing its potential, Google sponsored its development in 2006 and continues to do so to this day.

Tesseract c# has come a long way since the first version in 1995. The first versions up to and including version 2 only accepted inputs of TIFF images in simple one-column texts. The early versions did not include layout analysis making Tesseract frustratingly impractical for most uses.

It wasn’t until version 3.00 that the engine supported hOCR positional information and page-layout analysis. Tesseract is suitable to use in the backend. It can process complicated OCR tasks by adding frontend document analysis software such as OCRopus.

Version 3.04, released in July of 2015, supported over 100 languages. The current stable version is Tesseract v4.1.1 and is highly regarded as one of the most accurate OCR engines available.

Tesseract 101: How to Install and Run

The Tesseract OCD engine v4.1.1 source code is available for download on github.com. You can either install Tesseract via pre-built binary packages available on GitHub or build it from the source code (C++17 support required).

Tesseract runs from a command line interface only. However, various projects provide a GUI for configuring the Tesseract engine. Once installed on your computer (Mac, Windows, Linux) you will be able to run tesseract commands from your command terminal. Here is an example of a command line that performs OCD:

tesseract imagename outputbase [-l lang] [–oem ocrenginemode] [–psm pagesegmode] [configfiles…]

For information on the many command line options available type the command: tesseract –help

Basic Tesseract OCR: Every Coder Should Know

Tesseract OCR is an easy-to-use open-source software that supports multiple programming languages. Anyone considering a career in programming should definitely learn how to use this powerful software to perform OCR tasks inside their coding projects. For more of the latest in technology news, tutorials, and more, keep reading our blog.

A Quick and Dirty Guide to Tesseract OCR

Leave a Reply

Scroll to top