OCR (optical character recognition) is the electronic or mechanical conversion of printed images, handwritten, or typed text into machine-encoded text.
Tesseract.js is an OCR library that allows us to read characters from an image and convert them to a text which can be processed by JavaScript.
In this article, I will demonstrate how you can set up a new JavaScript project which utilizes Tesseract.js. Then, I will show how to create an API that reads data from an image file and then display that data. Finally, we will see a working example in which we read data from an image receipt and then display it in the console. For this project, we will use the offline version of Tesseract: https://github.com/jeromewu/tesseract.js-offline.
Setting up a Tesseract project
Adding dependencies in package.json
The first thing that needs to be done is to create a new JavaScript project. Once this project is created, you will need to create a package.json file and add these dependencies to it:
"dependencies": {"tesseract.js": "^2.0.0-beta.1","tesseract.js-core": "^2.0.0-beta.13"},
After adding the dependencies above, execute npm install. This command will install these dependencies in your project. The two dependencies above are what will let you run Tesseract from your project.
Clone or download GitHub files
Once the dependencies are added and installed, you will need to obtain the offline version of Tesseract. This can be done by either cloning or downloading the project from here: https://github.com/jeromewu/tesseract.js-offline
Language data files
Language data files are required for the initialization of Tesseract. The offline version of tesseract uses a lang-data folder that contains two files: eng.traineddata.gz and tha.trained.gz. These are trained data which are compressed and will help Tesseract improve its accuracy when reading text data from an image.
Once you have added the lang-data folder inside of your project, you can start using Tesseract.
Creating an API that uses Tesseract
Now that we have properly set up our Tesseract project, we can proceed to create an API that will read text from an image. This API will be created using Express which means that you will also need to add the required Express dependencies and install them using npm.
In the code snippet below, I have created an endpoint that reads text from a receipt using Tesseract, prints the data from that receipt, and then sends the text back to the user as a response.
const express = require('express')
const router = express.Router()
const multer = require('multer')
const path = require('path')
// Include dependences const fs = require('fs')
const upload = multer() //OCR const {createWorker} = require('tesseract.js') // Function for create file static with filename and content. const saveFile = async (file) => { new Promise((resolve, reject) => fs.writeFile('./public/uploads/receipt.png', file.buffer, (err) => err ? reject('An error occurred: ' + err.message) : resolve({uploaded: true}))) } // Endpoint Upload. router.post('/upload', upload.single('photo'), async (req, res, next) => { // Control for get file on request. if (!req.file)
res.sendStatus(400, 'Cannot find file on request') // Try create local file with content. try {
await saveFile(req.file)
const worker = createWorker({langPath:
path.join('./server', 'lang-data'),
logger: (m) => console.log(m),}); (async () => { await worker.load() await worker.loadLanguage('eng')
await worker.initialize('eng')
const { data: {text}, } =
await worker.recognize(path.join('./public/uploads', 'receipt.png')) //Uploading file const data = `${text}`
console.log(DATA, 'data'); //will print out the data after tesseract reads the file if (data == undefined || data.length <= 0) { res.sendStatus(400,'ERROR: Could not read receipt!')}
}
await worker.terminate()
res.send(data)})() } catch (err) {
console.log(err)
}res.sendStatus(500) } })
As you can see, I have imported the createWorker method from ‘tesseract.js’ and then created an endpoint called "upload" which takes an image file as an argument. When a user calls this endpoint, the Tesseract workers get initialized.
Then, the workers are trained using the training data inside the lang-data folder. After they are trained, the workers start to read the text from the receipt that the user has uploaded.
Once the workers are done reading the data, I proceed to store it in a variable and then print out this variable through a console.log() function. Finally, I stop the workers, by calling worker.terminate(), and return that data. If there is any issue reading the image file, the endpoint will return a 400 status code. Similarly, if there is an issue starting up the Tesseract workers, the endpoint will return a 500 status code.
If you want to check more detail on these endpoints. Please check out my GitHub repo below :) The endpoint shown above is inside the /server/api/camera.js file.
Working example with a receipt
Let's use the receipt below to walkthrough how tesseract is reading the data from it and printing it to the terminal.
After uploading a receipt and waiting for that worker to finish. We will get the data from the picture, like this:
The data that Tesseract reads depends on the quality of the picture. In order to obtain better results, I recommend that you use high-quality images.
Thank you for reading through this post! I hope that you have learned today how to use Tesseract.js inside your JavaScript projects in order to do OCR.