CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR.
Using this extracted text to evaluate marks using NLP.
Installation:
Install Tesseract-OCR-Engine https://github.com/tesseract-ocr/tesseract/wiki
Install python dependencies pytesseract,pillow,pandas,numpy,matplotlib
Usage:
1)Clone the repository into your working directory
2)Make sure you update path of tesseract executable in main.py
3)add image for testing to images folder
4)main.py imagename
It will return a HOCR file,which is very similar to XHTML
5)file_conversion.py hocrfilename.
It will convert HOCR into dataframe and store the output in a pickle file/json file
Phase1 demonstration of the OCR of handwritten text and exploiting into JSON
(Rendered python notebook displayed as markdown using nbconvert)
Phase2 Using nltk to Create A NLP model to evaluate Answers
Download all the packages using the nltk downloader
import nltk
nltk.download()
from pytesseract import pytesseract
import sys
import os
#Edit path to tesseract executable if you installation directory changed
pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
from datetime import datetime
def replaceMultiple(mainString, toBeReplaces, newString):
for elem in toBeReplaces :
if elem in mainString :
mainString = mainString.replace(elem, newString)
return mainString
mainStr=str(datetime.now())
file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
def generateFilename():
mainStr=str(datetime.now())
file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
return file_name
from PIL import Image
from IPython.display import display
import matplotlib.pyplot as plt
im = Image.open("testfile1.jpg")
fig, ax = plt.subplots()
ax.imshow(im)
print("(width,height):"+str(im.size))
(width,height):(3000, 3115)
box=(250,180,2800,400)
cropped_image = im.crop(box)
display(cropped_image)
cropped_text= pytesseract.image_to_string(cropped_image, lang = 'eng')
print(cropped_text)
Conductor wn magnetic Field Produce voltage :
def createHOCR(imagepath):
filename= generateFilename()
pytesseract.run_tesseract(imagepath, filename, lang=None,extension='html', config="hocr")
print("HOCR file generated: "+str(filename)+".hocr")
createHOCR("testfile.jpg")
HOCR file generated: 20181021042317089205.hocr
from lxml import etree
import pandas as pd
import os
import sys
import generate_filename as gf
def hocr_to_dataframe(fp):
doc = etree.parse(fp)
words = []
wordConf = []
for path in doc.xpath('//*'):
if 'ocrx_word' in path.values():
conf = [x for x in path.values() if 'x_wconf' in x][0]
wordConf.append(int(conf.split('x_wconf ')[1]))
words.append(path.text)
dfReturn = pd.DataFrame({'word' : words,
'confidence' : wordConf})
return(dfReturn)
filename=generateFilename()
dataframe=hocr_to_dataframe("20181021041156998790.hocr")
dataframe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
word | confidence | |
---|---|---|
0 | 95 | |
1 | 95 | |
2 | Q1. | 89 |
3 | Define | 96 |
4 | electromagnetic | 96 |
5 | induction. | 95 |
6 | Sane | 23 |
7 | | | 90 |
8 | Conductor | 93 |
9 | mM | 42 |
10 | magnetic | 70 |
11 | Field | 63 |
12 | produce | 67 |
13 | voltage | 65 |
14 | ‘Seconaewctntmnstnn | 0 |
15 | esionainsnenaneenrenncconanniiti | 0 |
16 | Q2. | 89 |
17 | What | 96 |
18 | are | 96 |
19 | 3 | 96 |
20 | examples | 96 |
21 | of | 95 |
22 | transparent | 95 |
23 | objects? | 96 |
24 | (Professor | 96 |
25 | provides | 96 |
26 | 5 | 96 |
27 | as | 95 |
28 | input) | 90 |
29 | 95 | |
30 | Q3. | 92 |
31 | Complete | 96 |
32 | the | 96 |
33 | network | 95 |
34 | tree. | 96 |
35 | 95 |
dataframe.to_json(filename+".json",orient='columns')
print("JSON generated: "+filename+".JSON")
dataframe.to_pickle(filename+".pkl")
print("Pickle generated: "+filename+".pkl")
JSON generated: 20181021042319190731.JSON
Pickle generated: 20181021042319190731.pkl