Alfresco

Configuring OCR in Alfresco

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. It recognizes the characters from the images or scanned documents, and that makes the images (which contain text) searchable. OCR is a very useful feature for any ECM product or software. In this blog, we will see how we can configure it in Alfresco Community Edition. We have tested this with Alfresco versions 5.1.f and 5.2.e. It should also work with other nearby versions.

Prerequisites: 

  1. Alfresco Community / Enterprise Edition installed and running
  2. Basic knowledge of Alfresco administration

     
Steps to Configure Tesseract:

1. Download Tesseract and install
Linux:

apt-get install tesseract-ocr


2. Stop the alfresco tomcat server

./alfresco.sh stop tomcat


3. Download the Linux /Windows context file and place at

/tomcat/shared/classes/alfresco/extension/
 


4. Place ocr.bat(Windows) and ocr.sh(Linux) at /


a) ocr.bat (for Windows)

 REM to see what happens
mkdir c:\tmp
echo from %1 to %2 >> C:\\tmp\ocrtransform.log
copy /Y %1 "C:\TMP\%~n1%~x1"
echo target %~d2%~p2%~n2
REM call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" "C:\tmp\%~n1%~x1" "%~d2%~p2%~n2" -l eng


b) ocr.sh (for Linux)

 # save arguments to variables
SOURCE=$1
TARGET=$2
TMPDIR=/tmp/Tesseract
FILENAME=`basename $SOURCE`
OCRFILE=$FILENAME.tif
# Create temp directory if it doesn't exist
sudo mkdir -p $TMPDIR
# to see what happens
#echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log
sudo cp -f $SOURCE $TMPDIR/$OCRFILE
# call tesseract and redirect output to $TARGET
sudo /usr/local/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
#sudo tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
sudo rm -f $TMPDIR/$OCRFILE 


Note: Make sure that the path for tesseract command is correct in the ocr.sh / ocr.bat file
Linux:

/usr/local/bin or /usr/bin


Windows:

C:\Program Files(x86)\Tesseract-ocr\tesseract.exe
or C:\Program Files\Tesseract-ocr\tesseract.exe


5. If the current user does not have read or execute permissions on ocr.sh then give it.

chmod +rx /opt//ocr.sh


6. Add following properties in the alfresco-global.properties file located at

/tomcat/shared/classes/


Linux:

ocr.script=/opt//ocr.sh
ghostscript.exe=gs


Windows:

ocr.script=C:\\ocr.bat
ghostscript.exe=gs


7. Start tomcat server
Linux:

./alfresco.sh start tomcat


Windows:

C:\\tomcat\bin\startup.bat press enter.
Or use manager-windows.exe

Note: Existing files in alfresco will not be OCRed, you have to upload new image files to test.


Important:

  1. Make sure you are passing correct arguments in the context file (Entries in context files will be  different for Windows and Linux).
  2. Check whether your .bat or .sh commands are properly working or not
  3. Verify that tesseract creates text file for the image file
    1. To verify that go to the directory where tesseract is installed and run the following command
    2. tesseract ./ ./ -l eng


If the text file is created with content in it, your tesseract is working.


Comment here, if your contents are still not searchable. We are happy to know your ECM challenges, as we love solving them Contact us!

Alfresco Development

Alfresco

Alfresco Development: How It Helps Businesses and Why It's Beneficial

Learn about Alfresco development and its benefits for businesses. Discover how leveraging Alfresco can enhance efficiency and collaboration.

Alfresco ECM Consultants

Alfresco

Alfresco ECM Consultants Unveil Secrets to Success

Mastering Alfresco Development: Essential Tips for Success in ECM Solutions" provides crucial insights and strategies for developers to excel in creating robust ECM solutions using the Alfresco platform. With essential tips and best practices, this resource is indispensable for both novice and experienced developers seeking to harness Alfresco's full potential for ECM projects.

Transform Your Business With Digital Enterprise Solutions

Contact us

Our Offices

INDIA AHMEDABAD, INDIA

401, One World West, Nr. Ambli T-Junction 200, S P Ring Road, Bopal, Ahmedabad, Gujarat 380058

UK
UK

Kemp House 160 City Road, London, United Kingdom EC1V 2NX

GERMANY GERMANY

Nürnberger Str. 46 90579 Langenzenn Deutschland

AUSTRALIA AUSTRALIA

Level 36 Riparian Plaza, 71 Eagle Street, Brisbane, QLD 4000

USA USA

4411 Suwanee Dam road, Bld. 300 Ste. 350 Suwanee GA, 30024

SOUTH AFRICA SOUTH AFRICA

Cube Work Space, 24 Hans Strijdom Avenue, Cape Town

UAE DUBAI, UAE

B 503 Sama Tower, Sheikh Zayed Road, United Arab Emirates

country-flag CANADA

34 Applegrove Ct. Brampton ON L6R 2Y8