Want to make sure that your workplace 365 atmosphere is really as secure as you possibly can? Sign up for our “All Access Tour: Office 365 Security and Governance Features” today!
Increasingly more data has ongoing to result from digital formats during the last decade. However, you may still find lots of cases when physical documents have to be used or preserved. Healthcare and financial industries particularly typically scan or fax lots of physical documents into TIFF or PDF formats. Unstructured content analysis has already been a challenge, which formats finish up becoming an even harder nut to hack.
Optical character recognition basically enables users to extract text content from pictures of physical documents to ensure that it’s within an editable format. This could affect pages of the book, scanned PDF files as well as handwritten content (though this functionality is much more limited). Compliance Guardian’s OCR implementation can be done mainly because of Google’s Tesseract library.
Besides the exciting new capacity to determine text in images, there’s also some factors to bear in mind.
OCR requires lots of computation along with a significant effect on CPUs. Consequently, the rate where documents could be scanned is going to be significantly slower than ever before. For example, it might take 5 seconds to process a 300-dots per inch scanned page (based on CPU power).
Although OCR technologies have advanced a great deal recently, it’s still not even close to perfect. It’s rare for OCR to yield 100% accurate results. The clearer the initial image, the greater accurate the end result is going to be.
Within the following section, we’ll expand on some common factors that may affect precision. To assist improve precision, pre-processing is essential. Common approaches can consist of converting a picture to grayscale, growing contrast, noise reduction, and much more. In special cases, more complicated pre-processing may be required (e.g. computer vision, contour recognition, rotate/crop/anchors).
Typically OCR works more effectively for documents which are:
- Scanned with flatbed scanners
- Scanned with higher resolution and lighting conditions
- Scanned rich in contrast
- Text centric
- Using common fonts
- Well aligned.
Documents created with a dedicated scanner or fax machine can meet many of these conditions, although not all documents can.
Following are a few information regarding how common factors could affect OCR results.
Nature from the Image
Scanned documents cash better precision than photos because photos normally have less contrast, more noise, blurriness (e.g. out-of-focus for edge area, or because of camera trembling), distortion (not flat), not well aligned and so forth.
Exactly the same principal pertains to images in scanned documents. The written text centric content is going to be a lot better than scanned images of driver’s licenses and ID cards, for example.
From testing, we discovered that images having a resolution of 300 dots per inch will normally have better results. If resolution is not high enough (under 100 dots per inch), despite some pre-processing to enlarge the look to enhance precision a little, however it would still ‘t be just like greater resolution images. However, images with excessive resolution will require longer to process.
Font size also leads to resolution. Bigger font size might be fine with low dots per inch, but smaller sized font size will need greater resolution to become recognized. For instance, font size 10.5pt perform fine with 300 dots per inch images, however for images at 200 dots per inch, a font size smaller sized than 12pt might not work nicely.
Font type is yet another factor. Google’s Tesseract library is pre-trained most abundant in common font types. When the font utilized in the document isn’t common, the precision is going to be lower.
Handwriting answers are typically poor because of the same reasoning. It is also worth noting that input and handwriting OCR really are a bit different for the reason that handwriting input tracks movement while just the final image will come in handwriting OCR.
The OCR engine is most effective on high contrast images. Best-scanned, text-centric documents satisfies this. For under ideal situations, pre-processing enables you to boost the contrast.
Google’s Tesseract library has minimal tolerance with regards to the alignment of scanned images. According to our testing, the precision of the scan will drop if it is alignment is much more than 5 levels off. Again, several complex pre-processing techniques may help overcome this.
Web seminar: Compliant Migration with DocAve Migrator
There are many additional factors that may degrade image precision for example blurriness, images not flat when scanned, and blemishes standing on images.
How AvePoint Might Help
AvePoint’s Compliance Protector product already comes with an extensive framework of technologies to assist customers with deep content analysis. Within our newest update to version 4.4, optical character recognition (OCR) for scanned documents will further expand our technology stack hugely.
With the aid of OCR, Compliance Protector allows users to evaluate physical documents a lot more efficiently via several image enhancement techniques which will considerably improve OCR results.
Compliance Guardian’s out-of-the-box optical character recognition (OCR) functionality concentrates towards more prevalent scanned text document situations. We’re excited that users will ultimately be capable of getting text content from images and scan for compliance violations on the woking platform.
That stated, you may still find challenges optimizing OCR to operate seamlessly for those use cases. As of this moment, precision continues to be evaluated on the situation-by-situation basis. We’ll still strive on precision and optimization enhancements, so please stay tuned in!
Want more great compliance-related content? Make sure to sign up for our blog!