Instances of sensitive data being exfiltrated through image files (i.e., JPEGS, TIFFS, BMPS, PNGS) are on the increase, making Optical Character Recognition (OCR) an important consideration for organizations evaluating Data Loss Prevention (DLP) solutions.

From accidental over-sharing of scanned documents, such as invoices, credit card data, and healthcare insurance applications, to malicious attempts to take photographs/screenshots of sensitive information and confidential on-screen data. OCR is needed to derive extracted text and potentially sensitive content from an image so that it can be analyzed by your DLP policy.

As we’ve already covered in an earlier blog post, there are two different approaches to managing OCR for DLP. Server-based and endpoint-based. For many companies, a server-based approach is heavily compromised with limitations including no offline protection, latency, and limited protection across exit points on the endpoint. It’s essential to regularly save and back up DLP configurations to avoid data loss.

This post goes further and looks at the limitations of Microsoft Purview Data Loss Prevention which leverages server-based OCR for DLP.

1. Inconsistency in file types supported

Image scanning is available for Microsoft Teams, Exchange, SharePoint, OneDrive, and Windows devices; with JPEG, JPG, PNG, BMP, TIFF, and PDF (image only) file formats supported.

However, there are variations in the file types that leverage OCR depending on the file location. For example, SharePoint and OneDrive only support JPEG, JPG, PNG, and BMP (no TIFF or PDF) files.

2. You have to pay to use OCR for Microsoft Purview Data Loss Prevention

As a server-based approach, Microsoft charges for OCR scanning. Your organization will need to subscribe to OCR through Microsoft Syntex pay-as-you-go billing before your DLP policy can be applied to images. The system uses exact data matching to ensure precision and has mechanisms to reduce false positives. Integration via an API might bring additional functionalities. Current pricing is $1 USD for every 1,000 images scanned, but be aware that every page of a PDF is considered a single scan. Currently, the system primarily supports English text detection.

3. Inconsistency in file sizes supported

Image files must be no larger than 20 MB for Exchange and Teams. For SharePoint, OneDrive, and Windows endpoints, the maximum image file size is 50 MB.

4. Daily limits can stop DLP policies from running

Admins should be prepared to troubleshoot when such limits are reached. When OCR functionality is enabled for Windows devices, there is a default daily limit of 1024 MB of data that can be sent to the cloud for OCR scanning. OCR stops scanning images once this daily limit is reached, so admins will need the right permissions to adjust bandwidth limitations.

5. Limited coverage of exit points

OCR for Microsoft Purview Data Loss Prevention is limited to scanning image files either stored within Microsoft repositories (i.e., Sharepoint and Onedrive) or through Microsoft apps (i.e., Teams and Outlook). There may be many other exit points on the endpoint that will not be covered by OCR functionality, presenting an immediate risk, especially if high-resolution pixels are used to capture data.

6. No offline capabilities

Because it’s a server-based approach, the endpoint needs an Internet connection for OCR to work. This means offline protection isn’t available. For example, a remote employee goes offline and exfiltrates sensitive data via removable storage or a printer.

7. No OCR support on macOS endpoints

OCR for Microsoft Purview Data Loss Prevention can be extended to Windows devices. However, it cannot be applied to macOS endpoints. Any users using Apple Mac hardware will be limited to OCR protection for files and sensitive content stored in the Microsoft Cloud (i.e., OneDrive and Sharepoint).

About Endpoint Protector by CoSoSys

As the industry’s leading endpoint DLP and Device Control solution for Microsoft Windows, macOS, and Linux, Endpoint Protector offers significant advantages for organizations that want to benefit from best-in-class OCR capabilities, need their DLP policies enabled for offline information protection, or simply have privacy concerns over their sensitive content being passed to cloud services for scanning.

For more information on OCR capabilities within DLP, read our blog post discussing the pros and cons of endpoint vs server-based OCR scanning, the impact of latency on end-users, and the performance benefits of hardware-based processing to improve data security across your sensitive content and files.


Download our free ebook on
Data Loss Prevention Best Practices

Helping IT Managers, IT Administrators and data security staff understand the concept and purpose of DLP and how to easily implement it.

