published on 06/14/2026 at 10:47 on LIGHTON

LightOn Expands OCR Model to Arabic with Targeted Training

LightOn has successfully extended its document understanding model, LightOnOCR-2, to support the Arabic language. This adaptation was achieved through targeted fine-tuning, utilizing a synthetic data generation pipeline. The data set included 12,000 synthetic pages with reference transcriptions, highlighting the model’s ability to handle complex Arabic script challenges.

Arabic OCR is challenging due to its right-to-left script, cursive characters, and underrepresentation in datasets compared to Latin-based languages. This development aims to ease document processing for organizations in the Middle East, offering an enterprise-grade, open-source solution under the Apache 2.0 license.

Guides for the fine-tuning process are available on LightOn's Hugging Face space, enhancing accessibility for users and extending potential applications for the model. LightOnOCR-2 continues to be central to LightOn's self-service offering, LightOn Console, ensuring a consistent technological foundation.

R. H.

Copyright © 2026 FinanzWire, all reproduction and representation rights reserved.
Disclaimer: although drawn from the best sources, the information and analyzes disseminated by FinanzWire are provided for informational purposes only and in no way constitute an incentive to take a position on the financial markets.

Document Automation Open Source LightOnOCR-2 Arabic OCR Targeted Training

Click here to consult the press release on which this article is based

See all LIGHTON news