Rossum, a London-based intelligent document processing (IDP) technology provider, said it has published the world’s largest research dataset aimed at improving AI models dedicated to business document information extraction. The company said such datasets are rare because the documents included in them are often sensitive and protected but are vital for improving models.
Rossum claims the DocILE (Document Information Localization and Extraction) benchmark is the largest collection of business documents in the world and will result in faster and more accurate extraction.
“This is an important milestone because it advances IDP research as a whole, where everyone can now develop and test more advanced algorithms on a benchmark of challenging and highly practical tasks,” said Dr. Milan Šulc, head of Rossum’s AI Labs. “The new dataset will increase accuracy levels in document information extraction by accelerating research in areas such as novel machine learning architectures and training objectives. This will ultimately lead to global optimization of business communication and workflows, further increasing the amount of the time saved for our customers.”
In addition to Rossum researchers, the DocILE benchmark was the work of Czech Technical University in Prague, University of La Rochelle, and the Autonomous University of Barcelona. It follows the peer-reviewed position paper Business Document Information Extraction: Towards Practical Benchmarks, presented by Rossum’s AI Labs at the recent CLEF 2022 conference.