|
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10174/39387
|
Title: | Portuguese Archives Handwritten text recognition of passport requisitions |
Authors: | Melo, Dora Pimenta Rodrigues, Irene Ferreira, Lígia |
Keywords: | handwritten recognition document annotation artificial intelligence data analysis |
Issue Date: | 11-Jul-2024 |
Publisher: | Universidade da Évora |
Citation: | Melo. D, Rodrigues. I.P., Ferreira, L. Portuguese Archives Handwritten text recognition of passport requisitions. In Anjos, A., Minhós, F:, Carapau, F., Bezzeghoud, M., Correio, P., Oliveira, R. J., Abreu, S. (2024). Book of Abstracts: 2nd International Workshop on Mathematics and Physical Sciences, Universidade de Évora, Évora. |
Abstract: | The DigitArq platform is the Portuguese National archive system that uses well-established
description standards, namely the ISAD(G) (General International Standard Archival Des-
cription) and ISAAR(CPF) (International Standard Archival Authority Record for Corporate
Bodies, Persons and Families) with a hierarchical structure adapted to the nature of
archival assets. In the EPISA project, one of the tasks included the migration of the
DigitArq information into a linked open data model, CIDOC-CRM [5]. This task included
the representation of textual description in the ISAD(G) element ‘Scope and Content’ by
extracting the information from natural language written text. The dataset for handwritten
recognition has 1000 registers with: digital representation, a text description of the digital
content, and the semantic representation in CIDOC-CRM of the text description [6]. This
information enables the automatic evaluation of handwritten recognition and can be used
to improve the performance of handwritten recognition through the use of semantic in-
formation. The handwritten data was selected from a set of registers with digital repre-
sentation, a jpg file, from the Portuguese National Archive. The registers were chosen
from those that have a text transcription of digital representation in the DigitArq platform.
Handwritten text recognition is an important task in computer vision that has received
considerable attention in recent years [1,2]. In our approach, the open-source document
processing platform ArkIndex [3,4] (https://teklia.com/our-solutions/arkindex/) is
used to automatize the document recognition system adapted to the passport registers
with digital representation. Initially, a corpus of 100 registers was built up and a manual
annotation was performed to represent the structure of the pages (text zones, pages
and text zones transcriptions), producing an automatic transcription of the handwritten
text. The described approach evaluation reveals promising results that confirm that the
initial annotated corpus can be used to obtain a general tool for processing the passport
registers in DIGITARQ. |
URI: | http://hdl.handle.net/10174/38651 http://hdl.handle.net/10174/39387 |
Type: | lecture |
Appears in Collections: | INF - Comunicações - Em Congressos Científicos Internacionais
|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
|