Results - ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking

Authors: Yu Xie, Ziyue Wang, Jielei Zhang

Affiliation: Bilibili Inc.

Description: In the end-to-end task of MapText, we used ViTAE-v 2 to extract global features, utilizing an encoder-decoder network architecture (DeepSolo). Data augmentation techniques such as cropping, scaling, saturation, and contrast adjustment were applied. Pre-training was conducted using available real datasets (TextOCR, TotalText, IC15, MLT2017). The model was fine-tuned on the MapText dataset, and post-processing methods were employed.

Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, 131(5), 1141-1162.

Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., & Tao, D. (2023). Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19348-19357).

method: MapTest2024-04-29

Authors: Hongen Liu

Affiliation: Tianjin University

method: MapTextSpotter2024-04-29

Authors: Jialiang Li, Canhui Xu, Cao Shi, Yucai Qu

Affiliation: Qingdao University of Science and Technology

Description: Unlike natural scene text, digitized historical maps have densely distributed text regions, rotated and curved text and widely spaced characters. The text instances have multiple granularities, which hierarchically represent structured geolocation context. To address the new challenges in map text spotting, we have proposed a novel unified network called MapTextSpotter which jointly explores distinct characteristics in text detection and recognition. Our MapTextSpotter utilized a single decoder with shared queries based on Transformer. The queries are specifically designed spatially and semantically according to text distribution in historical maps. Both point queries and character queries are incorporated and interacted to train the model so as to predict text instance curve Bezier points and character classification in parallel. Notably, densely distributed text instances are often accompanied by smaller fonts. We extract multi-scale visual features with high-resolution detailed convolutional features, which help capture text instances with multiple granularities. Furthermore, with the aid of priori knowledge, Large Language Model is employed to enhance interaction with contextual information to replace the lexicon matching process, which significantly boosts recognition precision. For words highly spaced with complicated text-like noisy distractors, and word phrases divided across multiple lines, we infer that the LLM could alleviate widely space text problems and improve recognition performance by performing instance linkage with prior knowledge.

Ranking Table

Description Paper Source Code

Date	Method	Quality	F-score	Tightness	Recall	Precision
2024-05-06	MapText Detection and Recognition Strong Pipeline	60.05%	71.32%	84.21%	67.18%	76.00%
2024-04-29	MapTest	52.32%	62.48%	83.75%	61.70%	63.27%
2024-04-29	MapTextSpotter	41.09%	49.61%	82.83%	46.64%	52.99%
2024-03-26	DS-LP	37.91%	52.25%	72.55%	54.87%	49.88%
2024-03-26	Baseline TESTR Checkpoint	27.82%	32.90%	84.56%	31.76%	34.11%
2024-05-04	Recognition finetuned from TrOCR	12.19%	15.54%	78.48%	13.42%	18.46%

Inactive evaluations

method: MapText Detection and Recognition Strong Pipeline2024-05-06

method: MapTest2024-04-29

method: MapTextSpotter2024-04-29

Ranking Table

Ranking Graphic

Ranking Graphic

Ranking Graphic