Authors: Yu Xie, Ziyue Wang, Jielei Zhang

Affiliation: Bilibili Inc.

Description: In the end-to-end task of MapText, we used ViTAE-v 2 to extract global features, utilizing an encoder-decoder network architecture (DeepSolo). Data augmentation techniques such as cropping, scaling, saturation, and contrast adjustment were applied. Pre-training was conducted using available real datasets (TextOCR, TotalText, IC15, MLT2017). The model was fine-tuned on the MapText dataset, and post-processing methods were employed.

Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, 131(5), 1141-1162.

Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., & Tao, D. (2023). Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19348-19357).

method: MapTest2024-04-29

Authors: Hongen Liu

Affiliation: Tianjin University

method: MapTextSpotter2024-04-29

Authors: Jialiang Li, Canhui Xu, Cao Shi, Yucai Qu

Affiliation: Qingdao University of Science and Technology

Description: Unlike natural scene text, digitized historical maps have densely distributed text regions, rotated and curved text and widely spaced characters. The text instances have multiple granularities, which hierarchically represent structured geolocation context. To address the new challenges in map text spotting, we have proposed a novel unified network called MapTextSpotter which jointly explores distinct characteristics in text detection and recognition. Our MapTextSpotter utilized a single decoder with shared queries based on Transformer. The queries are specifically designed spatially and semantically according to text distribution in historical maps. Both point queries and character queries are incorporated and interacted to train the model so as to predict text instance curve Bezier points and character classification in parallel. Notably, densely distributed text instances are often accompanied by smaller fonts. We extract multi-scale visual features with high-resolution detailed convolutional features, which help capture text instances with multiple granularities. Furthermore, with the aid of priori knowledge, Large Language Model is employed to enhance interaction with contextual information to replace the lexicon matching process, which significantly boosts recognition precision. For words highly spaced with complicated text-like noisy distractors, and word phrases divided across multiple lines, we infer that the LLM could alleviate widely space text problems and improve recognition performance by performing instance linkage with prior knowledge.

Ranking Table

Description Paper Source Code
DateMethodQualityF-scoreTightnessRecallPrecision
2024-05-06MapText Detection and Recognition Strong Pipeline60.05%71.32%84.21%67.18%76.00%
2024-04-29MapTest52.32%62.48%83.75%61.70%63.27%
2024-04-29MapTextSpotter41.09%49.61%82.83%46.64%52.99%
2024-03-26DS-LP37.91%52.25%72.55%54.87%49.88%
2024-03-26Baseline TESTR Checkpoint27.82%32.90%84.56%31.76%34.11%
2024-05-04Recognition finetuned from TrOCR12.19%15.54%78.48%13.42%18.46%

Ranking Graphic

Ranking Graphic

Ranking Graphic