Yesian Rohn

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Xingsong Ye, Yongkun Du, Jiaxin Zhang, Haojie Zhang, Chong Sun, Chen Li, Jing LYU, Zhineng Chen

ECCV, 2026

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene text recognition (WATER) far more challenging than general Scene Text Recognition (STR). To advance this task from both data and model perspectives, we construct WATER-S, a 2M synthetic dataset built via an upgraded rendering pipeline (SynthWordArt) and a Qwen3-VL + Z-Image generation pipeline for diverse coverage, and propose WATERec, which couples an arbitrary-shape visual encoder with an autoregressive decoder to break the fixed-template STR bottleneck.

What's Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

Xingsong Ye, Yongkun Du, JiaXin Zhang, Chen Li, Jing LYU, Zhineng Chen

CVPR, 2026

arXiv/ code/ dataset

We systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine that synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation.

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen

ICCV, 2025

project page/ arXiv/ code/ demo

We introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability.

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

ACM MM, 2026

project page/ arXiv/ code

We present MuSS, a large-scale dual-track dataset for multi-shot and subject-to-video (S2V) generation, curated from over 3,000 movies to support complex montage transitions and subject-centric storytelling. We introduce a progressive captioning pipeline that eliminates context conflicts and a cross-shot matching mechanism that removes the copy-paste shortcut in S2V. We further propose the Cinematic Narrative Benchmark with a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously evaluate continuous narration and 3D structural consistency.

ICPR 2026 Competition on Low-Resolution License Plate Recognition

Rayson Laroca, Valfride Nascimento, Donggun Kim, et al., Xingsong Ye, Yongkun Du, Yuchen Su, Zhineng Chen, et al., David Menotti

ICPR, 2026

competition page/ arXiv

We treat LRLPR as a robust scene text recognition problem. Rather than applying a dedicated super-resolution stage, each LR frame was directly fed to an OCR model, and the five per-frame predictions within a track were aggregated using a character-level voting strategy that combines both prediction frequency and per-character confidence scores.

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang

NAACL, 2025

project page/ arXiv/ code/ dataset

We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined.

Rethinking the Elementary Function Fusion for Single-Image Dehazing

Yesian Rohn

Course Project of DIP, 2024 / arXiv, 2024

arXiv/ code

We introduce CL2S, an innovative image dehazing network that overcomes limitations of DM2F (baseline) by trying sine functions, as validated through systematic ablation experiments.

DuanzAI: Slang-Enhanced LLM with Prompt for Humor Understanding

Yesian Rohn

Xiyuan Project of FDUROP, 2023 - 2024 / arXiv, 2024

project page/ arXiv/ code

We enhance LLMs' understanding of Chinese slang using curated datasets and advanced techniques, introducing DuanzAI and its application in the ChatDAI chatbot.

Yesian Rohn / Xingsong Ye

News

Publications