Yesian Rohn / Xingsong Ye

I am a Master's student in Computer Science at Fudan University, under the supervision of Prof. Zhineng Chen in the FVL Group. Before that, nothing had happened.

I am currently focused on projects involving OCR, MultiModal, and AIGC (C means Comedy). Please connect and collaborate with me to explore the potential of these fields and leverage them to make a positive impact on exciting computer science projects!

profile photo

News

Publications

WATER

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Xingsong Ye, Yongkun Du, Jiaxin Zhang, Haojie Zhang, Chong Sun, Chen Li, Jing LYU, Zhineng Chen

ECCV, 2026

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene text recognition (WATER) far more challenging than general Scene Text Recognition (STR). To advance this task from both data and model perspectives, we construct WATER-S, a 2M synthetic dataset built via an upgraded rendering pipeline (SynthWordArt) and a Qwen3-VL + Z-Image generation pipeline for diverse coverage, and propose WATERec, which couples an arbitrary-shape visual encoder with an autoregressive decoder to break the fixed-template STR bottleneck.

UnionST

What's Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

Xingsong Ye, Yongkun Du, JiaXin Zhang, Chen Li, Jing LYU, Zhineng Chen

CVPR, 2026

We systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine that synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation.

MuSS

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

arXiv, 2026

We present MuSS, a large-scale dual-track dataset for multi-shot and subject-to-video (S2V) generation, curated from over 3,000 movies to support complex montage transitions and subject-centric storytelling. We introduce a progressive captioning pipeline that eliminates context conflicts and a cross-shot matching mechanism that removes the copy-paste shortcut in S2V. We further propose the Cinematic Narrative Benchmark with a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously evaluate continuous narration and 3D structural consistency.

ICPR 2026 LRLPR

ICPR 2026 Competition on Low-Resolution License Plate Recognition

Rayson Laroca, Valfride Nascimento, Donggun Kim, et al., Xingsong Ye, Yongkun Du, Yuchen Su, Zhineng Chen, et al., David Menotti

ICPR, 2026

We treat LRLPR as a robust scene text recognition problem. Rather than applying a dedicated super-resolution stage, each LR frame was directly fed to an OCR model, and the five per-frame predictions within a track were aggregated using a character-level voting strategy that combines both prediction frequency and per-character confidence scores.

TextSSR

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen

ICCV, 2025

We introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability.

SIUO

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang

NAACL, 2025

We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined.

CL2S

Rethinking the Elementary Function Fusion for Single-Image Dehazing

Yesian Rohn

Course Project of DIP, 2024 / arXiv, 2024

We introduce CL2S, an innovative image dehazing network that overcomes limitations of DM2F (baseline) by trying sine functions, as validated through systematic ablation experiments.

DuanzAI

DuanzAI: Slang-Enhanced LLM with Prompt for Humor Understanding

Yesian Rohn

Xiyuan Project of FDUROP, 2023 - 2024 / arXiv, 2024

We enhance LLMs' understanding of Chinese slang using curated datasets and advanced techniques, introducing DuanzAI and its application in the ChatDAI chatbot.

Achievements

Honor

Awards

Internships

Service

Teaching / TA

Referee / Reviewer