Publications
See Google Scholar for more
* denotes equal contributions
# denotes corresponding authors
2025
- Under ReviewHow Do LLMs Handle Emotional Content in Video Game Localization? A Computational Linguistic ApproachXiaojing Zhao, Emmanuele Chersoni, Chu-Ren Huang, and 1 more author2025
- PerspectivesTranslating Vulgar Language in Video Game Localization: A Case Study of Black Myth: WukongXiaojing Zhao, Emmanuele Chersoni, Chu-Ren Huang, and 1 more author2025
This study examines the translation of vulgar language in the localization of the video game Black Myth: Wukong (Feng, 2024), using a corpus of original Chinese subtitles and their English translation. It investigates how the offensiveness of vulgar language varies after translation, the strategies employed, and their impact on the localization process. The findings reveal that the overall offensiveness of vulgar language in the source text was mitigated, suggesting an alignment with the domestication approach to address cultural sensitivities. Specifically, vulgar language related to religion, gender, and social morals tended to be mitigated through softening, omission, or implicitation. On the other hand, vulgar language that enhances character interactions was mostly preserved or intensified to emphasize the game’s themes, enrich character development, and drive the plot, thereby improving player immersion. These results highlight the delicate balance localization attempts to achieve between translation accuracy and cultural acceptability, contributing to the creation of a cross-cultural gaming experience.
- LM4DH@RANLPCan LLMs Help Sun Wukong in his Journey to the West? A Case Study of Language Models in Video Game LocalizationXiaojing Zhao, Han Xu, Huacheng Song, and 2 more authorsIn , 2025
Large language models (LLMs) have demonstrated increasing proficiency in general- purpose translation, yet their effectiveness in creative domains such as game localization remains underexplored. This study focuses on the role of LLMs in game localization from both linguistic quality and sociocultural adequacy through a case study of the video game Black Myth: Wukong. Results indicate that LLMs demonstrate adequate competence in accuracy and fluency, achieving performance comparable to human translators. However, limitations remain in the literal translation of culture-specific terms and offensive language. Human oversight is required to ensure nuanced cultural authenticity and sensitivity. Insights from human evaluations also suggest that current automatic metrics and the Multidimensional Quality Metrics framework may be inadequate for evaluating creative translation. Finally, varying human preferences in localization pose a learning ambiguity for LLMs to perform optimal translation strategies. The findings highlight the potential and shortcomings of LLMs to serve as collaborative tools in game localization workflows. Data are available at https://github.com/zcocozz/wukong-localization.
- ACLLearning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional AttentionZhaoxin Feng, Jianfei Ma, Emmanuele Chersoni, and 2 more authorsIn Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025
Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning. Our results show that bidirectional attention improves the LLMs’ ability to represent subsequent context but weakens their utilization of preceding context, while contrastive learning training can help to maintain both abilities.
@inproceedings{feng-etal-2025-learning, title = {Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention}, author = {Feng, Zhaoxin and Ma, Jianfei and Chersoni, Emmanuele and Zhao, Xiaojing and Bao, Xiaoyi}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.acl-long.1132/}, doi = {10.18653/v1/2025.acl-long.1132}, pages = {23226--23245}, isbn = {979-8-89176-251-0}, }
2023
- PreprintArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT modelsYikang Liu, Ziyin Zhang, Wanyang Zhang, and 5 more authorsarXiv preprint arXiv:2304.07666, Jul 2023
AI generated content (AIGC) presents considerable challenge to educators around the world. Instructors need to be able to detect such text generated by large language models, either with the naked eye or with the help of some tools. There is also growing need to understand the lexical, syntactic and stylistic features of AIGC. To address these challenges in English language teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative essays generated by 7 GPT models in response to essay prompts from three sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing tasks. Machine-generated texts are paired with roughly equal number of human-written essays with three score levels matched in essay prompts. We then hire English instructors to distinguish machine essays from human ones. Results show that when first exposed to machine-generated essays, the instructors only have an accuracy of 61% in detecting them. But the number rises to 67% after one round of minimal self-training. Next, we perform linguistic analyses of these essays, which show that machines produce sentences with more complex syntactic structures while human essays tend to be lexically more complex. Finally, we test existing AIGC detectors and build our own detectors using SVMs and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of ArguGPT achieves above 90% accuracy in both essay- and sentence-level classification. To the best of our knowledge, this is the first comprehensive analysis of argumentative essays produced by generative large language models. Machine-authored essays in ArguGPT and our models will be made publicly available at https://github.com/huhailinguist/ArguGPT.
@article{liu2023argugpt, title = {ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models}, author = {Liu, Yikang and Zhang, Ziyin and Zhang, Wanyang and Yue, Shisen and Zhao, Xiaojing and Cheng, Xinyuan and Zhang, Yiwen and Hu#, Hai}, journal = {arXiv preprint arXiv:2304.07666}, year = {2023}, }