Published: 10:21, April 1, 2025
PDF View
Guideline to develop AI-backed Chinese language database
By Zhao Yimeng

Digitalization of ancient texts promotes cultural heritage, Mandarin learning

China is accelerating the digitalization of ancient texts and boosting access to oracle bone script data, aiming to integrate cultural heritage with digital Chinese, officials said on Monday.

The Ministry of Education, the National Language Commission and the Cyberspace Administration of China issued a guideline to promote the digitalization of the Chinese language and characters. The focus is on developing national language resources and large-scale Chinese language models to support artificial intelligence.

The guideline aims to establish a national corpus and strategic language resources information database by 2027. By 2035, the country hopes it will have significantly expanded the presence of the Chinese language in global digital and generative AI scenarios.

READ MORE: Breathing digital life into ancient books

Liu Peijun, head of the Department of Language Information Management at the Ministry of Education, said the guideline calls for the digitalization of linguistic and cultural heritage, while promoting the construction of a national digital language and script museum.

It emphasizes advancing key technologies for ancient text digitalization, enhancing the accessibility of oracle bone script data and launching a multilingual digital education program to facilitate Chinese language learning globally, Liu said at a news conference.

A key aspect of this initiative is the development of large-scale linguistic data resources. The guideline outlines a plan to build a national corpus with extensive Chinese language datasets to support AI applications.

Among the pilot projects, Beijing Normal University has launched a large-scale Classical Chinese language model, an AI-driven initiative that sets a new benchmark in the field, Liu said.

Kang Zhen, vice-president of BNU, said the university has developed a range of digital language databases, including a comprehensive holographic Chinese character database, a digital resource of the ancient Chinese dictionary Shuowen Jiezi, and repositories for ancient inscriptions and handwritten texts.

These resources have played a crucial role in linguistic research and cultural preservation, Kang added.

ALSO READ: Chinese universities boost, broaden AI courses amid tech boom

The university's AI Taiyan, a Classical Chinese large language model trained with 1.8 billion parameters, has been designed for high-accuracy interpretation of ancient texts, supporting tasks such as word and phrase explanations, as well as classical-to-modern Chinese translation.

China is also spearheading the construction of a new national corpus to strengthen linguistic infrastructure in the AI era, said Wang Hui, deputy head of the Ministry of Education's Department of Language Application and Administration.

"Currently, most linguistic datasets remain limited to single-text formats and specific academic domains, lacking the scale and diversity required for AI applications," Wang said.

The department has begun planning for the corpus this year, seeking to launch two flagship databases, the Chinese civilization corpus for AI-assisted teaching and research, and the Chinese grand reading system corpus, Wang said.

zhaoyimeng@chinadaily.com.cn