Weissblick - Fotolia

Alibaba taps machine learning to digitise Chinese classics

Alibaba’s research arm Damo Academy has developed a machine learning model to convert scanned pages of ancient Chinese classics into text

Alibaba’s research arm Damo Academy has developed a machine learning model that uses optical character recognition (OCR) to turn characters in ancient Chinese classics into text.

As Chinese ancient characters are complex, digitising them can be a challenge. Throughout history, one Chinese character might have had several variants and written forms. Digitising Chinese ancient books using OCR not only facilitates machine reading, but also enables wider access to ancient book collections.

The digitisation effort was undertaken under a programme between Damo Academy, Alibaba Foundation, Library of University of California, Berkeley, Sichuan University, the National Library of China and Zhejiang Library.

The joint programme aims to digitise and aggregate ancient Chinese books, and convert scanned images into text for open access. This way, libraries in China and abroad can work together to make their ancient Chinese books freely available to the world, Alibaba said.

“Alibaba will continue to invest in resources and cutting-edge technology to support such projects,” said Jeff Zhang, head of Alibaba Damo Academy. “Making ancient books available to the public is in line with our values and belief in ‘tech for change’.

“We believe that technology can play a critical role in preserving precious cultural relics and heritage, and we look forward to working with libraries in China and abroad to make this happen,” he added.

“We believe that technology can play a critical role in preserving precious cultural relics and heritage, and we look forward to working with libraries in China and abroad to make this happen”
Jeff Zhang, Alibaba Damo Academy

The first batch of Chinese classics in this joint effort came from the CV Starr East Asian Library of University of California in Berkeley, one of the largest academic libraries with a rich collection of ancient Chinese books. The library provided scanned pages of the books as well as their metadata.

So far, 200,000 digital pages of ancient books have been digitised and are now on display. They include woodblock-printed books and manuscripts from the Song and Yuan dynasties, dating back more than 1,000 years. Other materials include digital pages of an original volume of Siku Quanshu, The Complete Works of Chinese Classics from the Qing dynasty.

Building on its work, Damo Academy has also teamed up with scholars in Sichuan University to develop an artificial intelligence (AI) model for single character indexing, automatic character grouping, and various forms of machine learning such as self-supervised learning and few-shot learning.

With an accuracy of 97.5% in recognising ancient characters, the model can now recognise 30,000 ancient Chinese characters, surpassing the speed of human reading by 30-fold.

Zhang said Alibaba would make its AI system for the machine reading of Chinese ancient books available to the public soon.

According to a report by Hong Kong’s South China Morning Post, computer vision currently makes up half of China’s AI market. From 2016 to 2018, Chinese computer vision firms raised $4.5bn from venture capital, the highest globally.

Read more on Artificial intelligence, automation and robotics