A series of seminars on "Intelligent Information Processing of Ancient Books" jointly sponsored by Peking University Digital Humanities Research Center, Peking University-ByteDance Digital Humanities Open Laboratory, and Peking University Artificial Intelligence Research Institute were held online a few days ago.

  At the seminar, Wang Jun, director of the Digital Humanities Research Center of Peking University, calculated an account: there are about 200,000 ancient books in my country, and from 1949 to 2019, nearly 38,000 kinds of ancient books have been restored, sorted and published. It may take three hundred years to restore and sort out all the existing ancient books.

However, if artificial intelligence technology is used to assist repair and sorting, it can be completed in about 20 to 30 years.

  "Using artificial intelligence technology to restore ancient books" mentioned by Wang Jun is not a distant scientific idea, it is becoming a vivid practice in reality.

Shortly after the first lecture of the "Intelligent Information Processing of Ancient Books" series of seminars, ByteDance announced that it would donate to the Peking University Education Foundation to support the Peking University-ByteDance Digital Humanities Open Laboratory to develop an "ancient book digital platform", using intelligent technology Accelerate the digital construction of Chinese ancient book resources, and it is expected to complete the intelligent restoration and sorting of 10,000 selected ancient books within three years.

The transformation of ancient book texts is becoming intelligent

  For a long time, the protection of ancient books has mainly adopted the original protection method, that is, ancient books are protected as "cultural relics".

Later, regenerative protection methods emerged, photocopying and recreating ancient books and image preservation, allowing ancient books to exist in the form of paper or microfilm.

Many of the existing digitized ancient books are converted from microfilm, with low resolution and mostly black and white images.

  Even if all ancient books are photocopied and published by digital means, the ancient books are "dead" and cannot be easily used by people.

Yang Haizheng, a professor at the Department of Chinese at Peking University, gave a simple example—photocopied ancient books have no punctuation marks, which are very inconvenient to read.

In addition, this is also not conducive to retrieving the content of ancient books. If you want to check a certain content, you have to read the original text page by page, and it is difficult to quickly find the desired knowledge.

Therefore, in order to improve the utilization of traditional ancient books, the content of ancient books must be converted into digital texts.

In the past, this conversion mainly relied on manual entry by experts, and the time cost was extremely high.

  "The development of information technology, especially the emergence of artificial intelligence and big data technology, has brought revolutionary changes to the restoration and arrangement of ancient books." Wang Jun said that in recent years, many universities and scientific research institutions, including Peking University, have A lot of pioneering work has been carried out in the digitization of ancient books, and relatively mature technologies and experience have been accumulated in OCR (optical character recognition), AI sentence reading, entity recognition, etc.

Taking OCR applications as an example, scanning ancient books on paper with electronic devices, the content of the ancient books will be transcribed into the computer, and the corresponding digital documents will be generated, which is more than ten million times more efficient than manual input.

  It is understood that using artificial intelligence and big data technology, the Digital Humanities Center of Peking University has realized automatic sentence reading of ancient texts in the corpus of large-scale ancient texts from the pre-Qin to Ming and Qing Dynasties, with an average accuracy rate of 94%. It also realizes automatic identification of people's names, place names, era names, official names, and book titles, with an accuracy rate of nearly 98% in medieval historical materials.

  In these respects, Internet companies such as ByteDance also have a lot of experience and technology accumulation.

For example, OCR technology is widely used in image and text recognition and subtitle translation on platforms such as Toutiao and Douyin, as well as various types of cards and documents in commercial business and industry document recognition.

"These technologies can be gradually migrated to the direction of intelligent digitization of ancient books. In the development of the ancient book digitization platform, we can complement the technical advantages of Peking University, and effectively open up and integrate." Li Hang, director of ByteDance Artificial Intelligence Laboratory, said .

  According to Wang Jun, the "Ancient Books Digital Platform" will further improve the accuracy, intelligence and openness of ancient book sorting.

On the one hand, key texts can be refined to meet the requirements of experts and scholars for the accuracy of the data; on the other hand, using the text recognition and proofreading tools on the intelligent platform, scholars and ancient book lovers can complete ancient book sorting online in one stop. Work, instead of sorting and editing in word documents as before, and then passing related documents, which improves efficiency and facilitates public participation.

The use of ancient books is expected to be intelligent

  Wang Zhaopeng, a professor at Sichuan University Chinese Culture Global Communication Big Data Center, believes that technological progress has brought two aspects to the intelligentization of ancient book restoration and sorting: one is the intelligent transformation of ancient book texts, and the other is the intelligent utilization of ancient books.

  Converting the content of paper ancient books into digital text is only the first step in the restoration and organization of ancient books.

On this basis, another problem to be solved is how to organize and categorize the contents of massive and obscure ancient books to form interactive, touchable and visualized digital humanities works for people to consult and use.

Otherwise, the ancient books entered into the computer will continue to "sleep".

  Based on artificial intelligence technology, my country has established a number of ancient book sorting automation and visualization platforms.

For example, Wang Jun presided over the design and development of the "Knowledge Graph Visualization System of "Song and Yuan Xue An", which processed and analyzed the text of the 2.4 million-word "Song and Yuan Xue An". The involved people, time, place, works, etc. are extracted and constructed into a knowledge map.

However, the intelligence level of many platforms is still low, such as entering keywords, the searched content is isolated and disordered.

Wang Zhaopeng believes that a smarter platform for sorting and utilizing ancient books should evolve from version 1.0 to version 2.0. For example, content retrieval should be “following similarities”, and the retrieved content should be related to each other and organically classified by artificial intelligence.

  The "Ancient Books Digital Platform" jointly developed by Peking University and ByteDance is an attempt to improve the level of intelligence in the arrangement and utilization of ancient books.

"The technical core of our cooperation is to apply artificial intelligence and big data to a large number of ancient books and documents, to realize the automatic generation of knowledge maps of ancient texts and the intelligent sorting of ancient book content, so that ancient books can be retrieved and read in the form of text. And deep mining and utilization." Li Hang said that in the future, the "digital platform for ancient books" will not only be an intelligent sorting platform for ancient books, but also a digital reading tool for readers, which will provide free and open access services.

  Wang Jun predicts that with the application of artificial intelligence technology, the ancient historical and cultural knowledge contained in ancient books and documents will be continuously extracted and constructed into various knowledge bases, which will support Internet front-end applications in the form of knowledge maps. .

  Due to the advantages in the research and development and design of Internet products, the participation of social forces such as Internet companies will further ensure the service quality of the ancient book digital platform.

"We have excellent product managers, designers, and software engineers who can continuously optimize and innovate the product functions of the ancient book digital platform to provide a better user experience." said Tang Daxin, product general manager of the corporate social responsibility department of Beijing ByteDance. At present, the design team of Toutiao and the development and testing team of Douyin have joined the development of the "Ancient Books Digital Platform".

Interdisciplinary collaboration required

  With the widespread application of artificial intelligence technology in the field of ancient book restoration and sorting, as a teacher of classical literature, Yang Haizheng is often asked a question by students: "Do I need to learn artificial intelligence while studying classical literature?" Although Yang Haizheng is not sure However, the fact is that the combination of artificial intelligence technology and ancient book restoration and sorting will open up a new interdisciplinary field, and the use of artificial intelligence technology to restore and sort ancient books will definitely require more compound talents.

  Wang Jun believes that under such circumstances, how to cultivate classical philology talents with both technical and academic abilities, and how to form a multidisciplinary curriculum system, are urgent problems to be solved in the related majors such as classical philology in colleges and universities.

  Furthermore, AI is not "super smart".

According to Jin Lianwen, a professor at the School of Electronics and Information, South China University of Technology, problems such as image enhancement and restoration of ancient books, and image layout analysis of ancient books with complex layouts need to be solved.

In the analysis and arrangement of ancient book content, the biggest technical difficulty at present is how to further realize relationship extraction after artificial intelligence recognizes proper nouns such as names and place names in ancient books, so as to prepare technical conditions for the automatic generation of ancient historical and cultural knowledge maps .

  Therefore, Yang Haizheng believes that in the sorting of ancient books, humanities and social sciences scholars should actively intervene and strengthen cooperation with technicians, so as to make better use of machines instead of being led by machines, so as to ensure the accuracy of the results.

  The development of artificial intelligence technology has brought about fundamental changes in the research methods and ideas of ancient book collation.

A consensus in the industry is that the use of artificial intelligence to promote the restoration and arrangement of ancient books requires cross-disciplinary, cross-environment, cross-cultural, and cross-regional cooperation.

As Wang Jun said, "The protection of ancient books requires the joint efforts of all sectors of society, and more ancient book collection institutions, research institutions and individuals who are enthusiastic about the cause of ancient books should be welcomed to join, so as to create an open 'digital platform for ancient books'".

(Reporter Han Yeting of this newspaper)