WYWEB: A NLP Evaluation Benchmark For Classical Chinese
Abstract
To fully evaluate the overall performance of different NLP
models in a given domain, many evaluation benchmarks are
proposed, such as GLUE, SuperGLUE and CLUE. The field of
natural language understanding has traditionally focused on
benchmarks for various tasks in languages such as Chinese,
English, and multilingual, however, there has been a lack of
attention given to the area of classical Chinese, also known
as "wen yan wen (文言文)", which has a rich history spanning
thousands of years and holds significant cultural and academic
value.
For the prosperity of the NLP community, in this paper, we
introduce the WYWEB evaluation benchmark, which consists of
nine NLP tasks in classical Chinese, implementing sentence
classification, sequence labeling, reading comprehension, and
machine translation. We evaluate the existing pre-trained
language models, which are all struggling with this benchmark.
We also introduce a number of supplementary datasets and
additional tools to help facilitate further progress on
classical Chinese NLU. The github repository and leaderboard
of WYWEB will be released as soon as possible.