Chinese Research, tools & pedagogy

How to Use Chinese Corpora Without Misreading Frequency

The reader can use Chinese corpora responsibly, understanding that frequency depends on corpus composition, genre, date, region, tokenization, and search method.

Published March 2, 2026 Chinese

Why this article matters

A corpus can make learners smarter, but only if they stop treating frequency as truth. “This word appears 10,000 times” means little until you know where the corpus comes from, how it was segmented, what genres it contains, what period it covers, and what exact form you searched. Corpus literacy is a serious skill.

Core concepts

Term	Meaning for learners	Risk
语料库	Corpus: searchable collection of language data	Not automatically balanced or current.
词频 / 字频	Word/character frequency	Depends on segmentation and source.
搭配	Collocation	More useful than raw word frequency for usage.
共现	Co-occurrence	May show association, not grammatical relation.
体裁 / 语域	Genre/register	News, fiction, forums, law, and speech differ sharply.
分词	Word segmentation	Chinese word boundaries are not always obvious.
样本	Sample	Bad sample, bad conclusion.

The article

Chinese corpora are powerful because they let learners inspect real examples. Instead of asking “Can I say this?” you can ask “Where does this expression appear, with what words around it, in what genre?” That is a better question. But corpus results can also mislead.

The first danger is corpus composition. A news-heavy corpus will overrepresent policy, institutions, formal reporting verbs, and official place names. A social-media corpus will overrepresent slang, fragments, emotional stance, and platform events. A literary corpus will overrepresent narrative description and older words. If you compare frequencies without knowing the source mix, you may think a word is common in everyday speech when it is actually common in news headlines.

The second danger is segmentation. Written Chinese has no spaces, so corpus tools must decide where word boundaries are. Is 研究生物 one word sequence meaning “graduate student biology,” or 研究 / 生物 meaning “study biology”? In real search, segmentation decisions affect counts, collocations, and examples. A string search can catch forms that a word search misses; a word search can avoid false hits that a string search includes.

The third danger is form variation. Simplified/traditional variants, regional vocabulary, old spellings, abbreviations, and named entities can split results. Searching 信息 may not tell you about 資訊 in Taiwan materials. Searching 普通话 may not tell you about 國語 or 华语. Serious corpus work requires variant awareness.

The fourth danger is raw count worship. A rare word is not always wrong. A common word is not always appropriate. Some words are rare because they belong to narrow domains; others are common but vague. Some phrases are frequent because of one repeated news formula, not because native speakers casually use them in daily life.

Worked example: 方便, 便利, 便捷

If you search these near-synonyms, raw frequency may not answer the learner's real question. The useful question is: what nouns and verbs does each word appear with?

Word	Likely collocation pattern	Reading implication
方便	everyday broad convenience; also social availability	方便的话, 交通方便, 使用方便
便利	infrastructure, policy, service conditions	提供便利, 交通便利, 生活便利
便捷	product/service speed and ease	操作便捷, 便捷服务, 支付便捷

The corpus is useful because it shows environment, not because it prints a magic answer.

Learner traps and repairs

Trap	Why it misleads	Better habit
Comparing raw counts across different corpora	Different corpora have different source mixes.	Compare within one corpus first.
Searching only one form	Variants and abbreviations split evidence.	Search simplified/traditional and synonyms where relevant.
Treating rare as wrong	Domain terms can be rare but correct.	Ask: rare in which genre?
Ignoring bad segmentation	Counts may merge or split incorrectly.	Inspect concordance lines manually.
Mining examples without source	You lose register and credibility.	Save source, date, genre, and sentence.

Practice protocol

Choose one word pair: 问题/议题, 方便/便捷, 认为/表示. Search in a corpus or a controlled source set. Record five examples from news, five from social media or conversation if available, and five from technical/official text. Write a register note before writing an English gloss.

Additional practice and repair

Corpus misuse diagnostics

Bad inference	What went wrong	Repair
Word A is more common than Word B, so Word A is better	Raw count ignores genre and context.	Compare collocations and example sentences within the same source.
This phrase has few hits, so it is wrong	It may be domain-specific, regional, new, or formal.	Ask “rare where?” before rejecting.
The corpus shows this usage, so I can use it anywhere	Corpus examples preserve source register.	Tag genre, date, region, and speaker/writer type.
A word search found zero results	Segmentation, simplification/traditional forms, or query form may be wrong.	Try string search, variants, and related forms.
Concordance lines prove meaning	KWIC context may be too short.	Open wider context before drawing a conclusion.

Query log template

Field	What to record
Research question	Meaning, collocation, register, region, grammar, or frequency?
Query forms	Simplified, traditional, abbreviation, synonym, inflected frame if relevant.
Corpus/source	Name, date range, genre mix, region if known.
Search mode	String, word, lemma-like, regex, or manual source search.
Example quality	Clear, ambiguous, formulaic, duplicated, noisy, or irrelevant.
Conclusion confidence	High only after example inspection and genre comparison.

Before/after repair set

Weak corpus note	Strong corpus note
便捷 is common.	In the searched product/service texts, 便捷 frequently modifies 服务, 操作, 支付, and 流程; it feels more product-copy/formal than everyday 方便.
议题 is less common than 问题.	议题 appears in agenda, public debate, and policy contexts; frequency is lower because its domain is narrower.
This phrase is rare.	In this news-heavy corpus it is rare; check spoken/interview/social examples before judging acceptability.

The corpus worksheet should require users to save at least three concordance lines with source metadata before writing a usage note. Add warnings for duplicate boilerplate, named-entity noise, and one-source overrepresentation. Include a simplified/traditional variant reminder.

Practice visualization

Build a corpus-query worksheet with fields for query form, corpus/source, genre, date, region, segmentation mode, raw frequency, top collocations, example quality, and caution note.

Check corpus-tool claims against BCC, CCL, Unicode text segmentation, and ICU dictionary-based boundary analysis. Be clear that corpus evidence is empirical but not self-interpreting.

How to Use Chinese Corpora Without Misreading Frequency

Why this article matters

Core concepts

The article

Worked example: 方便, 便利, 便捷

Learner traps and repairs

Practice protocol

Additional practice and repair

Corpus misuse diagnostics

Query log template

Before/after repair set

Practice visualization

Related reading

The May Fourth Language Shift and the Rise of 白话

Sino-Korean Vocabulary From a Mandarin Learner’s Perspective

The CJK Vocabulary of Modernity: Nation, Society, Science, Economy

A Research Stack for Chinese Learners: Dictionaries, Corpora, Standards, and Archives

How to Compare Mainland, Taiwan, and Diaspora Usage Responsibly

成语 for Adults: History, Register, and When Not to Use Them