How to Use Chinese Corpora Without Misreading Frequency
The reader can use Chinese corpora responsibly, understanding that frequency depends on corpus composition, genre, date, region, tokenization, and search method.
Why this article matters
A corpus can make learners smarter, but only if they stop treating frequency as truth. “This word appears 10,000 times” means little until you know where the corpus comes from, how it was segmented, what genres it contains, what period it covers, and what exact form you searched. Corpus literacy is a serious skill.
Core concepts
| Term | Meaning for learners | Risk |
|---|---|---|
| 语料库 | Corpus: searchable collection of language data | Not automatically balanced or current. |
| 词频 / 字频 | Word/character frequency | Depends on segmentation and source. |
| 搭配 | Collocation | More useful than raw word frequency for usage. |
| 共现 | Co-occurrence | May show association, not grammatical relation. |
| 体裁 / 语域 | Genre/register | News, fiction, forums, law, and speech differ sharply. |
| 分词 | Word segmentation | Chinese word boundaries are not always obvious. |
| 样本 | Sample | Bad sample, bad conclusion. |
The article
Chinese corpora are powerful because they let learners inspect real examples. Instead of asking “Can I say this?” you can ask “Where does this expression appear, with what words around it, in what genre?” That is a better question. But corpus results can also mislead.
The first danger is corpus composition. A news-heavy corpus will overrepresent policy, institutions, formal reporting verbs, and official place names. A social-media corpus will overrepresent slang, fragments, emotional stance, and platform events. A literary corpus will overrepresent narrative description and older words. If you compare frequencies without knowing the source mix, you may think a word is common in everyday speech when it is actually common in news headlines.
The second danger is segmentation. Written Chinese has no spaces, so corpus tools must decide where word boundaries are. Is 研究生物 one word sequence meaning “graduate student biology,” or 研究 / 生物 meaning “study biology”? In real search, segmentation decisions affect counts, collocations, and examples. A string search can catch forms that a word search misses; a word search can avoid false hits that a string search includes.
The third danger is form variation. Simplified/traditional variants, regional vocabulary, old spellings, abbreviations, and named entities can split results. Searching 信息 may not tell you about 資訊 in Taiwan materials. Searching 普通话 may not tell you about 國語 or 华语. Serious corpus work requires variant awareness.
The fourth danger is raw count worship. A rare word is not always wrong. A common word is not always appropriate. Some words are rare because they belong to narrow domains; others are common but vague. Some phrases are frequent because of one repeated news formula, not because native speakers casually use them in daily life.
Worked example: 方便, 便利, 便捷
If you search these near-synonyms, raw frequency may not answer the learner's real question. The useful question is: what nouns and verbs does each word appear with?
| Word | Likely collocation pattern | Reading implication |
|---|---|---|
| 方便 | everyday broad convenience; also social availability | 方便的话, 交通方便, 使用方便 |
| 便利 | infrastructure, policy, service conditions | 提供便利, 交通便利, 生活便利 |
| 便捷 | product/service speed and ease | 操作便捷, 便捷服务, 支付便捷 |
The corpus is useful because it shows environment, not because it prints a magic answer.
Learner traps and repairs
| Trap | Why it misleads | Better habit |
|---|---|---|
| Comparing raw counts across different corpora | Different corpora have different source mixes. | Compare within one corpus first. |
| Searching only one form | Variants and abbreviations split evidence. | Search simplified/traditional and synonyms where relevant. |
| Treating rare as wrong | Domain terms can be rare but correct. | Ask: rare in which genre? |
| Ignoring bad segmentation | Counts may merge or split incorrectly. | Inspect concordance lines manually. |
| Mining examples without source | You lose register and credibility. | Save source, date, genre, and sentence. |
Practice protocol
Choose one word pair: 问题/议题, 方便/便捷, 认为/表示. Search in a corpus or a controlled source set. Record five examples from news, five from social media or conversation if available, and five from technical/official text. Write a register note before writing an English gloss.
Additional practice and repair
Corpus misuse diagnostics
| Bad inference | What went wrong | Repair |
|---|---|---|
| Word A is more common than Word B, so Word A is better | Raw count ignores genre and context. | Compare collocations and example sentences within the same source. |
| This phrase has few hits, so it is wrong | It may be domain-specific, regional, new, or formal. | Ask “rare where?” before rejecting. |
| The corpus shows this usage, so I can use it anywhere | Corpus examples preserve source register. | Tag genre, date, region, and speaker/writer type. |
| A word search found zero results | Segmentation, simplification/traditional forms, or query form may be wrong. | Try string search, variants, and related forms. |
| Concordance lines prove meaning | KWIC context may be too short. | Open wider context before drawing a conclusion. |
Query log template
| Field | What to record |
|---|---|
| Research question | Meaning, collocation, register, region, grammar, or frequency? |
| Query forms | Simplified, traditional, abbreviation, synonym, inflected frame if relevant. |
| Corpus/source | Name, date range, genre mix, region if known. |
| Search mode | String, word, lemma-like, regex, or manual source search. |
| Example quality | Clear, ambiguous, formulaic, duplicated, noisy, or irrelevant. |
| Conclusion confidence | High only after example inspection and genre comparison. |
Before/after repair set
| Weak corpus note | Strong corpus note |
|---|---|
| 便捷 is common. | In the searched product/service texts, 便捷 frequently modifies 服务, 操作, 支付, and 流程; it feels more product-copy/formal than everyday 方便. |
| 议题 is less common than 问题. | 议题 appears in agenda, public debate, and policy contexts; frequency is lower because its domain is narrower. |
| This phrase is rare. | In this news-heavy corpus it is rare; check spoken/interview/social examples before judging acceptability. |
The corpus worksheet should require users to save at least three concordance lines with source metadata before writing a usage note. Add warnings for duplicate boilerplate, named-entity noise, and one-source overrepresentation. Include a simplified/traditional variant reminder.
Practice visualization
Build a corpus-query worksheet with fields for query form, corpus/source, genre, date, region, segmentation mode, raw frequency, top collocations, example quality, and caution note.
Check corpus-tool claims against BCC, CCL, Unicode text segmentation, and ICU dictionary-based boundary analysis. Be clear that corpus evidence is empirical but not self-interpreting.
Related reading
The May Fourth Language Shift and the Rise of 白话
The reader understands how modern written Chinese emerged from debates over education, literature, modernization, and accessibility.
Sino-Korean Vocabulary From a Mandarin Learner’s Perspective
The reader can recognize the Hanja layer behind many Korean words and understand how it relates to Mandarin vocabulary.
The CJK Vocabulary of Modernity: Nation, Society, Science, Economy
The reader sees how a shared character vocabulary helped East Asia name modern institutions and concepts across Chinese, Japanese, and Korean.
A Research Stack for Chinese Learners: Dictionaries, Corpora, Standards, and Archives
The reader can assemble a serious Chinese research stack for verifying words, usage, standards, historical context, public documents, and domain terminology.
How to Compare Mainland, Taiwan, and Diaspora Usage Responsibly
The reader can compare Mainland, Taiwan, Hong Kong, Singapore, and diaspora Chinese usage without collapsing everything into “same Chinese” or exaggerating difference.
成语 for Adults: History, Register, and When Not to Use Them
The reader learns to treat 成语 as register-sensitive cultural vocabulary, not as decorative proof of fluency.