Inkuntri
Chinese Research, tools & pedagogy

How to Use Chinese Corpora Without Misreading Frequency

The reader can use Chinese corpora responsibly, understanding that frequency depends on corpus composition, genre, date, region, tokenization, and search method.

Published March 2, 2026 Chinese

Why this article matters

A corpus can make learners smarter, but only if they stop treating frequency as truth. “This word appears 10,000 times” means little until you know where the corpus comes from, how it was segmented, what genres it contains, what period it covers, and what exact form you searched. Corpus literacy is a serious skill.

Core concepts

TermMeaning for learnersRisk
语料库Corpus: searchable collection of language dataNot automatically balanced or current.
词频 / 字频Word/character frequencyDepends on segmentation and source.
搭配CollocationMore useful than raw word frequency for usage.
共现Co-occurrenceMay show association, not grammatical relation.
体裁 / 语域Genre/registerNews, fiction, forums, law, and speech differ sharply.
分词Word segmentationChinese word boundaries are not always obvious.
样本SampleBad sample, bad conclusion.

The article

Chinese corpora are powerful because they let learners inspect real examples. Instead of asking “Can I say this?” you can ask “Where does this expression appear, with what words around it, in what genre?” That is a better question. But corpus results can also mislead.

The first danger is corpus composition. A news-heavy corpus will overrepresent policy, institutions, formal reporting verbs, and official place names. A social-media corpus will overrepresent slang, fragments, emotional stance, and platform events. A literary corpus will overrepresent narrative description and older words. If you compare frequencies without knowing the source mix, you may think a word is common in everyday speech when it is actually common in news headlines.

The second danger is segmentation. Written Chinese has no spaces, so corpus tools must decide where word boundaries are. Is 研究生物 one word sequence meaning “graduate student biology,” or 研究 / 生物 meaning “study biology”? In real search, segmentation decisions affect counts, collocations, and examples. A string search can catch forms that a word search misses; a word search can avoid false hits that a string search includes.

The third danger is form variation. Simplified/traditional variants, regional vocabulary, old spellings, abbreviations, and named entities can split results. Searching 信息 may not tell you about 資訊 in Taiwan materials. Searching 普通话 may not tell you about 國語 or 华语. Serious corpus work requires variant awareness.

The fourth danger is raw count worship. A rare word is not always wrong. A common word is not always appropriate. Some words are rare because they belong to narrow domains; others are common but vague. Some phrases are frequent because of one repeated news formula, not because native speakers casually use them in daily life.

Worked example: 方便, 便利, 便捷

If you search these near-synonyms, raw frequency may not answer the learner's real question. The useful question is: what nouns and verbs does each word appear with?

WordLikely collocation patternReading implication
方便everyday broad convenience; also social availability方便的话, 交通方便, 使用方便
便利infrastructure, policy, service conditions提供便利, 交通便利, 生活便利
便捷product/service speed and ease操作便捷, 便捷服务, 支付便捷

The corpus is useful because it shows environment, not because it prints a magic answer.

Learner traps and repairs

TrapWhy it misleadsBetter habit
Comparing raw counts across different corporaDifferent corpora have different source mixes.Compare within one corpus first.
Searching only one formVariants and abbreviations split evidence.Search simplified/traditional and synonyms where relevant.
Treating rare as wrongDomain terms can be rare but correct.Ask: rare in which genre?
Ignoring bad segmentationCounts may merge or split incorrectly.Inspect concordance lines manually.
Mining examples without sourceYou lose register and credibility.Save source, date, genre, and sentence.

Practice protocol

Choose one word pair: 问题/议题, 方便/便捷, 认为/表示. Search in a corpus or a controlled source set. Record five examples from news, five from social media or conversation if available, and five from technical/official text. Write a register note before writing an English gloss.

Additional practice and repair

Corpus misuse diagnostics

Bad inferenceWhat went wrongRepair
Word A is more common than Word B, so Word A is betterRaw count ignores genre and context.Compare collocations and example sentences within the same source.
This phrase has few hits, so it is wrongIt may be domain-specific, regional, new, or formal.Ask “rare where?” before rejecting.
The corpus shows this usage, so I can use it anywhereCorpus examples preserve source register.Tag genre, date, region, and speaker/writer type.
A word search found zero resultsSegmentation, simplification/traditional forms, or query form may be wrong.Try string search, variants, and related forms.
Concordance lines prove meaningKWIC context may be too short.Open wider context before drawing a conclusion.

Query log template

FieldWhat to record
Research questionMeaning, collocation, register, region, grammar, or frequency?
Query formsSimplified, traditional, abbreviation, synonym, inflected frame if relevant.
Corpus/sourceName, date range, genre mix, region if known.
Search modeString, word, lemma-like, regex, or manual source search.
Example qualityClear, ambiguous, formulaic, duplicated, noisy, or irrelevant.
Conclusion confidenceHigh only after example inspection and genre comparison.

Before/after repair set

Weak corpus noteStrong corpus note
便捷 is common.In the searched product/service texts, 便捷 frequently modifies 服务, 操作, 支付, and 流程; it feels more product-copy/formal than everyday 方便.
议题 is less common than 问题.议题 appears in agenda, public debate, and policy contexts; frequency is lower because its domain is narrower.
This phrase is rare.In this news-heavy corpus it is rare; check spoken/interview/social examples before judging acceptability.

The corpus worksheet should require users to save at least three concordance lines with source metadata before writing a usage note. Add warnings for duplicate boilerplate, named-entity noise, and one-source overrepresentation. Include a simplified/traditional variant reminder.

Practice visualization

Build a corpus-query worksheet with fields for query form, corpus/source, genre, date, region, segmentation mode, raw frequency, top collocations, example quality, and caution note.

Check corpus-tool claims against BCC, CCL, Unicode text segmentation, and ICU dictionary-based boundary analysis. Be clear that corpus evidence is empirical but not self-interpreting.

Related reading