Designing a Mandarin Error Corpus From Your Own Mistakes
The reader can turn recurring Mandarin mistakes into a personal error corpus that supports diagnosis, targeted practice, and measurable improvement.
article
Most learners waste their mistakes.
They feel embarrassed, fix the sentence once, nod at the correction, and move on. Two weeks later, the same error returns in a new sentence with new vocabulary.
This is not a character flaw. It is a data problem.
A Mandarin mistake is not automatically useful. A correction is not automatically retained. Feedback is not automatically converted into skill.
To make mistakes useful, you need a personal error corpus.
A corpus is a collection of language data. A personal error corpus is a structured collection of your own learner language: what you tried to say, what was wrong or unnatural, how it was corrected, why the error happened, and what practice should follow.
The key word is structured. A random list of corrections is not an error corpus. A shame diary is not an error corpus. A folder of teacher comments you never revisit is not an error corpus.
A useful error corpus lets you answer questions like:
- Do I misuse 了 in the same way every month?
- Are my tone mistakes mostly third-tone chains or tone-pair timing?
- Do I overuse 是 because of English?
- Are my word-choice errors really collocation errors?
- Do I sound too casual in formal writing?
- Which mistakes are frequent enough to deserve drills?
- Which mistakes are rare and can safely wait?
This is how learners move from “I make many mistakes” to “I know which mistakes matter next.”
What belongs in a personal Mandarin error corpus
Each entry should capture enough information to diagnose the mistake later.
Minimum fields:
| Field | Purpose | Example |
|---|---|---|
| Date | Track recurrence over time | 2026-05-25 |
| Original sentence | Preserve what you actually produced | 我有去过北京 |
| Correction | Record the better version | 我去过北京 |
| Error type | Classify the problem | Grammar: aspect / English transfer |
| Context | Explain where it happened | Speaking lesson, travel story |
| Source of correction | Teacher, tutor, native speaker, self-check, corpus | Tutor |
| Cause hypothesis | Why you made it | Translated “I have been” too directly |
| Practice item | What to do next | 10 sentences with 过, no 有 |
| Severity | How much it affects communication | Medium |
| Recurrence | First time or repeated? | Repeated |
Better fields for serious learners:
| Field | Why it helps |
|---|---|
| Target structure | Lets you group errors by grammar point |
| Register | Helps identify formal/casual mismatch |
| L1 transfer | Shows English-shaped Mandarin patterns |
| Word/collocation family | Helps fix vocabulary networks, not just one word |
| Audio attached | Useful for pronunciation and tone errors |
| Corrective explanation | Captures the principle behind the fix |
| Next review date | Prevents the entry from becoming dead text |
| Status | Open, drilling, improved, resolved, monitor |
A corpus is only as good as its categories.
Error categories that work for Mandarin
Use broad categories first. You can refine later.
| Category | What it includes | Example |
|---|---|---|
| Pronunciation | initials, finals, tones, rhythm, intonation | shi/xi, an/ang, 2-3 tone pair |
| Character/form | wrong character, typo, simplification, handwriting | 的/地/得; 以/已 |
| Word choice | wrong word for intended meaning | 知道 vs 认识; 方便 vs 便利 |
| Collocation | words that do not naturally pair | 做决定 is good; 制造决定 is not |
| Grammar | particles, word order, aspect, comparison, 把/被 | 我有去过北京 |
| Discourse | topic flow, pronoun omission, connectors, paragraph logic | too many repeated subjects |
| Register | too formal, too casual, too textbook, too slangy | using 成语 in a simple email |
| Pragmatics | politeness, indirectness, social effect | a request sounds like an order |
| Domain language | legal, medical, financial, technical misuse | 责任 vs 义务 |
| Translationese | grammatical but English-shaped | 在昨天我去了学校 |
Do not make 50 categories at the start. That creates friction. Start with 8–10 categories and add subtypes only when patterns emerge.
Entry examples
Example 1: aspect error
Original: 我有去过北京。 Correction: 我去过北京。 Error type: Grammar — experiential 过 / English transfer Context: Speaking about travel experience Cause hypothesis: Directly mapped English “I have been to Beijing” into Chinese with 有 Remediation: Drill experience sentences with 过:
- 我去过北京。
- 我没去过上海。
- 你吃过这个菜吗?
- 我看过这部电影。
- 他以前住过台北。
Rule note: Experiential 过 does not need 有. 没有 can negate experience: 我没有去过北京 is possible, but 有 is not required in the affirmative structure.
Example 2: word-order error
Original: 我在昨天去了学校。 Correction: 我昨天去了学校。 / 昨天我去了学校。 Error type: Grammar — time placement / translationese Context: Written diary practice Cause hypothesis: Translated “on yesterday” or treated time like English prepositional phrase Remediation: Time-word placement drill:
- 我明天去。
- 他上周回国了。
- 会议下午三点开始。
- 下个月我们搬家。
Rule note: Mandarin time expressions often appear before the verb phrase without 在 unless the phrase is a true location/time frame structure such as 在三点以前, 在会议结束后.
Example 3: collocation error
Original: 我制造了一个决定。 Correction: 我做了一个决定。 / 我决定了。 Error type: Collocation — English “make a decision” transfer Context: Speaking practice Cause hypothesis: Chose 制造 for “make” mechanically Remediation: Build a “make” verb field:
- 做决定
- 做饭
- 做作业
- 制作视频
- 制造产品
- 造成损失
- 使人担心
- 令人满意
Rule note: English “make” splits into many Mandarin verbs. Learn by object type.
Example 4: register error
Original: 老师,您能不能速速回复? Correction: 老师,您方便的时候能不能回复一下? / 老师,麻烦您方便时回复一下。 Error type: Register/pragmatics — overly literary/command-like request Context: Email to teacher Cause hypothesis: Learned 速速 as “quickly” but did not know its command-like/literary flavor Remediation: Request softening ladder:
- 回复。
- 请回复。
- 麻烦回复一下。
- 麻烦您方便的时候回复一下。
- 如果方便的话,麻烦您回复一下。
Rule note: Speed words can sound demanding. Politeness depends on burden, hierarchy, and phrasing, not just 您.
Example 5: pronunciation error
Original audio issue: Saying 买 mǎi with a falling contour, causing confusion with 卖 mài. Correction target: 买 should have low/dipping third-tone behavior in isolation and appropriate sandhi in context. 卖 is a falling fourth tone. Error type: Pronunciation — lexical tone / high-risk minimal pair Context: Ordering and shopping role-play Cause hypothesis: Final pitch movement rushed; third tone produced too sharply downward Remediation: Tone-pair drill:
- 我想买。
- 买一个。
- 买东西。
- 买还是卖?
- 他买了,不是卖了。
Rule note: Prioritize this because 买/卖 is a high-meaning-risk pair.
Severity: not all errors deserve equal attention
Learners often treat every correction as equally important. That is a mistake.
Use a severity scale.
| Severity | Meaning | Example | Action |
|---|---|---|---|
| 1 — Cosmetic | Slightly unnatural but understandable | minor wording preference | Note only |
| 2 — Register issue | Understandable but socially off | too blunt request | Add context note |
| 3 — Recurring grammar issue | Meaning usually recoverable, but repeated | wrong 了, word order | Drill |
| 4 — Meaning-changing | Listener may misunderstand | 买/卖, 不/没 in key context | Prioritize |
| 5 — High-stakes | Legal, medical, financial, safety, identity | wrong dosage, contract term | Avoid unsupervised use; get expert help |
A serious corpus prevents panic. It tells you what matters now.
Recurrence matters more than embarrassment
One embarrassing mistake is not automatically a priority. One boring mistake repeated 30 times is.
Add a recurrence count:
| Error pattern | Count | Last seen | Status |
|---|---|---|---|
| 有 + V过 | 7 | 2026-05-24 | active drill |
| 是 + adjective | 5 | 2026-05-22 | improving |
| an/ang confusion | 12 | 2026-05-25 | pronunciation focus |
| 只 placement | 4 | 2026-05-18 | monitor |
| overusing 成语 | 2 | 2026-05-10 | low priority |
When you review monthly, choose the top three recurring patterns. Not ten. Three.
How to convert errors into drills
An error entry is not finished until it creates practice.
Grammar error → contrast set
Error: 我有去过北京。 Drill:
- 我去过北京。
- 我没去过北京。
- 你去过北京吗?
- 我去年去了北京。
- 我去北京了。
Goal: distinguish experience 过, completed trip 了, and situation update.
Word-choice error → semantic field
Error: 知道他 instead of 认识他. Drill:
- 我认识他。
- 我知道这件事。
- 我了解这个情况。
- 我懂你的意思。
- 我明白了。
Goal: map English “know” into Mandarin verb choices.
Pronunciation error → minimal pair + sentence
Error: shi/xi confusion. Drill:
- 是 / 西
- 事 / 细
- 老师 / 学习
- 这是西边,不是市中心。
- 我喜欢学习历史。
Goal: carry contrast into sentences.
Register error → rewrite ladder
Error: request sounds too direct. Drill:
- 给我发一下。
- 麻烦发我一下。
- 方便的话,麻烦发我一下。
- 不好意思,能不能麻烦您方便时发我一下?
Goal: choose directness by relationship.
The monthly error review
Once a month, do not review every entry. Review patterns.
Monthly review template
- Total new errors logged:
- Top three recurring categories:
- Highest-severity error:
- Error I thought was fixed but returned:
- Error that is no longer appearing:
- One pronunciation target for next month:
- One grammar target for next month:
- One vocabulary/collocation target for next month:
- Output constraint for next month:
- Reading/listening input to support those targets:
Example:
- Top recurrence: 把 sentences missing result complement
- Drill: write 20 task-completion sentences with 把 + result complement
- Reading input: product manual instructions and workplace task messages
- Output constraint: every time I use 把, check whether the object has a clear outcome
This is how the corpus becomes a curriculum.
What not to log
Do not log everything. You will burn out.
Skip or lightly note:
- One-off typos
- Words far above your current level
- Corrections you do not understand yet
- Stylistic preferences from one speaker unless confirmed
- Errors caused by exhaustion, not pattern
- Corrections in domains you do not plan to use soon
Log seriously:
- Repeated grammar mistakes
- High-frequency word-choice errors
- Tone errors that change meaning
- Politeness mistakes
- Register mismatches
- Mistakes in your target domain
- Errors that native speakers repeatedly correct
A corpus is a filter. It is not a trash can.
Common learner objections
“This sounds like too much work.”
It is too much work if you log everything. Log only patterns. Five good entries per week beat 50 dead corrections.
“I do not know how to classify errors.”
Start rough. Grammar, pronunciation, word choice, register, character. Add detail later.
“My teacher already corrects me.”
Correction is input. A corpus turns correction into a system.
“I feel bad looking at my mistakes.”
Then write the corpus like a researcher, not like a judge. The question is not “Why am I bad?” The question is “What pattern is the data showing?”
The concept should be aligned with learner-corpus research rather than presented as a casual mistake diary. The HSK Dynamic Composition Corpus is a useful reference point because it collects and annotates Chinese learner writing, making recurring error patterns visible at scale. Personal learner corpora are smaller and informal, but the same principle applies: errors become more useful when they are categorized, contextualized, and reviewed systematically.
Remediation and upgrade layer
The upgraded thesis:
A personal error corpus is useful only when each error becomes a future decision rule, drill, or reading target.
If an error entry does not change future behavior, it is just archive clutter.
Remediation diagnosis: why personal error logs fail
| Failure mode | Symptom | Damage | Repair |
|---|---|---|---|
| Logging everything | The learner records every tiny correction | Review becomes impossible | Log only recurring, high-risk, or structurally revealing errors |
| No context | Entry says “wrong measure word” | The learner cannot reconstruct the mistake | Save the sentence, situation, and intended meaning |
| No cause hypothesis | Entry records correction only | Same error returns under pressure | Add “why I made this mistake” |
| No next drill | Corpus becomes a museum | No behavior change | Convert each major error into a micro-drill |
| Shame sorting | Learner ranks errors by embarrassment | Emotional noise replaces evidence | Rank by recurrence, severity, and communicative risk |
| Mixed model/error sentences | Incorrect original sits beside corrected sentence with no labels | Learner may review the wrong version | Use clear fields: Original, Correction, Model |
| Over-tagging | Entry has 12 tags | The tag system stops being usable | Use one primary error tag and optional secondary tag |
The article should tell readers that a useful error corpus is small at first. Fifty high-quality entries beat five hundred vague corrections.
Upgrade: the minimum viable error entry
A personal corpus entry should have these fields:
| Field | Example |
|---|---|
| Date | 2026-05-25 |
| Situation | Asked a tutor about weekend plans |
| Intended meaning | “I went to the museum last weekend.” |
| My original sentence | 我上周末去过博物馆。 |
| Correction/model | 我上周末去了博物馆。 |
| Error type | aspect: 过 vs 了 |
| Why it happened | I overused 过 for any past event |
| Rule of thumb | Use 过 for experience; use 了 for completed specific event |
| Next drill | Write 10 sentences contrasting 去过 vs 去了 |
| Recurrence count | 3 |
| Severity | medium: meaning recoverable but aspect nuance wrong |
| Review date | one week later |
This format is intentionally heavier than a flashcard. It is not for every slip. It is for errors that reveal a pattern.
Primary error taxonomy for Mandarin learners
The article should offer a stable taxonomy, but not an academic monster.
| Code | Category | Typical examples | Repair direction |
|---|---|---|---|
| PRON-T | Tone | 买/卖, 2-3 pair, neutral tone | minimal pair, tone-pair recording |
| PRON-S | Segmental sound | x/sh, q/ch, an/ang, ü/u | mouth-position drill, listening test |
| CHAR | Character/form | 的/地/得, 形近字, wrong simplified/traditional form | visual contrast, source sentence |
| WORD | Word choice | 认识 vs 知道, 方便 vs 便利 | collocation and object-type drill |
| COLL | Collocation | 做决定了一个计划 | phrase-level mining |
| MW | Measure word/classifier | 一只书 | noun-classifier pairing |
| ASP | Aspect | 了, 过, 着, 在 | timeline and contrast sentences |
| WO | Word order | time placement, adverb scope, relative clauses | sentence reconstruction |
| COMP | Complements | 看懂, 写完, 买得到 | verb-complement drills |
| BA/BEI | 把/被 | missing result, wrong affected object | construction rewrite |
| REG | Register | 成语 overuse, official wording in casual speech | genre labeling |
| PRAG | Pragmatic/social | too blunt request, wrong address term | scenario rewrite |
| TRANS | Translationese | English-shaped sentence | natural Chinese paraphrase |
Each entry should have one primary code. If everything is tagged everything, the corpus stops producing patterns.
Severity and recurrence matrix
Not all mistakes deserve equal study time. Add this matrix.
| Severity / Recurrence | Rare | Repeated |
|---|---|---|
| Low severity | Ignore or note lightly | Add to review if it annoys listeners/readers |
| Medium severity | Save if it reveals a pattern | Create a drill and review weekly |
| High severity | Fix quickly if domain-risky | Make it a top-three monthly target |
High severity includes errors that affect names, numbers, addresses, medical/legal terms, professional commitments, or social politeness. A rare typo in a low-stakes chat does not belong in the same queue as recurring tone errors in numbers.
Before/after repair sets
Aspect error
Original:
我昨天看过这个电影。
Better:
我昨天看了这部电影。
If the intended meaning is experience at some unspecified time:
我看过这部电影。
Corpus note:
Error type: ASP. Cause: confusing past-time adverb with experiential 过. Drill: write pairs with 昨天/去年 vs 以前/曾经.
Word-choice error
Original:
我认识这个问题。
Better, depending on meaning:
我知道这个问题。 我了解这个问题。 我明白这个问题的意思。
Corpus note:
Error type: WORD. Cause: overextending 认识 from “know.” Drill: sort objects into people / facts / situation / meaning.
Register error
Original in a casual message to a friend:
请贵方于今日下午三点前回复。
Natural casual version:
你今天下午三点前能回我一下吗?
Formal business version:
请贵方于今日下午三点前回复。
Corpus note:
Error type: REG/PRAG. Cause: formal document language used in casual interpersonal context. Drill: rewrite same request for friend, teacher, client, and public notice.
Complement error
Original:
我听这个问题。
Possible repairs:
我听懂了这个问题。 我听清楚了这个问题。 我听到了这个问题。
Corpus note:
Error type: COMP. Cause: using 听 without specifying understand / hear clearly / hear occurrence. Drill: 听见, 听到, 听懂, 听清楚 contrast set.
Monthly review upgrade
The monthly review should be concrete and ruthless.
Step 1: Count recurrence.
Sort entries by primary error code. Do not choose a target because it feels embarrassing. Choose it because it keeps happening.
Step 2: Pick top three.
A good monthly target list might be:
- ASP: overusing 过 for past events.
- WORD: confusing 认识, 知道, 了解, 懂, 明白.
- PRON-S: failing an/ang in common words.
Step 3: Build one drill per target.
| Target | Drill |
|---|---|
| 过 vs 了 | 20 sentence contrasts with time adverbs |
| know verbs | object-sorting table and dialogue fill-in |
| an/ang | listen-record-compare with 15 frequent pairs |
Step 4: Add one reading/listening target.
For 过 vs 了, the learner should read short travel narratives and mark every event sentence. For know verbs, collect examples from interviews and explanations. For an/ang, use audio-minimal pairs and real words.
Step 5: Retest under output pressure.
A corrected worksheet does not prove repair. The learner should retell a story, write a paragraph, or answer questions in real time while constrained to use the target pattern.
The original tool idea should become a lightweight database with strong guardrails.
Required fields:
- original attempt;
- correction/model;
- context;
- intended meaning;
- primary error category;
- cause hypothesis;
- next drill;
- recurrence count;
- severity;
- review status.
Important UI rule: show the corrected/model sentence more prominently than the incorrect sentence. The original error should be visible for diagnosis, but not visually rehearsed as the target.
Useful filters:
- “show repeated errors only”;
- “show high-severity errors”;
- “show errors by source: tutor, writing, speech, ASR, reading misparse”;
- “show errors with no drill yet”;
- “show fossilized errors older than 60 days.”
Export modes:
| Export | Use |
|---|---|
| Drill sheet | structured practice from top recurring errors |
| Anki candidates | only corrected model sentences, never raw errors alone |
| Tutor report | concise list of recurring patterns to ask about |
| Monthly summary | error counts, improvements, next targets |
Privacy and psychological safety
A serious error corpus often contains personal speech, tutor corrections, workplace phrases, or private writing. The article should remind readers not to upload sensitive material into public tools without thinking. Remove names, contact details, employer information, health details, and private chat content unless the storage environment is appropriate.
The article should also avoid turning error tracking into self-surveillance. The learner is not trying to prove that they are bad at Chinese. They are trying to make errors visible enough to retire them.
The HSK Dynamic Composition Corpus is a strong editorial reference point because it collects writing by foreign HSK test takers and supports error-oriented corpus work. Its public description notes earlier versions containing more than ten thousand compositions and millions of characters, with Version 2.0 adding search and error-related functions. A personal error corpus is far smaller and less formal, but the same methodological idea applies: errors become useful when they are categorized, contextualized, and reviewed for recurring patterns.
Related reading
Designing Chinese Anki Cards for Words, Characters, and Collocations
The reader can design Chinese flashcards that train recognition, pronunciation, meaning, collocation, character form, and contextual use without turning review into trivia.
From Flashcards to Literacy: When Chinese Study Must Leave the Card
The reader can recognize when flashcards are helping and when they are delaying real Chinese literacy, then shift toward connected reading and listening.
A Serious Learner’s Guide to Chinese Dictionaries
The reader can use Chinese dictionaries more deeply by reading definitions, parts of speech, usage notes, examples, synonyms, variants, and register labels.
Chinese Pronunciation Self-Diagnosis With Recording and Native Models
The reader can diagnose Mandarin pronunciation problems through recording, comparison, targeted drills, and structured feedback rather than vague “tone practice.”
How Chinese Language Policy Shows Up in School Textbooks
The reader can see textbooks as language-policy artifacts that teach vocabulary, values, standard pronunciation, literacy, and national narratives.
How to Build a Yearlong Mandarin Intensive Around Inkuntri + Reader
The reader can design a one-year Mandarin learning plan that combines structured lessons, topical reading, listening, review, output, diagnostics, and domain specialization.