Inkuntri
Chinese Research, tools & pedagogy

Designing a Mandarin Error Corpus From Your Own Mistakes

The reader can turn recurring Mandarin mistakes into a personal error corpus that supports diagnosis, targeted practice, and measurable improvement.

Published April 9, 2026 Chinese

article

Most learners waste their mistakes.

They feel embarrassed, fix the sentence once, nod at the correction, and move on. Two weeks later, the same error returns in a new sentence with new vocabulary.

This is not a character flaw. It is a data problem.

A Mandarin mistake is not automatically useful. A correction is not automatically retained. Feedback is not automatically converted into skill.

To make mistakes useful, you need a personal error corpus.

A corpus is a collection of language data. A personal error corpus is a structured collection of your own learner language: what you tried to say, what was wrong or unnatural, how it was corrected, why the error happened, and what practice should follow.

The key word is structured. A random list of corrections is not an error corpus. A shame diary is not an error corpus. A folder of teacher comments you never revisit is not an error corpus.

A useful error corpus lets you answer questions like:

  • Do I misuse 了 in the same way every month?
  • Are my tone mistakes mostly third-tone chains or tone-pair timing?
  • Do I overuse 是 because of English?
  • Are my word-choice errors really collocation errors?
  • Do I sound too casual in formal writing?
  • Which mistakes are frequent enough to deserve drills?
  • Which mistakes are rare and can safely wait?

This is how learners move from “I make many mistakes” to “I know which mistakes matter next.”

What belongs in a personal Mandarin error corpus

Each entry should capture enough information to diagnose the mistake later.

Minimum fields:

FieldPurposeExample
DateTrack recurrence over time2026-05-25
Original sentencePreserve what you actually produced我有去过北京
CorrectionRecord the better version我去过北京
Error typeClassify the problemGrammar: aspect / English transfer
ContextExplain where it happenedSpeaking lesson, travel story
Source of correctionTeacher, tutor, native speaker, self-check, corpusTutor
Cause hypothesisWhy you made itTranslated “I have been” too directly
Practice itemWhat to do next10 sentences with 过, no 有
SeverityHow much it affects communicationMedium
RecurrenceFirst time or repeated?Repeated

Better fields for serious learners:

FieldWhy it helps
Target structureLets you group errors by grammar point
RegisterHelps identify formal/casual mismatch
L1 transferShows English-shaped Mandarin patterns
Word/collocation familyHelps fix vocabulary networks, not just one word
Audio attachedUseful for pronunciation and tone errors
Corrective explanationCaptures the principle behind the fix
Next review datePrevents the entry from becoming dead text
StatusOpen, drilling, improved, resolved, monitor

A corpus is only as good as its categories.

Error categories that work for Mandarin

Use broad categories first. You can refine later.

CategoryWhat it includesExample
Pronunciationinitials, finals, tones, rhythm, intonationshi/xi, an/ang, 2-3 tone pair
Character/formwrong character, typo, simplification, handwriting的/地/得; 以/已
Word choicewrong word for intended meaning知道 vs 认识; 方便 vs 便利
Collocationwords that do not naturally pair做决定 is good; 制造决定 is not
Grammarparticles, word order, aspect, comparison, 把/被我有去过北京
Discoursetopic flow, pronoun omission, connectors, paragraph logictoo many repeated subjects
Registertoo formal, too casual, too textbook, too slangyusing 成语 in a simple email
Pragmaticspoliteness, indirectness, social effecta request sounds like an order
Domain languagelegal, medical, financial, technical misuse责任 vs 义务
Translationesegrammatical but English-shaped在昨天我去了学校

Do not make 50 categories at the start. That creates friction. Start with 8–10 categories and add subtypes only when patterns emerge.

Entry examples

Example 1: aspect error

Original: 我有去过北京。 Correction: 我去过北京。 Error type: Grammar — experiential 过 / English transfer Context: Speaking about travel experience Cause hypothesis: Directly mapped English “I have been to Beijing” into Chinese with 有 Remediation: Drill experience sentences with 过:

  • 我去过北京。
  • 我没去过上海。
  • 你吃过这个菜吗?
  • 我看过这部电影。
  • 他以前住过台北。

Rule note: Experiential 过 does not need 有. 没有 can negate experience: 我没有去过北京 is possible, but 有 is not required in the affirmative structure.

Example 2: word-order error

Original: 我在昨天去了学校。 Correction: 我昨天去了学校。 / 昨天我去了学校。 Error type: Grammar — time placement / translationese Context: Written diary practice Cause hypothesis: Translated “on yesterday” or treated time like English prepositional phrase Remediation: Time-word placement drill:

  • 我明天去。
  • 他上周回国了。
  • 会议下午三点开始。
  • 下个月我们搬家。

Rule note: Mandarin time expressions often appear before the verb phrase without 在 unless the phrase is a true location/time frame structure such as 在三点以前, 在会议结束后.

Example 3: collocation error

Original: 我制造了一个决定。 Correction: 我做了一个决定。 / 我决定了。 Error type: Collocation — English “make a decision” transfer Context: Speaking practice Cause hypothesis: Chose 制造 for “make” mechanically Remediation: Build a “make” verb field:

  • 做决定
  • 做饭
  • 做作业
  • 制作视频
  • 制造产品
  • 造成损失
  • 使人担心
  • 令人满意

Rule note: English “make” splits into many Mandarin verbs. Learn by object type.

Example 4: register error

Original: 老师,您能不能速速回复? Correction: 老师,您方便的时候能不能回复一下? / 老师,麻烦您方便时回复一下。 Error type: Register/pragmatics — overly literary/command-like request Context: Email to teacher Cause hypothesis: Learned 速速 as “quickly” but did not know its command-like/literary flavor Remediation: Request softening ladder:

  • 回复。
  • 请回复。
  • 麻烦回复一下。
  • 麻烦您方便的时候回复一下。
  • 如果方便的话,麻烦您回复一下。

Rule note: Speed words can sound demanding. Politeness depends on burden, hierarchy, and phrasing, not just 您.

Example 5: pronunciation error

Original audio issue: Saying 买 mǎi with a falling contour, causing confusion with 卖 mài. Correction target: 买 should have low/dipping third-tone behavior in isolation and appropriate sandhi in context. 卖 is a falling fourth tone. Error type: Pronunciation — lexical tone / high-risk minimal pair Context: Ordering and shopping role-play Cause hypothesis: Final pitch movement rushed; third tone produced too sharply downward Remediation: Tone-pair drill:

  • 我想买。
  • 买一个。
  • 买东西。
  • 买还是卖?
  • 他买了,不是卖了。

Rule note: Prioritize this because 买/卖 is a high-meaning-risk pair.

Severity: not all errors deserve equal attention

Learners often treat every correction as equally important. That is a mistake.

Use a severity scale.

SeverityMeaningExampleAction
1 — CosmeticSlightly unnatural but understandableminor wording preferenceNote only
2 — Register issueUnderstandable but socially offtoo blunt requestAdd context note
3 — Recurring grammar issueMeaning usually recoverable, but repeatedwrong 了, word orderDrill
4 — Meaning-changingListener may misunderstand买/卖, 不/没 in key contextPrioritize
5 — High-stakesLegal, medical, financial, safety, identitywrong dosage, contract termAvoid unsupervised use; get expert help

A serious corpus prevents panic. It tells you what matters now.

Recurrence matters more than embarrassment

One embarrassing mistake is not automatically a priority. One boring mistake repeated 30 times is.

Add a recurrence count:

Error patternCountLast seenStatus
有 + V过72026-05-24active drill
是 + adjective52026-05-22improving
an/ang confusion122026-05-25pronunciation focus
只 placement42026-05-18monitor
overusing 成语22026-05-10low priority

When you review monthly, choose the top three recurring patterns. Not ten. Three.

How to convert errors into drills

An error entry is not finished until it creates practice.

Grammar error → contrast set

Error: 我有去过北京。 Drill:

  • 我去过北京。
  • 我没去过北京。
  • 你去过北京吗?
  • 我去年去了北京。
  • 我去北京了。

Goal: distinguish experience 过, completed trip 了, and situation update.

Word-choice error → semantic field

Error: 知道他 instead of 认识他. Drill:

  • 我认识他。
  • 我知道这件事。
  • 我了解这个情况。
  • 我懂你的意思。
  • 我明白了。

Goal: map English “know” into Mandarin verb choices.

Pronunciation error → minimal pair + sentence

Error: shi/xi confusion. Drill:

  • 是 / 西
  • 事 / 细
  • 老师 / 学习
  • 这是西边,不是市中心。
  • 我喜欢学习历史。

Goal: carry contrast into sentences.

Register error → rewrite ladder

Error: request sounds too direct. Drill:

  • 给我发一下。
  • 麻烦发我一下。
  • 方便的话,麻烦发我一下。
  • 不好意思,能不能麻烦您方便时发我一下?

Goal: choose directness by relationship.

The monthly error review

Once a month, do not review every entry. Review patterns.

Monthly review template

  1. Total new errors logged:
  2. Top three recurring categories:
  3. Highest-severity error:
  4. Error I thought was fixed but returned:
  5. Error that is no longer appearing:
  6. One pronunciation target for next month:
  7. One grammar target for next month:
  8. One vocabulary/collocation target for next month:
  9. Output constraint for next month:
  10. Reading/listening input to support those targets:

Example:

  • Top recurrence: 把 sentences missing result complement
  • Drill: write 20 task-completion sentences with 把 + result complement
  • Reading input: product manual instructions and workplace task messages
  • Output constraint: every time I use 把, check whether the object has a clear outcome

This is how the corpus becomes a curriculum.

What not to log

Do not log everything. You will burn out.

Skip or lightly note:

  • One-off typos
  • Words far above your current level
  • Corrections you do not understand yet
  • Stylistic preferences from one speaker unless confirmed
  • Errors caused by exhaustion, not pattern
  • Corrections in domains you do not plan to use soon

Log seriously:

  • Repeated grammar mistakes
  • High-frequency word-choice errors
  • Tone errors that change meaning
  • Politeness mistakes
  • Register mismatches
  • Mistakes in your target domain
  • Errors that native speakers repeatedly correct

A corpus is a filter. It is not a trash can.

Common learner objections

“This sounds like too much work.”

It is too much work if you log everything. Log only patterns. Five good entries per week beat 50 dead corrections.

“I do not know how to classify errors.”

Start rough. Grammar, pronunciation, word choice, register, character. Add detail later.

“My teacher already corrects me.”

Correction is input. A corpus turns correction into a system.

“I feel bad looking at my mistakes.”

Then write the corpus like a researcher, not like a judge. The question is not “Why am I bad?” The question is “What pattern is the data showing?”

The concept should be aligned with learner-corpus research rather than presented as a casual mistake diary. The HSK Dynamic Composition Corpus is a useful reference point because it collects and annotates Chinese learner writing, making recurring error patterns visible at scale. Personal learner corpora are smaller and informal, but the same principle applies: errors become more useful when they are categorized, contextualized, and reviewed systematically.

Remediation and upgrade layer

The upgraded thesis:

A personal error corpus is useful only when each error becomes a future decision rule, drill, or reading target.

If an error entry does not change future behavior, it is just archive clutter.

Remediation diagnosis: why personal error logs fail

Failure modeSymptomDamageRepair
Logging everythingThe learner records every tiny correctionReview becomes impossibleLog only recurring, high-risk, or structurally revealing errors
No contextEntry says “wrong measure word”The learner cannot reconstruct the mistakeSave the sentence, situation, and intended meaning
No cause hypothesisEntry records correction onlySame error returns under pressureAdd “why I made this mistake”
No next drillCorpus becomes a museumNo behavior changeConvert each major error into a micro-drill
Shame sortingLearner ranks errors by embarrassmentEmotional noise replaces evidenceRank by recurrence, severity, and communicative risk
Mixed model/error sentencesIncorrect original sits beside corrected sentence with no labelsLearner may review the wrong versionUse clear fields: Original, Correction, Model
Over-taggingEntry has 12 tagsThe tag system stops being usableUse one primary error tag and optional secondary tag

The article should tell readers that a useful error corpus is small at first. Fifty high-quality entries beat five hundred vague corrections.

Upgrade: the minimum viable error entry

A personal corpus entry should have these fields:

FieldExample
Date2026-05-25
SituationAsked a tutor about weekend plans
Intended meaning“I went to the museum last weekend.”
My original sentence我上周末去过博物馆。
Correction/model我上周末去了博物馆。
Error typeaspect: 过 vs 了
Why it happenedI overused 过 for any past event
Rule of thumbUse 过 for experience; use 了 for completed specific event
Next drillWrite 10 sentences contrasting 去过 vs 去了
Recurrence count3
Severitymedium: meaning recoverable but aspect nuance wrong
Review dateone week later

This format is intentionally heavier than a flashcard. It is not for every slip. It is for errors that reveal a pattern.

Primary error taxonomy for Mandarin learners

The article should offer a stable taxonomy, but not an academic monster.

CodeCategoryTypical examplesRepair direction
PRON-TTone买/卖, 2-3 pair, neutral toneminimal pair, tone-pair recording
PRON-SSegmental soundx/sh, q/ch, an/ang, ü/umouth-position drill, listening test
CHARCharacter/form的/地/得, 形近字, wrong simplified/traditional formvisual contrast, source sentence
WORDWord choice认识 vs 知道, 方便 vs 便利collocation and object-type drill
COLLCollocation做决定了一个计划phrase-level mining
MWMeasure word/classifier一只书noun-classifier pairing
ASPAspect了, 过, 着, 在timeline and contrast sentences
WOWord ordertime placement, adverb scope, relative clausessentence reconstruction
COMPComplements看懂, 写完, 买得到verb-complement drills
BA/BEI把/被missing result, wrong affected objectconstruction rewrite
REGRegister成语 overuse, official wording in casual speechgenre labeling
PRAGPragmatic/socialtoo blunt request, wrong address termscenario rewrite
TRANSTranslationeseEnglish-shaped sentencenatural Chinese paraphrase

Each entry should have one primary code. If everything is tagged everything, the corpus stops producing patterns.

Severity and recurrence matrix

Not all mistakes deserve equal study time. Add this matrix.

Severity / RecurrenceRareRepeated
Low severityIgnore or note lightlyAdd to review if it annoys listeners/readers
Medium severitySave if it reveals a patternCreate a drill and review weekly
High severityFix quickly if domain-riskyMake it a top-three monthly target

High severity includes errors that affect names, numbers, addresses, medical/legal terms, professional commitments, or social politeness. A rare typo in a low-stakes chat does not belong in the same queue as recurring tone errors in numbers.

Before/after repair sets

Aspect error

Original:

我昨天看过这个电影。

Better:

我昨天看了这部电影。

If the intended meaning is experience at some unspecified time:

我看过这部电影。

Corpus note:

Error type: ASP. Cause: confusing past-time adverb with experiential 过. Drill: write pairs with 昨天/去年 vs 以前/曾经.

Word-choice error

Original:

我认识这个问题。

Better, depending on meaning:

我知道这个问题。 我了解这个问题。 我明白这个问题的意思。

Corpus note:

Error type: WORD. Cause: overextending 认识 from “know.” Drill: sort objects into people / facts / situation / meaning.

Register error

Original in a casual message to a friend:

请贵方于今日下午三点前回复。

Natural casual version:

你今天下午三点前能回我一下吗?

Formal business version:

请贵方于今日下午三点前回复。

Corpus note:

Error type: REG/PRAG. Cause: formal document language used in casual interpersonal context. Drill: rewrite same request for friend, teacher, client, and public notice.

Complement error

Original:

我听这个问题。

Possible repairs:

我听懂了这个问题。 我听清楚了这个问题。 我听到了这个问题。

Corpus note:

Error type: COMP. Cause: using 听 without specifying understand / hear clearly / hear occurrence. Drill: 听见, 听到, 听懂, 听清楚 contrast set.

Monthly review upgrade

The monthly review should be concrete and ruthless.

Step 1: Count recurrence.

Sort entries by primary error code. Do not choose a target because it feels embarrassing. Choose it because it keeps happening.

Step 2: Pick top three.

A good monthly target list might be:

  1. ASP: overusing 过 for past events.
  2. WORD: confusing 认识, 知道, 了解, 懂, 明白.
  3. PRON-S: failing an/ang in common words.

Step 3: Build one drill per target.

TargetDrill
过 vs 了20 sentence contrasts with time adverbs
know verbsobject-sorting table and dialogue fill-in
an/anglisten-record-compare with 15 frequent pairs

Step 4: Add one reading/listening target.

For 过 vs 了, the learner should read short travel narratives and mark every event sentence. For know verbs, collect examples from interviews and explanations. For an/ang, use audio-minimal pairs and real words.

Step 5: Retest under output pressure.

A corrected worksheet does not prove repair. The learner should retell a story, write a paragraph, or answer questions in real time while constrained to use the target pattern.

The original tool idea should become a lightweight database with strong guardrails.

Required fields:

  • original attempt;
  • correction/model;
  • context;
  • intended meaning;
  • primary error category;
  • cause hypothesis;
  • next drill;
  • recurrence count;
  • severity;
  • review status.

Important UI rule: show the corrected/model sentence more prominently than the incorrect sentence. The original error should be visible for diagnosis, but not visually rehearsed as the target.

Useful filters:

  • “show repeated errors only”;
  • “show high-severity errors”;
  • “show errors by source: tutor, writing, speech, ASR, reading misparse”;
  • “show errors with no drill yet”;
  • “show fossilized errors older than 60 days.”

Export modes:

ExportUse
Drill sheetstructured practice from top recurring errors
Anki candidatesonly corrected model sentences, never raw errors alone
Tutor reportconcise list of recurring patterns to ask about
Monthly summaryerror counts, improvements, next targets

Privacy and psychological safety

A serious error corpus often contains personal speech, tutor corrections, workplace phrases, or private writing. The article should remind readers not to upload sensitive material into public tools without thinking. Remove names, contact details, employer information, health details, and private chat content unless the storage environment is appropriate.

The article should also avoid turning error tracking into self-surveillance. The learner is not trying to prove that they are bad at Chinese. They are trying to make errors visible enough to retire them.

The HSK Dynamic Composition Corpus is a strong editorial reference point because it collects writing by foreign HSK test takers and supports error-oriented corpus work. Its public description notes earlier versions containing more than ten thousand compositions and millions of characters, with Version 2.0 adding search and error-related functions. A personal error corpus is far smaller and less formal, but the same methodological idea applies: errors become useful when they are categorized, contextualized, and reviewed for recurring patterns.

Related reading