|
| 1 | + |
| 2 | +--- |
| 3 | +title: "Building my own app for learning Chinese" |
| 4 | +type: tech |
| 5 | +weight: 6 |
| 6 | +--- |
| 7 | + |
| 8 | +[Just want to see the app?](#the-app)</code> |
| 9 | + |
| 10 | +## My learning journey so far |
| 11 | + |
| 12 | +I've been studying mandarin chinese for a few years now, using a few pretty |
| 13 | +great apps. I started out just learning ultra basic phrases from my spouse who |
| 14 | +bought me pimsleur to listen to in the car which was my foundation. The next |
| 15 | +app I used, which got me all the way through the basics was [Hello |
| 16 | +Chinese](https://www.hellochinese.cc/), easily the best app I've used for |
| 17 | +learning. Besides gamified lessons and reviews, it included a reader with a |
| 18 | +solid library of short texts. When reading if you didn't know a word, you could |
| 19 | +tap on it to hear it and see the English translation. There are other apps for |
| 20 | +these graded readers, like [DuChinese](https://duchinese.net/), but I already |
| 21 | +paid for Hello Chinese so I focused on that. |
| 22 | + |
| 23 | +Eventually, I finished all the lessons HelloChinese had to offer so I focused |
| 24 | +on reading. When you don't know a word, you have the option to save it for |
| 25 | +review... but this saves words into a separate review system from the lessons. |
| 26 | +Instead of the multiple choice quiz and typed-response grammar review, you'd |
| 27 | +get flashcards. I really don't like flashcards. Lots of people love them, but I |
| 28 | +find the self-grading aspect annoying. What I disliked more than flashcards, is |
| 29 | +the lack of the grammar review. You'd see an English sentence and have to type |
| 30 | +it in Chinese. I'm pretty sure _output_ being part of my daily review routine |
| 31 | +helped my retention a _lot_, especially when it comes to how to structure |
| 32 | +sentences. After a while, I gave up on this app and decided to focus on |
| 33 | +Comprehensible Input. |
| 34 | + |
| 35 | +The term Comprehensible Input is used pretty frequently in language learning |
| 36 | +circles. This is input (audio, text, video) that you can _mostly_ understand. |
| 37 | +Some people say it's the "natural way that babies learn languages" which is |
| 38 | +probably mostly true. Your brain should soak up patterns over time, get used |
| 39 | +to recognizing words without thinking about it, and begin to naturally replicate |
| 40 | +pronunciation. I _don't_ agree with those who claim CI is the _only_ tool you need |
| 41 | +to reach fluency. Especially when the target language is so completely different |
| 42 | +than your own native langague. |
| 43 | + |
| 44 | +I can still remember the feeling of "Holy shit I understand exactly what he |
| 45 | +said!" the second time I tried listinging to [Tea Time Chinese |
| 46 | +(茶歇中文)](https://teatimechinese.com/). My favorite CI resources for chinese are: |
| 47 | + |
| 48 | +* [Tea Time Chinese (茶歇中文)](https://teatimechinese.com/). |
| 49 | +* [Lazy Chinese](https://www.lazychinese.com/) |
| 50 | +* [Hello Chinese (Graded readers)](https://www.hellochinese.cc/) |
| 51 | + |
| 52 | +Another form of Comprehensible Input, is conversation practice with a native |
| 53 | +speaker. Websites like [Italki](https://www.italki.com/) allow you to pay a |
| 54 | +reasonable price for tutors from all over the world for different languages. |
| 55 | +Every teacher has a different approach, and I think I am very lucky to have |
| 56 | +found a teacher who just had conversations with me, sometimes with a topic in |
| 57 | +mind, sometimes just letting the conversation naturally flow. He would |
| 58 | +strategically introduce new words, when I was tyring to say something too |
| 59 | +complex, he'd encourage me to use the simpler language that I'm already |
| 60 | +comfortable with to express myself, and use simple language to teach words |
| 61 | +without using any English! |
| 62 | + |
| 63 | + |
| 64 | +Now, as I said before, CI is amazing but until you're nearly fluent, I don't |
| 65 | +think it can be the only tool for learning. In fact, I recently started taking |
| 66 | +more structured online classes with a teacher from [GoEast |
| 67 | +Mandarin](https://goeastmandarin.com/) and they're going well. I'd consider the |
| 68 | +entire class period to be CI since we use 90+% Chinese, but with a bit of |
| 69 | +English when introducing new words. |
| 70 | + |
| 71 | +Doing a couple hours a week of classes and using bit of Chinese at home with my |
| 72 | +spouse is _not going to get me from intermediate to fluent in a reasonable |
| 73 | +timeframe. Another tool besides CI that many language learning enthusiasts swear by |
| 74 | +is a Spaced Repitition System or SRS. There are apps for this like [Anki](https://apps.ankiweb.net/) |
| 75 | +or you could simply systematically organize physical flashcards. The basic idea is |
| 76 | +that you review whatever you're trying to memorize daily. When you are correct on the first try, |
| 77 | +you advance that item one level. Items in a higher level are reviewed less frequently until you |
| 78 | +consider them "learned". |
| 79 | + |
| 80 | +I hated flashcards, and Anki is pretty much a flashcard app. What I wanted was something more akin |
| 81 | +to the Hello Chinese review system that was used with their in-app lessons. So I started building |
| 82 | +one myself and realized there were a lot of tools I could build myself on top of that basic SRS. |
| 83 | + |
| 84 | +# The app |
| 85 | + |
| 86 | +{{< gallery >}} |
| 87 | + <img src="pics/home.png" alt="homescreen of the app" style="height: 512px;"> |
| 88 | + <img src="pics/search.png" alt="search for words" style="height: 512px;"> |
| 89 | +{{< /gallery >}} |
| 90 | + |
| 91 | +What I have built so far has a few key features: |
| 92 | + |
| 93 | +* A dictionary interface built on top of the open-source [CC-CEDICT](https://www.mdbg.net/chinese/dictionary?page=cedict). |
| 94 | +* SRS for both Words and Sentences with stats and a streak-tracker. |
| 95 | +* Reader that provides a similar experience to DuChinese and Hello Chinese. |
| 96 | +* TTS all over the place. |
| 97 | + |
| 98 | +And I still have a lot to do: |
| 99 | +* A library of CI content to feed into the reader |
| 100 | +* Camera based OCR to look up words you encounter IRL |
| 101 | +* Component/radical search for the dictionary |
| 102 | +* A floating widget that displays over other apps to look up words |
| 103 | + the way you would in my reader mode. |
| 104 | + |
| 105 | + |
| 106 | +### TTS |
| 107 | + |
| 108 | +Language learning is an audio-visual activity. With a language like Chinese, |
| 109 | +the written language is not directly tied to how a word or character sounds. |
| 110 | +Even with pinyin, the romanization system for chinese pronunciation, it's important |
| 111 | +to actually hear the words or sentences you're studying. |
| 112 | + |
| 113 | +A unique challenge that pops up all over building for Chinese is the ambiguity |
| 114 | +of the langauge. Word boundaries are not obvious. Sometimes a single character |
| 115 | +is a word, sometimes several characters are all one word. On top of this, a |
| 116 | +single character may be pronounced differently based on the surrounding |
| 117 | +context. |
| 118 | + |
| 119 | +If I sent the sentence “他们在那里呆了很长时间。”, the character 长 would likely |
| 120 | +be mispronounced by a lot of TTS systems. Is it 'cháng' or 'zhǎng'? Plenty of TTS |
| 121 | +services _do_ take context into account, but what if I'm looking at the two separate |
| 122 | +dictionary entries for 长's pronunciations which have their own meaning? |
| 123 | + |
| 124 | +The only service I found that allows passing the pinyin along with the characters |
| 125 | +is Azure's TTS, which has a `phoneme` element. |
| 126 | + |
| 127 | +### Reader |
| 128 | + |
| 129 | +{{< gallery >}} |
| 130 | +<img src="pics/reader.png" alt="reader screen" class="grid-w50"> |
| 131 | +<img src="pics/reader-import.png" alt="reader screen" class="grid-w50"> |
| 132 | +{{< /gallery >}} |
| 133 | + |
| 134 | +Arguably the most critical tool besides the SRS, is the reader. This is still |
| 135 | +quite a work in progress. Currently you can import content from: |
| 136 | + |
| 137 | +* Text in your clipboard |
| 138 | +* URLs, with a best-effort to identify where the primary content is on a |
| 139 | + webpage. (There's special handling for the transcripts on |
| 140 | + https://teatimechinese.com!) |
| 141 | +* YouTube URLs! I have a custom backend that handles downloading the captions |
| 142 | + if they're available. (Both Lazy Chinese and Tea Time Chinese videos usually |
| 143 | + do!). |
| 144 | + |
| 145 | +Major TODOs are: |
| 146 | + |
| 147 | +* Allow saving stories to be read later |
| 148 | +* An editor mode that allows writing and editing content directly. |
| 149 | +* A button that shows other potential words in the dictionary if my automated |
| 150 | + mappings are wrong. |
| 151 | +* Potentially, a curated library of content. I'm not sure how I'd source this |
| 152 | + legally and ethically. Unless... |
| 153 | +* Users or teachers could produce and share their own content, potentially |
| 154 | + with some kind of marketplace platform if it's not a free and open place to share. |
| 155 | + Content moderation is a real concern though. |
| 156 | + |
| 157 | + |
| 158 | +### Automated dictionary mappings |
| 159 | + |
| 160 | + |
| 161 | +As I mentioned in both the TTS and Reader sections, it can be difficult |
| 162 | +to tell what word and definition in the dictionary corresponds to a character |
| 163 | +in some text. There is so much ambiguity, and rule-based systems can only do |
| 164 | +an okay job dealing with it. Some examples: |
| 165 | + |
| 166 | +> 我的学长流血了很长时间. |
| 167 | +
|
| 168 | +It can be tricky to figure out whether 长 is 'cháng' or 'zhǎng'. |
| 169 | +The first instance is easy. 学长 is a word in the dictionary, so |
| 170 | +we can deal with that by just looking for the longest sequence that |
| 171 | +exists in the dictionary. |
| 172 | + |
| 173 | +> 如果你喜欢周杰伦的话,你应该听妈妈的话。 |
| 174 | +
|
| 175 | +In this case, 的话 has two possible meanings: |
| 176 | + |
| 177 | +1. 的话 - a conditional particle meaning "if the previous statement is true, then..." |
| 178 | +2. 的 - posessive particle; 话 - words |
| 179 | + |
| 180 | +The above sentence says "If you like Jay Chou, you should listen to your |
| 181 | +mother['s words]". (A reference to the song 《听妈妈的话》). Using the longest |
| 182 | +sequence would always be wrong in the case where 的 and 话 are split. This is a trivial |
| 183 | +example, but the general problem is still worth solving. |
| 184 | + |
| 185 | +To do this, I introduced a word segmentation model, from |
| 186 | +[ckiplab/ckip-transformers](https://github.com/ckiplab/ckip-transformers/tree/master). |
| 187 | +This model does a _decent_, but not perfect job at segmenting things. I ended up having |
| 188 | +to hand-tune the output confidences to bias it myself. |
| 189 | + |
| 190 | +```dart |
| 191 | +// each token gets a score on whether it should be part of |
| 192 | +// the same word as the preceding token |
| 193 | +final bScore = tokenScores[i][0]; |
| 194 | +final iScore = tokenScores[i][1]; |
| 195 | +
|
| 196 | +// do some funny math to figure out _how_ confident the model is |
| 197 | +// this makes the numbers a bit more interperetable for tuning threshold |
| 198 | +final pB = exp(bScore / temperature) / |
| 199 | + (exp(bScore / temperature) + exp(iScore / temperature)); |
| 200 | +final pI = exp(iScore / temperature) / |
| 201 | + (exp(bScore / temperature) + exp(iScore / temperature)); |
| 202 | +final absDiff = (pB - pI).abs(); |
| 203 | +
|
| 204 | +// only include it if we're above `threshold` difference |
| 205 | +// in confidences for or against inclusion |
| 206 | +if (bScore > iScore && absDiff > threshold) { |
| 207 | + if (current.isNotEmpty) { |
| 208 | + tokens.add(current); |
| 209 | + } |
| 210 | + current = sentence[i - 1]; |
| 211 | +} else { |
| 212 | + current += sentence[i - 1]; |
| 213 | +} |
| 214 | +``` |
| 215 | + |
| 216 | +Now, this is getting closer, but there are _still_ a lot of cases that it gets wrong. |
| 217 | +Some of this is due to the segmentation model not being super accurate, and some of it |
| 218 | +is due to cases where the word segmentation doesn't even make a difference. |
| 219 | + |
| 220 | +> 我长大了。 |
| 221 | +> 我等了很长时间。 (Yes, 很久 makes more sense, this is just an example.) |
| 222 | +
|
| 223 | +In both cases, 长 is a single character. The meaning changes along witht he |
| 224 | +pronunciation, though. Unfortunately, ckip-transformers nor huggingface |
| 225 | +provides a solid model for doing Chinese text to pinyin mappings. So I decided |
| 226 | +to train my own model. Using the ckip-transformers albert models as a base, I |
| 227 | +trained my model using some datasets I found online. |
| 228 | + |
| 229 | +> Input: 我 不 知 道 这 是 不 是 爱 。 |
| 230 | +> |
| 231 | +> Expected Output (Pinyin): wǒ bù zhī dào zhè shì bù shì ài |
| 232 | +> |
| 233 | +> Predicted Output (Pinyin): wǒ bù zhī dào zhè shì bù shì fu |
| 234 | +
|
| 235 | +It took a while... and I had to learn a bit about modern ML techniques, |
| 236 | +how to tune batch size, create a custom loss function, dynamically change the |
| 237 | +learning rate and more. |
| 238 | + |
| 239 | +> Input: 我 不 知 道 这 是 不 是 爱 。 |
| 240 | +> |
| 241 | +> Expected Output (Pinyin): wǒ bù zhī dào zhè shì bù shì ài |
| 242 | +> |
| 243 | +> Predicted Output (Pinyin): wǒ bù zhī dào zhè shì bù shì ài |
| 244 | +
|
| 245 | +But eventually, the robot finally understands what 爱 (love) is. |
| 246 | + |
| 247 | +The final model character-level accuracy level of **99.62%** and a |
| 248 | +sentence-level accuracy level of **`96.91%** (both of those numbers are when I |
| 249 | +ignore tone markers). |
| 250 | + |
| 251 | +I use the output of my pinyin model, the ckip-transformers wordseg model and |
| 252 | +the length of a candidate word to generate scores for all the possible |
| 253 | +dictionary mappings inside a sentence to determine my most confident prediction |
| 254 | +for the entire sentence. This isn't perfect, but I'm pretty sure it's better |
| 255 | +than anything else on the market. _And the whole thing runs with acceptable performance |
| 256 | +on my Pixel 8_. |
| 257 | + |
| 258 | + |
| 259 | +### SRS |
| 260 | + |
| 261 | + |
| 262 | +<img src="pics/leitner.png" alt="leitner box diagram"> |
| 263 | +By <a href="//commons.wikimedia.org/wiki/User:Zirguezi" title="User:Zirguezi">Zirguezi</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="http://creativecommons.org/publicdomain/zero/1.0/deed.en" title="Creative Commons Zero, Public Domain Dedication">CC0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=20328125">Link</a> |
| 264 | + |
| 265 | + |
| 266 | +The SRS itself is a pretty basic [Leitner |
| 267 | +system](https://en.wikipedia.org/wiki/Leitner_system). There's actually very |
| 268 | +little interesting techincal stuff going on here. The interface on top of it is |
| 269 | +what I find more useful than others' systems; but it's still very, very simple. |
| 270 | + |
| 271 | +{{< gallery >}} |
| 272 | + <img src="pics/sentence-write.png" alt="" class="grid-w33"> |
| 273 | + <img src="pics/sentence-gen.png" alt="" class="grid-w33"> |
| 274 | + <img src="pics/sentence-edit.png" alt="" class="grid-w33"> |
| 275 | +{{< /gallery >}} |
| 276 | + |
| 277 | +The source of the items to review are added manually, by searching in the dictionary |
| 278 | +or typing out the sentence and its english translation. There is a feature to use AI |
| 279 | +to generate sentences of varying degrees of difficulty, but I'm not sure how authentic |
| 280 | +the output is at times. The automatic mappings can still fail, so there is a (WIP) UI |
| 281 | +that allows correcting the mapping's segmentation and pick the correct definition. |
| 282 | + |
| 283 | + |
| 284 | +{{< gallery >}} |
| 285 | + <img src="pics/quiz_mc_ji.png" alt="" class="grid-w33"> |
| 286 | + <img src="pics/quiz_mc_yanxi.png" alt="" class="grid-w33"> |
| 287 | +{{< /gallery >}} |
| 288 | + |
| 289 | +Multiple choice questions are the only alternative to flashcards. Going between |
| 290 | +English and Chinese in both directions has been helpful, in my experience. |
| 291 | + |
| 292 | +{{< gallery >}} |
| 293 | + <img src="pics/quiz_grammar.png" alt="" class="grid-w33"> |
| 294 | + <img src="pics/quiz_blank_guanjian.png" alt="" class="grid-w33"> |
| 295 | +{{< /gallery >}} |
| 296 | + |
| 297 | +Seeing words in context is also incredibly important, so when we have |
| 298 | +sentences that use a word, fill-in-the-blank questions are generated. |
| 299 | + |
| 300 | +Saved sentences are fed directly into the sentence review. This activity |
| 301 | +is more time consuming than the multiple-choice word review, but output |
| 302 | +shouldn't be ignored as part of a review habit. |
| 303 | + |
| 304 | +{{< gallery >}} |
| 305 | + <img src="pics/tags_list.png" alt="tag on words" class="grid-w33"> |
| 306 | + <img src="pics/tags_review.png" alt="tag on words" class="grid-w33"> |
| 307 | +{{< /gallery >}} |
| 308 | + |
| 309 | +Any reviewable item also can be given custom tags. The use-case I had in mind |
| 310 | +was preparing for specific events. I play the Yu-Gi-Oh TCG, and I think it |
| 311 | +would be fun to one day play in a competition overseas in Taiwan or mainland |
| 312 | +China. Besides the daily review, you can also do a cutom review and filter |
| 313 | +using these tags if you're studying a particular topic. |
| 314 | + |
| 315 | +## Conclusion |
| 316 | + |
| 317 | + |
| 318 | +{{< gallery >}} |
| 319 | +<img src="pics/stats.png" alt="tag on words" class="grid-w100"> |
| 320 | +{{< /gallery >}} |
| 321 | +The main idea of the app is "bring your own content". I've been dogfooding |
| 322 | +it for a little while now, and it has definitely sped up my vocab acquisition. |
| 323 | +I'm not sure whether the long term plan is to polish it and publish it as FOSS, |
| 324 | +or to make it a closed source side-hustle. |
| 325 | + |
| 326 | +In the meantime, I'll continue to iterate. It's a slow going process and I |
| 327 | +just have the nights and weekends that I'm not doing other hobbies like BJJ or |
| 328 | +Yu-Gi-Oh. I'm happy with the progress I've made so far. |
0 commit comments