Skip to content

Commit ccc0720

Browse files
committed
chinese app
1 parent 96c770f commit ccc0720

27 files changed

+362
-25
lines changed

config/_default/languages.en.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
title = 'Dirt is Food'
1+
title = "Steven's blog"
22

33
[author]
44
name = "Steven Landow"

content/_index.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
---
22
---
33

4-
{{< list limit=100 cardView=true title="Technical Posts" where="Type" value="tech">}}
54

6-
{{< list limit=100 cardView=true title="Art" where="Type" value="art">}}
5+
### This is where post about my side projects so I can feel like I actually shipped something.
6+
7+
8+
{{< list limit=100 cardView=true title="Posts" where="Type" value="tech">}}
9+
10+
{{< list limit=100 cardView=true title="Doodles" where="Type" value="art">}}

content/art/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
---
2-
title: Art
2+
title: Doodles
33
menu: main
44
---

content/tech/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Techincal Posts
2+
title: Posts
33
menu: main
44
---
55

content/tech/buoyancy/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Faking Buoyancy
33
type: tech
4+
weight: 2
45
---
56

67
{{<video "demo.webm" >}}

content/tech/chinese_app/_index.md

Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
2+
---
3+
title: "Building my own app for learning Chinese"
4+
type: tech
5+
weight: 6
6+
---
7+
8+
[Just want to see the app?](#the-app)</code>
9+
10+
## My learning journey so far
11+
12+
I've been studying mandarin chinese for a few years now, using a few pretty
13+
great apps. I started out just learning ultra basic phrases from my spouse who
14+
bought me pimsleur to listen to in the car which was my foundation. The next
15+
app I used, which got me all the way through the basics was [Hello
16+
Chinese](https://www.hellochinese.cc/), easily the best app I've used for
17+
learning. Besides gamified lessons and reviews, it included a reader with a
18+
solid library of short texts. When reading if you didn't know a word, you could
19+
tap on it to hear it and see the English translation. There are other apps for
20+
these graded readers, like [DuChinese](https://duchinese.net/), but I already
21+
paid for Hello Chinese so I focused on that.
22+
23+
Eventually, I finished all the lessons HelloChinese had to offer so I focused
24+
on reading. When you don't know a word, you have the option to save it for
25+
review... but this saves words into a separate review system from the lessons.
26+
Instead of the multiple choice quiz and typed-response grammar review, you'd
27+
get flashcards. I really don't like flashcards. Lots of people love them, but I
28+
find the self-grading aspect annoying. What I disliked more than flashcards, is
29+
the lack of the grammar review. You'd see an English sentence and have to type
30+
it in Chinese. I'm pretty sure _output_ being part of my daily review routine
31+
helped my retention a _lot_, especially when it comes to how to structure
32+
sentences. After a while, I gave up on this app and decided to focus on
33+
Comprehensible Input.
34+
35+
The term Comprehensible Input is used pretty frequently in language learning
36+
circles. This is input (audio, text, video) that you can _mostly_ understand.
37+
Some people say it's the "natural way that babies learn languages" which is
38+
probably mostly true. Your brain should soak up patterns over time, get used
39+
to recognizing words without thinking about it, and begin to naturally replicate
40+
pronunciation. I _don't_ agree with those who claim CI is the _only_ tool you need
41+
to reach fluency. Especially when the target language is so completely different
42+
than your own native langague.
43+
44+
I can still remember the feeling of "Holy shit I understand exactly what he
45+
said!" the second time I tried listinging to [Tea Time Chinese
46+
(茶歇中文)](https://teatimechinese.com/). My favorite CI resources for chinese are:
47+
48+
* [Tea Time Chinese (茶歇中文)](https://teatimechinese.com/).
49+
* [Lazy Chinese](https://www.lazychinese.com/)
50+
* [Hello Chinese (Graded readers)](https://www.hellochinese.cc/)
51+
52+
Another form of Comprehensible Input, is conversation practice with a native
53+
speaker. Websites like [Italki](https://www.italki.com/) allow you to pay a
54+
reasonable price for tutors from all over the world for different languages.
55+
Every teacher has a different approach, and I think I am very lucky to have
56+
found a teacher who just had conversations with me, sometimes with a topic in
57+
mind, sometimes just letting the conversation naturally flow. He would
58+
strategically introduce new words, when I was tyring to say something too
59+
complex, he'd encourage me to use the simpler language that I'm already
60+
comfortable with to express myself, and use simple language to teach words
61+
without using any English!
62+
63+
64+
Now, as I said before, CI is amazing but until you're nearly fluent, I don't
65+
think it can be the only tool for learning. In fact, I recently started taking
66+
more structured online classes with a teacher from [GoEast
67+
Mandarin](https://goeastmandarin.com/) and they're going well. I'd consider the
68+
entire class period to be CI since we use 90+% Chinese, but with a bit of
69+
English when introducing new words.
70+
71+
Doing a couple hours a week of classes and using bit of Chinese at home with my
72+
spouse is _not going to get me from intermediate to fluent in a reasonable
73+
timeframe. Another tool besides CI that many language learning enthusiasts swear by
74+
is a Spaced Repitition System or SRS. There are apps for this like [Anki](https://apps.ankiweb.net/)
75+
or you could simply systematically organize physical flashcards. The basic idea is
76+
that you review whatever you're trying to memorize daily. When you are correct on the first try,
77+
you advance that item one level. Items in a higher level are reviewed less frequently until you
78+
consider them "learned".
79+
80+
I hated flashcards, and Anki is pretty much a flashcard app. What I wanted was something more akin
81+
to the Hello Chinese review system that was used with their in-app lessons. So I started building
82+
one myself and realized there were a lot of tools I could build myself on top of that basic SRS.
83+
84+
# The app
85+
86+
{{< gallery >}}
87+
<img src="pics/home.png" alt="homescreen of the app" style="height: 512px;">
88+
<img src="pics/search.png" alt="search for words" style="height: 512px;">
89+
{{< /gallery >}}
90+
91+
What I have built so far has a few key features:
92+
93+
* A dictionary interface built on top of the open-source [CC-CEDICT](https://www.mdbg.net/chinese/dictionary?page=cedict).
94+
* SRS for both Words and Sentences with stats and a streak-tracker.
95+
* Reader that provides a similar experience to DuChinese and Hello Chinese.
96+
* TTS all over the place.
97+
98+
And I still have a lot to do:
99+
* A library of CI content to feed into the reader
100+
* Camera based OCR to look up words you encounter IRL
101+
* Component/radical search for the dictionary
102+
* A floating widget that displays over other apps to look up words
103+
the way you would in my reader mode.
104+
105+
106+
### TTS
107+
108+
Language learning is an audio-visual activity. With a language like Chinese,
109+
the written language is not directly tied to how a word or character sounds.
110+
Even with pinyin, the romanization system for chinese pronunciation, it's important
111+
to actually hear the words or sentences you're studying.
112+
113+
A unique challenge that pops up all over building for Chinese is the ambiguity
114+
of the langauge. Word boundaries are not obvious. Sometimes a single character
115+
is a word, sometimes several characters are all one word. On top of this, a
116+
single character may be pronounced differently based on the surrounding
117+
context.
118+
119+
If I sent the sentence “他们在那里呆了很长时间。”, the character 长 would likely
120+
be mispronounced by a lot of TTS systems. Is it 'cháng' or 'zhǎng'? Plenty of TTS
121+
services _do_ take context into account, but what if I'm looking at the two separate
122+
dictionary entries for 长's pronunciations which have their own meaning?
123+
124+
The only service I found that allows passing the pinyin along with the characters
125+
is Azure's TTS, which has a `phoneme` element.
126+
127+
### Reader
128+
129+
{{< gallery >}}
130+
<img src="pics/reader.png" alt="reader screen" class="grid-w50">
131+
<img src="pics/reader-import.png" alt="reader screen" class="grid-w50">
132+
{{< /gallery >}}
133+
134+
Arguably the most critical tool besides the SRS, is the reader. This is still
135+
quite a work in progress. Currently you can import content from:
136+
137+
* Text in your clipboard
138+
* URLs, with a best-effort to identify where the primary content is on a
139+
webpage. (There's special handling for the transcripts on
140+
https://teatimechinese.com!)
141+
* YouTube URLs! I have a custom backend that handles downloading the captions
142+
if they're available. (Both Lazy Chinese and Tea Time Chinese videos usually
143+
do!).
144+
145+
Major TODOs are:
146+
147+
* Allow saving stories to be read later
148+
* An editor mode that allows writing and editing content directly.
149+
* A button that shows other potential words in the dictionary if my automated
150+
mappings are wrong.
151+
* Potentially, a curated library of content. I'm not sure how I'd source this
152+
legally and ethically. Unless...
153+
* Users or teachers could produce and share their own content, potentially
154+
with some kind of marketplace platform if it's not a free and open place to share.
155+
Content moderation is a real concern though.
156+
157+
158+
### Automated dictionary mappings
159+
160+
161+
As I mentioned in both the TTS and Reader sections, it can be difficult
162+
to tell what word and definition in the dictionary corresponds to a character
163+
in some text. There is so much ambiguity, and rule-based systems can only do
164+
an okay job dealing with it. Some examples:
165+
166+
> 我的学长流血了很长时间.
167+
168+
It can be tricky to figure out whether 长 is 'cháng' or 'zhǎng'.
169+
The first instance is easy. 学长 is a word in the dictionary, so
170+
we can deal with that by just looking for the longest sequence that
171+
exists in the dictionary.
172+
173+
> 如果你喜欢周杰伦的话,你应该听妈妈的话。
174+
175+
In this case, 的话 has two possible meanings:
176+
177+
1. 的话 - a conditional particle meaning "if the previous statement is true, then..."
178+
2. 的 - posessive particle; 话 - words
179+
180+
The above sentence says "If you like Jay Chou, you should listen to your
181+
mother['s words]". (A reference to the song 《听妈妈的话》). Using the longest
182+
sequence would always be wrong in the case where 的 and 话 are split. This is a trivial
183+
example, but the general problem is still worth solving.
184+
185+
To do this, I introduced a word segmentation model, from
186+
[ckiplab/ckip-transformers](https://github.com/ckiplab/ckip-transformers/tree/master).
187+
This model does a _decent_, but not perfect job at segmenting things. I ended up having
188+
to hand-tune the output confidences to bias it myself.
189+
190+
```dart
191+
// each token gets a score on whether it should be part of
192+
// the same word as the preceding token
193+
final bScore = tokenScores[i][0];
194+
final iScore = tokenScores[i][1];
195+
196+
// do some funny math to figure out _how_ confident the model is
197+
// this makes the numbers a bit more interperetable for tuning threshold
198+
final pB = exp(bScore / temperature) /
199+
(exp(bScore / temperature) + exp(iScore / temperature));
200+
final pI = exp(iScore / temperature) /
201+
(exp(bScore / temperature) + exp(iScore / temperature));
202+
final absDiff = (pB - pI).abs();
203+
204+
// only include it if we're above `threshold` difference
205+
// in confidences for or against inclusion
206+
if (bScore > iScore && absDiff > threshold) {
207+
if (current.isNotEmpty) {
208+
tokens.add(current);
209+
}
210+
current = sentence[i - 1];
211+
} else {
212+
current += sentence[i - 1];
213+
}
214+
```
215+
216+
Now, this is getting closer, but there are _still_ a lot of cases that it gets wrong.
217+
Some of this is due to the segmentation model not being super accurate, and some of it
218+
is due to cases where the word segmentation doesn't even make a difference.
219+
220+
> 我长大了。
221+
> 我等了很长时间。 (Yes, 很久 makes more sense, this is just an example.)
222+
223+
In both cases, 长 is a single character. The meaning changes along witht he
224+
pronunciation, though. Unfortunately, ckip-transformers nor huggingface
225+
provides a solid model for doing Chinese text to pinyin mappings. So I decided
226+
to train my own model. Using the ckip-transformers albert models as a base, I
227+
trained my model using some datasets I found online.
228+
229+
> Input: 我 不 知 道 这 是 不 是 爱 。
230+
>
231+
> Expected Output (Pinyin): wǒ bù zhī dào zhè shì bù shì ài
232+
>
233+
> Predicted Output (Pinyin): wǒ bù zhī dào zhè shì bù shì fu
234+
235+
It took a while... and I had to learn a bit about modern ML techniques,
236+
how to tune batch size, create a custom loss function, dynamically change the
237+
learning rate and more.
238+
239+
> Input: 我 不 知 道 这 是 不 是 爱 。
240+
>
241+
> Expected Output (Pinyin): wǒ bù zhī dào zhè shì bù shì ài
242+
>
243+
> Predicted Output (Pinyin): wǒ bù zhī dào zhè shì bù shì ài
244+
245+
But eventually, the robot finally understands what 爱 (love) is.
246+
247+
The final model character-level accuracy level of **99.62%** and a
248+
sentence-level accuracy level of **`96.91%** (both of those numbers are when I
249+
ignore tone markers).
250+
251+
I use the output of my pinyin model, the ckip-transformers wordseg model and
252+
the length of a candidate word to generate scores for all the possible
253+
dictionary mappings inside a sentence to determine my most confident prediction
254+
for the entire sentence. This isn't perfect, but I'm pretty sure it's better
255+
than anything else on the market. _And the whole thing runs with acceptable performance
256+
on my Pixel 8_.
257+
258+
259+
### SRS
260+
261+
262+
<img src="pics/leitner.png" alt="leitner box diagram">
263+
By <a href="//commons.wikimedia.org/wiki/User:Zirguezi" title="User:Zirguezi">Zirguezi</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="http://creativecommons.org/publicdomain/zero/1.0/deed.en" title="Creative Commons Zero, Public Domain Dedication">CC0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=20328125">Link</a>
264+
265+
266+
The SRS itself is a pretty basic [Leitner
267+
system](https://en.wikipedia.org/wiki/Leitner_system). There's actually very
268+
little interesting techincal stuff going on here. The interface on top of it is
269+
what I find more useful than others' systems; but it's still very, very simple.
270+
271+
{{< gallery >}}
272+
<img src="pics/sentence-write.png" alt="" class="grid-w33">
273+
<img src="pics/sentence-gen.png" alt="" class="grid-w33">
274+
<img src="pics/sentence-edit.png" alt="" class="grid-w33">
275+
{{< /gallery >}}
276+
277+
The source of the items to review are added manually, by searching in the dictionary
278+
or typing out the sentence and its english translation. There is a feature to use AI
279+
to generate sentences of varying degrees of difficulty, but I'm not sure how authentic
280+
the output is at times. The automatic mappings can still fail, so there is a (WIP) UI
281+
that allows correcting the mapping's segmentation and pick the correct definition.
282+
283+
284+
{{< gallery >}}
285+
<img src="pics/quiz_mc_ji.png" alt="" class="grid-w33">
286+
<img src="pics/quiz_mc_yanxi.png" alt="" class="grid-w33">
287+
{{< /gallery >}}
288+
289+
Multiple choice questions are the only alternative to flashcards. Going between
290+
English and Chinese in both directions has been helpful, in my experience.
291+
292+
{{< gallery >}}
293+
<img src="pics/quiz_grammar.png" alt="" class="grid-w33">
294+
<img src="pics/quiz_blank_guanjian.png" alt="" class="grid-w33">
295+
{{< /gallery >}}
296+
297+
Seeing words in context is also incredibly important, so when we have
298+
sentences that use a word, fill-in-the-blank questions are generated.
299+
300+
Saved sentences are fed directly into the sentence review. This activity
301+
is more time consuming than the multiple-choice word review, but output
302+
shouldn't be ignored as part of a review habit.
303+
304+
{{< gallery >}}
305+
<img src="pics/tags_list.png" alt="tag on words" class="grid-w33">
306+
<img src="pics/tags_review.png" alt="tag on words" class="grid-w33">
307+
{{< /gallery >}}
308+
309+
Any reviewable item also can be given custom tags. The use-case I had in mind
310+
was preparing for specific events. I play the Yu-Gi-Oh TCG, and I think it
311+
would be fun to one day play in a competition overseas in Taiwan or mainland
312+
China. Besides the daily review, you can also do a cutom review and filter
313+
using these tags if you're studying a particular topic.
314+
315+
## Conclusion
316+
317+
318+
{{< gallery >}}
319+
<img src="pics/stats.png" alt="tag on words" class="grid-w100">
320+
{{< /gallery >}}
321+
The main idea of the app is "bring your own content". I've been dogfooding
322+
it for a little while now, and it has definitely sped up my vocab acquisition.
323+
I'm not sure whether the long term plan is to polish it and publish it as FOSS,
324+
or to make it a closed source side-hustle.
325+
326+
In the meantime, I'll continue to iterate. It's a slow going process and I
327+
just have the nights and weekends that I'm not doing other hobbies like BJJ or
328+
Yu-Gi-Oh. I'm happy with the progress I've made so far.
41.8 KB
Loading
81.8 KB
Loading
124 KB
Loading
29.4 KB
Loading

0 commit comments

Comments
 (0)