輸入文字,聆聽圍頭話、客家話發音,傳承本土語言。Hear Waitau and Hakka from your words. Keep the languages alive.
This app is a static single-paged application (SPA) built with TypeScript, React, Tailwind CSS and daisyUI.
To convert it to a native Android and iOS application, it is made a progressive web application (PWA) powered by the Vite PWA plugin, then transformed with PWABuilder. For iOS, the output from PWABuilder is further compiled with Xcode.
public/assets/: contains pre-generated icons and screenshots of different sizes for use as PWAsite.webmanifest: web application manifest for use as PWA
src/db/: contains database initialisation & manipulation logic that downloads and saves model & audio data into an Indexed DB for offline usage.res/: contains both raw and processed Waitau & Hakka pronunciation data of Chinese characters and words, as well as the compilation script. See the Data Preprocessing section below.inference/: contains the code of a Web Worker for offline model inference as well as the API of the worker. A brief description ofinfer.tsis given in the Models & Inference section below.index.tsxis the entry point of the app.index.cssis a Tailwind CSS stylesheet containing repeatedly used styles that are not suitable to be inlined.App.tsxcontains the outermost React component.- The remaining files contain the definitions of other components, hooks, types and utility functions.
The app first segments the input into characters and non-characters (punctuation and symbols) in src/parse.ts. Then, characters are converted into pronunciation using pronunciation data loaded into a Trie data structure in src/Resource.ts. The input and the conversion result is then displayed as a SentenceCard component (src/SentenceCard.ts). The user can choose the desired pronunciation inside the card and the audio is generated by feeding the pronunciation as the input to the text-to-speech model.
Developers can safely ignore the actual content in the src/res/ folder and only need to keep in mind of the following:
- The app only loads and should only load the processed outputs,
chars.csv,waitau_words.csvandhakka_words.csv. Among them:chars.csvcontains four columns:char: the Chinese character consisting a single Unicode codepoint. This column is not unique and may repeat if the character is a polyphone (has multiple pronunciation).waitau,hakka: the Waitau and Hakka pronunciation of the character in HKILANG's own romanisation scheme, if any. You can refer to the website for the details of the romanisation scheme, but do not make further assumption of the format beyond/^[a-zäöüæ]+[1-6]$/for Waitau and/^[a-z]+[1-6]$/for Hakka in the code.notes: further explanation/clarification/disambiguation displayed underneath the character, if any
waitau_words.csvandhakka_words.csveach contains two columns:char: the Chinese characters consisting two or more Unicode codepoints. Again, this column is not unique and may repeat if the word can pronounce in multiple variations.pron: the Waitau or Hakka pronunciation of the character with the same number of syllables as the number of characters in thecharcolumn. Each pair of syllables is separated by an ASCII (ordinary) whitespace.
- The raw data,
dictionary.csv,WaitauWords.csv,HakkaWords.csvandpublic.csv, as well ascompile.py, should never be referenced in the code.
The outputs are precompiled and managed as part of the Git repository, so you need not generate them manually unless you modified the inputs or the complication script described below.
Note
src/res/compile.py gathers pronunciation data from the following sources as the inputs:
dictionary.csv,WaitauWords.csv,HakkaWords.csv: Pronunciation data from the HKILANG's dictionary, surveyed and collected from villages in Hong Kong in an earlier project. These are the core sources.public.csv: Lexicon table from the TypeDuck Cantonese keyboard, for further supplement of relatively uncommon words in order to facilitate automatic choice of pronunciation for polyphones (characters with multiple pronunciation). This is done inside thegeneratefunction incompile.pyby looking up the Waitau/Hakka equivalent of the Cantonese pronunciation in thedictionary.csvtable after Jyutping is converted into HKILANG's romanisation scheme by therom_mapfunction. Only entries with frequencies ≥ 10 which include at least one polyphone in the target language are included.
In addition to words from WaitauWords.csv, HakkaWords.csv and public.csv, words are also extracted from collocations from the note column of dictionary.csv.
The compilation script cleanses and normalises the inputs, computes extra words and outputs the result into the three files described in the above section. All monosyllabic results, whether linguistically a word or not, are included in the chars.csv file, and the polysyllabic results are written to waitau_words.csv and hakka_words.csv.
The app provides 3 different modes for generating audio from the input:
-
Online inference: The app requests (
fetches) audio from the backend from the following URL:https://Chaak2.pythonanywhere.com/TTS/${language}/${text}?voice=${voice}&speed=${speed}where the parameters are:
${language}, which must be one ofwaitauorhakka;${text}, which is the romanised text input, separated by spaces (%20) or+. There are 7 available punctuation marks:.,,,!,?,…,'and-. Separators are required both before and after punctuation marks. Percent-encoding is not mandatory except for the punctuation?(%3F).
Currently, only transliterations in HKILANG's own romanisation system is accepted. Chinese characters are not yet supported in the API, so you will need to first convert Chinese text into pronunciation using this app's interface.${voice}, which may be one ofmaleandfemale(optional, defaults tomale); and${speed}, which may be any number between 0.5 and 2 (optional, defaults to 1).
${and}indicate a parameter and should not be included as part of the URL.The backend is deployed as a PythonAnywhere instance and the code is open sourced in github.com/hkilang/TTS-API, which is a dead code eliminated reduction of Bert-VITS2. The pre-trained PyTorch machine learning models used for inference are published on the release page.
-
Offline inference: The app performs inference within itself in the Web Worker,
src/inference/worker.ts, using the same machine learning models used for online inference but exported as ONNX format, available in the github.com/hkilang/TTS-models repo. Each model consists of several components, some of which are split into smaller chunks due to size limitations. In the app, each model is downloaded and stored into an IndexedDB per user request. The user must download the desired model manually before audio generation.The
infermethod insrc/inference/infer.tsresembles theSynthesizerTrn.infer()method inmodels.pyin the TTS-API repo. In the method, each model component is loaded from the IndexedDB and the weights are released immediately after use to avoid out-of-memory errors in low-end devices with limited memory. A custom class,NDArray, is written for performing mathematical computation on the immediate results inferred by the model components. -
Lightweight mode: The app concatenates the pre-generated audio files available in the github.com/hkilang/TTS-audios repo. These files are created as follows: for each character in the dictionary, an audio file is generated using the same model as offline inference. The generated files are then concatenated into a single file,
chars.bin, and the corresponding pronunciation and the start offset for each audio file is saved into an offset table,chars.csv. The same is performed for each word in the dictionary, producing the fileswords.bin(split into smaller chunks due to size limitations) andwords.csv. These 4 files form an audio pack.In the app, each audio pack is downloaded and stored into an IndexedDB per user request. The user must download the desired audio pack manually before audio generation. During audio generation, the app loads the audio components from IndexedDB, uses the offset tables to locate the byte ranges for each phrase, slices those segments from the audio components and decodes them. Then, all the decoded audio segments are concatenated in order into a single audio buffer for playback.
Although the generation process is fast thanks to its computational simplicity, the use of this mode is discouraged due to the poor quality of the results produced and is intended only as a last resort in extremely low-end devices without an Internet connection when even offline inference fails.