Skip to content

hkilang/TTS

Repository files navigation

香港圍頭話及客家話文字轉語音
Hong Kong Waitau & Hakka Text-to-Speech

輸入文字,聆聽圍頭話、客家話發音,傳承本土語言。
Hear Waitau and Hakka from your words. Keep the languages alive.

本儲存庫包含香港圍頭話及客家話文字轉語音朗讀器前端部分之原始碼。
This repository contains the source code of the front-end part of the Hong Kong Waitau & Hakka Text-to-Speech reader.

本程式由香港本土語言保育協會開發及提供。
This application is developed and made available by the Association for Conservation of Hong Kong Indigenous Languages (HKILANG).

簡介 Introduction

圍頭話客家話皆是香港的非物質文化遺產,然而這些本土語言傳承因城市化出現了斷層。圍村新一代接觸圍頭話、客家話的機會甚少,或「曉聽唔曉講」。
Waitau and Hakka are both recognised as intangible cultural heritage in Hong Kong. However, urbanisation has been disrupting the transmission of these indigenous languages. Younger generations in walled villages are rarely exposed to Waitau and Hakka, and many are what the elders call “曉聽唔曉講” — able to understand, but unable to speak.

使用本文字轉語音朗讀器,可以作為學習圍頭話、客家話的資源,亦可以成為與圍村長輩溝通的工具,延續家庭和社區的語言傳承。
This text-to-speech reader serves not only as a resource for learning Waitau and Hakka, but also as a communication tool for engaging with the elderly in walled villages, helping to preserve the linguistic heritage within families and communities.

Development

This app is a static single-paged application (SPA) built with TypeScript, React, Tailwind CSS and daisyUI.

To convert it to a native Android and iOS application, it is made a progressive web application (PWA) powered by the Vite PWA plugin, then transformed with PWABuilder. For iOS, the output from PWABuilder is further compiled with Xcode.

Files Overview

  • public/
    • assets/: contains pre-generated icons and screenshots of different sizes for use as PWA
    • site.webmanifest: web application manifest for use as PWA
  • src/
    • db/: contains database initialisation & manipulation logic that downloads and saves model & audio data into an Indexed DB for offline usage.
    • res/: contains both raw and processed Waitau & Hakka pronunciation data of Chinese characters and words, as well as the compilation script. See the Data Preprocessing section below.
    • inference/: contains the code of a Web Worker for offline model inference as well as the API of the worker. A brief description of infer.ts is given in the Models & Inference section below.
    • index.tsx is the entry point of the app.
    • index.css is a Tailwind CSS stylesheet containing repeatedly used styles that are not suitable to be inlined.
    • App.tsx contains the outermost React component.
    • The remaining files contain the definitions of other components, hooks, types and utility functions.

Technical Overview of How the App Works

The app first segments the input into characters and non-characters (punctuation and symbols) in src/parse.ts. Then, characters are converted into pronunciation using pronunciation data loaded into a Trie data structure in src/Resource.ts. The input and the conversion result is then displayed as a SentenceCard component (src/SentenceCard.ts). The user can choose the desired pronunciation inside the card and the audio is generated by feeding the pronunciation as the input to the text-to-speech model.

Pronunciation Data

Developers can safely ignore the actual content in the src/res/ folder and only need to keep in mind of the following:

  • The app only loads and should only load the processed outputs, chars.csv, waitau_words.csv and hakka_words.csv. Among them:
    • chars.csv contains four columns:
      • char: the Chinese character consisting a single Unicode codepoint. This column is not unique and may repeat if the character is a polyphone (has multiple pronunciation).
      • waitau, hakka: the Waitau and Hakka pronunciation of the character in HKILANG's own romanisation scheme, if any. You can refer to the website for the details of the romanisation scheme, but do not make further assumption of the format beyond /^[a-zäöüæ]+[1-6]$/ for Waitau and /^[a-z]+[1-6]$/ for Hakka in the code.
      • notes: further explanation/clarification/disambiguation displayed underneath the character, if any
    • waitau_words.csv and hakka_words.csv each contains two columns:
      • char: the Chinese characters consisting two or more Unicode codepoints. Again, this column is not unique and may repeat if the word can pronounce in multiple variations.
      • pron: the Waitau or Hakka pronunciation of the character with the same number of syllables as the number of characters in the char column. Each pair of syllables is separated by an ASCII (ordinary) whitespace.
  • The raw data, dictionary.csv, WaitauWords.csv, HakkaWords.csv and public.csv, as well as compile.py, should never be referenced in the code.

The outputs are precompiled and managed as part of the Git repository, so you need not generate them manually unless you modified the inputs or the complication script described below.

Data Preprocessing

Note

This section is intended for dictionary maintainers. Developers of the body of the app need not read it.
Read the above section for the description of the compilation outputs.

src/res/compile.py gathers pronunciation data from the following sources as the inputs:

  • dictionary.csv, WaitauWords.csv, HakkaWords.csv: Pronunciation data from the HKILANG's dictionary, surveyed and collected from villages in Hong Kong in an earlier project. These are the core sources.
  • public.csv: Lexicon table from the TypeDuck Cantonese keyboard, for further supplement of relatively uncommon words in order to facilitate automatic choice of pronunciation for polyphones (characters with multiple pronunciation). This is done inside the generate function in compile.py by looking up the Waitau/Hakka equivalent of the Cantonese pronunciation in the dictionary.csv table after Jyutping is converted into HKILANG's romanisation scheme by the rom_map function. Only entries with frequencies ≥ 10 which include at least one polyphone in the target language are included.

In addition to words from WaitauWords.csv, HakkaWords.csv and public.csv, words are also extracted from collocations from the note column of dictionary.csv.

The compilation script cleanses and normalises the inputs, computes extra words and outputs the result into the three files described in the above section. All monosyllabic results, whether linguistically a word or not, are included in the chars.csv file, and the polysyllabic results are written to waitau_words.csv and hakka_words.csv.

Audio Generation

The app provides 3 different modes for generating audio from the input:

  1. Online inference: The app requests (fetches) audio from the backend from the following URL:

    https://Chaak2.pythonanywhere.com/TTS/${language}/${text}?voice=${voice}&speed=${speed}

    where the parameters are:

    • ${language}, which must be one of waitau or hakka;
    • ${text}, which is the romanised text input, separated by spaces (%20) or +. There are 7 available punctuation marks: ., ,, !, ?, , ' and -. Separators are required both before and after punctuation marks. Percent-encoding is not mandatory except for the punctuation ? (%3F).
      Currently, only transliterations in HKILANG's own romanisation system is accepted. Chinese characters are not yet supported in the API, so you will need to first convert Chinese text into pronunciation using this app's interface.
    • ${voice}, which may be one of male and female (optional, defaults to male); and
    • ${speed}, which may be any number between 0.5 and 2 (optional, defaults to 1).

    ${ and } indicate a parameter and should not be included as part of the URL.

    The backend is deployed as a PythonAnywhere instance and the code is open sourced in github.com/hkilang/TTS-API, which is a dead code eliminated reduction of Bert-VITS2. The pre-trained PyTorch machine learning models used for inference are published on the release page.

  2. Offline inference: The app performs inference within itself in the Web Worker, src/inference/worker.ts, using the same machine learning models used for online inference but exported as ONNX format, available in the github.com/hkilang/TTS-models repo. Each model consists of several components, some of which are split into smaller chunks due to size limitations. In the app, each model is downloaded and stored into an IndexedDB per user request. The user must download the desired model manually before audio generation.

    The infer method in src/inference/infer.ts resembles the SynthesizerTrn.infer() method in models.py in the TTS-API repo. In the method, each model component is loaded from the IndexedDB and the weights are released immediately after use to avoid out-of-memory errors in low-end devices with limited memory. A custom class, NDArray, is written for performing mathematical computation on the immediate results inferred by the model components.

  3. Lightweight mode: The app concatenates the pre-generated audio files available in the github.com/hkilang/TTS-audios repo. These files are created as follows: for each character in the dictionary, an audio file is generated using the same model as offline inference. The generated files are then concatenated into a single file, chars.bin, and the corresponding pronunciation and the start offset for each audio file is saved into an offset table, chars.csv. The same is performed for each word in the dictionary, producing the files words.bin (split into smaller chunks due to size limitations) and words.csv. These 4 files form an audio pack.

    In the app, each audio pack is downloaded and stored into an IndexedDB per user request. The user must download the desired audio pack manually before audio generation. During audio generation, the app loads the audio components from IndexedDB, uses the offset tables to locate the byte ranges for each phrase, slices those segments from the audio components and decodes them. Then, all the decoded audio segments are concatenated in order into a single audio buffer for playback.

    Although the generation process is fast thanks to its computational simplicity, the use of this mode is discouraged due to the poor quality of the results produced and is intended only as a last resort in extremely low-end devices without an Internet connection when even offline inference fails.