-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Kalmyk parallel corpora actually may have several Oirat translations with little bit of differences.
So the task of this issue is a tool to compare similar texts and combine found differences.
The most left translation in input parallel corpora is the main one, we will compare other translations with it.
So, differences may be:
- Removal: some word exists in Main translation, but it absents in Twin translation (twin == non-main)
- Insertion: some word absents in Main translation, but it exists in Twin translation
- Changing: "twins" from different translations may have minor differences. Twins are entire words
- Replacement: Removal(1)+Insertion(2) of immutable words or words with Changes(3) at once
Whole texts will be compared part-by-part. This means advance manual or automatic matching of single sentences or parts of text with each other. So prospective tool will compare pieces of text within each row of input corpus. Changes(3) will be combined with equal ones in other twins and listed together in result report. Result report should be saved into xlsx file (as possibility).
Some ideas for this tool:
In CorporaView on 'Compare' button click we will open a dialog window. All translations from input corpus will be there. 'Translations' means all columns starting from second one in parallel corpus. We will have three colorful buttons for every column in that dialog window: Insertions, Removals and Replacements. But for Main translation we will have 'all Insertions', 'all Removals' and 'all Replacements' buttons. These buttons have fixed state (Pressed/Unpressed) and can be pressed in any combination. If a button Unpressed it is Gray, if a button pressed it is colorful (Green, Red or Yellow respectively). All-*** buttons in Main translation will automatically press or unpress all the corresponding buttons in Twin translations. And conversely, for example, all pressed Green buttons in Twin translations will automatically press 'all Insertions' button in Main translation.
On the noted buttons pressing/unpressing we will see/don't see highlighted words in translations. The highlights will have the corresponding color: Insertions - Green, Removals - Red. Replacements/Changes - Yellow(or Orange)
Changes(3) means replacement of one or several letters within one word (its twins). These letters must be placed together in word and be a less part of this word, otherwise such words will not be twins and will be reported as Removal and Insertion. Words are twins if Jaro-Winkler value is less than 0,25 for them. Removals/Insertions of separate letters within twins will not be reported, only Replacements are considered as Changes(3) and will be combined. Changes(3) will be combined in xlsx file but not in the dialog window.
On mouse cursor hover on any of such highlights the related highlights stay shown and other highlights will be hidden. A tip with some information for this highlight drops down in this case. After mouse moving all the existing (enabled) highlights will be shown again.