Issue with Slovenian characters (č/š/ž) causing word splitting
1. I'm encountering an issue when your tool crawls/ingests Slovenian text. Words containing č, š, ž are being split into separate tokens, which breaks the meaning and impacts our results.
I’m attaching a screenshot showing the problem with this sentence:
“Ližem čokolado in štejem žoge za košarko.”
The translated meaning of this sentence would be:
"I'm licking chocolate and counting balls for basketball."
If we use just the last word: košarko = basketball, Infranodus presents 3 closely connected concepts (ko, š, arko) from 1 single word.
Screenshot: https://prnt.sc/UCkqJRWYVjVw
Could you please advise whether the tool can be configured to automatically transliterate these characters during crawling/normalization, for example:
-
č → c
-
š → s
-
ž → z
(Option names might be: ASCII folding, diacritic stripping, transliteration, Unicode normalization.)
If this isn’t currently supported, can you suggest a recommended workaround or add this as an ingest-time preprocessing step?
The thing is that our language use this characters a lot and are present on websites and in SERP.
If I manually translate this characters, the tool get it right: https://prnt.sc/jI93qtyEXLng
2. I was also thinking about if I could auto-translate everything to supported language (for example on SERP crawl using the tool features), and how can I integrate DeepL so some other translator. Do we have some example video?
Thank you.
-
Official comment
Hello, it's not really possible to adjust transliteration, but you could use this approach described here to avoid lemmatization: https://support.noduslabs.com/hc/en-us/articles/5218618256658-In-Your-Own-Language-Lemmatization-and-Stopwords-Removal
Please sign in to leave a comment.
Comments
1 comment