Issue with Slovenian characters (č/š/ž) causing word splitting

1. I'm encountering an issue when your tool crawls/ingests Slovenian text. Words containing č, š, ž are being split into separate tokens, which breaks the meaning and impacts our results.

I’m attaching a screenshot showing the problem with this sentence:

“Ližem čokolado in štejem žoge za košarko.”

The translated meaning of this sentence would be:
"I'm licking chocolate and counting balls for basketball."

If we use just the last word: košarko = basketball, Infranodus presents 3 closely connected concepts (ko, š, arko) from 1 single word.
Screenshot: https://prnt.sc/UCkqJRWYVjVw

Could you please advise whether the tool can be configured to automatically transliterate these characters during crawling/normalization, for example:

č → c
š → s
ž → z

(Option names might be: ASCII folding, diacritic stripping, transliteration, Unicode normalization.)

If this isn’t currently supported, can you suggest a recommended workaround or add this as an ingest-time preprocessing step?

The thing is that our language use this characters a lot and are present on websites and in SERP.

If I manually translate this characters, the tool get it right: https://prnt.sc/jI93qtyEXLng

2. I was also thinking about if I could auto-translate everything to supported language (for example on SERP crawl using the tool features), and how can I integrate DeepL so some other translator. Do we have some example video?

Thank you.

Official comment

Dmitry Paranyushkin

December 13, 2025 12:56

Hello, it's not really possible to adjust transliteration, but you could use this approach described here to avoid lemmatization: https://support.noduslabs.com/hc/en-us/articles/5218618256658-In-Your-Own-Language-Lemmatization-and-Stopwords-Removal

Issue with Slovenian characters (č/š/ž) causing word splitting

Comments

Didn't find what you were looking for?