The following snippet gives inconsistent results:
from reynir_correct import tokenize
texts = ["Skúta", "300 ára gömul írsk skúta fundin við Suður-Noreg" ]
for t in texts:
g = tokenize(t, only_ci=True)
for t in g:
if t.txt:
print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
Output:
Skúta
300
ára
gömul
írsk
skúta U001 Óþekkt orð: 'skúta'
fundin
við
Suður-Noreg
The correct word skúta is marked as unknown, but not if it's written as standalone word. Using no options for the tokenize() method works as expected.
It's also not clear from the documentation, what exactly the optiononly_ci does.
The following snippet gives inconsistent results:
Output:
The correct word
skútais marked as unknown, but not if it's written as standalone word. Using no options for thetokenize()method works as expected.It's also not clear from the documentation, what exactly the option
only_cidoes.