Number of unique “root” words?

In the Polish sentences, many of the words seem to be variations based on declination, conjugation, etc.

Does that mean we have fewer than 20,000 words? If so, how many unique “root” words are we working with?

From what I’ve been told (for Russian), the number of words refers to unique forms, not to unique roots. So indeed, if you had 20,000 unique forms, you would have fewer than 20,000 unique roots.

Answering the question of how many unique root words one is working with is difficult not only in terms of developing an implementation, but in terms of agreeing on a standard. For instance, does a noun derived from a verb (or vice versa) count as unique?

I suspect that no attempt to count unique root words has been made. But perhaps I’m wrong.

Thanks for the feedback Alan.

That makes a lot of sense, and I can see where the difficulty arises in counting.

Based on this article, it seems the linguistic term is “lemmas”, and 5,000 is around what a native English speaker uses daily (while understanding anywhere from 3-8x that number): Redirect Notice

I’m hoping that, with the 30-40k clozes in the most common words collections, we’d hit somewhere near that 5,000 mark.

1 Like

I wouldn’t hope for that, based on the alphabetical list of all Polish words I see there is 4-7 lemmas in every 100 words at best, that would make 1,200-2,800 lemmas in 30-40k clozes, not even close to 5,000.

1 Like