Thoughts on using sentences from Wikipedia?

mike · July 7, 2020, 10:19pm

Many less popular languages like Irish, Scottish Gaelic, Catalan, and Albanian, among others, don’t yet have many sentences available on Tatoeba.

One option we’re considering to help bolster these languages is to use sentences from Wikipedia. We’d aim to select sentences that fit certain criteria (not too short/long, not too many proper nouns / numbers, useful vocab, etc.), and likely use Google translate for the initial translations.

Other sources could include Wikibooks and Project Gutenberg where possible.

Two questions for everyone here:

What do you think of using sentences from these sources / branching out beyond Tatoeba?
Given the ability to edit translations on Clozemaster, what do you think of using machine translations?

Curious to hear of course if you any other thoughts on we might improve support for these languages.

Adrianxu · July 7, 2020, 10:41pm

Perfect!

Depending on the language combination it could work.

Expugnator · July 8, 2020, 12:20am

I agree with other sources, though the Wikipedia articles don’t necessarily match through languages.

As for machine translation, coming from the translation field, I’d say it leads to unnatural learning, even if the MT is only for our source (usually native or stronger) language.

kadrian · July 8, 2020, 5:06am

I think you should try to find bilingual / very knowledgeable people to check all machine translations.

You mentioned Project Gutenberg - I’ve often thought it would be fun to read a book with a cloze word missing from every sentence! I started a collection with sentences from “Madame Bovary” but found it too time-consuming to upload sentences to get very far. The sentences would need to play in order rather than randomly.

alanf_us · July 8, 2020, 9:52pm

I’m not crazy about any of these ideas:

Removing Tatoeba from the chain
Using Wikipedia as a source
Using machine translation as a starting point

Regarding 1:

I think sites should focus on what they do best. Clozemaster does a great job of taking sentence pairs from elsewhere (namely, Tatoeba), identifying and classifying cloze words, and integrating the sentences into its “gamifying” infrastructure. By contrast, for 14 years, Tatoeba’s mission has been to develop a high-quality, extensive, freely usable corpus of translated sentences. Clozemaster doesn’t have anything near the infrastructure to accomplish anything similar. But if it could, whatever sentence pairs it created should go back to Tatoeba, given the fact that Tatoeba’s existence is what makes Clozemaster possible. (Disclosure: I am an admin at Tatoeba, but I would feel the same way even if I were a regular contributor.)

Regarding 2:

Wikipedia has licensing requirements that seem to me stricter than Tatoeba’s. For instance, their documentation on reuse of text talks about the necessity of providing a hyperlink to the original page. I’m not sure that could be accomplished practically on a fine-grained level (per sentence, for instance).
Using Wikipedia text would require a lot of markup filtering (references, etc.).
Much Wikipedia content would be very different in many respects (style, vocabulary, sentence length) from current Tatoeba/Clozemaster text.

Regarding 3:

Machine translation simply shifts the work from initial translation to correction. It only works if the quality of the machine translation is already very high and you have expert human translation checkers. If Tatoeba has a shortage of Irish, Scottish Gaelic, Catalan, and Albanian contributors/translators, what could Clozemaster offer that would attract more of them? I suppose money or in-kind benefits (free Pro membership?) would be a possibility, but I’m dubious that they would be a deciding factor.
The quality of machine translation depends on the availability of large amounts of data, which need to come from somewhere, such as Tatoeba. (Even if it comes from another source, it’s quite likely that data that is rare at Tatoeba is also rare elsewhere.) So, again, if Tatoeba has too little data to support Clozemaster for various language pairs, it’s likely that machine translation for those language pairs will also be of poor quality.

punk · July 9, 2020, 3:59am

One thing worth exploring, as @alanf_us mentioned in his 3rd point, is having someone with a good grasp of both target and source languages correct the machine translated sentences.

A few more considerations:

As for the imported material, try to stick with recent material as much as possible, in order to avoid words and expressions that either are no longer in use, or that were spelled differently back then.
In the list of languages avaliable, there should be a clear distinction between the current languages that are sourced exclusively from Tatoeba, and languages that also source from elsewhere. Like a warning that the language is currently in “Beta” and that it contains machine translated sentences, if that’s the case. This is so that the user can manage their expectations when selecting a language.
As more sentences are added into a language in Tatoeba, there might come a point when its number becomes large enough to make it worth being a 100% Tatoeba-sourced language. In that case, what should happen to sentences sourced from elsewhere? Should they be converted into cloze-collections, leaving the “Most Common Words” & “Fluency Fast Track” exclusive for Tatoeba sentences? Or should sentences from all different sources always coexist?

Overall, given that the translations meet a decent standard, this is something I’d be interested in trying out, for the purpose of learning a language that has scarce resources avaliable.

alanf_us · July 11, 2020, 3:06pm

To expand on my previous response, let’s say you wanted to build up the “Albanian from English” language pair. Here’s what I would suggest doing. Some of these steps can be done in parallel, some are alternatives to each other, and some of them you have undoubtedly already done to create your language pairs in the first place. Note that some of them require joining Tatoeba (membership is free).

Look through the native-level (five-star) speakers of Albanian at Tatoeba ( Members: Albanian - Tatoeba ). Figure out which ones are trustworthy. You can look at the comments on a given user’s sentences to help you figure that out. Also, figure out which ones are active. You can look at the user’s most recent activity to determine that.
Contact the trustworthy speakers by sending them messages through Tatoeba. Ask them if they would be willing to help you out by translating English sentences into Albanian, reviewing existing Albanian sentences, etc.
Find native-level speakers of Albanian elsewhere who would be interested in joining Tatoeba and contributing there.
Decide which English sentences that are not already translated into Albanian should be translated. Add these to a list.
Obtain English sentences from a source whose license permits it (such as noncopyrighted books from Project Gutenberg), process them, and have a native English speaker (such as you) add them to Tatoeba. They can then serve as the basis for translation not only to Albanian but also to other languages.

Mupfel · July 15, 2020, 9:43am

@mike, can you give us some more information on how and how often sentences are ‘imported’ from Tatoeba to Clozemaster?
For example, the pair Latin-English seems to have 29.421 sentences on Tatoeba, but only 6.920 on Clozemaster.

And what consequences does an edit of an translation on Clozemaster have? If I change a translation, does that change it for everybody? (Which could be a problem, if I know but little)
Or does it only change my wrong sentence? (But then every single user would have to change it, is that reasonable?)
Or are there moderators that check the correctness of the translation?

Like others wrote before me, I’m a bit sceptical of machine translations as they depend on a large database of correct sentences as starting point, which probably doesn’t exist for the small languages.

I like the idea of using Project Gutenberg (maybe in a special collection? ClozeBooks?), but I guess there aren’t also many books available in Gaelic, Navajo or Brezhoneg.

mike · July 16, 2020, 12:08pm

This is all great feedback - thanks @adrianxu @Expugnator @kadrian @alanf_us @punk and @Tarob! Very much appreciated.

It sounds like the consensus is that if we do pursue using sentences from other sources along with machine translations, they should be kept separate (separate collections, perhaps under a separate section, with a note like @punk mentioned) until we can get humans/translators to verify the quality. In this way we can try it out without muddling the existing collections using primarily Tatoeba sentences/translations.

From there we can work on figuring out if there’s a way to get more people involved with verifying the translation quality and translating content either on Tatoeba or that we can contribute to Tatoeba like @alanf_us described. We’ll have to give this some more thought. @alanf_us the process you outlined sounds right on, just a matter of the time/resources to execute on it.

@Tarob sentences haven’t been imported very frequently from Tatoeba, however we’re very close to changing that. We’re putting the finishing touches on a “pipeline” which should let us add more languages and languages pairings as well as updating existing language pairings much more easily - super excited about it

Latin is actually one of the languages we’re looking to update first, likely within the next few weeks, which should max out the Random Collection at 20k sentences. We may also be able to add a Fast Track - we’re working on possibly coming up with a frequency list. We also recently added Toki Pona and Lojban.

Editing a translation on Clozemaster is only for you. We should probably add a note to make this clear, thanks for letting us know it’s not at the moment.

ClozeBooks! I’ve also thought this would be a cool idea - will see what we can do!

Mupfel · July 18, 2020, 7:45pm

Sounds great!

I think frequent syncing will be very motivating for those users that use booth Clozemaster and Tatoeba. If I see that my work on Tatoeba makes Clozemaster better, I’m motivated to do more.

And I’m really looking forward to ClozeBooks if it can be done.
My goal was to start reading books after I finished the Fast Fluency Track. The plan was to use a different service (like Readlang), but of course it would be way more pleasant if it were both in the same app. (I’m also not sure if I want to pay for two subscriptions )

With ‘ClozeBooks’ it would be great if there was also a ‘level’ etc, like this book can be read with a vocabulary of 5000 words, for this you need 10.000 etc. Maybe it would also be good to have some children’s books as the tend to be easier for starters.

JioMc · October 21, 2020, 4:22pm

One other possible idea:

Many minority languages have centralised language boards which offer a vast range of materials, often free. For instance, the entirety of the Say Something In Kernewek course is available free online, and the Cornish Kesva has at least one of it’s Skeul An Yeth textbooks posted for free. For Māori, the government-funded online Māori dictionary has a translated sentence accompanying almost every single word. The whole Te Reo superstructure here in New Zealand is extremely well resourced, a central feature of the govt’s cultural platform.

I wonder if it might be an idea to approach such centralised language organisations - they get an extended platform for the propagation of the language which they don’t have to pay for or do anything about, and you get translated sentences.

Just an thought

P.S. also, most languages have freely available, easily searchable translated bibles. Many of them have been released as apps with translated text side-by-side. It could be a very easy and fruitful way of enlarging or even creating a collection

physes · June 25, 2021, 2:42pm

If there’s anything you can do to improve the Irish course, that would be great: a full 20k fluency fast track as well as grammar sections. Currently there’s just too little material available. I think using Wikipedia would be a great idea- I was even attempting to create a private collection based on this idea.

Dcarl1 · June 25, 2021, 5:23pm

It all sounds good except the machine translation. I would avoid that. Too many pitfalls and likelihood for cleanup.