Saturday, August 10, 2019

Should we stop sentences with Tom and Mary?


A discussion was initiated on the Wall regarding the overwhelming amount of sentences containing “Tom” and “Mary”. The initial proposal was to ask the community to stop creating new sentences with “Tom” and “Mary”.

Wall thread:


To the question “Should we stop sentences with Tom and Mary?”, the official decision is: no, we should not. Contributors may continue creating sentences with “Tom” and “Mary”. No action will be taken against them.

As a general rule, no action will be taken against a contributor based on the sole fact that they are creating new sentences with a name that has been overused.

We will still take measures in regards of the underlying issue of diversity in our corpus:
  • We will make it clear in our documentation that people are free to use other names than Tom and Mary.
  • We will add on Tatoeba's contribution page a short text to encourage people to keep the corpus diverse.
  • We will create guidelines on how to contribute diverse sentences. These guidelines will be published in the wiki.


We recognize that the content of our corpus has become unfulfilling for many of our users and we recognize that we need to make an effort to make it more diverse. However, after a thorough discussion with the community, I can conclude that attempting to make Tom and Mary "illegal" is not an adequate response to the problem.

In the same way that it has been said:
Whatever issues or inconveniences arise because of people using a more diverse set of names, we will solve them, but with another solution than enforcing wildcard names.
The same thing could be said from the other side:
Whatever issues or inconveniences arise because of people overusing "Tom" and "Mary", we will solve them, but with another solution than restricting these names.
Some of our contributors feel a certain attachment to Tom and Mary. It has become a comfort zone for them and if they do not wish to step out of their comfort zone, we should not force them to. Doing so would only generate a sense of loss of freedom and under these conditions, it is easy to develop uncooperative behavior or even try to cause more problems as a sign of protest.

We can obviously have the same issue on the other side: people might be leaving or causing problems out of disappointment that Tom and Mary sentences will continue to expand. But then we are just trading a bad situation against another bad situation and there is no way to evaluate which one really is worse than the other.

Restrictive measures may help us achieve our diversity goal faster, but such measures would be motivated by impatience. As long as Tatoeba welcomes people from all backgrounds and gives them the chance to express themselves in their most authentic ways, we will achieve this goal. The abundance and growth of Tom and Mary sentences does not eliminate the possibility for a diverse corpus. We will get there, that is inevitable. Whether it takes five years or fifty years, there is no rush.

Additional points

Tom and Mary, aka. wildcards, started out as an idea to reduce redundancy in the corpus. It has been demonstrated that this idea is inefficient. If you have been creating sentences with wildcards under the belief that it helps to prevent near-duplicate sentences, know that it can actually have the opposite effect. You may continue to create sentences with wildcards if you wish to, but you cannot claim that it is for the sake of reducing redundancy. It is misinformation at this point, and it is spreading an unnecessary fear of near-duplicates.

Near-duplicates are not a big deal. We need to make this clear. They are in fact necessary. They help to identify patterns. We encourage everyone to simply not worry about them and focus on being creative instead. Avoiding near-duplicates will come naturally: the more creative your sentence is, the less likely there will be a near-duplicate of it. This will also help with diversity.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.