WMT is a conference on machine translation, where teams submit their algorithms, which test against a fixed parallel text. I looked carefully at the most recent WMT paper, WMT2019 - Findings of the 2019 Conference on Machine Translation. It is very accessible, and I was able to understand it with minimal effort. It is, itself, a summary of results, and worth reading in its own merit. I read this paper, as well as a number of the submissions to it. I had the following conclusions.

  1. Teams which did well on BLEU did well on human evaluation.

This suggests that, despite criticism, BLEU is a good metric for translation quality, at least for independent sentences. Later, I read articles suggesting that in terms of context (the big problem with MT), it is insufficient.

2. They implement Amazon Mechanical Turk to measure translation QC.

Smart. I should do this as well. However, I read in another paper (in the criticism on the famous MS paper on human parity) that nonprofessional translators are not as good at translation QC. However, the general idea of leveraging cloud translation suites (vs. hiring individuals for example on upwork) is a good one.

I recommend a reading of Neubig's submission [https://www.aclweb.org/anthology/W19-5368.pdf] for details on how this is implemented.

3. Transformer is king

"The Transformer architecture (Vaswani et al., 2017) dominates with more than 80% of submissions". The fundamental mechanic in Transformer is attention, and a brief reading of the paper suggests that the core reason that Transformer is so effective is that using attention allows it to work with larger datasets, due to various more efficient properties of attention vs other mechanism.

Certainly a detailed study of Transformer is in order.

4. There are languages which are tied or superhuman

Specifically, Facebook-Fair (which performed the best in all but one metric) is superhuman in English to German, and is tied in German to English and English to Russian. One wonders why exactly that is the case, especially with Russian. I imagine this is due at least in part to the efforts of the Yandex team, although it may simply be due to the larger population of russia compared with other countries in europe and, therefore, a larger available corpus.

5. Top performers did not exclusively use the WMT parallel texts

WMT called this the "constrained" task. They make available a number of parallel text for training, which was very exciting, until I realized that most systems use their own specialized texts. Naively, one would think this suggests that the WMT parallel corpus is at best a good introductory test.

There were a number of rather technical notes in terms of the trends, mostly concerned with the way that various text was preprocessed. For example, they write that 40% of submissions which used tokenizers (90%+) used Moses. Since this is too technical for me at the moment, I will need to review these findings when I have a stronger grasp on the subject.

I briefly read the Facebook-Fair paper, but feel that it was somewhat too specialized to understand in depth. Another paper to return to at a later point.