Introduction

Background: Math PhD. Started several small companies for the past few years, the last of which reached some moderate success. I hope to use some of the domain expertise from the latter to devise a good product either for internal or external use.

The topic here will be a general study of NLP with special attention to MT tasks.

Concretely speaking, the goal is to do one of:

  1. Find a great company to join
  2. Build a valuable product
  3. Build a product that gets me to (1)
  4. Create a blog that gets me to (1)
  5. Create a blog which attracts a team which gets me to (2)

To this end I will publish my investigations into NLP. I had briefly investigated NLP in 2016 when I had hoped to use it for some machine translation. The state of the art at the time proved to be quite insufficient. I hope in 2019 to find or create a model that will perform better.

Literature Search and Preliminary Findings

I did a basic literature search. Since I identified JP-EN as a MT task which is underserved, I searched for publications on this topic. To this end, I found a tutorial which included a JP-EN corpus at [https://github.com/neubig/nmt-tips]. It is dated by NLP standards, but it's quite good and from the excellent Neulab at CMU. Following the citations on the neulab site [http://www.cs.cmu.edu/~neulab/], I identified the existence of WMT.

WMT is a conference on machine translation, where teams submit their algorithms, which test against a fixed parallel text. I looked carefully at the most recent WMT paper, WMT2019 - Findings of the 2019 Conference on Machine Translation. It is very accessible, and I was able to understand it with minimal effort. It is, itself, a summary of results, and worth reading in its own merit. I read this paper, as well as a number of the submissions to it.

In addition to WMT, I via the research page on neulab another such test, the Findings of the First Shared Task on Machine Translation Robustness. This used the MTNT dataset from Michel and Neubig. Many of the same teams applied to this, and there are a number of articles. Of interest is that they found BLEU to be representative of human judgement:

In terms of evaluation, we found an automatic metric (BLEU) to be roughly consistent with human judgment

Mentioned in this article is the Microsoft paper claiming human parity, as well as a response to this claim. These are both quite interesting articles to read. It seems that there are numerous technicalities involed in the testing methodology. In particular, Toral et al. found that Microsoft may have been scammed by Chinese translators who used post-edited MT. I myself have been scammed in this way before, as have others, so I am not surprised to find this. Of note is the following quote:

This provides qualitative evidence that non-experts may be more tolerant of translation errors than professional translators.

This has been my experience as well, but it is interesting for numerous reasons. The target audience for MT documents is not professional translators. I have find in my personal experience that their criticism can sometimes be too harsh when compared to the target audience. Still, despite the importance of this lesson, I found that many later papers did not make a point to utilize professional translators (or at least, they did not remark on this).

From here, I decided I would look into Transformer, which lead me to a wealth of information. In WMT 80% of the submissions were based on the Transformer architecture, and it deserves its own treatment. However, since it is quite a popular topic, it lead me to a number of other resources:

https://github.com/THUNLP-MT/MT-Reading-List
https://github.com/lena-voita/good-translation-wrong-in-context
https://github.com/yandexdataschool/nlp_course

I read in detail the article on automatic post-editing there, which was very accessible. I microblogged it on twiter as @eochad, upon finding that one of the authors was on twitter. Through twitter I found Sebastian Ruder:

http://newsletter.ruder.io/
https://nlpprogress.com

Finding Ruder's newsletter was a huge breakthrough, and drowned me in information. I devoured the last year of his newletter postings, which I highly recommend to read in detail, as well as all of the articles they reference. Some of them are good and others are mediocre, but it (along with the articles on ruder.io ) is to the best of my knowledge is the best resource of the state of the art available.

Through it, I discovereed BERT, as well as the numerous innovations which branched off of it. Currently, here is my understanding of the state of the art:

  1. Transformer is (still) king

Considering the WMT results, as well as how active BERT has been, Transformer continues to reign supreme. From the newsletter we find an entire collection of resources for studying Transformers.

2. BERT, and transfer learning from it

Sebastian links [https://github.com/thunlp/PLMpapers]. Here is a good (layman's) explanation of BERT: [https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/]. The original paper by Devlin et al. is reasonably accessible, and worth reading, but I think that one should think of BERT not in isolation but together with the many extensions it enables:

https://github.com/thunlp/PLMpapers/raw/master/PLMfamily.jpg

It is available on the google research github, which contains a Google collab sheet. It is a huge new branch of NLP I have only minimally explored, and will take me some time to fully grasp. Information about multilingual BERT is here: [https://github.com/google-research/bert/blob/master/multilingual.md].

3. Much I still don't know

There is still quite a bit to explore. In terms of reading, I would like to at least read briefly the most recent papers from the top people in this field. However, I have not quite identified all of them. In addition, I should write more software. After reading the paper on automatic post-editing, I tried to train a base model using the jupyter sheets that Lena made available in that repository, but Google Collab kept terminating my process so I never quite did.