Pangeanic has reached its first milestone in its project called Hybrid Neural Machine Translation Platform.
This project, with the backing of the CDTI and the EU in its project Operative Growth of Intelligence (case no. IDI-20170964), aims to create a neural machine translation program through the development of hybridization techniques, using AI.
Why neural machine translation?
Neural machine translation systems are currently a hot topic in the scientific community. In the last years, the number of publications is growing on this topic.
These systems have great advantages; the context taken into account when translating is at sentence level (in classic statistical systems a maximum of 7 words were taken into account) and all the components of the system are trained at the same time in order to achieve better translation quality. Also, the stored model for translation occupies less memory and weighs less than the classical statistical systems. Mega-corporations such as Google (Wu et al., 2016) and Microsoft (Hassan et al., 2018) are interested in neural translation and claim that they are beginning to render neural machine translation results similar to human translation.
The architecture for neural systems is completely new and different from classical statistical translation systems. This means that all existing functionalities in classical statistical machine translation systems have to be re-investigated. The implementation of these functionalities is not obvious and requires further study.
Pangeanic’s stance on its study
Part of the first milestones for Pangeanic’s Hybrid Neural Machine Translation Platform include:
- The redesigning of pre-processes and post-processes in order for them to operate correctly in neural systems. Previously designed in statistical systems that could work correctly in neural systems.
- Selecting the appropriate toolkit for the project.
- Architectural design of the project; a standard model was chosen due to its bidirectional sequence-to-sequence recurrent neural network.
Several toolkits were tested: Nematus, ModernMT, TensorFlow and OpenNMT. OpenNMT is open, has many functionalities, and the documentation is complete enough to be able to easily implement new options. Furthermore, it is supported by Harvard and Systran and a big community is currently using it. Therefore, we decided on OpenNMT due to all those advantages.
After deciding on OpenNMT, the first action taken was experimenting in order to ascertain the best settings, architecture and number of parameters needed for the amount of data that we have.
Future Publications related to Hybrid neural Machine Translation Platform
We are currently reviewing an article that collects the results obtained during the investigation on the impact that tokenization has on the quality of the final translation, which was carried out in the first part of this project.
Additionally, we plan to draft several articles publishing the results of the investigations carried out and to send them to relevant congresses and workshops, which will be held in 2019. Finally, we plan to prepare demonstrations of the system to be developed during the project that will be presented at one of the most important conferences held next year.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Ta, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, Ming Zhou (2018). Achieving Human Parity on Automatic Chinese to English News Translation. arXiv preprint arXiv 1803.05567.
¿Quieres leer esta noticia en castellano? Encuéntrala aquí.