Data Diversification in Different Domains

full text (pdf)

Abstract

We tackle the problem of data diversification from multiple angles. We first explore the problem of text diversification and how sampled subsets of sentences from different sampling techniques can affect the machine translation model obtained. We show that we are able to get notable increase in translation quality in some cases of our sampling.

We then focus on data diversification on a wider scope and consider the problem of submodular maximisation, which is one way a diversification problem can be phrased more mathematically. We develop a parallel submodular maximisation algorithm under a cardinality constraint, and demonstrate its performance against existing submodular maximisation algorithms. We are able to show that our algorithm yields accuracies comparable to existing algorithms, while taking orders of magnitude less computation time and fewer function calls.

Keywords

Diversification, Natural Language Processing, Submodular Maximisation