TY - JOUR T1 - Density-Based K-Nearest Neighbor Active Learning for Improving Farsi-English Statistical Machine Translation System TT - JF - ITRC JO - ITRC VL - 7 IS - 3 UR - http://ijict.itrc.ac.ir/article-1-94-en.html Y1 - 2015 SP - 63 EP - 72 KW - Active Learning KW - Statistical Machine Translation KW - Farsi and English pair Languages KW - Soft Decision Making KW - Kernel Based Distance KW - Density Based KNN N2 - Labeled data are useful resources for different application in different fields like image processing, natural language processing etc. Producing labeled data is a costly process. One efficient solution for alleviating the costly process of annotating data is managing the sampling process. It is better to query for essential samples instead of a group of unnecessary ones. Active learning (AL) attempts to overcome the labeling bottleneck by sending queries for unlabeled instances to be labeled with the help of an annotator. This technique is applied to Natural Language Processing (NLP) especially in Statistical Machine Translation (SMT) tasks that we also focus on in this work. In Statistical Machine Translation, parallel corpora are scarce resources, and AL is a way of solving this problem. It attempts to alleviate the costly process of data annotating by sending queries just for translation of the most informative sentences which are essential for system improvement. The contribution of our work is proposing a new approach in AL for selecting sentences through a soft decision making process. In this algorithm, in addition to scoring sentences according to their information, the distribution of the space of unlabeled data is also considered. Each sentence (either labeled or unlabeled) changes to a vector of feature scores. Then each new coming sentence is observed in the feature space and gets two probabilities: how probable it is to be either labeled or unlabeled. These probabilities are calculated according to the position of new instance related to its labeled and unlabeled neighbors. We have applied the proposed model for improving training corpus of a SMT system. Also Farsi-English language pairs are selected as the base-line SMT system. We have sampled the best sentences that can improve the quality of our SMT system and send query for their translations. In this way the costly approach of making parallel corpus is alleviated. Finally, our experiments show significant improvements for sampling sentences by soft decision making in comparison to the random sentence selection strategy. M3 ER -