Multi-type Obfuscation Corpus for Cross-Lingual Plagiarism Detection

Asghari, Habibollah

Back to the articles list | Back to browse issues page

Multi-type Obfuscation Corpus for Cross-Lingual Plagiarism Detection

Habibollah Asghari

ICT Research Institute, ACECR, Tehran, Iran , habib.asghari@ictrc.ac.ir

Abstract: (563 Views)

In recent years, due to high availability of documents through the Internet, plagiarism is becoming a serious issue in many fields of research. Moreover, the availability of machine translation systems facilitates the re-use of textual contents across languages. So, the detection of plagiarism in cross-lingual cases is now of great importance especially when source and target language are different. Various methods for automatic detection of text reuse have been developed whose objective is to help human experts to investigate suspicious documents for plagiarism cases. For evaluating the performance of theses plagiarism detection systems and algorithms, we need to construct plagiarism detection corpora. In this paper, we propose an English-Persian plagiarism detection corpus comprised of different types of paraphrasing. The goal is to simulate what would be done by human to conceal plagiarized passages after translating the text into target language. The proposed corpus includes seven types of paraphrasing methods that cover (but not limited to) all of the obfuscation types in the previous works into one integrated CLPD corpus. In order to evaluate the corpus, an extrinsic evaluation approach has been applied by executing a wide variety of plagiarism detection algorithms as downstream tasks on the proposed corpus. The results show that the performance of the algorithms decrease by increasing the obfuscation complexity.

Keywords: Cross-lingual plagiarism detection, Corpus construction, Obfuscation strategy, Translation obfuscation

Full-Text [DOCX 489 kb] (47 Downloads)

Type of Study: Research | Subject: Information Technology

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Principal Contact