通过信息增益收集定义答案外文翻译

Gathering Definition Answers by Information Gain
通过信息增益收集定义答案
Abstract. 摘要
A definition question is a kind of question whose answer is a complementary set of sentence fragments called nuggets, which define the target term. Since developing general and flexible patterns with a wide coverage to answer definition questions is not feasible, we propose a method using information gain to retrieve the most relevant information. To obtain the relevant sentences, we compared the output of two retrieval systems: JIRS and Lucene. One important feature that impacts on the performance of definition question answering systems is the length of the sentence fragments, so we applied a parser to analyze the relevant sentences in order to get clauses. Finally, we observed that, in most of the clauses, only one part before and after the target term contains information that defines the term, so we analyzed separately the sentence fragments before (left) and after (right) the target term. We performed different experiments with the collections of questions from the pilot evaluation of definition questions 2002, definition questions from TREC 2003 and other questions from TREC 2004. F-measures obtained are competitive when compared against the participating systems in their respective conferences. Also the best
results are obtained with the general purpose system (Lucene) instead of JIRS, which is intended to retrieve passages for factoid questions.
定义型问题是一种答案为一套被称为nuggets的补充性短语，用来定义目标词条。因为开发覆盖面广的通用和灵活的模式是不可行的，所以我们建议用信息增益的方法来检索最相关的信息。为了获取相关的句子，我们对比JIRS 和Lucene两个检索系统的输出。一个重要的特性是：影响定义型问题回答系统性能的是句子片段的长度，所以我们应用解析器去分析相关的句子以获得子句。最后，我们观察到，只有一部分目标词之前和之后包含了定义信息，所以我们单独分析了句子片段前（左）后（右）的目标词。我们收集了2002试点评估的定义问题，2003TREC的定义问题以及2004TREC的其他问题，进行了不同的实验。在他们各自会议中与参与系统相比，被获得的特征测量是有竞争力的。通用系统Lucene代替JIRS是获得的最好的结果，意在打算为事实型问题检索短文。
1 Introduction
介绍
Question Answering (QA) is a computer-based task that tries to improve the output generated by Information Retrieval (IR) systems. A definition question [9] is a kind of question whose answer is a complementary set of sentence fragments called nuggets.
问题回答是一个基于计算机任务试图改善输出生成的依靠信息检索的系统。定义型问题是一种答案为一套被称为金砖的补充性短语。
After identifying the correct target term (the term to define) and context terms, we need to obtain useful and non redundant definition nuggets. Nowadays, patterns are obtained manually as surface patterns [5, 6, 12]. These patterns can be very rigid, leading to the alternative soft patterns [2], which are even extracted in an automatic way [5]. Then, once we have the patterns, we apply a matching process to extract the nuggets. Finally, we need to perform a process to determine if these nuggets are part of the definition; where a common criterion employed is the frequency of appearance of the nugget.
在确定正确目标词和上下文条件后，我们需要获得有用和非冗余定义nuggets。如今的模式是如表面模式的手动模式。这些模式非常精确，从而替代那些甚至可以自动提取的软模式。然后，我们一旦有了模式，提供一个匹配的程序即可提取nuggets。最后，我们需要以一个共同的判断目标词出现频率的标准来确定该目标词是否为被定义的一部分。
According to the state of the art, the highest F-measure in a pilot evaluation [9] for definition questions in 2002 is 0.688 using the nugget set supplied by author, taking =5. For the TREC 2003 [10], the best F-measure was 0.555 also with β=5, and the TREC 2004 [11] F-measure was 0.460, now with β=3.
根据目前发展状况，在2002年定义问题的试点评估中特征测量最高值是由作者提供的nuggets集0.688，β为5。2003年的TREC，最高的特征测量值为0.555，β仍为5，而在2004年的TREC中最高特征测量值为0.460，β值为现在的3.
In contrast to the traditional way to extract nuggets, we propose a method that uses information gain to retrieve the most relevant information. First, we obtain passages from the AQUAINT Corpus using the retrieval system Lucene1. Next, from the passages, we extract the relevant sentences, these are further parsed (using Link Grammar [4]) to obtain clauses. Then, from the clauses, we select four kinds of sentence fragments, these are: noun phrases containing an appositive phrase, noun phrases containing two noun phrases separated by comma, embedded clauses, and main or subordinate clauses without considering embedded clauses. Finally, the sentence fragments are separated in two kinds of fragments, i.e. the fragments to the left and right of the target term. We then assess the information gain of sentence fragments to decide which are the most relevant, and in consequence select them as part of the final answer.
对比传统的提取nuggets的方法，我们提出用信息增益的方法来检索最相关的信息。首先，我们用检索系统Lucene1从AQUAINT 语料库中获得短文。然后，从子句中，我们筛选四种句子片段，分别是：包含一个同位语短语的名词性短语；包含两个用逗号，插入语隔开的名词性短语，而主句和从句中不含有插入语。最后，句子片段被两种断句分开，即断句在目标词条的左和右。然后我们评估信息增益的句子片段来决定哪个是最相关的，因此选择他们作为最终答案的一部分。
For this task, we work with the questions of the pilot evaluation of definition questions 2002 [9], definition questions from TREC 2003 [10] and other questions from TREC 2004 [11]. First, we test the output of two retrieval systems JIRS2 and Lucene. In the second experiment, we test balanced and non-balanced sets of sentence fragments from the right and left sets. Finally, we compare the F-measure obtained with our system DefQuestions_IG against the participating systems in the TREC conferences.
对于这个任务，我们研究2002年定义问题的试点评估问题，2003年TREC的定义问题和2004年TREC的其他问题。首先，我们测试了JIRS2和Lucene两个检索系统的输出。在第二个实验中，我们从左到右测试了句子片段的平衡和非平衡。最后，我们对比了我们的IG定义问题系统和TREC学术讨论会中的参与系统的特征测量值。
The paper is organized as follows: next section describes the process to extract sentence fragments; Section 3 describes the approaches used and the method to retrieve only definition sentence fragments; Section 4 reports experimental results; finally, some conclusions and directions for future work are presented in Section 5.
本文组织如下：在接下来的部分描述提取句子片段的过程；在第三部分中描述使用方法以及仅定义句子片段的方法；在第四部分中报告实验结果；最后，在第五部分中将呈现一些对未来工作的结论和趋势。
2 Sentence Fragments Extraction
Thus, a reason to extract sentence fragments is that we need to retrieve only the most important information from relevant sentences. Other reason to extract short sentence fragments is related to the performance F-measure applied to definition systems in the TREC evaluation; this measure combines the recall and precision of the system. The precision is based on length (in non-white-space characters) used as an approximation to nugget precision. The length-based measure starts from an initial allowance of 100 characters for each (vital or non-vital) nugget matched. Otherwise, the measure value decreases as the length the sentence fragment increases.
2句子片段的提取
我们需要从相关句子中只检索最重要的信息是我们提取句子片段的一个原因。应用于TREC评估的特征测度性能的相关定义系统是提取短句子片段的其他原因；这个措施结合了记忆和精度系统。精度是基于长度（在非空白字符），作为nuggets的近似精度。长度基于从100个字母的首字母折让开始测量，且每个（重要或不重要）nugget都要匹配。另外，测量的价值随着句子片段长度的增加而增加。
After our experiments comparing two retrieval systems (and detailed later on), we decide to use Lucene as main system to extract candidate paragraphs from the AQUAINT Corpus of English News Text. From these candidate paragraphs, we extract the relevant sentences, i.e. the sentences that contain the target term. Then, to extract sentence fragments we propose the following process:
经过我们实验对比两个检索系统，我们决定使用Lucene作为主要系统来提取新闻语篇语料库中的候选段落。从这些候选段落中，我们提取相关句子，即句子包含目标词。然后，我们提供以下程序提取句子片段：

Parse the sentences. Since we need to obtain information segments (phrases or clauses) from a sentence, the relevant sentences were parsed with Link Grammar [6]. We replace the target term by the label SCHTERM. As an example, we get the following sentence for the target term Carlos the Jackal: The man known as Carlos the Jackal has ended a hunger strike after 20 days at the request of a radical Palestinian leader, his lawyer said Monday.

1）解析句子。因为我们需要从句子中获得信息片段（短语或子句），被解析的相关句子都有语法链。我们替换目标词的SCHTERM标签。例如，我们把Carlos the Jackal作为目标词：Carlos the Jackal的律师周一说，Carlos the Jackal 20天后在一个激进的巴勒斯坦领导请求的时候结束了绝食斗争。
The Link Grammar the produces the following output with the target term replaced as detailed above:
[S [S [NP [NP The man NP] [VP known [PP as [NP SCHTERM NP] PP] VP] NP] [VP has [VP ended [NP a hunger strike NP] [PP after [NP 20 days NP] PP] [PP at [NP [NP the request NP] [PP of [NP a radical Palestinian leader NP] PP] NP] PP] VP] VP] S] , [NP his lawyer NP] [VP said [NP Monday NP] . VP] S]

以上是部分资料介绍, 需要完整的请联系客服购买.

购买指南在线支付
毕业设计论文购买流程：
1.在坤哥毕业设计找到您想要的毕业设计论文，记住毕业设计的名称。
2.联系在线客服，将您的毕业设计论文名称发送给客服，客服如果不在线给客服留言或者留下您的联系电话。
3.与客服确认您所要的毕业设计。为了保证毕业设计的可用性，我们承诺每个地区只出售一次，购买前请主动告知您的地区位置。
4.付款,可通过本站商家验证的支付宝,也可通过银行转账等方式。
5.付款之后通知客服，客服核实后将您所要的设计按照您的要求发送于您。
6.毕业设计或论文使用过程遇到任何问题请联系客人服，我们会在第一时间帮您解决。

热门标签

通过信息增益收集定义答案外文翻译