Information about Benchmarking Patent citation extraction



Benchmarking Patent citations extraction

Some of the corpora used for checking the performance of Ddoc are made available for benchmarking here. Around 2010, at the time of creating these corpora, the text used was taken from in-house databases at the EPO. Those original texts have been replaced with the corresponding fulltext members made available by the EPO using the Espacenet/OPS. Below you can find links to benchmark corpora for testing your own text-mining algorithm for patent citations. The benchmark files are intended to be used with Trek Eval tools for benchmarking/evaluating text mining tools, see link furter down.

Corpus: 92 EP Gold

This is a manually checked corpus containing 92 EP applications published 2008-2011. It is believed to be completely correct, hence “Gold”. Some EP applications arriving to the EPO via the PCT route have the PCT application text.

Corpus for 92 EP Gold Benchmark

92 EP Gold Benchmark

Corpus: 99 EP Direct

99 EP Direct is a corpus containing 99 EP Applications, published 2010-2011 and filed directly at the EPO. They have been extracted using the “Cited documents” field in Espacenet, with some manual corrections. This is considered a “Silver” corpus since it has not been completely checked manually.

Corpus for 99 EP Direct Benchmark.zip

99 EP Direct Benchmark

Corpus: 92 EP Direct

92 EP Direct is a corpus containing 92 EP Applications published 2007-2015 that have been extracted using the “Cited documents” field in Espacenet, with some manual corrections. This is also considered a “Silver” corpus since it has not been completely checked manually. Some applications arriving to the EPO via the PCT Route have the PCT application text.

Corpus for 92 EP Direct Benchmark

92 EP Direct Benchmark

The trec_eval evaluation tool

The trec_eval evaluation tool is program written in C. trec_eval is the standard tool used by the TREC community for evaluating an ad hoc retrieval run, given the results file and a standard set of judged results.

Corpus for 92 EP Gold Benchmark

Link to the trec_eval evaluation tool

Link to the Text REtrieval Conference (TREC) page

Old US patent drawing