Abstract for Ddoc
Ddoc is a rule-based text-mining program originally intended for extracting patent prior art references in patent applications. Today Ddoc is available as a free service on this site and can be used on any text to extract patent publication or application references.
For patent publication references, Ddoc uses a word-by-word detection. During the processing, a vector with indicators is populated. The indicators are country code (cc), number (num) and kind code (kc). The retrieved citations are checked against a database and the publication date is added to the output. Ddoc can detect references from 173 different national or regional offices (i.e. for all publications available in Espacenet/OPS and beyond). So-called bundles of documents (a single country code followed by a list of numbers) are detected for DE, EP, FR, GB, JP, US, WO and TW patent publication references.
For patent application references, Ddoc uses regular expressions. Patent applications from 11 national or regional offices (CN, DE, FR, GB, IT, JP, KR, TW, US; PCT and EP) are detected.
The Open Patent Service (OPS) at the European Patent Office is used for checking extracted numbers. For the most common citation countries a local database is used to speed up the checking of patent numbers. The local database is 100-150 times faster than the OPS and contains bibliographic information for around 75% of the documents present in Espacenet/OPS. Ddoc has been shown to retrieve at least 95% of the cited documents (recall) with a precision of 95%.
Article describing Ddoc in more detail
If you want to test your own extraction algorithm please have a look at the Benchmarking subpage.