Matt Marx and Aaron Fuegi
We curate and characterize a complete set of citations from patents to scientific articles, including 16.8 million from the full text of USPTO and EPO patents. Combining hand-tuned heuristics and the GROBID machine-learning package, we achieve much higher performance than machine learning alone. Recall is evaluated with a set of 5939 randomly sampled, cross-verified “known good” citations, which the authors have never seen. At 99.4% precision, we achieve recall rates of 78% for the full test set and 88% for references specified without mistakes. We compare these “in-text” citations with those on the front page of patents. In-text citations are more diverse temporally, geographically, and topically; moreover, they are less self-referential and less likely to be copied from one patent to the next. In-text citations have dropped from two-thirds of all patent-to-article citations half a century ago to about one-third today. In replicating two articles that use only front-page citations, we show that failing to capture in-text citations leads to understating the role of academic science in commercial invention. All patent-to-article citations, the known-good test set, and the source code are available at http://relianceonscience.org.