Since 1998, volunteer scientists have been reading papers and manually populating a database of fossils called PaleoDB with information about the names and family trees of ancient creatures. PaleoDB was crowdsourcing before Jeff Howe named such efforts crowdsourcing. The data they've generated has resulted in more than 190 scientific publications on topics like changes in body mass over the course of evolution.

But now a group of researchers led by geobiologist Shanan Peters at the University of Wisconsin, Madison, and computer scientist Christopher Re at Stanford are trying to build a new, competing database by leveraging artificial intelligence.

"Paleobiologists have been spending 15 years… reading papers and typing them into a database," said Re. "So we thought, 'This is something a computer should do.'"


Earlier this year, they uploaded a paper to the scientific preprint website, arXiv, describing a new machine-learning system dubbed PaleoDeepDive that’s capable of automatically reading through journal articles, graphs and charts and classifying their information into a searchable database, with minimal human intervention. This week, the research, which is sponsored in part by Google, was published in the open-access journal PLOS ONE, with a few updates. Among them, the group now has access to Elsevier's stacks of publications. The scientific publishing giant, which in the past has been criticized for its exorbitant subscription fees, gave them access to roughly 10,000 downloads per week.

The PaleoDeepDive database is an improvement over the PaleoDB version, the researchers say, because it’s faster to update, more comprehensive and more amenable to fixing errors, all while being able to reproduce findings made possible by PaleoDB. For example, the distributions of body size among brachiopods — shelled marine critters — estimated by PaleoDeepDive and PaleoDB were similar.  Scientists also found PaleoDeepDive's error rate was comparable to PaleoDB's.

"It took PDD a tiny fraction of the time to assemble what has taken humans years to assemble," said Peters in an email. "Humans should get on with thinking about, critically assessing, and generating new data, not compiling it manually from the literature."


With this new database-generating system, Re and his colleagues are attempting to build scientists something as nimble as Google’s Knowledge Graph or Facebook’s Entities Graph, but for ancient creatures instead of web pages and people.

The corporate databases are massive networks of concepts that are related in some way. The Graphs underpin how the two companies' signature services—search and Newsfeed, respectively—work. For instance, when you like the Fusion Facebook page, Facebook represents that new relationship as a connection between two nodes, you and Fusion. The more likes and comments and shares you make on or from the page, the stronger that relationship becomes. It’s simple network analysis.

But dealing with scientific papers is more complicated because text made for human consumption requires more interpretation. Natural language processing, the science—or sometimes dark art—of deciphering language, is different and much more complicated. Imagine, for instance, if scientists published a new study about Confuciusornis, a type of paleo-bird typically found in ancient lakes in China, which merely mentions webbed feet. If PaleoDeepDive connected up these two nodes—“webbed feet” and “Confuciusornis”—the system might think the creature’s feet were webbed, when that's not true.

So, a computer has to interpret the context of the language. On the open web, things like humor, sarcasm and puns make this especially difficult because these are subtle, and usually tied to a cultural context that's hard to translate for machines, at least currently. The language of science is more precise. It's predictable. That makes reading scientific articles less interesting, but easier to process with computers. The meaning can be extracted from the words more easily.


And that's part of the reason PaleoDeepDive seems to work pretty well. It has "read" thousands of scientific publications and extracted the key information scientists need to generate new knowledge. This is software that knows a lot about dinosaurs and other prehistoric creatures!

But the core of the system that underpins PaleoDeepDive is equally exciting: It's a different, more modern, type of database.


Most databases—PaleoDB included—function on the premise that everything in them is true. They're static. PaleoDeepDive is what scientists call a "statistical inference database," which means that the system assigns a kind of truth score to each nugget of information, and that value rises and falls based on the information fed into the database. Plus, because the system processes information asynchronously instead of sequentially, it can achieve a lower error rate. A bit of noise actually helps the system process and learn. Scientists don't fully understand why this works, but it does seem to. (Google and Microsoft are using similar approaches to build AI, some of it inspired by Re's work.)

This all means that DeepDive is flexible and adaptable. The real promise, experts like Austin Hendy at the Los Angeles County Natural History Museum said, is that it can be applied to other fields of science as well, for example climate change, biodiversity, and drug repurposing. Point it at a virtual pile of papers and it returns a database of organized scientific information. That's what's really exciting and interesting.

“We’re never going to be able to read Nature articles better than a biologist, but we can read so many more articles that we have richer contextual facts,” says Re.


With the flexibility of the database, however, comes a special set of challenges. How confident does the machine have to be that something is probably true before it crosses over into the realm of fact? "What is a good cut-off threshold? Why did the system think this is likely true? Can I believe the high probability associated with this?" asked Trishul Chilimbi, a distributed systems engineer at Microsoft Research who recently used some of Re's work to build a computer vision system dubbed Adam. "Unlike an image recognition where system errors are often easily detectable (e.g., if the system says with high-confidence that a picture of a cat is a dog), there is no easy way here to identify system errors."

And, while PaleoDeepDive's creators have published on its accuracy, some human experts wonder whether machines really are comparably good at classifying scientific texts. "Having worked on such data sets for nearly 30 years now, I can tell you with great conviction that I think curation is essential. Published paleontological literature is very hard to interpret accurately, especially with respect to taxonomy," said John Alroy, a paleobiologist at Macquarie University in Sidney, Australia, and the founder of PaleoDB. PaleoDB is "important because it presents highly curated data to a large audience. PaleoDeepDive presents uncurated data.”


Alroy isn’t alone. There are questions about the accuracy of results powered by machine-learning approaches. After all, more data doesn't always equal better. Garbage in. Garbage out. For example, one of the limitations of PaleoDeepDive is that its engineers can only feed it open-access papers or articles they have permission to use, like in the case of Elsevier. Findings in high-impact journals like Science and Nature are off limits. PaleoDB, on the other hand, can incorporate these.

These are all considerations Re and his team will have to take into account. But scientists are hopeful. "Given the current state of technology, the application domain needs to be narrow and forgiving of errors," said Chilimbi. "While this may seem pessimistic, I think the future is bright for this as the technology improves."


Daniela Hernandez is a senior writer at Fusion. She likes science, robots, pugs, and coffee.