The triumph of generic databases
The computerization of commonsense knowledge goes back at least to Ross Quillian’s paper from the 1969 book Semantic Information Processing. Ross used methods that aren’t that different from what I use today, but he was able to store just a few hundred concepts in his computer.
The Cyc project, starting in the 1980s contained about 3 million facts. It was successful on it’s own terms, but it didn’t lead to the revolution in natural language processing that it promised. WordNet, from the same era, documents about 120,000 word senses, but like Cyc, hasn’t had a large engineering impact.
DBpedia and Freebase have become popular lately, I think because they’re a lot like traditional databases in character. For a person, place or creative work you’ve got the information necessary to make a ‘Pokemon card’ about the topic. With languages like SPARQL and MQL it’s possible to write queries you’d write in a relational database, so people have an idea what to do with it.
DBpedia and Freebase are much larger than the old commonsense databases. The English Dbpedia contains 4 million topics derived from Wikipedia pages and Freebase contains 24 million facts about 600 million topics. It’s hard to quantify it, but subjectively, people feel like Wikipedia contains most of the concepts that turn up when they are reading or thinking about things. Because the new generic databases are about concepts rather than words, they are inherently multilingual.
DBpedia Spotlight is the first of a breed of language processing products that use world knowledge instead of syntactic knowledge. Using a knowledge base created from DBpedia and Wikipedia, Spotlight gets accuracy comparable to commercial named entity recognition systems — although Spotlight uses simple methods and, so far, has made little of the effort a commercial system would to systematically improve accuracy.
The greatest difficulty with generic databases today is documentation. :BaseKB, derived from Freebase, has roughly 10,000 classes and 100,000 properties. Finding what you’re looking for, and even knowing if it is available, can be difficult.
Data quality is another problem, but it’s hard to characterize. The obvious problem is that you write queries and get wrong answers. If, for instance, you ask for the most important cities in the world, you might find that some big ones are missing but that some entities other than cities, such as countries, have been mistyped. If you look for the people with the latest birthdates, some may be in the future — some may be wrong, some may be anticipated, and others could be mistyped science fiction characters.
Quality problems can have an impact beyond their rate of incidence. Wikipedia, for instance, contains about 200,000 pages documenting the tree of biological classification. Wikipedians don’t have real tools for editing hierarchies, so any edit they make to the tree could break the structure of the tree as a whole. Even if tree-breaking errors occurred at a rate of 1-in-50,000 it would be impossible (or incorrect) to use algorithms on this tree that assume it is really a tree.
Quality problems of this sort can be found by systematic testing and evaluation — these can be repaired at the source or in a separate data transformation step.
intelligence without understanding
One answer to the difficulties of documentation that use algorithms that use predicates wholesale; one might filter or weight predicates, but not get two concerned about the exact meaning of any particular predicate. Systems like this can make subjective judgments about importance, offensiveness, and other attributes of topics. The best example of this is seevl.net a satisfying music recommendation engine developed with less than 10% of the expense and effort that goes into something like Last.FM or Pandora.
These approaches are largely unsupervised and work well because the quality (information density) of semantic data is higher than data about keywords, votes, and other inputs to social-IR system. Starting with a good knowledge base, the “cold start” problem in collaborative filtering can be largely obliterated.
In the RDF model we might make a statement like
:Robert_Redford :actedIn :Sneakers .
we’ve yet to have a satisfactory way to say things about this relationship. Was he the star of this movie? Did he play a minor part? Who added this fact? How must do we trust it?
These issues are particularly important for a data Wiki, like Freebase, because provenance information is crucial if multiple people are putting facts into a system.
These issues are also important in user interfaces because, if we know thousands of facts about a topic, we probably can’t present them all, and if we did, we can’t present them all equally. For an actor, for instance, we’d probably want to show the movies they starred in, not the movies they played a minor role in.
Google is raising the standard here: this article shows how the Google Knowledge Graph can convert the “boiling point” predicate into a web search and sometimes shows the specific source of a fact.
This points to a strategy for getting the missing information about statements — if we have a large corpus of documents (such as a web crawl) we search that corpus for sentences that discuss a fact. Presumably important facts get talked about more frequently than unimportant facts, and this can be used as a measure for importance.
I’ll point out that Common Crawl is a freely available web crawl hosted in Amazon S3 and that by finding links to Wikipedia and official web sites one can build a corpus that’s linked to Dbpedia and Freebase concepts.
Although Freebase and DBpedia know of many names for things, they lack detailed lexical data. For instance, if you’re synthesizing text, you need to know something about the labels you’re using so you can conjugate verbs, make plurals, and use the correct article. Things are relatively simple in English, but more complex in languages like German. Parts-of-speech tags on the labels would help with text analysis.
YAGO2 is an evolving effort to reconcile DBpedia’s concept database with WordNet; it’s one of the few large-scale reconciliations that has been statistically evaluated and was found to be 95% accurate. YAGO2′s effectiveness has been limited, however, by WordNet’s poor coverage compared with DBpedia. There’s a need for a lexical extension to DBpedia and Freebase that’s based on a much larger dictionary than WordNet.
relationship with machine learning
I was involved with the text classification with the Support Vector Machine around 2003 and expected to see the technology to become widely used. In our work we found we got excellent classification performance when we had large (10,000 document) training sets, but we got poor results with 50 document sets.
Text classification hasn’t been so commercially successful and I think that’s because few users will want to classify 10,000 documents to train a classifier, and even in large organizations, many classes of documents won’t be that large (when you get more documents, you slice them into more categories.)
I think people learn efficiently because they have a large amount of background knowledge that helps them make hypotheses that work better than chance. If we’re doing association mining between a million signals, there are a trillion possible associations. If we know that some relationships are more plausible than others, we can learn more quickly. With large numbers of named relationships, it’s possible that we can discover patterns and give them names, in contrast to the many unsupervised algorithms that discover real patterns but can’t explain what they are.
range of application
Although generic databases seem very general, people I talk to are often concerned that they won’t cover terms they need, and some cases they are right.
If a system is parsing notes taken in an oil field, for instance, the entities involved are specific to a company — Robert Redford and even Washington, DC just aren’t relevant.
Generic databases apply obviously to media and entertainment. Fundamentally, the media talks about things that are broadly known, and things that are broadly known are in Wikipedia. Freebase adds extensive metadata for music, books, tv and movies, so that’s all the better for entertainment.
It’s clear that there are many science and technology topics in Wikipedia in Freebase. Freebase has more than 10,000 chemical compounds and 38,000 genes. Wikipedia documents scientific and technological concepts in great detail, so there are certainly the foundations for a science and technology knowledge base there — evaluating how good of a foundation this is and how to improve it is a worthwhile topic.
web and enterprise search
schema.org, by embedding semantic data, makes radical web browsing and search. Semantic labels applied to words can be resolved to graph fragments in DBpedia, Freebase and similar databases. A system that constructs a search index based on concepts rather than words could be effective at answering many kinds of queries. The same knowledge base could be processed into something like the Open Directory, but better, because it would not be muddled and corrupted by human editors.
Concept-based search would be effective for the enterprise as well, where it’s necessary to find things when people use different (but equivalent) words in search than exist in the document. Concept-based search can easily be made strongly bilingual.
middle and upper ontologies
It is exciting today to see so many knowledge bases making steady improvements. Predicates in Freebase, for instance, are organized into domains (“music”) and further organized into 46 high level patterns such as “A created B”, “A administrates B” and so forth. This makes it possible to tune weighting functions in an intuitive, rule-based, manner.
Other ontologies, like UMBEL, are reconciled with generic databases but reflect a fundamentally different point of view. For instance, UMBEL documents human anatomy in much better detail than Freebase so it could play a powerful supplemental role.
generic databases, when brought into confrontation with document collections, can create knowledge bases that enable computers to engage with text in unprecedented depth. They’ll not only improve search, but also open a world of browsing and querying languages that will let users engage with data and documents in entirely new ways.