Archive for the 'Semantic Web' Category

Beyond Freebase and DBpedia


The triumph of generic databases

The computerization of commonsense knowledge goes back at least to Ross Quillian’s paper from the 1969 book Semantic Information Processing. Ross used methods that aren’t that different from what I use today, but he was able to store just a few hundred concepts in his computer.

The Cyc project, starting in the 1980s contained about 3 million facts. It was successful on it’s own terms, but it didn’t lead to the revolution in natural language processing that it promised. WordNet, from the same era, documents about 120,000 word senses, but like Cyc, hasn’t had a large engineering impact.

DBpedia and Freebase have become popular lately, I think because they’re a lot like traditional databases in character. For a person, place or creative work you’ve got the information necessary to make a ‘Pokemon card’ about the topic. With languages like SPARQL and MQL it’s possible to write queries you’d write in a relational database, so people have an idea what to do with it.

DBpedia and Freebase are much larger than the old commonsense databases. The English Dbpedia contains 4 million topics derived from Wikipedia pages and Freebase contains 24 million facts about 600 million topics. It’s hard to quantify it, but subjectively, people feel like Wikipedia contains most of the concepts that turn up when they are reading or thinking about things. Because the new generic databases are about concepts rather than words, they are inherently multilingual.

DBpedia Spotlight is the first of a breed of language processing products that use world knowledge instead of syntactic knowledge. Using a knowledge base created from DBpedia and Wikipedia, Spotlight gets accuracy comparable to commercial named entity recognition systems — although Spotlight uses simple methods and, so far, has made little of the effort a commercial system would to systematically improve accuracy.

Continue Reading »

Carpictures.cc, My First Web 3.0 Site

I haven’t blogged much in the last month because I’ve been busy. I’ve got a day job, and I’m a parent, and I’ve also been working on a new site: Creative Commons Car Pictures.  What you see on the surface of “Car Pictures” might be familiar,  but what’s under the surface is something you’ve probably never seen before:

carpicturesfrontpage

A collection of car pictures under a Creative Commons license,  it was built from a taxonomy that was constructed from Dbpedia and Freebase.  My editor and I then used a scalable process to clean up the taxonomy,  find matching images and enrich image metadata.  As you can imagine,  this is a process that can be applied to other problem domains:  we’ve tested some of the methodology on our site Animal Photos! but this is the first site designed from start to finish based on the new technology.

Car Pictures was made possible by data sets available on the semantic web,  and it’s soon about to export data via the semantic web.  We’re currently talking with people about exporting our content in a machine-readable linked data format — in our future plans,  this will enable a form of semantic search that will revolutionize multimedia search:  rather than searching for inprecise keywords,  it will become possible to look up pictures,  video and documents about named entities found in systems such as Freebase,  Wikipedia,  WordNet,  Cyc and OpenCalais.

In the next few days we’re going to watch carefully how Car Pictures interacts with users and with web crawlers.  Then we’ll be adding content,  community features,  and a linked data interface.   In the meantime,  we’re planning to build something at least a hundred times bigger.

Quite literally, thousands of people have contributed to Car Pictures,  but I’d like to particularly thank Dan Brickley,  who’s been helpful in the process of interpreting Dbpedia with his ARC2 RDF tools,  and Kingsley Idehen,  who’s really helped me sharpen my vision of what the semantic web can be.

Anyway,  I’d love any feedback that you can give me about Car Pictures and any help I can get in spreading the word about it.  If you’re interested in making  similar site about some other subject,  please contact me:  it’s quite possible that we can help.

Putting Freebase in a Star Schema

What’s Freebase?

cyclopedia
Freebase is a open database of things that exist in the world:  things like people,  places,  songs and television shows.   As of the January 2009 dump,  Freebase contained about 241 million facts,  and it’s growing all the time.  You can browse it via the web and even edit it,  much like Wikipedia.  Freebase also has an API that lets programs add data and make queries using a language called MQL.  Freebase is complementary to DBpedia and other sources of information.  Although it takes a different approach to the semantic web than systems based on RDF standards,  it interoperates with them via  linked data.

The January 2009 Freebase dump is about 500 MB in size.  Inside a bzip-compressed files,  you’ll find something that’s similar in spirit to a Turtle RDF file,  but is in a simpler format and represents facts as a collection of four values rather than just three.

Your Own Personal Freebase

To start exploring and extracting from Freebase,  I wanted to load the database into a star schema in a mysql database — an architecture similar to some RDF stores,  such as ARC.  The project took about a week of time on a modern x86 server with 4 cores and 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.

This is sufficient for my immediate purposes,  but future versions of Freebase promise to be much larger:  this article examines the means that could be used to improve performance and scalability using parallelism as well as improved data structures and algorithms. Continue Reading »