Beyond Freebase and DBpedia


The triumph of generic databases

The computerization of commonsense knowledge goes back at least to Ross Quillian’s paper from the 1969 book Semantic Information Processing. Ross used methods that aren’t that different from what I use today, but he was able to store just a few hundred concepts in his computer.

The Cyc project, starting in the 1980s contained about 3 million facts. It was successful on it’s own terms, but it didn’t lead to the revolution in natural language processing that it promised. WordNet, from the same era, documents about 120,000 word senses, but like Cyc, hasn’t had a large engineering impact.

DBpedia and Freebase have become popular lately, I think because they’re a lot like traditional databases in character. For a person, place or creative work you’ve got the information necessary to make a ‘Pokemon card’ about the topic. With languages like SPARQL and MQL it’s possible to write queries you’d write in a relational database, so people have an idea what to do with it.

DBpedia and Freebase are much larger than the old commonsense databases. The English Dbpedia contains 4 million topics derived from Wikipedia pages and Freebase contains 24 million facts about 600 million topics. It’s hard to quantify it, but subjectively, people feel like Wikipedia contains most of the concepts that turn up when they are reading or thinking about things. Because the new generic databases are about concepts rather than words, they are inherently multilingual.

DBpedia Spotlight is the first of a breed of language processing products that use world knowledge instead of syntactic knowledge. Using a knowledge base created from DBpedia and Wikipedia, Spotlight gets accuracy comparable to commercial named entity recognition systems — although Spotlight uses simple methods and, so far, has made little of the effort a commercial system would to systematically improve accuracy.

Continue Reading »

What Is Ookaboo?

Ookaboo is a collection of images that are indexed by terms from the semantic web, or, the web of linked data. Ookaboo provides both a human interface and a semantic API. Ookaboo has two goals: (i) to dramatically improve the state of the art in image search for both humans and machines, and (ii) to construct a knowledge base about the world that people live in that can be used to help information systems better understand us.

Semantic Web, Linked Data

In the semantic web, we replace the imprecise words that we use everyday with precise terms defined by URLs. This is linked data because it creates a universal shared vocabulary.
Continue Reading »

Nested Functions: C# that looks like Javascript

Introduction

Anonymous functions and closures are a language feature that,  in many ways,  allow programmers to reshape the syntax of a language.  Although often associated with highly dynamic languages such as Lisp and TCL and moderately dynamic languages,  such as Javascript,  C# shows that closures retain their power in statically typed languages.  In this article,  I identify a language feature that I like from the Pascal language and ‘clone’ it using closures in C#.

Continue Reading »

Closures, Javascript And The Arrow Of Time

time-machine

Introduction

In most written media,  time progresses as you move down a page:  mainstream computing languages are no different.   Anonymous Closures are a language mechanism that,  effectively,  lets programmers create new control structures.  Although people associate this power with exotic dynamic languages such as FORTH,  Scheme and TCL,   closures are becoming a feature of  mainstream languages such as Javascript and PHP (and even static languages such as C#.)

Although this article talks about issues that you’ll encounter in languages such as C# and Scheme,  I’m going to focus on Javascript written on top of the popular JQuery library:  I do that because JQuery is a great toolkit that lets programmers and designers of all skill levels do a lot by writing very little code.  Because JQuery smooths away low-level details,  it lets us clearly illustrate little weirdnesses of its programming model. Although things often work “just right” on the small scale,  little strange things snowball in larger programs — a careful look atthe  little problems is a key step towards the avoidance and mitigation of problems in big RIA projects.

Continue Reading »

Getting a Good Track From a Garmin eTrex

Introduction

The Garmin eTrex series are popular handheld GPS units.  The eTrex H is a inexpensive unit with excellent sensitivity and accuracy,  and the eTrex Vista HCx is a favorite of OpenStreetMap contributors because it accepts a microSD card which can hold OSM maps.  I intended to use my Vista HCx to contribute to OSM and to georeference photographs,  but I was shocked to discover that eTrex units remove time information from saved tracks. This means that saved tracks aren’t useful if you want to georeference photographs with an application like GPicSync.  There’s a simple solution to this problem:  avoid using saved tracks,  and download the “Active Track” instead.

Saved Tracks

The eTrex units “dumb down” saved tracks by (i) reducing the number of points, (ii) removing time information,  and (iii) applying spatial filtering.  A saved track looks like this in Garmin MapSource:

badtrack

This track could be useful for mapping,  but lacking time information,  it can’t be used to reference events that occur at a particular moment in time.  Although the eTrex has a number of menu items for configuring the active track (to,  for instance,  increase the sample rate at which points are taken) there are no options that influence the information stored in saved tracks.

Continue Reading »

Carpictures.cc, My First Web 3.0 Site

I haven’t blogged much in the last month because I’ve been busy. I’ve got a day job, and I’m a parent, and I’ve also been working on a new site: Creative Commons Car Pictures.  What you see on the surface of “Car Pictures” might be familiar,  but what’s under the surface is something you’ve probably never seen before:

carpicturesfrontpage

A collection of car pictures under a Creative Commons license,  it was built from a taxonomy that was constructed from Dbpedia and Freebase.  My editor and I then used a scalable process to clean up the taxonomy,  find matching images and enrich image metadata.  As you can imagine,  this is a process that can be applied to other problem domains:  we’ve tested some of the methodology on our site Animal Photos! but this is the first site designed from start to finish based on the new technology.

Car Pictures was made possible by data sets available on the semantic web,  and it’s soon about to export data via the semantic web.  We’re currently talking with people about exporting our content in a machine-readable linked data format — in our future plans,  this will enable a form of semantic search that will revolutionize multimedia search:  rather than searching for inprecise keywords,  it will become possible to look up pictures,  video and documents about named entities found in systems such as Freebase,  Wikipedia,  WordNet,  Cyc and OpenCalais.

In the next few days we’re going to watch carefully how Car Pictures interacts with users and with web crawlers.  Then we’ll be adding content,  community features,  and a linked data interface.   In the meantime,  we’re planning to build something at least a hundred times bigger.

Quite literally, thousands of people have contributed to Car Pictures,  but I’d like to particularly thank Dan Brickley,  who’s been helpful in the process of interpreting Dbpedia with his ARC2 RDF tools,  and Kingsley Idehen,  who’s really helped me sharpen my vision of what the semantic web can be.

Anyway,  I’d love any feedback that you can give me about Car Pictures and any help I can get in spreading the word about it.  If you’re interested in making  similar site about some other subject,  please contact me:  it’s quite possible that we can help.

First-Class Functions and Logical Negation in C#

negation

Introduction

Languages such as LISP,  ML,  oCaml F# and Scala have supported first-class functions for a long time.  Functional programming features are gradually diffusing into mainstream languages such as C#,  Javascript and PHP.   In particular,  Lambda expressions,  implicit typing,  and delegate autoboxing have made  C# 3.0 an much more expressive language than it’s predecssors.

In this article,  I develop a simple function that acts on functions:  given a boolean function fF.Not(f) returns a new boolean function which is the logical negation of f.  (That is,  F.Not(f(x)) == !f(x)).   Although the end function is simple to use,  I had to learn a bit about the details of C# to ge tthe behavior  I wanted — this article records that experience.
Continue Reading »

Putting Freebase in a Star Schema

What’s Freebase?

cyclopedia
Freebase is a open database of things that exist in the world:  things like people,  places,  songs and television shows.   As of the January 2009 dump,  Freebase contained about 241 million facts,  and it’s growing all the time.  You can browse it via the web and even edit it,  much like Wikipedia.  Freebase also has an API that lets programs add data and make queries using a language called MQL.  Freebase is complementary to DBpedia and other sources of information.  Although it takes a different approach to the semantic web than systems based on RDF standards,  it interoperates with them via  linked data.

The January 2009 Freebase dump is about 500 MB in size.  Inside a bzip-compressed files,  you’ll find something that’s similar in spirit to a Turtle RDF file,  but is in a simpler format and represents facts as a collection of four values rather than just three.

Your Own Personal Freebase

To start exploring and extracting from Freebase,  I wanted to load the database into a star schema in a mysql database — an architecture similar to some RDF stores,  such as ARC.  The project took about a week of time on a modern x86 server with 4 cores and 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.

This is sufficient for my immediate purposes,  but future versions of Freebase promise to be much larger:  this article examines the means that could be used to improve performance and scalability using parallelism as well as improved data structures and algorithms. Continue Reading »

Using Linq To Tell if the Elements of an IEnumerable Are Distinct

The Problem

I’ve got an IEnumerable<T> that contains a list of values:  I want to know if all of the values in that field are distinct.  The function should be easy to use a LINQ extension method and,  for bonus points,  simply expressed in LINQ itself

One Solution

First,  define an extension method

01   public static class IEnumerableExtensions {
02        public static bool AllDistinct<T>(this IEnumerable<T> input) {
03            var count = input.Count();
04            return count == input.Distinct().Count();
05        }
06    }

When you want to test an IEnumerable<T>,  just write

07 var isAPotentialPrimaryKey=CandidateColumn.AllDistinct();

Continue Reading »

Subverting XAML: How To Inherit From Silverlight User Controls

The Problem

Many Silverlighters use XAML to design the visual appearance of their applications.  A UserControl defined with XAML is a DependencyObject that has a complex lifecycle:  there’s typically a .xaml file,  a .xaml.cs file,  and a .xaml.g.cs file that is generated by visual studio. The .xaml.g.cs file is generated by Visual Studio,  and ensures that objects defined in the XAML file correspond to fields in the object (so they are seen in intellisense and available to your c# code.)  The XAML file is re-read at runtime,  and drives a process that instantiates the actual objects defined in the XAML file — a program can compile just fine,  but fail during initialization if the XAML file is invalid or if you break any of the assumptions of the system.

XAML is a pretty neat system because it’s not tied to WPF or WPF/E.  It can be used to initialize any kind of object:  for instance,  it can be used to design workflows in asynchronous server applications based on Windows Workflow Foundation.

One problem with XAML,  however,  is that you cannot write controls that inherit from a UserControl that defined in XAML.  Visual Studio might compile the classes for you,  but they will fail to initialize at at runtime.  This is serious because it makes it impossible to create subclasses that let you make small changes to the appearance or behavior of a control.

Continue Reading »