Archive for the 'Tools' Category

Beyond Freebase and DBpedia


The triumph of generic databases

The computerization of commonsense knowledge goes back at least to Ross Quillian’s paper from the 1969 book Semantic Information Processing. Ross used methods that aren’t that different from what I use today, but he was able to store just a few hundred concepts in his computer.

The Cyc project, starting in the 1980s contained about 3 million facts. It was successful on it’s own terms, but it didn’t lead to the revolution in natural language processing that it promised. WordNet, from the same era, documents about 120,000 word senses, but like Cyc, hasn’t had a large engineering impact.

DBpedia and Freebase have become popular lately, I think because they’re a lot like traditional databases in character. For a person, place or creative work you’ve got the information necessary to make a ‘Pokemon card’ about the topic. With languages like SPARQL and MQL it’s possible to write queries you’d write in a relational database, so people have an idea what to do with it.

DBpedia and Freebase are much larger than the old commonsense databases. The English Dbpedia contains 4 million topics derived from Wikipedia pages and Freebase contains 24 million facts about 600 million topics. It’s hard to quantify it, but subjectively, people feel like Wikipedia contains most of the concepts that turn up when they are reading or thinking about things. Because the new generic databases are about concepts rather than words, they are inherently multilingual.

DBpedia Spotlight is the first of a breed of language processing products that use world knowledge instead of syntactic knowledge. Using a knowledge base created from DBpedia and Wikipedia, Spotlight gets accuracy comparable to commercial named entity recognition systems — although Spotlight uses simple methods and, so far, has made little of the effort a commercial system would to systematically improve accuracy.

Continue Reading »

First-Class Functions and Logical Negation in C#

negation

Introduction

Languages such as LISP,  ML,  oCaml F# and Scala have supported first-class functions for a long time.  Functional programming features are gradually diffusing into mainstream languages such as C#,  Javascript and PHP.   In particular,  Lambda expressions,  implicit typing,  and delegate autoboxing have made  C# 3.0 an much more expressive language than it’s predecssors.

In this article,  I develop a simple function that acts on functions:  given a boolean function fF.Not(f) returns a new boolean function which is the logical negation of f.  (That is,  F.Not(f(x)) == !f(x)).   Although the end function is simple to use,  I had to learn a bit about the details of C# to ge tthe behavior  I wanted — this article records that experience.
Continue Reading »

Putting Freebase in a Star Schema

What’s Freebase?

cyclopedia
Freebase is a open database of things that exist in the world:  things like people,  places,  songs and television shows.   As of the January 2009 dump,  Freebase contained about 241 million facts,  and it’s growing all the time.  You can browse it via the web and even edit it,  much like Wikipedia.  Freebase also has an API that lets programs add data and make queries using a language called MQL.  Freebase is complementary to DBpedia and other sources of information.  Although it takes a different approach to the semantic web than systems based on RDF standards,  it interoperates with them via  linked data.

The January 2009 Freebase dump is about 500 MB in size.  Inside a bzip-compressed files,  you’ll find something that’s similar in spirit to a Turtle RDF file,  but is in a simpler format and represents facts as a collection of four values rather than just three.

Your Own Personal Freebase

To start exploring and extracting from Freebase,  I wanted to load the database into a star schema in a mysql database — an architecture similar to some RDF stores,  such as ARC.  The project took about a week of time on a modern x86 server with 4 cores and 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.

This is sufficient for my immediate purposes,  but future versions of Freebase promise to be much larger:  this article examines the means that could be used to improve performance and scalability using parallelism as well as improved data structures and algorithms. Continue Reading »

Using Linq To Tell if the Elements of an IEnumerable Are Distinct

The Problem

I’ve got an IEnumerable<T> that contains a list of values:  I want to know if all of the values in that field are distinct.  The function should be easy to use a LINQ extension method and,  for bonus points,  simply expressed in LINQ itself

One Solution

First,  define an extension method

01   public static class IEnumerableExtensions {
02        public static bool AllDistinct<T>(this IEnumerable<T> input) {
03            var count = input.Count();
04            return count == input.Distinct().Count();
05        }
06    }

When you want to test an IEnumerable<T>,  just write

07 var isAPotentialPrimaryKey=CandidateColumn.AllDistinct();

Continue Reading »

Subverting XAML: How To Inherit From Silverlight User Controls

The Problem

Many Silverlighters use XAML to design the visual appearance of their applications.  A UserControl defined with XAML is a DependencyObject that has a complex lifecycle:  there’s typically a .xaml file,  a .xaml.cs file,  and a .xaml.g.cs file that is generated by visual studio. The .xaml.g.cs file is generated by Visual Studio,  and ensures that objects defined in the XAML file correspond to fields in the object (so they are seen in intellisense and available to your c# code.)  The XAML file is re-read at runtime,  and drives a process that instantiates the actual objects defined in the XAML file — a program can compile just fine,  but fail during initialization if the XAML file is invalid or if you break any of the assumptions of the system.

XAML is a pretty neat system because it’s not tied to WPF or WPF/E.  It can be used to initialize any kind of object:  for instance,  it can be used to design workflows in asynchronous server applications based on Windows Workflow Foundation.

One problem with XAML,  however,  is that you cannot write controls that inherit from a UserControl that defined in XAML.  Visual Studio might compile the classes for you,  but they will fail to initialize at at runtime.  This is serious because it makes it impossible to create subclasses that let you make small changes to the appearance or behavior of a control.

Continue Reading »

Manipulate HTML Forms With Silverlight 2

Remote Control

Introduction

Lately I’ve been working on a web application based on Silverlight 2.  The application uses a traditional web login system based on a cryptographically signed cookie.  In early development,  users logged in on an HTML page,  which would load a Silverlight application on successful login.  Users who didn’t have Silverlight installed would be asked to install it after logging in,  rather than before.

Although it’s (sometimes) possible to determine what plug-ins a user has installed using Javascript,  the methods are dependent on the specific browser and the plug-ins.  We went for a simple and effective method:  make the login form a Silverlight application,  so that users would be prompted to install Silverlight before logging in.

Our solutionn  was to make the Silverlight application a drop-in replacement for the original HTML form.  The Silverlight application controls a hidden HTML form:  when a user hits the “Log In” buttonin the Silverlight application,  the application inserts the appropriate information into the HTML form and submits it.  This article describes the technique in detail. Continue Reading »

require(), require_once() and Dynamic Autoloading in PHP

Introduction

I program in PHP a lot,  but I’ve avoided using autoloaders,  except when I’ve been working in frameworks,  such as symfony,   that include an autoloader.  Last month I started working on a system that’s designed to be part of a software product line:  many scripts,  for instance,  are going to need to deserialize objects that didn’t exist when the script was written:  autoloading went from a convenience to a necessity.

The majority of autoloaders use a fixed mapping between class names and PHP file names.  Although that’s fine if you obey a strict “one class,  one file” policy,  that’s a policy that I don’t follow 100% of the time.  An additional problem is that today’s PHP applications often reuse code from multiple frameworks and libraries that use different naming conventions:  often applications end up registering multiple autoloaders.  I was looking for an autoloader that “just works” with a minimum of convention and configuration — and I found that in a recent autoloader developed by A.J. Brown. Continue Reading »

What do you do when you’ve caught an exception?

Abort, Retry, Ignore

This article is a follow up to “Don’t Catch Exceptions“, which advocates that exceptions should (in general) be passed up to a “unit of work”, that is, a fairly coarse-grained activity which can reasonably be failed, retried or ignored. A unit of work could be:

  • an entire program, for a command-line script,
  • a single web request in a web application,
  • the delivery of an e-mail message
  • the handling of a single input record in a batch loading application,
  • rendering a single frame in a media player or a video game, or
  • an event handler in a GUI program

The code around the unit of work may look something like

[01] try {
[02]   DoUnitOfWork()
[03] } catch(Exception e) {
[04]    ... examine exception and decide what to do ...
[05] }

For the most part, the code inside DoUnitOfWork() and the functions it calls tries to throw exceptions upward rather than catch them.

To handle errors correctly, you need to answer a few questions, such as

  • Was this error caused by a corrupted application state?
  • Did this error cause the application state to be corrupted?
  • Was this error caused by invalid input?
  • What do we tell the user, the developers and the system administrator?
  • Could this operation succeed if it was retried?
  • Is there something else we could do?

Although it’s good to depend on existing exception hierarchies (at least you won’t introduce new problems), the way that exceptions are defined and thrown inside the work unit should help the code on line [04] make a decision about what to do — such practices are the subject of a future article, which subscribers to our RSS feed will be the first to read.

Continue Reading »

Converting A Synchronous Program Into An Asynchronous Program

Introduction

One of the challenges in writing programs in today’s RIA environments  (Javascript, Flex, Silverlight and GWT)  is expressing the flow of control between multiple asynchronous XHR calls.  A “one-click-one-XHR” policy is often best,  but you don’t always have control over your client-server protocols.  A program that’s simple to read as a synchronous program can become a tangle of subroutines when it’s broken up into a number of callback functions.  One answer is program translation:  to manually or automatically convert a synchronous program into an asynchronous program:  starting from the theoretical foundation,  this article talks about a few ways of doing that.

Thibaud Lopez Schneider sent me a link to an interesting paper he wrote, titled “Writing Effective Asynchronous XmlHttpRequests.” He presents an informal proof that you can take a program that uses synchronous function calls and common control structures such as if-else and do-while, and transform it a program that calls the functions asynchronously. In simple language, it gives a blueprint for implementing arbitrary control flow in code that uses asynchronous XmlHttpRequests.

In this article,  I work a simple example from Thibaud’s paper and talk about four software tools that automated the conversion of conventional control flows to asynchronous programming.  One tool,  the Windows Workflow Foundation, lets us compose long-running applications out of a collection of asynchronous Activity objects.  Another two tools are jwacs and Narrative Javascript,  open-source   translators that translated pseudo-blocking programs in a modified dialect of JavaScript into an asynchronous program in ordinary JavaScript that runs in your browser.

Continue Reading »

Stop Catching Exceptions!

Motivation

It’s clear that a lot of programmers are uncomfortable with exceptions [1] [2]; in the feedback of an article I wrote about casting, it seemed that many programmers saw the throwing of a NullReferenceException at a cast to be an incredible catastrophe.

In this article, I’ll share a philosophy that I hope will help programmers overcome the widespread fear of exceptions. It’s motivated by five goals:

  1. Do no harm
  2. To write as little error handling code as possible,
  3. To think about error handling as little as possible
  4. To handle errors correctly when possible,
  5. Otherwise errors should be handled sanely

To do that, I

  1. Use finally to stabilize program state when exceptions are thrown
  2. Catch and handle exceptions locally when the effects of the error are local and completely understood
  3. Wrap independent units of work in try-catch blocks to handle errors that have global impact

This isn’t the last word on error handling, but it avoids many of the pitfalls that people fall into with exceptions. By building upon this strategy, I believe it’s possible to develop an effective error handling strategy for most applications: future articles will build on this topic, so keep posted by subscribing to the Generation 5 RSS Feed.

Continue Reading »