The triumph of generic databases
The computerization of commonsense knowledge goes back at least to Ross Quillian’s paper from the 1969 book Semantic Information Processing. Ross used methods that aren’t that different from what I use today, but he was able to store just a few hundred concepts in his computer.
The Cyc project, starting in the 1980s contained about 3 million facts. It was successful on it’s own terms, but it didn’t lead to the revolution in natural language processing that it promised. WordNet, from the same era, documents about 120,000 word senses, but like Cyc, hasn’t had a large engineering impact.
DBpedia and Freebase have become popular lately, I think because they’re a lot like traditional databases in character. For a person, place or creative work you’ve got the information necessary to make a ‘Pokemon card’ about the topic. With languages like SPARQL and MQL it’s possible to write queries you’d write in a relational database, so people have an idea what to do with it.
DBpedia and Freebase are much larger than the old commonsense databases. The English Dbpedia contains 4 million topics derived from Wikipedia pages and Freebase contains 24 million facts about 600 million topics. It’s hard to quantify it, but subjectively, people feel like Wikipedia contains most of the concepts that turn up when they are reading or thinking about things. Because the new generic databases are about concepts rather than words, they are inherently multilingual.
DBpedia Spotlight is the first of a breed of language processing products that use world knowledge instead of syntactic knowledge. Using a knowledge base created from DBpedia and Wikipedia, Spotlight gets accuracy comparable to commercial named entity recognition systems — although Spotlight uses simple methods and, so far, has made little of the effort a commercial system would to systematically improve accuracy.
Continue Reading »
Paul Houle on June 19th 2012 in Freebase, Semantic Web
Introduction
Languages such as LISP, ML, oCaml F# and Scala have supported first-class functions for a long time. Functional programming features are gradually diffusing into mainstream languages such as C#, Javascript and PHP. In particular, Lambda expressions, implicit typing, and delegate autoboxing have made C# 3.0 an much more expressive language than it’s predecssors.
In this article, I develop a simple function that acts on functions: given a boolean function f, F.Not(f) returns a new boolean function which is the logical negation of f. (That is, F.Not(f(x)) == !f(x)). Although the end function is simple to use, I had to learn a bit about the details of C# to ge tthe behavior I wanted — this article records that experience.
Continue Reading »
Paul Houle on March 9th 2009 in Dot Net, Functional Programming, Linq
What’s Freebase?
Freebase is a open database of things that exist in the world: things like people, places, songs and television shows. As of the January 2009 dump, Freebase contained about 241 million facts, and it’s growing all the time. You can browse it via the web and even edit it, much like Wikipedia. Freebase also has an API that lets programs add data and make queries using a language called MQL. Freebase is complementary to DBpedia and other sources of information. Although it takes a different approach to the semantic web than systems based on RDF standards, it interoperates with them via linked data.
The January 2009 Freebase dump is about 500 MB in size. Inside a bzip-compressed files, you’ll find something that’s similar in spirit to a Turtle RDF file, but is in a simpler format and represents facts as a collection of four values rather than just three.
Your Own Personal Freebase
To start exploring and extracting from Freebase, I wanted to load the database into a star schema in a mysql database — an architecture similar to some RDF stores, such as ARC. The project took about a week of time on a modern x86 server with 4 cores and 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.
This is sufficient for my immediate purposes, but future versions of Freebase promise to be much larger: this article examines the means that could be used to improve performance and scalability using parallelism as well as improved data structures and algorithms. Continue Reading »
Paul Houle on February 25th 2009 in Freebase, SQL, Semantic Web
The Problem
I’ve got an IEnumerable<T> that contains a list of values: I want to know if all of the values in that field are distinct. The function should be easy to use a LINQ extension method and, for bonus points, simply expressed in LINQ itself
One Solution
First, define an extension method
01 public static class IEnumerableExtensions {
02 public static bool AllDistinct<T>(this IEnumerable<T> input) {
03 var count = input.Count();
04 return count == input.Distinct().Count();
05 }
06 }
When you want to test an IEnumerable<T>, just write
07 var isAPotentialPrimaryKey=CandidateColumn.AllDistinct();
Paul Houle on February 13th 2009 in Dot Net, Linq
The Problem
Many Silverlighters use XAML to design the visual appearance of their applications. A UserControl defined with XAML is a DependencyObject that has a complex lifecycle: there’s typically a .xaml file, a .xaml.cs file, and a .xaml.g.cs file that is generated by visual studio. The .xaml.g.cs file is generated by Visual Studio, and ensures that objects defined in the XAML file correspond to fields in the object (so they are seen in intellisense and available to your c# code.) The XAML file is re-read at runtime, and drives a process that instantiates the actual objects defined in the XAML file — a program can compile just fine, but fail during initialization if the XAML file is invalid or if you break any of the assumptions of the system.
XAML is a pretty neat system because it’s not tied to WPF or WPF/E. It can be used to initialize any kind of object: for instance, it can be used to design workflows in asynchronous server applications based on Windows Workflow Foundation.
One problem with XAML, however, is that you cannot write controls that inherit from a UserControl that defined in XAML. Visual Studio might compile the classes for you, but they will fail to initialize at at runtime. This is serious because it makes it impossible to create subclasses that let you make small changes to the appearance or behavior of a control.
Continue Reading »
Paul Houle on February 10th 2009 in Dot Net, Silverlight
Introduction
Lately I’ve been working on a web application based on Silverlight 2. The application uses a traditional web login system based on a cryptographically signed cookie. In early development, users logged in on an HTML page, which would load a Silverlight application on successful login. Users who didn’t have Silverlight installed would be asked to install it after logging in, rather than before.
Although it’s (sometimes) possible to determine what plug-ins a user has installed using Javascript, the methods are dependent on the specific browser and the plug-ins. We went for a simple and effective method: make the login form a Silverlight application, so that users would be prompted to install Silverlight before logging in.
Our solutionn was to make the Silverlight application a drop-in replacement for the original HTML form. The Silverlight application controls a hidden HTML form: when a user hits the “Log In” buttonin the Silverlight application, the application inserts the appropriate information into the HTML form and submits it. This article describes the technique in detail. Continue Reading »
Paul Houle on January 21st 2009 in Silverlight
Introduction
I program in PHP a lot, but I’ve avoided using autoloaders, except when I’ve been working in frameworks, such as symfony, that include an autoloader. Last month I started working on a system that’s designed to be part of a software product line: many scripts, for instance, are going to need to deserialize objects that didn’t exist when the script was written: autoloading went from a convenience to a necessity.
The majority of autoloaders use a fixed mapping between class names and PHP file names. Although that’s fine if you obey a strict “one class, one file” policy, that’s a policy that I don’t follow 100% of the time. An additional problem is that today’s PHP applications often reuse code from multiple frameworks and libraries that use different naming conventions: often applications end up registering multiple autoloaders. I was looking for an autoloader that “just works” with a minimum of convention and configuration — and I found that in a recent autoloader developed by A.J. Brown. Continue Reading »
Paul Houle on January 9th 2009 in PHP
Abort, Retry, Ignore
This article is a follow up to “Don’t Catch Exceptions“, which advocates that exceptions should (in general) be passed up to a “unit of work”, that is, a fairly coarse-grained activity which can reasonably be failed, retried or ignored. A unit of work could be:
- an entire program, for a command-line script,
- a single web request in a web application,
- the delivery of an e-mail message
- the handling of a single input record in a batch loading application,
- rendering a single frame in a media player or a video game, or
- an event handler in a GUI program
The code around the unit of work may look something like
[01] try {
[02] DoUnitOfWork()
[03] } catch(Exception e) {
[04] ... examine exception and decide what to do ...
[05] }
For the most part, the code inside DoUnitOfWork() and the functions it calls tries to throw exceptions upward rather than catch them.
To handle errors correctly, you need to answer a few questions, such as
- Was this error caused by a corrupted application state?
- Did this error cause the application state to be corrupted?
- Was this error caused by invalid input?
- What do we tell the user, the developers and the system administrator?
- Could this operation succeed if it was retried?
- Is there something else we could do?
Although it’s good to depend on existing exception hierarchies (at least you won’t introduce new problems), the way that exceptions are defined and thrown inside the work unit should help the code on line [04] make a decision about what to do — such practices are the subject of a future article, which subscribers to our RSS feed will be the first to read.
Continue Reading »
Paul Houle on August 27th 2008 in Dot Net, Exceptions, Java, PHP, SQL
Introduction
One of the challenges in writing programs in today’s RIA environments (Javascript, Flex, Silverlight and GWT) is expressing the flow of control between multiple asynchronous XHR calls. A “one-click-one-XHR” policy is often best, but you don’t always have control over your client-server protocols. A program that’s simple to read as a synchronous program can become a tangle of subroutines when it’s broken up into a number of callback functions. One answer is program translation: to manually or automatically convert a synchronous program into an asynchronous program: starting from the theoretical foundation, this article talks about a few ways of doing that.
Thibaud Lopez Schneider sent me a link to an interesting paper he wrote, titled “Writing Effective Asynchronous XmlHttpRequests.” He presents an informal proof that you can take a program that uses synchronous function calls and common control structures such as if-else and do-while, and transform it a program that calls the functions asynchronously. In simple language, it gives a blueprint for implementing arbitrary control flow in code that uses asynchronous XmlHttpRequests.
In this article, I work a simple example from Thibaud’s paper and talk about four software tools that automated the conversion of conventional control flows to asynchronous programming. One tool, the Windows Workflow Foundation, lets us compose long-running applications out of a collection of asynchronous Activity objects. Another two tools are jwacs and Narrative Javascript, open-source translators that translated pseudo-blocking programs in a modified dialect of JavaScript into an asynchronous program in ordinary JavaScript that runs in your browser.
Continue Reading »
Motivation
It’s clear that a lot of programmers are uncomfortable with exceptions [1] [2]; in the feedback of an article I wrote about casting, it seemed that many programmers saw the throwing of a NullReferenceException at a cast to be an incredible catastrophe.
In this article, I’ll share a philosophy that I hope will help programmers overcome the widespread fear of exceptions. It’s motivated by five goals:
- Do no harm
- To write as little error handling code as possible,
- To think about error handling as little as possible
- To handle errors correctly when possible,
- Otherwise errors should be handled sanely
To do that, I
- Use finally to stabilize program state when exceptions are thrown
- Catch and handle exceptions locally when the effects of the error are local and completely understood
- Wrap independent units of work in try-catch blocks to handle errors that have global impact
This isn’t the last word on error handling, but it avoids many of the pitfalls that people fall into with exceptions. By building upon this strategy, I believe it’s possible to develop an effective error handling strategy for most applications: future articles will build on this topic, so keep posted by subscribing to the Generation 5 RSS Feed.
Paul Houle on July 31st 2008 in Dot Net, Exceptions, Java, PHP