Tuesday, March 1, 2011

On Search: Metadata - Tim Bray

On Search: Metadata
Tim Bray
In the Web’s early years, the overwhelming favorite among search engines was Yahoo. Today it’s Google. Neither has actually had better text search technology than the competition. They won because they used metadata effectively to make their services more useful. In this ninth On Search episode, a survey of what metadata is, where it comes from, and how to use it. Metadata is technically “information about information” and you can start a fistfight in the bar at any XML or Content-Management conference about what’s data and what’s metadata. In the context of search, metadata is anything that you know about the documents you’re searching beyond the words they contain. With descriptive markup, it’s easy enough to store a document’s metadata right inside it (consider HTML’s tag).
Yahoo · Back when everyone searched at Yahoo, the usual result list looked quite a bit different. If I typed in “donkey,” before the pointers to Web pages there would be a few pointers to categories in the Yahoo taxonomy that contained the word “Donkey.”
This worked really well, because if the Yahoo editor had classified Diseases of the Horse Familyor The Asses of the British Isles under a donkey-related category, I’d find them even though “donkey” wasn’t in the title.
In effect, Yahoo maintained one useful piece of metadata about each page in the engine: What is this about?. This is a real value-add for the searcher.
Google · Google, like Yahoo, maintains one key metadata field about each item it indexes: the well-known PageRank, essentially a measure of how many other pages point to it. They make use of it very simply, to order the result list with high Page-Ranks at the top.
 Conclusions? · Google seized search leadership from Yahoo; can we conclude that it’s more important to know how popular something is than to know what it’s about? If you’d told me that ten years ago I would have had a hard time believing it, but the evidence seems pretty compelling. Note that Google actually does have some subject metadata via their integration with the Open Directory Project, but they don’t push it that hard, and the volunteer-staffed, highly-political, AOL-semi-orphan ODP is fairly weak reed to lean on anyhow.

On the other hand, Google has always been way more focused on search than Yahoo has, and isn’t always trying to get in front of you with stock prices and news and weather and so on. More important, even if it turns out that popularity is the key thing for Internet search, the Internet is a very special place, and it’s quite unlikely that popularity is the killer metadatum for the whole universe of search applications.
I believe, though, in the other obvious conclusion: that the number-one way to make search work better is to bring some metadata to bear on the problem. This really shouldn’t be surprising: As I’ve discussed before, it’s really hard to make search engines act much smarter than they do today. So instead, let’s reinforce them with externally-supplied metadata.
Where Does Metadata Come From? · Those Yahoo and Google metadata offerings, while really quite different, have one important thing in common: both are expensive. Yahoo has for years employed a team of editors to sort websites into their subject hierarchy by hand. And Google’s immense rooms full of machines humming away computing PageRanks twenty-four hours a day are a legend in our industry.
In my experience, this is typical. Put another way: There is no cheap metadata. Of course, if we could use computers to compute the metadata like Google does, that would be immensely cheaper than having employees do it. And a lot of smart people have invested a lot of effort and money into the problem of deriving metadata from data, but it’s a hard one. (Still, we should be on the lookout for opportunities; more later).
Many people in the content-management and knowledge-management trades have noticed this, and concluded that the trick is to gather metadata upstream. Remember how Microsoft Word, out of the box, used to pop up a dialog every time you created a new document and encourage you to provide a little metadata? Most people immediately said “Make this go away!” and I don't think Word has done this (by default) for years.
Historically, the difficulty of collecting metadata at source has been generally large enough to outweigh the (potentially huge) benefits from collecting it. But I for one am not ready to give up on this approach. There are, after all, domains where metadata is at the core of the business proposition, and the process works there. For examples, the editorial staff who produce the Wall Street Journal add metadata as they go along, identifying people, companies, stock ticker symbols, and so on.
If You Collect Metadata By Hand · The most important lesson I’ve learned, is: Don’t try to collect too much. You might, just might, get people, when they’re interacting with your intranet, to label their information by project and title; but more than a couple of fields and people will just bypass the process.
This is harder than it looks. When you decide in principle that metadata should be collected, it will develop that many stakeholders have short-lists of the fields they need to make this worthwhile. You can easily end up with a “short” list of a dozen or more fields that constitute the “absolute minimum” that people think you must have. And if you adopt it, you’re dead, because except in special circumstances (e.g. the WSJ), people just will not take the time to do this.
Automatic Metadata · Obviously, there are some metadata items the computer will give you for free: a filename, created/modified dates, who created it, what kind of file (HTML, Excel, PowerPoint), how big it is. These can be handy for search applications and since they’re free, you should collect them and make them available.
The second category of machine-generated metadata is what “Autocategorization” software does. These are the companies like (in alphabetical order) Autonomy, Gavagai, Semio, Stratify, and Vivisimo; they all promise to take your raw data and either generate or fill-in a subject taxonomy telling you what it’s about.
Sometimes they work, sometimes they don’t, and sometimes it can be puzzling figuring out whether they’re going to work or not. But they are not an exception to the no-cheap-metadata rule; this is software that’s generally expensive to buy and expensive to deploy.
Don’t Neglect Your Logfiles · There’s one kind of automatic metadata that I think doesn’t get the respect it deserves: the contents of your logfiles. Here’s the most obvious example: unless you’ve been throwing away your internal Web server log files, you already know which are the most popular items on the Intranet. It would’t be that hard to boil them down (occasionally, on a batch basis, this doesn’t need to be real-time) and develop your own internal “PopRank” based on what gets downloaded the most. It might not be as sexy as PageRank, but if I search the Intranet for material on expense policies, you can bet I’m going to find a lot, and if two or three stand out because they’re the ones everyone ends up reading, you might save a lot of people a lot of time.

Care, Feeding, and Using · Once you’ve got some metadata, since it’s expensive, you should take good care of it. This almost always means putting it in a relational database. As I mentioned above, debates over the meta-ness of data can get religious, but in practice, I’ve observed that while data itself (for example XML or video) often resists being forced into rows and columns, metadata usually lines up happily. Even ongoing has a little MySQL database sitting off to the side of all the XML-encoded entries, tracking a bunch of useful facts about them, including some (e.g. the title) that are replicated inside the data.

And of course you’ll want to put this goodness to work. One obvious way is to have a query screen, so that people can search for resources by author, date, title, and so on, not just brute-force full-text. But what you’d really like is to learn from Yahoo and Google, and have the metadata just there, silently helping. For example, to use in ranking  your results.
Another thing you could do is call up Antarctica, our Visual Net product takes metadata and gives search a Graphical User Interface just like your personal computer has.
In the API · This means that if you’re going to design an API for a search engine (something I plan to do eventually in this series) you’re going to need to include entry-points not just for searching and adding words to the full-text index, but also for adding, maintaining, and using the metadata that drives the search.
The Web and the Semantic Web · One of the Web’s distinguishing features is that there’s a big gaping hole where the metadata ought to be. The Web has resources, identified by URI, and you can ask for “representations,” which come with some metadata, but the metadata is about the representation, not the resource. This is probably a bit abstract for those who don’twrestle professionally with Web Architecture, so an example’s in order: Suppose you read an online news story from your desktop computer at 9AM. You get a Web page with some metadata telling you that it’s in HTML and is in English and ISO-Latin-8859-encoded and can’t be cached and so on. Suppose, at noon, on the road, you hit the same story from the minibrowser in your cellphone. The server cleverly notices this is a small-screen device and sends the same information in WAP or simplified HTML or some such thing, with metadata saying what it is (which is completely different from the metadata you got with the PC Browser version).
So, given a URI, the Web has no built-in way to ask questions about it, for example “What is this about?” or “When does it expire?” or “Is this suitable for children?” or “Is this good?”
The Semantic Web project is trying to make the whole Web smarter and more machine-readable, and obviously this is never going to happen without metadata. So a lot of really smart people are working hard to develop good ways to encode, organize, and interchange metadata keyed by URIs. Of course, these people’s dreams aren’t about mere search, they’re about managing your schedule and your medical treatments and your shopping and your supply chain. All of which is fine; but if the Semantic Web ever takes off, there is going to be a whole lot more metadata available about a whole lot of stuff.
As a side-effect, I expect that all the search services of the world will become a lot richer, a lot smarter, and a lot more fun to use. But we’re not there yet.
A Word On Our Sponsor · This is a sponsored essay. It is brought to you by the local power company, who arranged a complete power failure in Antarctica’s offices this afternoon, so I took advantage of battery power to type this in. Power’s back, it’s back to work we go.



No comments:

Post a Comment