Accent Folding

A List Apart has been a steady source of thought-provoking inspiration over the years, not only from a website building perspective, but also because much of what they publish crosses boundaries and impacts other projects and interests in my life.

Their current article, Accent Folding, greatly impacts library data in general, and library catalogs in particular.  It deals with the issue of Unicode and pattern recognition, namely how one creates search tools that allow for variations in how words containing accents, stress marks, and other non-ascii characters.  The most succinct example:

There is no excuse for your software to play dumb when the user types “cafe” instead of “café.”

The article presents methods of “normalizing” text to allow for proper matching, and should be read by anyone who gets to deal with library data for reports and searching aids.  If you know how to use regular expressions, you will likely be in for a treat.

The other example they present, this time to demonstrate the limitations of accent folding, uses Japanese to illustrate just how differently the same data can be presented:

These four sentences all say “Children like to watch television” in Japanese:

  • Kanji: 子供はテレビを見るのが好きです。
  • Hiragana: こども は てれび を みる の が すき です 。
  • Romaji: kodomo wa terebi o miru noga suki desu.
  • Cyrillic: кодомо ва тэрэби о миру нога суки дэсу.

Even if you don’t end up applying this directly to your work, the information in this article will help your appreciation for the challenges contained within your data, and how tough it can be to make it “just work” sometimes.

This entry was posted in ILS, Language, Libraries, Library 2.0, OPAC, Search, Web Design and tagged , , , , , , , , . Bookmark the permalink.