Accent Folding

Posted on 28 February 2010 by Rick Mason

A List Apart has been a steady source of thought-provoking inspiration over the years, not only from a website building perspective, but also because much of what they publish crosses boundaries and impacts other projects and interests in my life.

Their current article, Accent Folding, greatly impacts library data in general, and library catalogs in particular. It deals with the issue of Unicode and pattern recognition, namely how one creates search tools that allow for variations in how words containing accents, stress marks, and other non-ascii characters. The most succinct example:

There is no excuse for your software to play dumb when the user types “cafe” instead of “café.”

The article presents methods of “normalizing” text to allow for proper matching, and should be read by anyone who gets to deal with library data for reports and searching aids. If you know how to use regular expressions, you will likely be in for a treat.

The other example they present, this time to demonstrate the limitations of accent folding, uses Japanese to illustrate just how differently the same data can be presented:

These four sentences all say “Children like to watch television” in Japanese:

Kanji: 子供はテレビを見るのが好きです。

Hiragana: こどもはてれびをみるのがすきです。

Romaji: kodomo wa terebi o miru noga suki desu.

Cyrillic: кодомо ва тэрэби о миру нога суки дэсу.

Even if you don’t end up applying this directly to your work, the information in this article will help your appreciation for the challenges contained within your data, and how tough it can be to make it “just work” sometimes.

This entry was posted in ILS, Language, Libraries, Library 2.0, OPAC, Search, Web Design and tagged Character sets, ILS, Libraries, OPAC, Regular expression, Search, Typography, Unicode, Web Design. Bookmark the permalink.

Search for:
Calendar
February 2010

S M T W T F S

1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28

« Jan Mar »
Libology Tags:
- Amazon
- Author
- Blog
- Blogs
- Books
- Chicago
- Congress
- Copyright
- copyright law
- Education
- Facebook
- Google
- Government
- History
- html
- Humor
- Illinois
- ILS
- Karen Coyle
- librarian
- Librarian.net
- Libraries
- Library
- Library Journal
- Library of Congress
- LibraryThing
- Licensing
- Linux
- Microsoft
- News
- New York Times
- OCLC
- OCLC Records Use Policy
- Official
- Ohio
- OPAC
- Open Source
- Publishing
- search engine
- social networking
- Software
- Technology/Internet
- United States
- USD
- Web Design
Categories
Categories
Blog Links
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Libology Blog
Established July 2006
ISSN: 1946-1852
by Rick Mason