Content Representation With A Twist

Friday, September 03, 2010

How to algorithmically determine valid synonyms?

Currently, I am working on a project inofficially called "read and let read", addressing the point that reading is slow and machines are far more quick than humans. The focus is on web feeds (RSS, Atom) and the question of whether it is interesting to the user, so whether or not they should subscribe to it.

How to tackle this by a machine? The idea is to look at the keywords the feed postings are tagged with: How much do they match the user's interests?

Trying this by myself, I made the tags of this here blog's tag cloud be my interests and ran them against some of the more interesting feeds I read. The results are encouraging but the actual match values tiny: Yes, the software detects matches. No, they usually range below 2%, often even below 1%.

One issue here might be synonymy: There may be greater matches between my interests and the feeds topics, but the two of us may speak different languages: Engadget simply might use different words for the same things. So the dumb unknowable machine does not know there is a match. To fix this, now I'm looking for an algorithm that determines synonyms for each word of a given set of keywords. (For the impatient of you, there is a resource named WordNet to come to help here.)

Determining valid synonyms based on a single given word likely will bring up such synonyms that match a different meaning of the given word. Like "canine", also the "trestle" is a synonym for "dog", and the software for sure would come up with that.

Looking into my old university books for this issue, they all implicitly presumed a human would look for the synonym. But, no, here it'd be a machine, and it won't be able to detect the meaning shift intuitively, won't be able to skip nonsensical synonyms.

Looking further, I found some postings on Google starting to imply synonyms to searches. So, there indeed is some kind of algorithm around that determines synonyms based on a small set of given keywords. Remaining question: Got that algorithm published ad what does it look like?