Content Representation With A Twist

Showing posts with label introduction: thesaurus. Show all posts
Showing posts with label introduction: thesaurus. Show all posts

Friday, June 22, 2007

An Issue about Getting a Replacement for a Heater Part by Using a Thesaurus

One of the most ancient questions that led to MOM was the issue a former information science teacher of mine presented by one of the lectures he gave to us: He drafted the case someone had an issue with their heater and had to find the replacement part by using a thesaurus. Actually the task was the guy should look for the matching word in the thesaurus to order the part by mail.
 

Simply said, a thesaurus is a vocabulary with the aim to order that vocabulary. It helps experts to find the right words. Likely the 'thesaurus' embedded to any major text writing program does to lay people. The thesaurus relates the items it deals with to each other to bring them into a hierarchy: A mouse is a mammal, and a mammal is a vertebrate. The cat is a mammal too, so the cat node is placed next to the mouse one. I tiger is a cat too, as well as the lion and the panther, so they get placed as childs of the cat.

The core item the thesaurus deals with is a trinity of item, word for that item and thought item/imagination of the item. It's called a notion. The notion is an abstraction for an item, thus in every case is immaterial. Related to the state of matter the notion focuses on the imaginary part but never lets go the material part out of sight or the word. The thesaurus orders its words by looking at the words and the real items.

In a first step, words synonymous to each other get collected to a set. The most familiar one of those synonyms get picked and becomes declared to be the descriptor of of that set. If the descriptor is referred to, implicitly the whole set is meant -- or, thought-with.

In a second step, the thesaurus goes ahead to order the items, identified by their descriptors. Most widely used relationship between descriptors might be the is a relationship, just as demonstrated above: The cat is a mammal, the mammal is a vertebrate, and so on. Alternatively, another common relationship applied by thesauri is the has a relationship. It states that the vertebrate has a spine, and a human has a head, has a torso, has a pair of arms and also has a pair of legs and feet. However, the is a relationship is much more used but the has a one. Reason for that might be that the resulting network of relationships and descriptors would quickly become a rather dense graph, hardly to maintain.

Aside of these two kinds of relationship any creator of a thesaurus can set up any kind of relationship they might imagine. Such as associative relationships relating related notions to each other that cannot be brought together by using any other kind of relationships, such as cat and cat food. The available relationships may vary from thesaurus to thesaurus, as their developers might have chosen different kinds of relationships to use.
 

For the case of the heater, we assumed a thesaurus effectively consisting of is a relationships only, since that seems to be the most common set up of a thesaurus.

A thesaurus consisting of is a relationships only helps an expert to quickly find the words they already know. On the other hand, a lay person usually gets stuck in that professional slang rather quickly as they get unable to discern the one notion from the other. Thesauri traditionally don't aim at assisting lay people, so the definitions they provide for descriptors are barely more but a reminder -- as said, for something the thesaurus developers assume the user already knows. If the thesaurus provides that definition text at all. During my course of studies I learned, thesauri resemble deserts of words, providing definitions as rarely as deserts have oases.

So, the answer to the heater question is: Restricted to a thesaurus, the guy won't find the replacement part for his heater.

And the amazing part my information science teacher pointed to too, was that going to the next heating devices shop would solve the problem within a minute -- it would suffice if the guy would describe the missing part by its look.
 

That miracle stuck with me. I came to the point to wonder why not to set up kind of a thesaurus that would prefer has a over is a relationships and asked the teacher about that. I was pointed to issues of how to put that into practice? How to manage that heavily wired graph?
 

Well, that's a matter of coping with machines, so I went along, although I wasn't about to get any support from that teacher. On the other hand, I was familiar to programming since 1988 -- so what? ... And over time, MOM evolved.
 
 
Update: During tagging all the rather old postings which were already in this blog when blogger.com didn't offer post tagging yet, I noticed I presented another variant of the issue earlier, then related to a car replacement part.

      
Updates:
20070624: added reference to the car repair example

Sunday, March 05, 2006

Cleaning up the confusion about thesauri and classifications

To clean up the confusion mentioned earlier, I wrote a short introductionary article on thesauri and classifications. It mainly relies on an excerpt taken from a book of a former information science teacher of mine. -- Here we go:

Introduction

Base assumption is that data, information, knowledge require to be ordered. The data has to be ordered systematically. People who perform indexing on data bases -- so called "indexers" -- use order systems to make content retrievable. Thus, only if a user knows the tools used during database creation she/he can retrieve information from that very database.[1]


Methods of organizing content representation originate from the fields of library science and documentation.[2]


There are two dominant content representation methods in documentation: classification and thesaurus. "A classification is a structured representation of classes and of the notional relationships between the classes."[3]  Any class is represented by a notation, whereby the notation is independent of any natural language (cf. DIN 37205, 2).[4]  "Similarly, a thesaurus also is an organized compilation of terms, but in this case their natural language appellations are used (cf. DIN 1463/1, 2)."[5] 

Systems of Concepts

Business documentation creates an order. This order refers to notions (= concepts) an relations between these notions. "One may assume that this equals to organize business terminology [...] by a notional order."[6] 


Systems of concept differentiate between two main kinds of relationship: associative and hierarchical relationship. There are two variants of the latter: the abstract and the partitive variant.[7]

Abstract relationship means that a "child" term has the same features as its parent plus at least one additional feature, the parent term does not have.[8]  (The features are not stated anywhere; there is nothing but a parent term, a child term, and a relationship link between both of them, representing that very "the child term has the same features as the parent one, but at least one additional other feature, the parent term does not have". -- In fact, the relationship refers to the items not to the notations: The item referred to by the notation has the more/the less features than the item referred to by the parent/child term notation.)

Partitive relationship means that the items referred to by the child terms are part of the item referred to by the parent term.[9]

Associative relationship refers to a relationship existing between a pair of terms that cannot be related hierarchically to each other, although nevertheless there is a relationship between both of the terms. A pair of antonyms cannot be related hierarchically to each other, therefore here the matching kind of relationship is the associative one.[10]


"A classification is a system of concepts"[11]  having notions as classes. The classes are labeled by notations[12],  which are language independent.

Main difference between Thesaurus and Classification

The main difference between thesaurus and classification is that a thesaurus selects natural language words instead of setting up cryptic notations.[13]  Thus, otherwise than classifications, a thesaurus cannot be used language indepently. To avoid confusing originating from synonyms and homonyms a thesaurus applies terminological control, i.e. it keeps homonyms distinct and sets of synonyms referring to the same item constitute a class. The most common one of these synonyms gets picked up and will be treated as the descriptor of that very class, i.e. a handle for the class -- one could say "a natural language notation" or simply "a label". To ensure that any piece of information (which shall be referred to by at least one "word" of the thesaurus[14])  always gets referred by the same "word", the non-descriptor synonyms refer to the descriptor of the class they belong to. Thus, any piece of information gets referred to by descriptors only, and retrieval attempts using non-descriptor synonyms get redirected to the matching descriptors. So, the pieces of information get tagged by descriptor synonyms (for short just "descriptors") and retrievals also use descriptors, so a match between tagging "words" and retrieval "words" becomes much more probable than if allowing to choose from descriptors and non-descriptors as well.[15] 

Accuracy sacrified to administration convenience

Sometimes synonyms and terms referring to similar items as the other synonyms won't be/aren't kept distinct but simply added to a common class[16].  Reason for this often is to keep administration expense low.



[1] cf. [Stock 2000], p. 59, par. 1
[2] cf. [Stock 2000], p. 59, par. 2, sentence 1
[3] [Stock 2000], p. 59, par. 3, sentence 2: "Eine Klassifikation ist eine strukturierte Darstellung von Klassen und der zwischen den Klassen bestehenden Begriffsbeziehungen [...]."
[4] cf. [Stock 2000], p. 59, par. 3, sentence 2
[5] cf. [Stock 2000], p. 59, par. 3
[6] cf. [Stock 2000], p. 59, par. 4 (incl headline)
[7] cf. [Stock 2000], p. 60, par. 1
[8] cf. [Stock 2000], p. 61, par. 2, sentences 1–2
[9] cf. [Stock 2000], p. 61, par. 2, sentences 4–5
[10] cf. [Stock 2000], p. 62, par. 1 (incl. bullet list)
[11] [Stock 2000], p. 63, par. 1, sentence 1: "Ein Klassifikationssystem ist ein Begriffssystem [...]."
[12] cf. [Stock 2000], p. 63, par. 1, sentences 1–2
[13] cf. [Stock 2000], p. 76, no. 3.3, par. 2, sentence 2
[14] cf. [Stock 2000], pp. 81–84
[15] cf. [Stock 2000], p. 77, par. 1 (including the example in between)
[16] cf. [Stock 2000], p. 77, par. 1, sentence 2

[Stock 2000]:
Stock, Wolfgang G.
Informationswirtschaft : Management externen Wissens
number of edition unknown
Muenchen, Wien, Oldenbourg, 2000
ISBN 3-486-24897-9<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as more precise word picks, better grammar.