Content Representation With A Twist

Sunday, March 05, 2006

Cleaning up the confusion about thesauri and classifications

To clean up the confusion mentioned earlier, I wrote a short introductionary article on thesauri and classifications. It mainly relies on an excerpt taken from a book of a former information science teacher of mine. -- Here we go:

Introduction

Base assumption is that data, information, knowledge require to be ordered. The data has to be ordered systematically. People who perform indexing on data bases -- so called "indexers" -- use order systems to make content retrievable. Thus, only if a user knows the tools used during database creation she/he can retrieve information from that very database.[1]


Methods of organizing content representation originate from the fields of library science and documentation.[2]


There are two dominant content representation methods in documentation: classification and thesaurus. "A classification is a structured representation of classes and of the notional relationships between the classes."[3]  Any class is represented by a notation, whereby the notation is independent of any natural language (cf. DIN 37205, 2).[4]  "Similarly, a thesaurus also is an organized compilation of terms, but in this case their natural language appellations are used (cf. DIN 1463/1, 2)."[5] 

Systems of Concepts

Business documentation creates an order. This order refers to notions (= concepts) an relations between these notions. "One may assume that this equals to organize business terminology [...] by a notional order."[6] 


Systems of concept differentiate between two main kinds of relationship: associative and hierarchical relationship. There are two variants of the latter: the abstract and the partitive variant.[7]

Abstract relationship means that a "child" term has the same features as its parent plus at least one additional feature, the parent term does not have.[8]  (The features are not stated anywhere; there is nothing but a parent term, a child term, and a relationship link between both of them, representing that very "the child term has the same features as the parent one, but at least one additional other feature, the parent term does not have". -- In fact, the relationship refers to the items not to the notations: The item referred to by the notation has the more/the less features than the item referred to by the parent/child term notation.)

Partitive relationship means that the items referred to by the child terms are part of the item referred to by the parent term.[9]

Associative relationship refers to a relationship existing between a pair of terms that cannot be related hierarchically to each other, although nevertheless there is a relationship between both of the terms. A pair of antonyms cannot be related hierarchically to each other, therefore here the matching kind of relationship is the associative one.[10]


"A classification is a system of concepts"[11]  having notions as classes. The classes are labeled by notations[12],  which are language independent.

Main difference between Thesaurus and Classification

The main difference between thesaurus and classification is that a thesaurus selects natural language words instead of setting up cryptic notations.[13]  Thus, otherwise than classifications, a thesaurus cannot be used language indepently. To avoid confusing originating from synonyms and homonyms a thesaurus applies terminological control, i.e. it keeps homonyms distinct and sets of synonyms referring to the same item constitute a class. The most common one of these synonyms gets picked up and will be treated as the descriptor of that very class, i.e. a handle for the class -- one could say "a natural language notation" or simply "a label". To ensure that any piece of information (which shall be referred to by at least one "word" of the thesaurus[14])  always gets referred by the same "word", the non-descriptor synonyms refer to the descriptor of the class they belong to. Thus, any piece of information gets referred to by descriptors only, and retrieval attempts using non-descriptor synonyms get redirected to the matching descriptors. So, the pieces of information get tagged by descriptor synonyms (for short just "descriptors") and retrievals also use descriptors, so a match between tagging "words" and retrieval "words" becomes much more probable than if allowing to choose from descriptors and non-descriptors as well.[15] 

Accuracy sacrified to administration convenience

Sometimes synonyms and terms referring to similar items as the other synonyms won't be/aren't kept distinct but simply added to a common class[16].  Reason for this often is to keep administration expense low.



[1] cf. [Stock 2000], p. 59, par. 1
[2] cf. [Stock 2000], p. 59, par. 2, sentence 1
[3] [Stock 2000], p. 59, par. 3, sentence 2: "Eine Klassifikation ist eine strukturierte Darstellung von Klassen und der zwischen den Klassen bestehenden Begriffsbeziehungen [...]."
[4] cf. [Stock 2000], p. 59, par. 3, sentence 2
[5] cf. [Stock 2000], p. 59, par. 3
[6] cf. [Stock 2000], p. 59, par. 4 (incl headline)
[7] cf. [Stock 2000], p. 60, par. 1
[8] cf. [Stock 2000], p. 61, par. 2, sentences 1–2
[9] cf. [Stock 2000], p. 61, par. 2, sentences 4–5
[10] cf. [Stock 2000], p. 62, par. 1 (incl. bullet list)
[11] [Stock 2000], p. 63, par. 1, sentence 1: "Ein Klassifikationssystem ist ein Begriffssystem [...]."
[12] cf. [Stock 2000], p. 63, par. 1, sentences 1–2
[13] cf. [Stock 2000], p. 76, no. 3.3, par. 2, sentence 2
[14] cf. [Stock 2000], pp. 81–84
[15] cf. [Stock 2000], p. 77, par. 1 (including the example in between)
[16] cf. [Stock 2000], p. 77, par. 1, sentence 2

[Stock 2000]:
Stock, Wolfgang G.
Informationswirtschaft : Management externen Wissens
number of edition unknown
Muenchen, Wien, Oldenbourg, 2000
ISBN 3-486-24897-9<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as more precise word picks, better grammar.

No comments: