Content Representation With A Twist

Tuesday, April 04, 2006

previous description of this blog

Today I change(d) the description of this blog. The prior one was:
    The Model of Meaning is a knowledge representation approach that shall allow to skip the training phase of an artificial neural network. – I started research in this field of subject earlier, but I am very interested in collaboration with researchers of related fields like artificial intelligence, neurobiology etc. Software applications of findings of this project also would be very appreciated.
<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

Sunday, March 05, 2006

Cleaning up the confusion about thesauri and classifications

To clean up the confusion mentioned earlier, I wrote a short introductionary article on thesauri and classifications. It mainly relies on an excerpt taken from a book of a former information science teacher of mine. -- Here we go:

Introduction

Base assumption is that data, information, knowledge require to be ordered. The data has to be ordered systematically. People who perform indexing on data bases -- so called "indexers" -- use order systems to make content retrievable. Thus, only if a user knows the tools used during database creation she/he can retrieve information from that very database.[1]


Methods of organizing content representation originate from the fields of library science and documentation.[2]


There are two dominant content representation methods in documentation: classification and thesaurus. "A classification is a structured representation of classes and of the notional relationships between the classes."[3]  Any class is represented by a notation, whereby the notation is independent of any natural language (cf. DIN 37205, 2).[4]  "Similarly, a thesaurus also is an organized compilation of terms, but in this case their natural language appellations are used (cf. DIN 1463/1, 2)."[5] 

Systems of Concepts

Business documentation creates an order. This order refers to notions (= concepts) an relations between these notions. "One may assume that this equals to organize business terminology [...] by a notional order."[6] 


Systems of concept differentiate between two main kinds of relationship: associative and hierarchical relationship. There are two variants of the latter: the abstract and the partitive variant.[7]

Abstract relationship means that a "child" term has the same features as its parent plus at least one additional feature, the parent term does not have.[8]  (The features are not stated anywhere; there is nothing but a parent term, a child term, and a relationship link between both of them, representing that very "the child term has the same features as the parent one, but at least one additional other feature, the parent term does not have". -- In fact, the relationship refers to the items not to the notations: The item referred to by the notation has the more/the less features than the item referred to by the parent/child term notation.)

Partitive relationship means that the items referred to by the child terms are part of the item referred to by the parent term.[9]

Associative relationship refers to a relationship existing between a pair of terms that cannot be related hierarchically to each other, although nevertheless there is a relationship between both of the terms. A pair of antonyms cannot be related hierarchically to each other, therefore here the matching kind of relationship is the associative one.[10]


"A classification is a system of concepts"[11]  having notions as classes. The classes are labeled by notations[12],  which are language independent.

Main difference between Thesaurus and Classification

The main difference between thesaurus and classification is that a thesaurus selects natural language words instead of setting up cryptic notations.[13]  Thus, otherwise than classifications, a thesaurus cannot be used language indepently. To avoid confusing originating from synonyms and homonyms a thesaurus applies terminological control, i.e. it keeps homonyms distinct and sets of synonyms referring to the same item constitute a class. The most common one of these synonyms gets picked up and will be treated as the descriptor of that very class, i.e. a handle for the class -- one could say "a natural language notation" or simply "a label". To ensure that any piece of information (which shall be referred to by at least one "word" of the thesaurus[14])  always gets referred by the same "word", the non-descriptor synonyms refer to the descriptor of the class they belong to. Thus, any piece of information gets referred to by descriptors only, and retrieval attempts using non-descriptor synonyms get redirected to the matching descriptors. So, the pieces of information get tagged by descriptor synonyms (for short just "descriptors") and retrievals also use descriptors, so a match between tagging "words" and retrieval "words" becomes much more probable than if allowing to choose from descriptors and non-descriptors as well.[15] 

Accuracy sacrified to administration convenience

Sometimes synonyms and terms referring to similar items as the other synonyms won't be/aren't kept distinct but simply added to a common class[16].  Reason for this often is to keep administration expense low.



[1] cf. [Stock 2000], p. 59, par. 1
[2] cf. [Stock 2000], p. 59, par. 2, sentence 1
[3] [Stock 2000], p. 59, par. 3, sentence 2: "Eine Klassifikation ist eine strukturierte Darstellung von Klassen und der zwischen den Klassen bestehenden Begriffsbeziehungen [...]."
[4] cf. [Stock 2000], p. 59, par. 3, sentence 2
[5] cf. [Stock 2000], p. 59, par. 3
[6] cf. [Stock 2000], p. 59, par. 4 (incl headline)
[7] cf. [Stock 2000], p. 60, par. 1
[8] cf. [Stock 2000], p. 61, par. 2, sentences 1–2
[9] cf. [Stock 2000], p. 61, par. 2, sentences 4–5
[10] cf. [Stock 2000], p. 62, par. 1 (incl. bullet list)
[11] [Stock 2000], p. 63, par. 1, sentence 1: "Ein Klassifikationssystem ist ein Begriffssystem [...]."
[12] cf. [Stock 2000], p. 63, par. 1, sentences 1–2
[13] cf. [Stock 2000], p. 76, no. 3.3, par. 2, sentence 2
[14] cf. [Stock 2000], pp. 81–84
[15] cf. [Stock 2000], p. 77, par. 1 (including the example in between)
[16] cf. [Stock 2000], p. 77, par. 1, sentence 2

[Stock 2000]:
Stock, Wolfgang G.
Informationswirtschaft : Management externen Wissens
number of edition unknown
Muenchen, Wien, Oldenbourg, 2000
ISBN 3-486-24897-9<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as more precise word picks, better grammar.

Saturday, March 04, 2006

Main thesis of MOM

Main thesis of MOM is, that items get represented in mind -- and can be represented in terms of data -- by features of these very items.

Where to stop going deeper and deeper into details

First objection against this assertion often is, that that inevitably would necessitate a highhanded decision, at what level of granularity to stop going to even more granularity: If a dog's features area head, a neck, a tail, a torso, and four legs, then its head's features are at least, that it haves two eyes, two ears, a mouth, a nose; the features of the mouth at least are: some teeth, a tongue, spittle etc. The teeth' features are enamel, root, ... and so on. So, where to stop going into more and more and more detail? -- But if an external mind has to make that decision where to make this stop, the approach has to be wrong, since there is no external mind enjoining human being's minds where to stop.

Furthermore, if such a decision is performed, i.e. if there is a stop of going more and more into details, then the representation cannot be complete. Thus, the represented entity does not match the real example. If so, the representation has to be treated to be invalid. -- On the other hand, if there is no stop, i.e. details barrier, the representation might have has to be as sophisticated as reality/the universe itself. In other words: It is impossible to represent things following this approach, since noone has the resources to represent the universe as a whole.


True. But on the other hand, everyone knows that humans tend to be fallible and by first hand experience most people also know that there is often one thing or another that they didn't know yet. That humans are able to learn was treated to be something special for a long spell. Additionally, most people made the experience, that children which grew up with small dogs in their neighbourhood confuse a cat with the yet known dogs, when they first experience a cat: They call the cat by dogs' common nick name ("Wau-Wau" in Germany).

Therefore I assume, the representation of an item in the first place consists of only as many details as necessary to differenciate the item from another one. A dog from a human. But, since not ever experienced, not from a cat. When experienced, more features get added to also differenciate the cat from the dog. That is: I assume, that in fact there is a details barrier, but I do not assume that it is given highanded by an external mind. Instead, I assume the details barrier is constituted by the attempt of resources saving: Every represented feature might cause a cost of resources, so any unnecessarily represented feature would cause unnecessary cost of resources. Therefore only necessary features would be stored, resulting in the details barrier.

This rises another question: Is it enough to store not all the features of an item but only the necessary ones? -- I assume: Yes, it is, since reality brings a backup: It even might be enough to represent only features an item has without to store, how these features are ordered -- if you treat a dog as head, neck, tail, torso, four legs, can bark, reality ensures that there is not a dog that has its legs and tail attached to its head and the neck and torso to the tail. So, for first this unordered composition approach suffices. Deeper insight into the Model of Meaning offers ways to represent structure as well.<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

terms confused

In my previous posts [1] I attempted to describe the problem, approaching from the information science point of view. I developed MOM since my very first semesters of my course of studies, so that I already dropped the usual information science approaches of organizing information when I were taught these. Hence I am not strong in these. Having to research on thesauri for my diploma thesis was a problem, but to avoid to let it become a problem for the thesis' assessment I took strong web research to get the thesaurus terms and implications clear. That effort was honoured by an A (i.e. excellent) rating for the thesis. But it also leaves the gaps relating to other kinds of term organizing ontologies and terminologies.

For you readers of this blog, I attempted to guide you from any kind of term organizing ontology to MOM; but in fact I am not firm in the fields of ontologies. So, I likely confused the terms. I researched to get the terms clear, but all I found was that the terms as is are treated to be not clear yet. So my only hope to get a clear approach would be to ascertain the currently most preferred set of definitions of the terms mentioned.

But since I am seeking a job, I don't have that much time as I would like to have to perform such a research/retrieval approach. So I drop this approach for now. If you can define the terms involved, I would greatly appreciate, if you would contribute any of the definition(s) in the comments of this post or any of the previous ones!


[1] referred previous posts: <<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar. Removed my workaround for backlinks blogger.com didn't support in earlier times. Now, backlings are there, therefore the bypass can be dropped.


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

Saturday, February 25, 2006

Approach from Tagging

ontology

Ontology is a form of classification which may be thought of as "conceptual" relationships. Whereas in a taxonomy terms may be classified together as "fruit", in an ontology the conceptual relationship may be "fruit which are used in pies" or "growing fruit for sale". Ontological relationships are inherently self-describing. Ontologies are the backbone of the semantic web, as they provide multiple links to data and therefore can support search and insightful navigation. — Source

Problem, I see: The meaning of "fruit which are used in pies" is not accessible to the machine. The machine depends on this kind of "knowledge" given to it, cannot exceed the borders resulting from that kind of knowledge. For short: The machine is unable to determine by itself whether or not a given item belongs to a "fruit which are used in pies" relationship, if nowhere is stated that this is the case. Same for different kinds of thesaurus relationships ("broader term", "narrower term", etc.), which I originally criticised.


I avoided to touch the semantic web attempt, since I feared such a kind of approach. Now, when I had to look up taxonomy, classification, terminology, etc. to avoid to mess them up, I came across the above statement, that the semantic web in fact approaches this way. Sad.

In my opinion, a machine needs a way, to get competence to decide. Knowledge to make the decision on. Knowledge on knowledge. Knowledge on knowledge to get able to verify its own knowledge. As long as there are parts of the machine's knowledge the machine cannot access, it hasn't got a chance to get independent of external minds, e.g. humans.

I don't think that the best approach to achieve that goal is to stack more and more complexity/abstraction layers on the already not-working attempts. Instead, I think, it is necessary to determine a way to make most basic knowledge accessible to the machine, in a way that the machine doesn't have to depend on externally given knowledge. I think, that if that basic goal is achieved, more complex kinds of content/knowledge can be built atop.<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

Friday, February 24, 2006

Usability problem in the widespread 'is a' approach

As introduced earlier, taxonomies as thesauri and classifications organize terms in a net similar to a tree. Hierarchically, the terms get ordered mainly by is a relationships, seldom by has a ones. Also there are relationships between terms that have "something" to do with each other, but cannot be ordered validly hierarchically. Like bird and bird cage. Some taxonomies force one parent term per node, i.e. a structure that easily can be identified as being mainly a tree, with exception of some cross reference like associative relationships. Other taxonomies don't force the one parent per term rule, so that such a one might easily look more like a net but like a tree. The prior ones are called "mono-hierarchical", while the latters' label is "poly-hierarchical".

What's the problem?

Taxonomies don't do anything more but relating the terms to other terms. The task to define the terms they leave to dictionaries.

This presupposition requires that the one who's using a taxonomy already is a kind of an expert in the field organized by the taxonomy. If you're looking for a term you don't know where to look up, you're lost.

Say, some part fell from your car, and since then it doesn't move anymore. You don't know the term for that part, and the very one is too heavy, so you cannot just take it to the garage. -- Because of its primary is a nature, a taxonomy is of no use here for you. Another chance is to go to the garage without the part and attempt to explain the nature of it to a worker there. (So far, my university information science teacher guided me.)

So, the problem of a taxonomy is that it doesn't support the most straight ahead approach to identify an item -- to select the most conspicuous properties of the item the taxonomy -- broader: knowledge storage, e.g. a garage worker's memory -- already knows about.<<


Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

Thursday, February 23, 2006

From the Information Science local point of view (upgraded)

Information Science and its predecessor sciences like documentation or library science tackle one big problem in information: stay able to retrieve pieces of information once stored.

In ancient days the pieces of information mainly were material, i.e. not computer-indexable. For example, books were such a kind of material.

Common approach from the information science point of view is to assign each of the books with a set of keywords: When you want to retrieve one of them later, you go ahead, choose some of the keywords, and lookup them in a catalogue which itself refers to the books
theirselves.

To be able to handle this all, you need at least three kinds of storage:
  1. a storage for the books, e.g. a kind of library, organized in a way to stay able to at least locate the shelf a particular book is placed in
  2. a storage for the catalogue
  3. and, most important: a storage for the keywords.
The keywords theirselves have to be stored somewhere. If you neglect this part of the task, one day you will apply this keyword to the book and another day another keyword, but both meaning the same -- i.e. being synonyms.

So, what result originates from that?

Assumed you associated two books X and Y very similar in content with two synonymous but different keywords, A and B. One day you want to know something about a topic that is covered by both books X and Y, but you don't know about that. You directly go to the catalogue. You pick up a search keyword that accidently fell into your mind. Say B.
You look up the appropriate catalogue card and find Y. -- That there is a closely related X book you don't even get aware of. So you fetch the Y book, but the X book remains in shelf. Possibly it would have been valuable to find X as well.


Therefore keywords get stored theirselves too.

The main goal of a keyword storage is similar to the other pieces of information storages: To stay able to find the wanted contents, i.e. the keywords -- and to find exactly the keywords wanted. "Keywords wanted" are those that might be applied to one or more books.

The appropriate tool for keyword or, more precisely, term (as in "search term") storage is a terminology. It mentions every word that was applied to at least one piece of information -- e.g. book -- of the pieces of information storage -- e.g. library. (In a converse, to keep administrative work load small, there's the suggest to choose only keywords already listed in the terminology, to associate books with.)

In a simple case, a terminology might be an alphabetical list of terms. Even better a taxonomy is: For each of the terms it offers an orientation help: Usually there are broader and narrower variants of terms: a mammal is treated to be broader than a dog or cat or cow or horse or something else which is a mammal. (In fact, taxonomies refer to items but list the labels of the items. Taxonomies are closely related to ontologies.)

So, a taxonomy offers is a relationships to identify the location of a given term in the whole taxonomy. Less common than is a are has a relationships, like the ones applied between car and something like wheels, motor, front window, doors etc. Both of these relationships are called hierarchical relationships.

My diploma thesis was about the thesaurus kind of taxonomy, so I currently I am not sure if this applies for the classification kind as well: There is at least one more kind of relationship -- the associative one. It relates terms at each other that don't belong to a hierarchical order, but somehow have something to do with each other, like bird and bird cage. (Admittedly, they might be related using a has a relationship, but I never came across such an assignment.)

Synonyms in taxonomies get treated a special way

Synonyms in taxonomies get treated a special way: They get collected to a single class. Each of the items of that class is (/treated to be, compare to administrators' cheats above) synonymous with every other item of the class.

In classifications there are just classes related to each other, being representative for the terms belonging to the class. In thesauri there are no such classes. During thesaurus creation sets of synonymous terms get identified. One "most significant/common" term gets chosen to be the term representative for all the other synonyms. This "most significant/common" one is called "descriptor", while the others become non-descriptors.

Why all the fuss about the synonym details?

If you look up a synonym you get redirected to the class or descriptor the synonym belongs to. None of the books/other pieces of information ever gets keyworded by a synonym. That solves the problem of searching the X and Y books mentioned above one day by the A and another day by the B search term: Either A points
to B or B to A or both to a third term, say C. And both the books are associated by the "most significant/common" term, e.g. C. So, either if you chose A or B as your search term, you always find all the relevant pieces of information/books, e.g. X and Y.


So far the common part of terminologies.


But there is a usability problem in the widespread is a approach.<<

Glossary

  • to retrieve
    • to find again
  • class
    • a set of synonymous words
  • synonym
    • a word meaning the same as another word
    • sometimes there is a difference between them both, but the one applying them doesn't notice
    • terminology administrators sometimes think it is not worth the effort to keep "very similar" terms distinct, so they merge them into a single class by claiming the very words were synonyms
  • catalogue card
    • associations between keywords are stored on catalogue cards which, as a whole, make up the catalogue itself



Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

Saturday, February 18, 2006

From the Information Science local point of view

Information Science and its predecessor sciences like documentation or library science tackle one big problem in information: stay able to retrieve pieces of information once stored.

In ancient days the pieces of information mainly were material, i.e. not computer-indexable. For example, books were such a kind of material.

Common approach from the information science point of view is to associate each of the books with a set of keywords: When you want to retrieve one of them later, you go ahead, choose some of the keywords, and lookup them in a catalogue which itself refers to the books itself.

To be able to handle this all, you need at least three kinds of storage:
  1. a storage for the books, e.g. a kind of library, organized in a way to stay able to at least locate the shelf a particular book is placed in
  2. a storage for the catalogue
  3. and, most important: a storage for the keywords.
The keywords theirselves have to be stored somewhere. If you neglect this part of the task, one day you will apply this keyword to the book and another day another keyword, but both meaning the same -- i.e. are synonyms.

So, what result originates from that?

Assumed you associated two contently very similar books X and Y with two synonymuos but different keywords, A and B. One day you want to know something about a topic that is covered by both books X and Y, but you don't know about that. You directly go to the catalogue. You pick up a search keyword that accidently fell into your mind. Say B. You look up the appropriate catalogue card and find Y. -- That there's a closely related X book you don't even get aware of. So you fetch the Y book, but the X book remains in shelf. Possibly it would have been valuable to find X as well.

Therefore keywords get stored theirselves too.


The main goal of a keyword storage is similar to the other pieces of information storages: To stay able to find the wanted contents, i.e. the keywords -- and to find exactly the keywords wanted. "Keywords wanted" are those that might be applied to one or more books.

The appropriate tool for keyword or, more precisely, term (as in "search term") storage is a terminology. It mentions every word that was applied to at least one piece of information -- e.g. book -- of the pieces of information storage -- e.g. library. (In a converse, to keep administrative workload small, there's the suggest to choose only keywords already listed in the terminology, to associate books with.)


In a simple case, a terminology might be an alphabetical list of terms. For each of the terms it offers an orientation help: Usually there are broader and narrower variants of terms: a mammal is treated to be broader than a dog or cat or cow or horse or something else which is a mammal. (In fact, terminologies refer to items but list the labels of these.)

So, a terminology offers is a relationships to identify the location of a given term in the whole terminology. Less common than is a are has a relationships, like the ones applied between car and something like wheels, motor, front window, doors etc. These kinds of relationships are called hierarchical relationships.

My diploma thesis was about the thesaurus kind of terminology, so I currently I am not sure if this applies for the classification kind as well: There is at least one more kind of relationship -- the associative one. It relates terms at each other that don't belong to a hierarchical order, but somehow have something to do with each other, like bird and bird cage. (Admittedly, they might be related using a has a relationship, but I never came across such an assignment.)

Synonyms in terminologies get treated a special way


Synonyms in terminologies get treated a special way: They get collected to a single class. Each of the items of that class is (/treated to be, compare to administrators' cheats above) synonymuos with each other item of the class.

In terminologies called classification there just classes related to each other, being representative for the terms belonging to the class. In thesaurus kind of terminologies there are no such classes. During thesaurus development a sets of synonymous terms get identified. One "most significant/common" term gets chosen to be the term representatively for all the other synonyms. This "most significant/common" one is called "descriptor", while the others become non-descriptors.

Why all the fuss about the synonym details?


If you lookup a synonym you get redirected to the class or descriptor the synonym belongs to. None of the books/other pieces of information ever gets keyworded by a synonym.

That solves the problem of searching the X and Y books mentioned above one day by the and another day by the B search term: Either A points to B or B to A or both to a third term, say C. And both the books are associated by the "most significant/common" term, e.g. C. So, either if you chose A or B as your search term, you always find all the relevant pieces of information/books, e.g. X and Y.

So far the common part of terminologies.

But there is a usability problem in the widespread is a approach.

Glossary


to retrieve - to find again

class - a set of synonymous words

catalogue card - associations between keywords are stored on catalogue cards which, as a whole, make up the catalogue itself

synonym - a word meaning the same as another word
  • sometimes there is a difference between them both, but the one applying them doesn't notice
  • taxonomy administrators sometimes think it is not worth the effort to keep "very similar" terms distinct, so they merge them into a single class by claiming the very words were synonyms



Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.

guide lines for this blog

I am going to guide you to the interesting points of the Model of Meaning matter. Therefore I keep the introduction easily comprehensible, but nevertheless complete, so that after reading it you'll have the complete understanding to discuss the model qualifiedly.<<



Updates: 20070624: Tagged the posting. Removed my workaround for backlinks blogger.com didn't support in earlier times. Now, backlings are there, therefore the bypass can be dropped.

Friday, February 10, 2006

"ia: organizing notions" now removed

Prior blog "ia: organizing notions" now removed. All links there should be dead by now.<<



Updates: 20070624: Tagged the posting. Removed my workaround for backlinks blogger.com didn't support in earlier times. Now, backlings are there, therefore the bypass can be dropped.
ia: organizing notions has been merged into this blog now<<



Updates: 20070624: Tagged the posting.

init

just initiated this blog, now merging the prior ia: organizing notions here. It was described the following way
A blog upon my approach on organizing notions/concepts/ideas. As far as I know, this is a topic as well on information architecture.
where notions was exactly misleading. A wrong hint someone told me without considering the whole matter I am working on.



Updates: 20070624: Tagged the posting. Updated the posting style (layout) to my current style, such as using blockquotes when appropriate, more precise word picks, better grammar.


[Merged from (the now removed) ia: organizing notions:] I've been working on this topic for a long time, so now I want to get it clear, reliably substantiated and prepared to get it discussed. I've taken several attempts to create a united blog for it, like knowledge, knowledge.meta and find using notions (partitially English--partitially German).

I wasn't resolutely to in fact publish it while working on it--but I am working on it yet such a long time, I am sure I cannot expect to finish it in the near future. And also, I have not anymore that lot of time to work on it as I had when I was a student. I made the mental step to be willing to let a potential employer get its hands on it (well, in fact, if the employer is a cute search engine provider), so it doesn't hurt anymore to publish it while it is not yet finished.

The place for that shall be here. Well, I'm going to rename the URL of this blog soon, but there the description of the model shall happen, and any discussion on it, too.<<



Updates: 20070624: Tagged the posting. Removed my workaround for backlinks blogger.com didn't support in earlier times. Now, backlings are there, therefore the bypass can be dropped.