Content Representation With A Twist

Wednesday, February 21, 2007

About the Simple Set Core

The Simple Set Core project is about a set engine. Aim of this set engine is to recognize ("identify") items -- sets of features -- by only some of their features, to store these items and their features recursively as directed graphs, and to reorganize these graphs so that as well implicit items/features become visible as the graph as a whole becomes less dense, thus more easy to handle.

Background

The set engine is part of a larger project, the Model of Meaning. Its approach is that notions ("meanings") consist of smaller notions (or raw data like "light sensed"). Different but common approaches, the Model of Meaning drops the familiar "is a" relationship between things: The assumption of the model is that, despite a car is a kind of a vehicle, a vehicle is a part of the car. -- At first glance this is hard to comprehend, but isn't it so that you are just thinking of the car and the vehicle? Then, both are not physical, thus there is no physical problem of "cramming" a just imagined vehicle within a car. Which also is just imagination.

Benefit For The Web

Having the Model of Meaning in mind, the set engine alone can do the web a big service: If tags would be related to other tags by "is part of" relationships, first, the tagging folks could quit to mention implications, second, the people searching for content could get the matching content even when looking for low level implications of the very content.

Perspective

Also, if there is a way to recognize items by just parts of their features, tag graphs could be integrated with each other, automatically, simply because the items mentioned in the graphs could be recognized by their features also; and, dropping the tags, replacing them by notion identifiers ("IDs") where the tags become attached to, maybe that could overcome even the language barrier, once and for all.

      
Updates:
none so far

Now, MOM will be developed publicly

Just a few minutes ago, I queued the core set engine for approval as a GPL project at gna.org.

The long description of the SSC project, I'll file soon.

      
Updates:
none so far

If word processors could know the words meanings...

If complex content get be built out of atomic pieces (of content?), how should the complex content become clear, ever?

I think about a word processor. The whole document becomes built out of words, and these out of letters. There is no machine understandable content attached to the words; the words remain just strings of letters. The machine has no idea about the meaning of these strings of letters nor of the sequences of words forming sentences and the whole document.

If at least the words had concepts attached, for a machine it might become much more simple to figure out the content of the whole document. Also since the words theirselves are somewhat related to each other by the underlaying grammar.

To get at least the content of the words accessible, an approach could be to integrate a request for explanation mechanism into the spell checker framework: If the content of the word typed-in is unknown, the user should explain it.

      
Updates:
none so far

Tuesday, February 20, 2007

What is MOM?

MOM is the acronym for "Model of Meaning". It is founded on one basic assumption, that a notion always consists out of simpler notions or, alternatively, out of raw data -- like "light sensored"

Like ontologies, the MOM building of notions is represented as a graph, a heavily wired network, where the notions make the nodes and the edges indicate which notion consists out of which other notions. Other but common ontologies, MOM avoids edges injecting knowledge which is unreachable to a system that gets the network as its knowledge base. For example, in thesauri there are edges like "this is an antonym to that" and "this a broader term to that". And most of the thesaurus edges (called "relationships") are abstracting. So, for example a dog node may be immediately attached to a animal node. But that implies: The system cannot sense why the dog may be an animal. It fully depends on the humans who foster the network. The same in the case of antonym or even more sophisticated relationships. -- As far as I know, most ontologies tend to make use of such knowledge injecting edges.

Related to this, there is one another big difference between MOM and common classifications/thesauri (also known as "controlled vocabularies"): MOM builds upon a slightly different definition of the term "notion".

Usually, notions are thought of as a triangle of term, item, and thought of the item. MOM does not focus on the item, nor does it have any interest on the terms. (Different cultures have different terms for the same items, hence why care?) Thus, the MOM nodes in fact don't represent notions but just thoughts of items. That makes a huge difference: Relying on the traditional definition of notions, a classification or a thesaurus can validly define a motor vehicle to consist of e.g. a set of wheels, engine, body (amongst others), but validly it can not define a van to consist of a motor vehicle and other parts, since a motor vehicle, simply, is not a part of a van. MOM, on the other hand, ignores the chance that a physical item might be attached to a notion. Hence, yes, part of a van may be a motor vehicle.

This insight allowed to drop the dominating kind of relationship of traditional controlled vocabularies, the abstraction relationship. Which MOM dropped in fact. Which led to a single remaining kind of edges: The partial one. Questions?

      
Updates:
none so far

Sunday, February 18, 2007

To Keep One's Insights Secret is the Best Chance to Get Stuck.

I am working on a model to represent content in a manner free of words. There are two other main parts tightly related to it: To recognize content by a variable set of given features. And, to reorder the stored content so that implicitly represented content becomes explicit while the already explicit part becomes more straight at the same time.

There is one main thing I avoided during my approach: black boxes. Things that base on people's believes but on proven facts. Things that are overly complex, hence may be estimated only, but not proven.

I avoided two common approaches: utilizing artificial intelligence and linguistics.

Representing The Content

On dealing with thesauri and classifications, I noticed the fact that those ontologies force abstraction relationships between mentioned notions. Therefore, I thought about the alternative, to force the partial relationship. What would result, if all the notions had to be represented as partial relationships? -- A heavily wired graph. -- Originally, thesauri and classifications were fostered manually. Therefore, it looked clear to me, noone would like to foster a densely wired graph to keep track of notions.

Nevertheless, I continued the quest. There is software available, nowadays, hence why to keep the chance out of mind?

Over time, I came to the insight, that notions might be constituted by sets of features, mainly. There -- I had "words" represented. More precisely: the items which are treated as the "templates" for the notions. ... I begun to develop an algorithm to recognize items by their features, varying sets of features, effectively getting rid of the need for globally unique identifier numbers, "IDs" for short. I had a peer to peer network in mind, which automatically should exchange item data sets. That demanded for a way not to identify but at least to recognize items by their features.

Since items can be part of other items -- like a wheel can be part of a car, and a car a part of a car transporter --, I switched from sets to graphs. -- To make clear that the edges used by my model are not simple graph edges, but also to differentiate them from classification/thesauri relationships, I call them "connections".

Then I noticed similarities to neurological structures, mainly neurons, axons, dendrites. Also, I noticed, the yet developed tools could not represent a simple "not", e.g. "a not red car", I begun to experiment with meaning modifying edges. Keeping in mind, that there is probably no magical higher instance providing the neurons with knowledge -- as for example knowledge about that one item is the opposite of another --, I kept away from connections injecting new content/knowledge; in this case knowledge about how to interpret an antonym relationship. I strove for the smallest set of variations of connections.


However, even without content modifying connections, the tag phenomenon common to the web could take great benefit: The connections between the items make clear which item is implication of which other(s). Applied to tagging, users would not need to mention the implications anymore: The implications would be available by the underlaying ontology. (And the ontology could be enhanced by a central site, by individual users, or by a peer to peer network of users.)

Recognizing The Content

Having that peer to peer network in mind, I needed a way to identify items stored at untrusted remote hosts. I noticed, that collecting sets of features together which theirselves would be considered to be items, meant nothing but a definition of the items by their features. Different peers -- precisely: users of the peer software -- might define the same items by different features. Which might leave only some of the features locally known/set to match those remotely set. -- However, most time that's enough to recognize an item by its features: Some features point to a single item only. These features are valuable: They are specific for that very item. If one of these specific features is given, most probably the item, the feature is pointing to, is meant.

But usually, every feature points to multiple items at once. Most probably, every item a feature is pointing to is chosen reasonably, i.e. the item the feature is pointing to is neither randomly chosen nor complete trash. Thus, a simple count might be enough: How many given features point to which items? How great is the quota of features pointing to a particular item, compared to the number of total incoming feature connections? -- The number of incoming feature connections, I call stimulations.

There's one condition applied to the recognition: If one node gets a particular number of stimulations, e.g. two stimulations, that very node will be considered to be "active" itself, hence stimulating its successor nodes as well. For a basic implementation of recognition, this approach is enough. A more sophisticated kind of recognition also considers nodes stimulated only, but not activated.


However, having recognition at hand -- even at this most basic level -- would finally support the above aproach of tagging to leave alone the implications: Given only a handful of features would be enough to determine the item meant. Also, applied to online search, the search engine could determine the x most probably meant items and include them into the search.

Despite these, I see one big chance: Currently, if a person gets a physical item the individual does not know and cannot identify, she or he needs to let it recognize by someone else. Usually your vendor. Where you have to move the part to. Some parts are heavy, others cannot be moved simply. And you are hindered using common tools to identify the object: Search engines don't operate on visual appearance, and information science tools like thesauri and classifications fail completely, simple because they prefer abstraction relationships over partial ones.

Using software able to recognize items by features would overcome such issues, completely, and independently of the kind of feature: It would be equal whether the feature would be a word, i.e. name, label, or color, shape, taste, smell or other. And, other but relational databases, there were no need for another table dimension for each feature -- just a unified method to attach new features to items.

Reorganizing The Content

Also directly related to that peer to peer network in mind, peers exchanging item data sets -- e.g. nodes plus context (connections and attached neighbor nodes) -- could result in heavy wiring, superfluous chains of single-feature-single-item definitions, and lots of unidentified implicit item definitions. That needs to be avoided.

Since the chains oftenly just can be cut down, I concentrated on the cases of unidentified implicit definitions. For simplicity, I imagined a set of features pointing to sets of items. Some of the features might point to the same items in common, e.g. the features <four legs>, <body>, <head>, and <tail> in common would point to the items <cat>, <dog>, and <elephant>. You might notice, that <cat>, <dog>, and <elephant>, all are animals, and also all these animals feature a body, four legs, head, and tail. Thus, <animal> is one implication of this set of features and items. The implication is not mentioned explicitely, but it's there.

Consequently, the whole partial network could be replaced by another one, mentioning the <animal> node as well: <four legs>, <body>, <head>, and <tail> would become features of the new <animal> node, and <animal> itself would become common feature of <cat>, <dog>, and <elephant>.

By that, the network would become more straight (since the number of connections needed would reduced from number of features * number of items to only number of features + number of items), hence also more lightweight. Also, items implied would become visible.

While this approach makes the implications visible, it opens two new doors: One, the identified implications cannot be named without the help of a human -- at least not easily. (Recognition could do a honour, but I skip that here.) The second issue is, that the newly introduced node ("<animal>") conflicts with recognition: For example: If there was a another node, e.g. <meows>, directly pointing to the <cat> node, after the reorganization, a given set of <meows> and <head>, only, would not result in <cat> anymore since each of the given features would, yes, stimulate their successor nodes, but not activate. -- To actively receive stimulations from predecessor nodes could be a solution, but I am not yet quite sure. As mentioned initially, this is a work in progress.


However, reorganization would automate identification of implications. People could provide labels for the added implication nodes. -- Which induces another effect of the model.

Overcome the Language Barrier

I mentioned that I kept away from connections injecting knowledge unreachable to the software. That's not all. I strive to completely avoid any kind of content unreachable, needing any external mind/knowledge storage which would provide interpretation of such injected content/knowledge. Hence, I also avoided to operate on a basis of labels for the items.

Instead, all the items and the features (whereby "features" is just another name for items located a level below the level of items initially considered) get identified by [locally only] unique IDs. The names of the items I'd store somewhere else, so that item names in multiple languages could point to the very item. That helps in localization of the system, but also, it opens the chance to overcome an issue dictionaries cannot manage: There are languages that do not provide an immediate translation for a given word -- because the underlaying concept is different. The English "to round out" term and the German "abrunden" is such an example: In fact, the German variant considers an external point of view ("to round by taking something away"), while the English obviously takes an internal one.

Not sticking on labels but on items/notions, the model features a chance to label the appropriate, maybe even slightly dismatching, nodes: The need to label exactly the same -- but probably not exactly matching -- node is gone. -- In a word: I think, in different cultures many notions differ slightly from similar ones of other cultures, but each culture labels only their own notions, ignoring slightly different points of view of other cultures. -- This notions/labeling issue, I imagine as puddles: Each culture has its own puddles for each of its notions. From the point of view of different languages, some puddles match greatly, maybe even totally, but some feature no intersection at all. Those are the terms having no counterpart in the other language.

In the long term, I consider the approach to build upon notions/items -- but on words -- as a real chance to overcome the language barrier.

Conclusion

Despite the option of dropping the custom to label different items by the same name (as thesauri tend to do to reduce foster effort) and the possible long-term chance of overcoming the language barrier, I see three main benefits for the web, mainly for tagging and online search:
  1. Tagging could be reduced to the core tags; all implications could be left to the underlaying ontology.
  2. Based on the same ontology, during search, sets of keywords could be examined to "identify" the items most probably meant. The same approach could be applied by the peer to peer network to exchange item data sets.
  3. Finally, the reorganization would keep the ontology graph lightweight, therefore ease the fostering. Also, the auto-detection of implications would support users in keeping their tags clear and definite. That could reduce blur in tag choose, thus increase precision in search/search results.


      
Updates:
none so far

Saturday, February 17, 2007

Elements of Semantic Web Explained in an Environment of Buzz

In Minding the Planet, on February 13, 2007, Nova Spivack makes a lot of buzz about his upcoming venture, Radar Networks. Besides, he explains some of the elements of Semantic Web he considers to be key. Mostly, his posting commends the chance of ontologies to disambiguate homonyms, and that that disambiguation can be applied to documents by markup.

Some of the concepts, the posting does not get really clear, like information, content, concept, meaning, "things", data. The posting looks less scientific but very excited about the business of its author. There are lots of repetitions.

On the web side, the posting mentions the sites Powerset, TextDigger, Metaweb and the terms OWL and REST, which might be worth a look.


On the blogging side, Nova Spivack's posting can be tracked back, but only immediately. Blogger does not support that feature, hence, currently, I am stuck with manual trackback tools like Adam Kalsey's Simpletrack or the Wizbang Standalone Trackback Pinger. But the trackbacks generated by them, Nova Spivack's blog refuses by a captcha -- and both of the tools don't show that.

Despite Nova Spivack's posting didn't boosted me with concepts I didn't know yet, I'd be interested in learn about alternative manual trackback tools (either online or available for *nix) or any great but free alternatives to blogger.com. Can you provide me with a matching hint?

      
Updates:
none so far

Thursday, February 15, 2007

Social Software does not equal Social Networking

Radar Networks mentions:
(note that social software is not necessarily social networking -- that is subset of social software)
Quite true, like a spam filter applied by a major freemail provider: Participants can mark mails in their inboxes as spam while unknowingly marking the same mail (body) in other participants' inboxes as spam as well. Or at least, increase the chance the mail will be considered as spam at some decent point of marks pileage.

On the other hand, social networking includes making the participants known to each other -- in some way or the other.

      
Updates:
none so far

How content providers could inform about the quality of their content

The W3C Content Label Incubator Group writes:
The diversity of material on the Web continues to grow to encompass audio, video, games and all manner of data services alongside traditional documents. The Content Label Incubator Group [WCL-XG] aims to foster ideas for how content providers can inform [...] that their content is of a certain type, fulfils certain criteria or meets given requirements.
(Highlighting is mine.)

If a content provider gets the chance to mention whether its content fulfils certain criteria, I wonder, how that should be trustworthy in any way. Every single unreflected opinion, provided by any anonymous poster could be announced to fulfill, simply, every criteria.

I think, a self-selected claim of what the content is self offers, wouldn't work unless there's some kind of social control. Either that there's some kind of unwritten law applied or there's an immediate social control by known peers of the content provider or an indirect social control, as drafted by the popular rating systems, known e.g. from e-bay.

Alternatively, a (neutral?) software could be developed which states the fulfill of the content. Another question is, whether a piece of content fulfills the same criteria or every perceiver or if the fulfillment depends on the person who's consuming the content. -- At least for the also mentioned "content providers can inform [...] that their content [...] meets given requirements", I'd opt for the chance, that the peer actually needing to apply the information provided (by the content) probably can decide better than the content provider whether or not the content provided meets any given requirement. If that question is answerable once and for all, ever.

      
Updates:
none so far

Guest Book: Your comments are appreciated!

You comments are welcome!

If you're about to comment on the blog as a whole but on a single posting, here's your place. Just make use of the comments section of this posting, consider it to be a substitute for a non-existing guest book application.

Thanks for leaving a note.

      
Updates:
none so far

time to blog

Ralf told me about the evolving tags vs. semantic web vs. social web vs. web 2.0 story. I noticed, yet being late. In 2000, I begun to develop a system of language independently represent content. I evolved it far, but I focused on analysis but implementation. Also, I dropped programming as hobby in about 1997; and the (programming) language I used to prove this or that concept, was not the kind for release. Plus, I am a C refugee.

Currently, I am bringing a part project to an end, which reorders graphs of concepts ("notions") so that not explicitely mentioned concepts become visible (and accessible and reusable). I used Perl for that, but it might be largely unreadable. Nevertheless, if anyone is interested in it, I'd disclose it.

Despite tags is a topic yet behind, I decided to prepare an article on that topic. There I noticed, now even more people became interested in that field of topic. Ralf's mention confirmed that. And my impression of becoming late. Hence, now, this as a first step into public.

      
Original posting:
dot dot dot: time to blog
Updates: none so far