Content Representation With A Twist

Showing posts with label reorganization. Show all posts
Showing posts with label reorganization. Show all posts

Tuesday, February 17, 2009

A hack makes reorganization work

Progressed further: One important step taken: reorganization, now, basically, works. That's the pre-requisite for introducing quality control to recognition.

Below, there are two renderings of basically the same net: One rendered before, the other after reorganization.

pre-reorg


post-reorg

The net is broken because this reorg is a hack, and I therefore hacked together a generator too. (25) pointing to (22) and (15) pointing to itself can be seen as indicators for the brokenness of the generator.

However, what's important is that reorganization works.

Sunday, January 20, 2008

Build a new 'programming' language that neither instructs computers but tells them what to make sure?

Just reading the lastest news on a security hole in Winamp, and still having in mind how our programming trainees tend to assume things and base their programming on that -- instead of making pretty quite sure --, having worked on knowledge representation a rather long time, with the Winamp issue a thought popped into my mind:

As long as there are minds out there trying to make any one's programs do anything they were not intended for -- and that might be for a pretty long time --, programming might initially look like it looked like since decades: instruct the computer what to do and in which order to do it. But on the second glance, having people in mind who try to abuse programs, other people who ease them to do so instead of making sure, all that programming sort of things, in my eyes, looks like being in conversion to be knowledge work, rather than lining up building blocks. That kind of knowledge work that is to make sure things are the way we'd assume them to be. So the whole program might become some sort of building where each single building block was not only lines up but verified too. So, then in fact the whole building consists of knowledge rather than basically of building blocks of assumptions.
 

The majority of my achievements in knowledge representation was to figure out two fundamental concepts, aside of a minor but even more fundamental one: The concept of recognition is after "How to recognize items by a given subset of their features?" while reorganization asks how to reorganize a given graph of knowledge representation to make it less matter/energy consuming while still representing the very same content? The minor one was how to store content by graphs at all. It's very basic but important nevertheless.

Long a while ago, I wondered whether there might be a reason to base any kind of computers instructing language on that effort. But then I didn't see any such reason, and I didn't take any further effort to figure out any such one.

However, coming to the point today to see secure programs as a building of certainities, there in fact might be a reason to convert my efforts into a new computers instructing language.

      
Updates:
none so far

Thursday, August 09, 2007

Benefits of detecting and replacing kind N networks: revealing implied content

Added documentation to the sub-framework of detection of replacable partitial networks, and rewrote parts of initialization for a few classes. Actually, what I am talking here about is the kind N networks detection.

the N networkKind N networks, in the MOM terminology, are such ones of four or more nodes, two predecessors (cf. image: nodes A and B), two successors (nodes C and D), and each of the predecessors connected to each of the successors. That looks like a mixture of an "N" and an "X" character.

As the X often gets used to indicate something unknown, but here is not anything unknown, that kind of network got called the N network. a pure kind N network -- Add a pair of successor and predecessor, and it still looks somewhat like that X-N mixture. As it features more but two base and two top points, we call it a pure kind N network: any MOM sub-network that consists of an equal number of predecessor and successor nodes (and wires all the predecessor to the successor nodes) is called a pure kind N network. Thus, the N network, of course, is also a pure kind N network.

Then, there are chances, a network features more successor nodes but predecessor ones. the W network If the number of predecessor nodes is >= 2, that kind of net gets called a kind W network, because of its shape. -- It gets called the W network, if it sports only two predecessor and exactly three successor nodes. a kind V fan If a graph features only a single predecessor node, it's a kind V fan. Similar to the naming scheme for the W network, a kind V fan gets called the V fan, if its shape matches the letter: If it features a single predecessor and exactly two successor nodes.

Put upside down, we get a kind M network, the M network, a kind A fan and the A fan, resepectively.

Because kind W/M networks follow the pure kind N network approach by wiring each predecessor with each successor node, all together -- kind W, kind M and pure kind N networks get summarized under the generic "kind N network" label.


So, the efforts done were to detect any kind N networks within a larger MOM network. Why? -- The complete wiring of each predecessor node with each successor node results in a situation, detecting hidden content that the content of each successor node equals the content of all predecessor nodes together (plus the content of any separate predecessor nodes the successors don't share with their neighbours of that kind N network). The image aside shows it: Nodes F and G share the content of B, C and D. A becomes part of F only, so does E for G. -- That sharing of common predecessor nodes implies two things:
  • First, as any MOM node represents the merged content of its predecessor nodes, we could replace the heavy wiring by adding a new node and wire link all predecessor nodes of the kind N network to that newly added node and it to all the successor nodes. That way the number of edges needed to administer could get reduced from a * b to only a + b.
  • Second, as now it might be obvious, that wiring of all the predecessor nodes to each of the successor nodes was nothing different but an implication. A not explied (?)/explicited (?) notion. By adding the node, we make it explicit.
So, detecting kind N networks offers the chance to detect implicit content as well as to decrease the number of edges to administer. That was accomplished yet a few days ago.

Now, part of it was rewritten and all of it documented. Now that sub-framework for kind N net detection needs to get spread into separate class files and put into a sub-directory or sub-directory hierarchy.

      
Updates:
none so far

Tuesday, July 31, 2007

Finished a first piece of reorganization

Just finished: One core part of reorganization -- finding large replacable partitial networks. I figured, that might get me rid of those double feed news, as having this functionality available might enable me to sort news by topic. ... Which makes me ponder about wrapping this bit into a rails site. ;-)

      
Updates:
none so far

Friday, July 27, 2007

chance for a MOM application: get old news filtered from RSS feeds

Development on the MOM SSC framework and especially implementing one core part of the reorganizer got lagged because I am still after getting a job (and other issues). Apparently, that search distracts more but actually having a job.

However, the time to read the feeds I defend. But there I found a problem -- too much interesting news and too many repetitions of the same topic. I survived one Apple keynote time, and I endured the Vista market introduction. But when there was another hype on the iPhone I begun feeling nagged.

Now, as the iPhone wave gets prolonged by iPhone hacks, and as noone can hid from that Harry Potter hype, I really get annoyed. -- As the Model of Meaning provides the logic to detect similarities, I want a tool that determines old news and variants of yet known news. Such as the latest iPhone hack or Potter p2p share.

Another way but looking up and dealing with the tags of feed entries, might be to take the words of any set of two or more articles and see for sets of words they share. A more brute-force (and less MOM way approach would be to take word neighbourhoods (word sequences) into consideration. -- On the other hand, the tool-to-be could use wordnet to include synonyms into 'consideration' when looking for similarities between texts.

For that reason, now I see how I can get through with the beforementioned reorganizer core -- the one that actually detects similarities for to save edges, i.e. storage -- logical by edges as well as "physically" by disk space.

      
Updates:
20070731: linked the word "lagged" to the last recent release posting

Tuesday, June 26, 2007

Reorganizing Tags -- For What Benefit?

Having in sight to get over the core MOM reorganization obstacle and get reorganization implemented, as well as having noticed a possible benefit of having only//just//at least a reorganizer at hand (i.e. without any reorganizer) [aside of the benefit of becoming able to develop a more sophisticated recognizer then], I begun thinking about whether there might be a chance to make some profit by providing the MOM reorganizer as a web service.
 

Still unknowingly about any profitable such web service, I ended up with looking up 'tagging' in wikipedia. Which might be worth a read, same so for the German variant of that very article [for those of you comfortable with that language].

      
Updates:
none so far

Thursday, June 21, 2007

Link payload to get an impression of user interests

Is it link payload? Or something like a content or a set of features the link clicking web users reveal about themselves?

Having a tool in reach that might mine immediately processable content from the web, the reorganizer module of MOM, I keep wondering how to actually mine the web.

Just the minute, I am skimming a news web site that, on its overview page, provides the headlines of the articles only. Not the least preview, not even a snipped of text, hinting on what the linked article mght deal with, and where it might dig into the depth. So, a human can say: If you click on that link, you might be interested in the topic spotlighted by the headline. Or, since I know the sometimes crudely set up headlines, there's a chance you clicked only to get an idea, what the heck the article might deal with. There's also the chance you'd click any link accidentally, but let's skip that possibility for now.

What I noticed the minute before, when I was skimming that headlines list was that converting the headline's words to nouns (e.g. by stemming) might suffice to tag the links. Given the case people would click only links they'd be interested in, in the mirror, any such link clicked reveals the topics the user is interested in -- the tags peel off the link and adhese to the person who clicked that link. In other words: By clicking the link, the users tag themselves. -- Track, what the user clicks over time, and you'd get not only a cloud of tags which you can link to a user, but by actually linking them to the user, applying reorganization, it's simple to learn the interests of a user. Add counting of the -- no, not of the links, as you might do for plain web site statistics, but instead -- add counting of the tags the users tag themselves with, and you might get a rather specific profile of the user. -- Cover a broad cloud of topics, thus a broad cloud of tags, and your users' profiles would become even sharper.
 

And, in the back of my head, there's still Google's advertising system. If each page, Google puts ads on, has to be 'enriched' by a handful of tags, visiting that page, the users tag themselves with those tags. If Google manages to assign that set of tags to individually you, Google might have quite a good impression of your interests.

      
Updates:
none so far

Sunday, June 17, 2007

cluster your feed news: MOM reorganizer vs RSS feed news

The chance to reveal topics several different sources work with by applying a reorganizer also implies the chance to cluster RSS feeds by topic: Instead of approaching that issue by applying traditional information science procedures, alternatively the tags of the fetched articles could get looked up (retrieve the original article, pick its tags) and thrown through a reorganizer.

That might ease to skip feed news of usually valuable feeds on topics completely out of interest.

      
Updates:
none so far

MOM's reorganization could reveal topics/theme complexes

Currently, I am preparing to implement MOM's reorganization capabilities. yet implemented parts of MOM/parts currently under development Today, some time amidst of it, I noticed that MOM could provide service already with only the reorganization functionality in place: Based on popular tagging, MOM [actually, its reorganizer] could reveal topics different sources (e.g. flickr photos or blog entries) deal with, unrecognizedly so far. -- The background:

Problem:
I've got lots of papers which are tagged. They deal with several different topics, on and off over time. There might be far later papers dealing with similar topics like any far earlier ones. -- Using the tags alone, doing that task intellectually, I might have a rather hard time: There are too many distinct tags to keep track of.

Approach for solution:
Reorganization could be applied: It might detect clouds of tags that belong together and 'mark' them by pooling them to separate new -- but yet unnamed -- 'tags' (= MOM nodes). That new tag, then, points to every paper the topic the tag represents deals with. -- That reduces the workload to be performed intellectually to find appropriate names for the newly created, first unnamed, tags. And, of course, to tag all those papers beforehand.

Benefit:
That does not only apply for my private issue of getting clear what topic I touched with MOM at what time, but also to any other unordered collection -- e.g. for papers collected in preparation of a diploma thesis..any scientific work, maybe even a law library..any library..all literature ever written.

      
Updates:
none so far

Sunday, June 03, 2007

new release: documentation improved, dotty output added

New version is out! It mainly features improved documentation of all of the files of the framework, but also introduced better MOM net output for MOMedgesSet and MOMnet class. For the latter even a simple dotty output method.

See here a sample MOM net created by MOMnet and rendered by dot: Two layer MOM nets oftenly contain hidden, i.e. not explicit (i.e. implicit) content. To have a generator for them available constitutes the chance to develop a detector for such implicit content and to make it explicit. A mechanism that takes both of these steps is known as reorganizsation. -- Which might become implemented next.
      
Updates:
none so far

Sunday, February 18, 2007

To Keep One's Insights Secret is the Best Chance to Get Stuck.

I am working on a model to represent content in a manner free of words. There are two other main parts tightly related to it: To recognize content by a variable set of given features. And, to reorder the stored content so that implicitly represented content becomes explicit while the already explicit part becomes more straight at the same time.

There is one main thing I avoided during my approach: black boxes. Things that base on people's believes but on proven facts. Things that are overly complex, hence may be estimated only, but not proven.

I avoided two common approaches: utilizing artificial intelligence and linguistics.

Representing The Content

On dealing with thesauri and classifications, I noticed the fact that those ontologies force abstraction relationships between mentioned notions. Therefore, I thought about the alternative, to force the partial relationship. What would result, if all the notions had to be represented as partial relationships? -- A heavily wired graph. -- Originally, thesauri and classifications were fostered manually. Therefore, it looked clear to me, noone would like to foster a densely wired graph to keep track of notions.

Nevertheless, I continued the quest. There is software available, nowadays, hence why to keep the chance out of mind?

Over time, I came to the insight, that notions might be constituted by sets of features, mainly. There -- I had "words" represented. More precisely: the items which are treated as the "templates" for the notions. ... I begun to develop an algorithm to recognize items by their features, varying sets of features, effectively getting rid of the need for globally unique identifier numbers, "IDs" for short. I had a peer to peer network in mind, which automatically should exchange item data sets. That demanded for a way not to identify but at least to recognize items by their features.

Since items can be part of other items -- like a wheel can be part of a car, and a car a part of a car transporter --, I switched from sets to graphs. -- To make clear that the edges used by my model are not simple graph edges, but also to differentiate them from classification/thesauri relationships, I call them "connections".

Then I noticed similarities to neurological structures, mainly neurons, axons, dendrites. Also, I noticed, the yet developed tools could not represent a simple "not", e.g. "a not red car", I begun to experiment with meaning modifying edges. Keeping in mind, that there is probably no magical higher instance providing the neurons with knowledge -- as for example knowledge about that one item is the opposite of another --, I kept away from connections injecting new content/knowledge; in this case knowledge about how to interpret an antonym relationship. I strove for the smallest set of variations of connections.


However, even without content modifying connections, the tag phenomenon common to the web could take great benefit: The connections between the items make clear which item is implication of which other(s). Applied to tagging, users would not need to mention the implications anymore: The implications would be available by the underlaying ontology. (And the ontology could be enhanced by a central site, by individual users, or by a peer to peer network of users.)

Recognizing The Content

Having that peer to peer network in mind, I needed a way to identify items stored at untrusted remote hosts. I noticed, that collecting sets of features together which theirselves would be considered to be items, meant nothing but a definition of the items by their features. Different peers -- precisely: users of the peer software -- might define the same items by different features. Which might leave only some of the features locally known/set to match those remotely set. -- However, most time that's enough to recognize an item by its features: Some features point to a single item only. These features are valuable: They are specific for that very item. If one of these specific features is given, most probably the item, the feature is pointing to, is meant.

But usually, every feature points to multiple items at once. Most probably, every item a feature is pointing to is chosen reasonably, i.e. the item the feature is pointing to is neither randomly chosen nor complete trash. Thus, a simple count might be enough: How many given features point to which items? How great is the quota of features pointing to a particular item, compared to the number of total incoming feature connections? -- The number of incoming feature connections, I call stimulations.

There's one condition applied to the recognition: If one node gets a particular number of stimulations, e.g. two stimulations, that very node will be considered to be "active" itself, hence stimulating its successor nodes as well. For a basic implementation of recognition, this approach is enough. A more sophisticated kind of recognition also considers nodes stimulated only, but not activated.


However, having recognition at hand -- even at this most basic level -- would finally support the above aproach of tagging to leave alone the implications: Given only a handful of features would be enough to determine the item meant. Also, applied to online search, the search engine could determine the x most probably meant items and include them into the search.

Despite these, I see one big chance: Currently, if a person gets a physical item the individual does not know and cannot identify, she or he needs to let it recognize by someone else. Usually your vendor. Where you have to move the part to. Some parts are heavy, others cannot be moved simply. And you are hindered using common tools to identify the object: Search engines don't operate on visual appearance, and information science tools like thesauri and classifications fail completely, simple because they prefer abstraction relationships over partial ones.

Using software able to recognize items by features would overcome such issues, completely, and independently of the kind of feature: It would be equal whether the feature would be a word, i.e. name, label, or color, shape, taste, smell or other. And, other but relational databases, there were no need for another table dimension for each feature -- just a unified method to attach new features to items.

Reorganizing The Content

Also directly related to that peer to peer network in mind, peers exchanging item data sets -- e.g. nodes plus context (connections and attached neighbor nodes) -- could result in heavy wiring, superfluous chains of single-feature-single-item definitions, and lots of unidentified implicit item definitions. That needs to be avoided.

Since the chains oftenly just can be cut down, I concentrated on the cases of unidentified implicit definitions. For simplicity, I imagined a set of features pointing to sets of items. Some of the features might point to the same items in common, e.g. the features <four legs>, <body>, <head>, and <tail> in common would point to the items <cat>, <dog>, and <elephant>. You might notice, that <cat>, <dog>, and <elephant>, all are animals, and also all these animals feature a body, four legs, head, and tail. Thus, <animal> is one implication of this set of features and items. The implication is not mentioned explicitely, but it's there.

Consequently, the whole partial network could be replaced by another one, mentioning the <animal> node as well: <four legs>, <body>, <head>, and <tail> would become features of the new <animal> node, and <animal> itself would become common feature of <cat>, <dog>, and <elephant>.

By that, the network would become more straight (since the number of connections needed would reduced from number of features * number of items to only number of features + number of items), hence also more lightweight. Also, items implied would become visible.

While this approach makes the implications visible, it opens two new doors: One, the identified implications cannot be named without the help of a human -- at least not easily. (Recognition could do a honour, but I skip that here.) The second issue is, that the newly introduced node ("<animal>") conflicts with recognition: For example: If there was a another node, e.g. <meows>, directly pointing to the <cat> node, after the reorganization, a given set of <meows> and <head>, only, would not result in <cat> anymore since each of the given features would, yes, stimulate their successor nodes, but not activate. -- To actively receive stimulations from predecessor nodes could be a solution, but I am not yet quite sure. As mentioned initially, this is a work in progress.


However, reorganization would automate identification of implications. People could provide labels for the added implication nodes. -- Which induces another effect of the model.

Overcome the Language Barrier

I mentioned that I kept away from connections injecting knowledge unreachable to the software. That's not all. I strive to completely avoid any kind of content unreachable, needing any external mind/knowledge storage which would provide interpretation of such injected content/knowledge. Hence, I also avoided to operate on a basis of labels for the items.

Instead, all the items and the features (whereby "features" is just another name for items located a level below the level of items initially considered) get identified by [locally only] unique IDs. The names of the items I'd store somewhere else, so that item names in multiple languages could point to the very item. That helps in localization of the system, but also, it opens the chance to overcome an issue dictionaries cannot manage: There are languages that do not provide an immediate translation for a given word -- because the underlaying concept is different. The English "to round out" term and the German "abrunden" is such an example: In fact, the German variant considers an external point of view ("to round by taking something away"), while the English obviously takes an internal one.

Not sticking on labels but on items/notions, the model features a chance to label the appropriate, maybe even slightly dismatching, nodes: The need to label exactly the same -- but probably not exactly matching -- node is gone. -- In a word: I think, in different cultures many notions differ slightly from similar ones of other cultures, but each culture labels only their own notions, ignoring slightly different points of view of other cultures. -- This notions/labeling issue, I imagine as puddles: Each culture has its own puddles for each of its notions. From the point of view of different languages, some puddles match greatly, maybe even totally, but some feature no intersection at all. Those are the terms having no counterpart in the other language.

In the long term, I consider the approach to build upon notions/items -- but on words -- as a real chance to overcome the language barrier.

Conclusion

Despite the option of dropping the custom to label different items by the same name (as thesauri tend to do to reduce foster effort) and the possible long-term chance of overcoming the language barrier, I see three main benefits for the web, mainly for tagging and online search:
  1. Tagging could be reduced to the core tags; all implications could be left to the underlaying ontology.
  2. Based on the same ontology, during search, sets of keywords could be examined to "identify" the items most probably meant. The same approach could be applied by the peer to peer network to exchange item data sets.
  3. Finally, the reorganization would keep the ontology graph lightweight, therefore ease the fostering. Also, the auto-detection of implications would support users in keeping their tags clear and definite. That could reduce blur in tag choose, thus increase precision in search/search results.


      
Updates:
none so far