Content Representation With A Twist

Showing posts with label information visualization. Show all posts
Showing posts with label information visualization. Show all posts

Friday, July 13, 2007

Positive hits in the "content representation" search results

Correct hits on the "content representation" term Google search (in opposite to any such hits that contained "content <something else but whitespace only, such as punctuation> representation"): I went through the results from end (page 79) towards start, since I presumed many false hits the nearer the end of the tail. But there few false hits there.

The above results I picked from pages 79 and 78 only -- and already learned a lession: It might make more sense to apply some kind of clustering here instead of walking through the list manually. Even the intellectual check whether there is anything in between of "content" and "representation" -- to filter out false hits --, can be done by software.

I'd like to learn the most-often used terms (besides of "content representation"), and, by help of that clustering/visualization, I want to get the chance to ignore obvious false hits.

That demands for using -- get hands on -- the Google API.

      
Updates:
none so far

wanted: tag cloud for the other pages mentioning the term "content representation"

I'd like to learn what all these 91,900 search results related to content representation might be about. (Curiously, I wonder where I left the article directly pointing to that search result -- when it still were 88,900 "only".)

To learn that quickly, first I need to decide whether to see the pages manually or "mechanically". Then, I'd need to learn how to use the Google API to quickly get all the hits -- which actually end by page 78 which in fact is not 90 thousand plus search results but only a "small" number of only 788 hits.

However, since I'd like to redo this search every now and then again, and as I might like to do the search for sites like Cite Seer as well, it might be worth the effort to develop a small program which helps me in determining the content of all the pages. -- A tag cloud and toying around with precision and recall might contribute a bit to the visualized cloud. -- The cloud terms' sizes could visualize quantity in recall, while the precision might get indicated by color incoding, e.g. blue .. green .. yellow .. orange .. red, like on maps, where high precision might get indicated by red and low precision by blue.

There's a tag cloud generator available in Debian's share of Perl libraries. I already modified it, and it's available on demand. -- However, I'd prefer to have any place in the web to put my version to. Any repository out there for that library?

      
Updates:
none so far