A weblog by Tom Coates concerning future media, social software and the web of data
Quote of the month: "This is not a brothel, there are no prostitutes here"
You can subscribe to an RSS feed, read the disclaimer or explore the archives

How do we find information in the Blogosphere?

Posted May 16, 2003 12:59 PM.

It has become almost a truism in critical examinations of the Blogosphere to talk about how - with the explosion in weblog numbers - it becomes difficult to find the best insights on any given subject. I first came into contact with the clear expression of this idea in an article called Scaling Clay Shirky but it's recently been pretty much everywhere...

I believe that there are some legitimate concerns in these sentiments, but I think fundamentally they miss the point - it's my opinion that replication of content online and a massive increase in the number people posting about a specific issue does not constitute a problem for the blogosphere, but instead one of its most significant advantages. In fact I'd go further and say that where there are problems, these can be resolved by simply speeding up the self-organising mechanisms that are implicit within the blogosphere, which is, I think what sites like Daypop, Blogdex, Popdex and Technorati are currently doing, albeit in a reasonably primitive way. But I'm getting ahead of myself. Today I'm just going to talk about How do we reach 100% information saturation on any given subject in the blogosphere without reading anywhere near 100% of the weblogs in it? Or to put it another way: With everyone posting lots, does the system help me find the good stuff?

  • Before I start though - here's a simplified, and easier to assimilate / read pdf version of what I'm about to say: scaling_clay_shirky.pdf [75k]

Let's start off by aggregating all the possible insights about a given subject from all the weblogs that specifically refer to it. This total aggregation will represent 100% of the information available on the subject in the blogosphere at a point in time.

If information was distributed evenly throughout webloggery and weblogs were read randomly then take-up of information would be linear and stable - in order to get 100% of the insights, you'd have to read 100% of the weblogs.

Linear gradient

[In this first graph I've plotted on the left the amount of information that you've managed to assimilate versus (on the right) the percentage of the weblogs that you'd have to read in order to get that amount of information - in the very specific special case that information is distributed evenly and randomly. The features of this "special case" will gradually be removed over the rest of the article. Another point I should perhaps clarify is that I've tried to conceive of the bottom axis as also including the order in which one reads the weblogs - that should become clearer through the article...]

However, we know it to be the case that information will not be distributed evenly throughout these weblogs. Many weblogs will contain limited information of any kind. Some will contain a lot. Many will contain replicated information that could easily be found on other sites.

Graph reaches 100% earlier

In this graph, ignore for the moment the dotted lines on the left. they represent nothing but the uncertainly fo the beginning of the curve. This diagram takes into account that weblogs have different levels of insight withint them, and that information is often replicated (either by active memetic spread or because the insights are simple and common). In the vast majority of cases then - even given that you're still reading weblogs in a totally arbitrary order - it's likely that you'll get extremely close to the 100% saturation point a significant way before you've read 100% of the available weblogs.

In practice - again assuming that you were reading the weblogs in a random order, it would be impossible to gauge the particulars of the curve that led up to the near-as-dammit-to-100% information saturation point. A sample curve would probably be organised in a series of steps - with gradual accretion of insight being the normal, but with occasional significant massive leaps also occurring.

The line becomes a series of progressive steps

Now - all these models have been based upon the assumption that the order in which the weblogs are read will be random. In fact nothing could be further from the truth. Some weblogs are clearly more likely to be read - this is not necessarily purely based upon the value of their contributions, but it's not completely distinct from such valuations either. It would probably be fair to say that on average well-linked-to sites are more likely (albeit perhaps only incrementally) to contain insight than sites which are not linked to at all. Secondly, if someone does produce content of value and insight on any specific subject, then it is more likely to be linked to - which in turn increases the likelihood that an individual will visit the site in question.

Both of these criteria suggest that (in our attempts to reach the 100% insight threshold) we will be more likely to be initially directed to high-insight sites than low-insight sites. This changes our graph substantially.

The graph starts strong and levels off close to 100%

It seems likely, in other words, that even if there's a limited tendency for sites with more insight to be read first - then the information accretion would be remarkably steep initially and the level off dramatically close to the 100% saturation point.

Hypothetical conclusions: For any given body of information on weblogs, no matter the rate of replication of information or the number of people who post exactly the same comments, close to 100% of the available insight can be reviewed by reading a disproportionately small number of sites - sites that will - as a rule - be among the first that they stumble across through their normal browsing and research patterns.

Related Hypotheses perhaps worth exploring: (1) The larger the number of posts about a subject (and hence the more likely replication) the smaller the proportion of those sites that need to be read in order to have reviewed close to 100% of the available insight. (2) The size of the available insight will increase as the number of posts about a subject increases (although perhaps not in linear proportion).

Comments

Please stay on-topic, informative and polite. I reserve the right to remove comments for whatever vague capricious reasons seem reasonable at the time.

An interesting line of argument - but I'm not sure that it follows directly that the quality of the insight can be related to the number of links in a effect-and-cause way. There's the possibility once a site has reached a certain threshold of popularity, it becomes a link-well, and the number of new links becomes a function of the number of old links, and so on.

If everyone links to Plasticbag or Scripting News, is that a measure of the quality of insight contained therein (present company excepted :-) or just that these sites have effectively become blogging portals?

What effect do the blog mavens have on the overall distribution?

Posted by: Tim at May 16, 2003 1:22 PM


The problem is in the special case that you state right on the first graph - that information is distributed evenly and randomly.

Information isn't distributed evenly - what is useful or popular rises to the top, and obscures what is hidden beneath. Blogs amplify this by linking not on the validity of the source and the verity of the information, but out of personal interest and delight - and long may it be so. But this weighs more in the favour of 'popular' information rising to the top as opposed to 'useful'.

Therefore, if there is a problem with searching blogs for useful insight, it's that, as a collective of editors, blogs aren't very good at the job - certainly worse that the editors of encyclopedias anyway. But - that's not why I go to most blogs, in the same way I don't read Hello! for it's excellent financial coverage.

Chris.

Posted by: Chrislunch at May 16, 2003 4:41 PM

I really like this analysis of how the size of the information pool needn't be an obstacle. I know you're talking more about how the information is inherently organised and therefore accessible, but it's relevance rests on the premise that weblog readers are, on aggregate, information-seekers.

It is likely that much broader motivations such as community-membership or opinion confirmation may underlie their reading choices. I suspect there will be a strong element of the latter: people will go to weblogs containing information they already know in order to have their own opinions endorsed by people they respect.

I wonder if it would be possible to form some aggregate weblog consumption theory. If we could conceptualise how people actually pick their way around the blogosphere we could form some really exciting conclusions, and explain a lot of observed internet phenomena.

Posted by: Gareth at May 16, 2003 5:02 PM

I think that I want to partly echo what Chris is surmising. Blogs can be obscured form the searching process even if they have shown great insight, insight which would have been much linked to had that blog been an already popular blog. But why does a blog become popular? Surely it can be for all sorts of reason: humour, technical knowledge, specific interest (e.g. social software) and so on. What happens if a relatively unknown blog is the one with the killer insight? Won't the popularity of other blogs - for all the wrong reasons in this case - obscure the insight offered by this unconnected one? Likewise I think the theory is sound if blogs are good at keeping themselves connected. If all the relevant feeds are listed on the aggregation sites and the blog portals know about them. If not, good insight can be missed.

Posted by: Nico at May 16, 2003 5:09 PM

I really like this analysis of how the size of the information pool needn't be an obstacle. I know you're primarily commenting on the inherently organised nature of the information in the blogosphere but there's another premise, namely that weblog readers are, on aggregate, information-seekers.

This doesn't necessarily hold. Much broader motivations such as community-membership or opinion confirmation may underlie reading choices. I suspect there will be a strong element of the latter: people will go to weblogs containing information they already know in order to have their own opinions endorsed by people they respect.

I wonder if it would be possible to form some "aggregate weblog consumption theory", starting from a broad concept of the motivation of a reader. If we could conceptualise how people actually pick their way around the blogosphere we could form some really exciting conclusions, and explain a lot of observed internet phenomena.

Posted by: Gareth at May 16, 2003 5:10 PM

I think I agree with Gareth. I'd add that the notion of "100% of insights" is a bit odd, too. In fact, it seems a bit pathalogical that anyone would *want* to know what *everyone* thought (or what all the thoughts were) about a given subject. It gets me thinking about the tyranny of technology, and the fact that I start to feel like I *ought* to exhaustively research what other people think. When, of course, that's a bad thing.

Posted by: Richard at May 16, 2003 5:31 PM

Firstly I think I have to clarify to Chris that I only start from the hypothetical thought experiment that information would be distributed evenly and randomly throughout weblogs. In fact each graph is designed to show what happens when you bring in aspects of the observable world in - firstly that some information is heavily replicated and then that certain sites will contain more insight than others. The statement that people operate in webloggia because of social instincts and popularity is also true, but the question is not whether people link for social reasons, but whether on average insightful stuff is linked to more often than the non-insightful stuff. If it is, then it's always going to be nudged towards the front of the curve...

Posted by: Tom Coates at May 16, 2003 6:21 PM

Richard - the question I'm trying to answer here is a relatively simple one. It's a direct response to this quote from the iSociety piece: "A predictable pattern soon emerged. In no time at all there were far too many commentary posts for anyone to read them all. Compounding this is the fact that with so many posts appearing on small, poorly linked sites, many comments were repeated. And each person who posted in ignorance something already said elsewhere muddied the waters further." The point of this article is to demonstrate only that you don't need to read all the articles about any given subject because you'll find the best ones relatively quickly and that redundancy isn't a problem (they're less likely to get linked to, but they're correspondingly easy insights to find). That's all I'm trying to say really, that by a few simple processes, the weblogging system self-organises itself...

Posted by: Tom Coates at May 16, 2003 6:40 PM

Very interesting.

The curve is definitely true. Now that I use a newsreader, I read many more blogs than before. The amount of information I receive isn't proportional to the number of blogs, though. I find that the more blogs I read, the less "efficient" my input is due to blogs carrying stories that they heard about on other blogs.

Nothing wrong with that of course, this is how the blogosphere filters content, and I think it works great. The network of blogs and also the blog indices (Daypop, Blogdex, etc, see here ). Publish, then filter (link).

This also relates to the idea of hubs and authorities. Authorities know stuff. Hubs link to people who know stuff. Both are important roles, and it is also possible to be both simultaneously.

Finding information in the blogosphere on topics is an interest of mine. I wrote a little about this on faganfinder. Also on my RSS page (link) I list resources for finding topical RSS feeds, some of them being cross-blog.

The Internet Topic Exchange (link) tries to aggregate topics by bloggers opting-in on a post-level. Waypath and Blogging Headline News both try to organize blog posts automatically. The former by how posts relate to each other, and the latter by organizing posts into topics.

Easy News Topics (ENT) (spec here ) is an interesting new development in this front. K-collector (link) is making use of this new data and the future looks promising.

Posted by: Michael Fagan at May 17, 2003 2:05 AM

Ahh, but the context of the query for information is just as important as the relevancy of the information acquired.

Posted by: Taran at May 17, 2003 11:48 PM

I agree with what Gareth said earlier, in general, people aggregate and link to blogs which provide affirmation. In terms of an intellectual discourse this can lead to sterility, I think it's a predictable in human nature to avoid certain content which may make people feel bored or uneasy. In one of graphs (no. 3) the cumulative effect of linking to information that's been endorsed, or validated, allows one to avoid the these 'dismissed' blogs. It may be worth while considering a more robust approach, using a non-linear cooperativity fit as the key off, statistical weighting of the data is allowed.

In a simple closed system under equilibrium conditions (a tenuous assumption but you have to start somewhere), consider the individual surfing for content as the target, S/he comes across content which they like, internally the blog has further linkage to other content, the probability of s/he continuing to that content and creating a cycle of repetitive information gathering is positively coupled. Even if s/he doesn't follow through on the link, if they agree and like the content the information is indexed and effects their browsing experience.

There is also the negatively coupled effect, you see content, it's indexed as either bad, ill-informed or crazy and that weblog is consigned to a vacuum. There's an additional category here, involving discourse, you could come across a weblog which you've previously endorsed and suggests a different site for information that is insoluble with your current mindset. Therein lies the challenge to assimilation of information, we could call it heterotropic linking. You then come up with another statistical model within the current model which classifies this effect, etc. etc.

If would be interesting to take virgin weblog surfers, across a broad demographic, and monitor their behaviour. Give them no edict except to find and surf weblogs and bookmark their favourite sites along the way. Weight this information with the cooperativity curve and their social or political affiliation and you might just get the predictable result that Gareth talked about. Now there's a number for it. ;)
Of course, the experiment is tainted because we're forcing web surfers who have no prior knowledge (by choice?) of weblogs to analyse this content. There are controls for these scenarios.

Posted by: Gummi at May 18, 2003 3:30 PM

So, which are the sites to read?

Posted by: wannaknow at May 22, 2003 9:11 AM

Like it. But one question remains unanswered. Does the vertical access represent facts, perspectives, or truth? Your legend says information, which sounds like fact. But your text says 'insight', which sounds more like perspectives on truth. Alternatively, it could be that in the end their is 'truth'; a right answer which is either true in some a priori sense, or true because it is accepeted as the dominant interpretation. My take is that your systems works very nicely for facts - the facts of a story can be gathered quite quickly by linking around. It works less well for perspectives, because these are cumulative and change over time. It doesn't work at all for truth, because as someone else commented, the idea of 100% truth is rather odd. How 'bout that?

Posted by: James Crabtree at May 23, 2003 1:41 PM

I think I should make clear that I'm certainly not talking about truth. I'm talking very specifically - in this case at least - simply about unique pieces of information or opinion. I'm suggesting that if a dozen webloggers write about something, you only need to read (say) five of them. If you read any more than those five, you'd just be being told stuff you'd heard already.

Posted by: Tom Coates at May 23, 2003 3:17 PM

Want to add your opinion?

© 1999-2007 Tom Coates