x

It’s Not Co-Citation.. but it’s still awesome! (Or what’s really going on in the SERPs?)

Hello everybody! Like most of you, I couldn’t help but be intrigued by Rand’s recent WBF entitled “Prediction: Anchor Text is Dying…And Will Be Replaced by Co-citation – Whiteboard Friday” Since Rand was really only scratching the surface of something that is potentially game changing for SEO, I thought I’d look into it a bit…

Hello everybody!

Like most of you, I couldn’t help but be intrigued by Rand’s recent WBF entitled “Prediction: Anchor Text is Dying…And Will Be Replaced by Co-citation – Whiteboard Friday

Since Rand was really only scratching the surface of something that is potentially game changing for SEO, I thought I’d look into it a bit more. By the time I got done swimming through a sea of information, I came up with some interesting things I thought you might be interested in.

Why I wrote this giant paper…

My Friday began like most others. I woke up, made some tea, and went about tinkering with a few projects I’ve been working on. This particular Friday I was working on some code for the upcoming launch of my Marketing Intelligence start-up; “Clever Marketing Tools.” In this instance, I was working on implementing a Content Extraction algorithm to identify the main content from it’s surrounding markup. It does the job of tools like “Readability” but without depending on predefined tag look-ups. I figured it would make my Sentiment Analysis Engine more accurate if the content it was analyzing was free of the usual nonsense like navigation, footers, and sidebars.

I got it working and I decided to celebrate with more tea. As I went back to my desk with the tea, I decided to check my email and I was surprised to find a message from my friend Mike King, pointing me towards Rand’s recent WBF.

Honestly, I don’t normally keep up with the happenings on Moz… but Mike always sends me great stuff to check out, so I decided to skim the transcript; which is an awesome feature by the way. I don’t think I’d have bothered to check the link that day if it weren’t for that feature, as I was so engaged with my other project.

In retrospect, I’m quite glad I did make the time to check it out, as once I read the transcript, I couldn’t help but agree that the behavior Rand was highlighting was pretty interesting, and I knew it signified something big. However, I didn’t understand why Rand was attributing it to Co-Citation. I had my own suspicions about what was going on, but I didn’t want to call Rand crazy if it was just me not being up on recent developments.

I tend to work in a vacuum, so me not knowing what was up was a total possibility. Wanting to look into this more, I decided to head over to Google Scholar and grab the latest literature on Co-Citation Analysis.

Just in case you’re not familiar, Co-Citation Analysis is a tool used in Bibliometrics that’s also made it’s way into Information Retrieval since there’s a lot of overlap between the fields and the problems they seek to solve. It’s a clustering method used to identify similar content through shared citations.

A quick take away is that SEO Professionals have actually known about and discussed Co-Citation Analysis for awhile now; it was probably deployed by Google to create the results returned by the “related:” advanced operator.

Keeping that in mind, the more I read, the more I knew it probably wasn’t Co-Citation Analysis at work here… and that I definitely needed to develop my theory a bit more. So now it was off to SEO By The Sea since Bill Slawski is literally the Google Patent Guru; though thankfully there was no need to hire a Sherpa to make this expedition. As per the usual, thanks for being so awesome Bill. If there were an SEO Hall of Fame, you’d have my vote.

What I found there confirmed my suspicions about what the real cause of this new behavior probably was; Lexical Co-occurrence /Collocation Analysis and Topical PageRank. In short, this behavior is totally based on content analysis for the purpose of deriving context, so that the most relevant resources rank.

Ultimately, I probably could have saved myself a lot of headache by just reading the comments first (like this one where Rand clarified what he means when he says “co-citation”), but then you wouldn’t have this awesome paper to read.

As I was about halfway done with this paper, Bill Slawski published his own response, but I felt the paper was still worth finishing. The Topical Authority scoring algorithm that came of it is pretty awesome, but we’ll get back to that later.

So what is all the fuss is about?

As I understand the subject, Rand discussed some real world examples of sites ranking highly for queries where the factors that we traditionally associate with grabbing a high ranking for a specific search query; Authoritative Links with Relevant Anchor Text, Query contained in the HTML Title, and reasonable query frequency within the document body, seemed to be totally absent.

Basically, there appeared to be no good reason for these sites to rank as high as they did for these queries, based on our usual understanding of SERP Ranking Signals. It seems as if though Google is playing favorites and ranking big brands quite a bit higher than they deserve.

I know, I know… You’re probably thinking that this is no big deal and actually makes perfect sense based on what we already know about the Google Algorithm. Brands are going to rank for all sorts of queries, because Google heavily favors Links in their algorithm. Sure they’ve refined the algorithm a little bit to add content and quality signals, but Links are still the secret to ranking, and brands attract lots of nice juicy links… so they deserve to rank. It’s just good user experience for these brands to come up in the SERPs for queries they may only be slightly relevant, right? After all, they’re big brands for a reason!

The simple truth is that it’s all too possible to game the algorithm when it’s a pure popularity contest… and as we’ve learned, there are times when PageRank heavy algorithms failed miserably at providing relevant SERPs. Owing to this fact, Google has made huge shifts away from measures of popularity, towards measures of quality.

 

(A classic example of what happens when algorithms go wrong, courtesy of Wikipedia)

 

For example, two of the most widely talked about algorithm updates in recent history; the Panda and Penguin updates, owe part of their controversy to their use of on-page quality signals. The scope of their true impact on the SERPs is still an on-going matter of debate within the SEO Community. The perceived scope of these algorithms was so broad, that Google set up special contact forms for fielding requests from site owners who felt they had been unfairly targeted.

Panda was designed to combat low quality sites using continually re-trained Quality Classifers, built using Supervised Classification techniques. This algorithm is inherently focused around on-page factors connected to poor user experiences, derived from human reviewers rating individual web pages.

Penguin was a more hybrid approach designed to specifically target SERP Spam. It looked at obvious attempts to over-optimize through keyword stuffing, link schemes (like the popular link wheel technique), or content spinning & scraping. It’s probably driven by a combination Supervised Learning and raw statistical analysis, but I haven’t spent as much time looking into it myself.

This Supervised Learning approach is part of the reason that Google rolls out Panda & Penguin updates incrementally; they need to continually re-train the Classifiers used in the algorithm. Supervised Techniques leave lots of room for continuous improvement, especially as Google expands their training set over time

Creating Training Sets and re-training the classifier takes a bit of time, but the results are quite reliable. Each generation of the classifier is usually tested against a baseline data set, commonly known as “The Gold Standard.” Testing each generation against the same Gold Standard Data helps ensure the integrity of the classifier and ensure that it’s at least as accurate(if not moreso) than it’s predecessors.

I’ve actually had the opportunity to play with Supervised Machine Learning quite a bit myself over the last few years. In my own work as an SEO Consultant, I’ve used similar techniques to identify spam content, discern the query intent of keywords, or perform sentiment analysis of content using Machine Intelligence rather than Human Intelligence.. The classifiers I’ve developed for those tasks actually form the basis of the soon to be launched Clever Marketing Tools “Spamalyzer,” “Intentalyzer,” and “Sentilyzer” APIs. Soon everybody will have access to the technology we need to keep in step with Google.

Panda, and the recent Link Quality Signal algorithm updates are making such tools a necessary part of the trade these days, as the potential for being impacted by negative signals is just as real as being benefited by positive signals. We’ve finally come full circle and entered an era where Negative SEO is a real possibility. Plus I like efficiency, and tasks like manually sorting links to identify spam or spending tons of time manually tagging keywords “Informational,” “Navigational,” and “Transactional” is better suited to machines than people… just ask Google!

And this matters because?

There are plenty of content based signals that play into the ranking algorithm already, and Google is continuing to refine it’s understanding of what separates a quality page from spam, which includes catching link spam and other attempts at gaming the algorithm.

Due to this on-going emphasis on content quality, link quality, and other qualitative ranking factors within the Plex, pure link juice shouldn’t explain this behavior… especially since algorithm updates like Penguin included ways of detecting and ignoring low-quality links (like links from irrelevant and/or non-authoritative pages) and attempts at ranking through “Google-Bombing” via exact match anchor text links.

Some links just aren’t as juicy as they used to be, and now there are even types of links that can hurt your efforts to rank…. so for these pages to be ranking the way they do, there should be some indication of their relevance to the topic.

So if it’s not just link juice, what’s really going on?

If I’m understanding Rand correctly, it appears that he is suggesting the ranking signal is based on Lexical Co-Occurrence (the nerdy Natural Language Processing term for what he describes as “Co-Citation”); specifically it seems that Google is boosting brands with increased relevance on the basis of co-occuring mentions in content, even when that content doesn’t link to one of the brands pages!

Rand appears to be suggesting that SEO and PR are going to converge in a way that devalues traditional anchor signals in favor of content based co-occurrence signals. Under this model, to rank well, brands will have to focus on nurturing co-occurring mentions of their brand across the web on pages that are topically relevant to the terms for which they hope to rank.

If he’s correct, not only does it represent a new and powerful content based ranking factor, it signifies a major shift away from the primacy of traditional ranking factors towards a whole new class of signals derived from Natural Language Processing techniques. It also means ranking just got a lot harder for the little guy. Google isn’t just looking for Links, Content, or Quality features anymore, they’re looking for Context.

And this isn’t any old Context, this is is Ontological Context; an understanding of what it means to have meaning. Or as Google describes the idea when discussing the Knowledge Graph, it’s about “Things, not Strings.”

To even be able to pull this off, Google’s systems have to be capable of distinguishing named entities such as brands from any old word on the page in a way simple text crawlers and relational databases just can’t do.

So is this even possible?

While landing firmly in the “easier said than done” column, it’s absolutely possible that a company with Google’s resources has developed a system capable of performing tasks such as Entity Disambiguation. Natural Language Processing, especially as it relates to Entity Disambiguation and creation of Taxonomies for the purpose of building an Ontology is very much a continually evolving field and it’s an active area of research at many of the top universities and companies in the United States.

Natural Language Processing techniques already exist for Named Entity Recognition and building Taxonomies, and Google has a huge corpora providing more than enough data to learn from if they wanted to break new ground in the field.

It’s well within the realm of plausibility that they have the capabilities to do it… but is there any indication they’re actually doing it?

They employ some of the brightest minds in the world for the purpose of “organizing the world’s information and making it universally accessible and useful.”

Organizing information is pretty much the primary function of concepts such as Taxonomy and Ontology in Information Retrieval. As if this weren’t suggestive enough, Google has a few Patent filings that are relevant to the field… here’s a few of the most relevant ones courtesy of SEO By the Sea:

 

Some of these techniques have definitely made their way into the SERPs, as these screen shots from a recent Ebiquity article entitled “Entity Disambiguation in Google Auto-Complete” demonstrate:

(Image Credit: UBMC Ebuiqity – Entity Disambiguation in Google Auto-Complete)

 

In his discussion of Google’s “Reasonable Surfer” patent, Bill Slawski describes several content based features that may be used in connection with document ranking, especially by weighting or devaluing links based on contextual signals within the linking document.

http://www.seobythesea.com/2010/05/googles-reasonable-surfer-how-the-value-of-a-link-may-differ-based-upon-link-and-document-features-and-user-data/

These patents show a continued trend by Google to examine features within a document in an attempt to derive context well beyond pure simple term frequency in page content or incoming link anchor text, paving the way for technologies such as Entity Extraction and Co-Occurrence to potentially factor into document ranking algorithms in new and exciting ways.

If you need further proof, please allow me to introduce the Google Knowledge Graph… an Ontological Information Retrieval system.

http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

Not only is it possible for Google to be working with this technology… it seems Google has taken major strides towards implementing the necessary infrastructure to do it.

Okay… so is that what’s really happening here?

We’ve seen sufficient evidence to support the possibility of Rand’s theory of Lexical Co-Occurrence and Entity Disambiguation being used as a document ranking factor… but big theories demand big proof. After all, Patents, Interest, and Ability, an algorithm do not make… so let’s take a peek at the examples Rand cites and see what kind of evidence we can come up with.

Trying to reverse engineer new ranking signals on the basis of a few examples is naturally next to impossible, but let’s run with what we’ve got since at first glance they do appear to contradict what we normally understand about how to rank well.

We’re going to analyze some of the usual ranking factors and see if we can’t find a more plausible explanation for how these domains are ranking. I will be referring to data pulled from OSE (using a Free Tier account), and some relevant factors from the 2011 Ranking Factor Data. Specifically, I’ll be looking at factors from the Correlation & Survey Data section, as well as using the Broad Algorithm chart data as a guide. I’ll also be examining on-page evidence as revealed by the actual SERP for the query.

Based on the 2011 Ranking Factor Data, we can see that Page Level Link Metrics are believed to factor pretty highly in the Broad Algorithm, while Page Authority and the Number of Root Domains Linking with Partial Match Anchor Text were two of the highest correlated data points. Domain Level Link Authority is also believed to factor in pretty highly, and much like the Page Level Metrics, Anchor Text is among the highest correlating factors. These metrics along with Domain Authority will be our primary measures, as though Page Level Keyword Metrics are believed to be important, they don’t seem to correlate as strongly with rankings.

This will be a quick and dirty assessment, rather than anything empirical, but it should give us an indication of whether or not we need to invest more time in performing more empirical research, while also setting up a loose framework for more detailed investigation, should it be required.

Analysis of query: Directory of Manufacturers

 

 

I chose this example to start with, since it really does seem to defy our understanding of traditional ranking factors. Of the sites ThomasNet is outranking, some have exact match titles, but there are no exact match domains present in the SERPs. Many of the pages currently ranking also seem to have exact matches within the content, though as the 2011 Ranking Factor Survey showed, it’s not as strongly correlated as a ranking signal as Link Metrics at the Page and Domain Level.

As it stands, there’s actually no logical reason ThomasNet couldn’t deserve that ranking… but it would have to earn it through Page/Domain Authority and the number of Partial Match Anchor Text links. So off to Open Site Explorer it is!

OSE Reveals a Domain Authority of 88 and 12 Unique Domains offering 29 Links containing “directory of” AND a stemmed variant of “manufacture.” The real oddball in this SERP is IndustrialQuickSearch.com, which ranks #4 for Directory of Manufacturers without the term in it’s link graph or content at all. Phrase Based Indexing and Stemming seem to be at work in recognizing some content factors since “Manufacturing” is also bolded in the SERP, and it’s totally possible that the “of” is being eliminated as a stop word.

If that’s the case, it’s a lot less weird since “directory manufacturers” isn’t really that different from “manufacturers directory.”

Based on the SERP Highlighting and the fact that “of” is more than likely considered a stop word, it’s also possible that “Manufacturing Directory” is a viable partial match anchor factoring heavily into the results shown.

To test the theory, I decided to do a quick search for the term “manufacturing directory,” which ended up showing another big surprise:

 

 

IndustrialQuickSearch.com is outranking not only ThomasNet, but also the Yahoo Directory for the query. A quick look at the Anchor Text information for IndustrialQuickSearch.com and ThomasNet for a stemmed variant of “manufacture” AND “directory” quickly sorts out the discrepancy.

ThomasNet only has 4 Unique Domains offering 7 total links for the query, while IndustrialQuickSearch.com has 201 total links coming from 12 unique domains. There’s a chance 185 of those links are being devalued as they all come from one domain, but that still leaves 11 unique domains and 16 links. This ends up being a pretty obvious difference between the two domains which makes this ranking a little less mysterious, especially when we consider the exact match in the title for IndustrialQuickSearch.com.

If the analysis were expanded to anchor texts simply containing both words (ala phrase based indexing), rather than just looking for stemmed phrase matches, these numbers would definitely jump up. Going back to the “Directory of Manufacturers” query, “List” also appears to be highlighted as a synonym for “Directory” in the snippet for ThomasNet. There are anchor references to manufacturing lists in the ThomasNet OSE export, so this could also be contributing to the overall authority for the page.

 

 

Interestingly, it seems that “Manufacturing Directory” is being treated as a Synonym of “Directory of Manufacturers,”contributing to the high rankings for both ThomasNet and IndustrialQuickSearch.com for the query. Further evidence of this possibility can be found if you make the query a phrase based search for ‘directory of manufacturers.’

IndustrialQuickSeach.com disappears from the Top 10, while ThomasNet holds it’s position as #1, indicating that IndustrialQuickSearch.com is probably receiving a boost from “manufacturing directory” related Anchor Text under the broad match algorithm for “directory of manufacturers.”

 

Analysis of query: Cell Phone Ratings

 

 

I’m not going to spend a lot of time covering this one, since much like the above example with ThomasNet, Consumer Reports is an Authoritative Domain with plenty of relevant Anchor Text Links contributing to the high rank.

There is one thing definitely worth pointing out here, and that’s the affect Synonymy is having on the SERPs. Most of the top results returned are for more targeted for “Cell Phone Reviews” rather than “Cell Phone Ratings.” It seems “reviews” is being treated as a synonym for “ratings” as part of the retrieval and ranking process. This would suggest that when examining the anchor text profile we should also take into account links with an anchor text containing “cell phone reviews.”

However “Cell Phone Ratings” is not a synonym for “cell phone reviews” according to Google, and if we jump over to that query, the SERPs paint quite a different picture. Consumer Reports ends up dropping from #4 to #10, suggesting that much like ThomasNet and IndustrialQuickSearch.com in the last example, Consumer Reports is receiving a boost for “cell phone ratings” on the basis of anchor text synonymy.

 

Analysis of query: Backlink Analysis

 


I wish I could say that I saved the best for last… but in truth, Open Site Explorer’s ranking for the query Backlink Analysis seems to be more of the same. The standard ranking signals seem to be in effect here, OSE is just an authoritative domain receiving plenty of contextual juice from a series of relevant anchor text links.

One neat thing we can draw from analyzing the SERP snippet is that an Image ALT attribute is being processed as relevant on-page content, even though that image is the anchor for a link pointing back to SEOMoz.org in the page header. This highly prominent placement of the query within the content may be contributing to the rank even though it’s just an ALT attribute.

If this ranking were really the result lexical co-occurrence of brand mentions, we’d probably expect OSE to rank for Link Popularity & Link Popularity Tool like it does for Backlink Analysis & Backlink Analysis Tool as the terms tend to co-occur in the anchor text a lot. However, in the SERPs for Link Popularity, OSE doesn’t show up in the Top 10, or even Top 20; though an article from SEOMoz about MozRank does rank #3.

I think this is readily explained by the fact that Link Popularity is a much more ambiguous query with lots of room for query disambiguation. As a result it’s also a highly competitive query, having around three times the results and many more exact match domains. In fact this SERP just shows better attempts at optimization across the board. Even the SEOMoz article has an exact match for the query in it’s TITLE tag.

 


Other queries you can check out that suggest OSE isn’t really getting an unfair boost would be “link checker” and “backlink checker.” OSE ranks just below Majestic (who has an exact match in the TITLE tag) on page 2 for “backlink checker” but can’t be found for “link checker.” OSE just has more backlink related anchor text providing a stronger contextual signal.

As for practical optimization tactics, things get more interesting when you change the query to “back link analysis.” Google appears to treat references to “backlink” as the same as “back link” but not the other way around. OSE maintains a #2 rank here, thanks in part to all the relevant links for “backlink analysis” that are being counted alongside the additional links containing “back link analysis” that aren’t being counted in the previous SERP.

 


SEOBook’s SERP is an example of wasted optimization efforts. The page misses an opportunity to have an exact title match for both “Backlink” & “Back Link” by not observing the natural language rules Google seems to be employing here. Keep in mind that the resource on SEOBook is now defunct and probably not being actively optimized… so it doesn’t necessarily speak to what Aaron was attempting to do at the time, it just forms a convenient example of things we should avoid doing in our own efforts in this brave new world of Contextual SEO.

This should be an important lesson to SEO Professionals… pay attention to the SERP Highlighting and query suggestions, and learn to play with variant forms and stemming to give yourself maximal coverage without loosing flexibility in your link building and copy writing efforts.

What we’ve learned so far…

So far it seems like all of our examples can be explained by the power of Authoritative domains with plenty of keyword rich anchor texts pointing their way. It seems like we were looking into a perceived oddity, just chasing windmills as it were.

Based on that, I have to disagree with Rand, at least so far. There’s no evidence that these ranking are coming from an increased emphasis on a new Mention based ranking signal. While Google is definitely deploying Entity Disambiguation and Lexical Co-Occurrence based techniques, I don’t really think Google is doing things like actively associating Open Site Explorer with “backlink analysis” or “backlink analysis tool” on the basis of repeated in-content mentions shared between the ideas.

I believe it’s ranking highly on the basis of relevant anchor text links, much in the same way Google’s algorithm has always functioned, with one critical difference; I believe we’re seeing the effects of Penguin on the SERPs. It’s a combination of Topical PageRank via Anchor Text providing a boost to these already highly authoritative domains along with the devaluation of irrelevant links from many other sites.

As well come to see in a little while, Lexical Co-Occurrence is probably a big part of the Penguin Algorithm, especially when trying to figure out what a “relevant” link is.

Topical PageRank & Penguin’s Effects on the SERPs…

In his own response to Rand, Not All Anchor Text is Equal and other Co-Citation Observations; Bill also demonstrates that the sites Rand lists could easily be ranking due to algorithmic factors we already know to be in place, and he lists some of the relevant patent literature. Of particular interest are the patents related to the Reasonable Surfer Model and System and method for determining a composite score for categorized search results since taken in combination they are reminiscent of research on Topical PageRank done by Stanford University.

 

(Slide 10 from Topic Sensitive PageRank – Taher H. Haveliwala, Stanford University)

Google’s Patent System and method for determining a composite score for categorized search results much like the research done at Stanford, relies on the creation of category based taxonomies to compute a Topical PageRank; however I believe what we’re seeing is an expansion of Topical PageRank to include the Ontological Taxonomies and Entity Recognition capabilities evidenced by the Knowledge Graph.

By using Topical PageRank as a measure for a document’s authority for a given topic or concept, and correlating that score with features extracted from the page itself, it’s possible to improve the quality of the SERPs without losing coverage for ambiguous queries or because of poor optimization. Context provided by relevant inbound links, properly weighted is essentially web scale human classification of content. In short, Google is computing PageRank for things, not strings and ranking resources according to their relevance to these various concepts, leading to increasingly relevant SERPs.

A necessary part of making this work well would be the ability to discern attempts at gaming the algorithm through unnatural or irrelevant links. That’s where Penguin comes in. Some features that might be examined to separate irrelevant links from more relevant ones are described in Ranking documents based on user behavior and/or feature data.

Some of the more interesting ones I found mentioned by Bill in his article Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data include:

  • The position of the link (measured, for example, in a HTML list, in running text, above or below the first screenful viewed on an 800 X 600 browser display, side (top, bottom, left, right) of document, in a footer, in a sidebar, etc.);
  • If the link is in a list, the position of the link in the list;
  • Number of words in anchor text of a link;
  • Actual words in the anchor text of a link;
  • How commercial the anchor text associated with a link might be;
  • The context of a few words before and/or after the link;
  • A topical cluster with which the anchor text of the link is associated;

How we can put it to work: Topical Authority Algorithms

So now that we have a theory that seems to fit the data, how can we go about measuring Topical PageRank for our own sites? It’s not like Google’s going to open up an API for the scores, and computing it is definitely a Big Data problem. For the sake of providing something fun to test, let me suggest that Topical Authority is a metric measured within the Knowledge Graph, leading to a weighted graph structure between documents and topics. Topical Authority is an adjustment of a Page Authority metric such as PageRank and can be passed by Links or measured as a document feature using a document’s own Page Authority.

With that in mind, I tried my hand at crafting a Topical Authority Scoring algorithm. So here goes:

Where:

PA = Page Authority
Prob(L) = Probability of L, where L is a Lexeme (unigram, bigram, etc)
Topical Authority = (PA*.85)*Prob(L)

The probability of a Lexeme within the document could be calculated as follows:

Where:

NoL = Number of total lexemes within the document. Ex. 500 in a 500 word article.
Freq(L) = Frequency of Occurrences of Lexeme within the document
Prob(L) = 1/NoL * Freq(L)

 

I’m not a mathematician, so please forgive my non-standard notation. If someone out there can express this more properly, please feel free. Consider the Topical Authority Algorithm as released under a GPL License. Feel free to use it, just remember to pay it forward and keep the knowledge growing. It would be great if we could eventually create an open information retrieval algorithm resource by way of our SEO Experimentations.

This should work well using precomputed Authority metrics such as PageRank, MozRank, etc since those precomputed metrics will have already done all the iterative work, providing a relative measure of the Page’s Authority across the entire system. Basically, we can lean on existing Big Data to form a reasonable approximation of any given page’s relative Topical Authority… but we’ll still come up short when attempting to measure the passage of Topical Authority across the Link Graph.

If measuring a document’s topical authority, it’s enough just to use the formula as is. If you wanted to distribute that Topical Authority across links, you could do it simply by dividing the topical authority by the number of links on the page, or you could get really granular about it and weight links by their position in the DOM hierarchy, or according to the relevance of their anchor text for the topic.

One shortcut might be to pull all links with relevant anchor text, and compute their Topical Authority as passed to the page you’re analyzing. It’s still an approximation, but it’s a lot better than trying to initiate your own web scale crawl.

Conclusion:

We are looking at a new ranking signal, but it’s not quite the one Rand suggested. It’s Topical PageRank and the devaluing of irrelevant, spammy, or weak links from algorithms like Penguin. This may actually be confirmation of the scope of Penguin on the SERPs. We’re not seeing rankings based on something Google just noticed, we’re seeing rankings based on what Google is ignoring. Nice job at catching this Rand, and thanks for beating me to the punch about the co-occurrence thing Bill. If you weren’t so awesome, I’d be bummed that you scooped me.

 

Why it’s not that exciting…

  1. The algorithm is still driven by words that are in the content or anchor text, meaning it’s just business as usual for Google. Links are still very important to ranking, they’ve just changed the way they look at links.
  2. Many of the secondary factors, such as Lexical Co-Occurrence and Synonymy have been in place in the algorithm for quite some time.
  3. A lot of the technology used to make it happen is still out of reach for the average SEO Professional. We don’t all have access to web scale data to try and compute the transfer of Topical Authority.
  4. We’re still in theory land, more testing is going to be needed to get to the bottom of this.

Why it is that exciting…

  1. It’s a continuation of the shift away from ranking signals that are easily spammed. Google’s algorithm is getting better, and we’ve learned that it’s Context that is truly King.
  2. It shows an increased ability for Google to disambiguate concepts within phrases and provide contextual results on the basis of similar or related terminology.
  3. It appears to be an example of the impact the Knowledge Graph may have on the SERPs in the future.
  4. Lexical Co-Occurrence is at work in the examples because they’re oddball queries in their own right. They co-occur a lot less in the web corpora than some of their variant forms…. or they don’t receive a lot of mentions, so Google is estimating/approximating.

 

It’s actually good news for SEO Copywriters, since we know we can take advantage of stemming and phrase based indexing, and even synonymy when writing content. Gone are the days of stiff pieces written to support exact match mentions of the key phrase.

It also leads us into an off-page optimization methodology. Link building just got a little trickier since we really need links from content that is topically relevant or they won’t count. There’s even a good chance they’ll hurt us under new algorithms designed to combat attempts at link spam.

There’s also the added complexity of delicately balancing our anchor text in such a way that we have sufficient topical authority passed for the phrases we want to rank for, without crossing the threshold and triggering a penalty.

On that note, I’d be very interested in seeing someone replicate the Search Engine Watch study on Penguin, using Topical Authority as a measure instead of percentage of exact match anchors to see if there’s a point where excess Topical Authority can trigger a penalty.

On the other side of the coin, it did get easier to diversify our anchor text in ways that can still make an impact in our campaign. Once again we can take advantage of phrase based indexing, stemmed forms, and synonymy to give our topical authority a major boost, without essentially Google-Bombing ourselves out of the SERPs.

I think this definitely highlights the importance of monitoring the “Did You Mean” suggestions from Google for queries we’re seeking to optimize for. If you get a “Did you mean” for the query, chances are you’re targeting the wrong term. Use the original query as a variant anchor text and focus on optimizing for the query suggested by the “Did You Mean.”

And with that, I’m done rambling. Thanks for hanging in there so long. You should grab some tea or something, you’ve earned it.