Home > News & Blogs >
Understanding On Page SEO – How Google Reads Your Web Copy

Understanding On Page SEO – How Google Reads Your Web Copy

Sharing is caring

Let’s talk.
– OR –
Slide into our DMs…

How Does Google Know If Your Web Copy is Any Good?

 “it’s complicated”  

SEO as an industry sometimes feels like it’s all made up.  

Even to me, and I do it to pay my bills. I see in real time every single day how improved organic rankings help businesses earn more money, and even then, I sometimes feel like this is all a joke. One I’m not in on.  

It feels this way because no one really knows how it all works. We notice patterns, make educated guesses, and do our best to make our client’s website work for them, but we don’t really, and I mean really, know how search engines work. (If you want to get some of our educated guesses for free, click here!)  

Not even Google themselves have a complete picture of their creation anymore. It’s too complicated and constantly evolving. If they do have a complete picture, they aren’t going to show it to us mere mortals anytime soon.  

– “How does Google know what good on-page content is?”  

-“it’s complicated”  

-“F*ck off!”  

“It’s complicated” is never the answer anyone wants. Think of all the times you have asked a question that got that response—were you satisfied with the answer?  

Of course you weren’t, and I’m not either.  

So, like an irritating toddler with endless curiosity and no respect for boundaries, I just kept asking why, why, why until I got an answer I was happy with.  

And I found it!  

The philosopher’s stone (or sorcerer’s, if you’re wrong)  

The holy grail  

The big secret to good on-page content is…  

*Drum roll*  

NATURAL LANGUAGE UNDERSTANDING  

Right there you go. See you later. Thanks for reading. Get your free digital health check here. 

Understanding On Page SEO - How Google Reads Your Web Copy

No, but for real, I work in SEO. I have been using NLU to inform and improve my content against target keywords for the past year and have had so much success for my clients that I employ it as a standard process for every single on-page job I do, without fail. No, I’m not getting a commission from Big Tech to write this. However, I will absolutely sell out, so hit me up if you know anyone who needs me to shill for them.  

Effective copywriting is not an art; it’s a science.  

Here is a simple version of how it all works.  

Natural Language Understanding (NLU) is a part of natural language processing (NLP). The point of this is really to teach machines how to read and understand written words. If machines can do this, they can use that understanding to engage with us in more “human” ways (yes, this is a part of generative AI programming; well done, you!)  

NLU programmes tend to all drink from the same well and include or use:  

Entity Recognition, also known as Named Entity Recognition (NER), is a process where a computer programme finds the names of people, organisations, locations, and other specific information (like dates, amounts of money, and more) within a text. The goal is to pull out these specific pieces of data. These pieces of data are called “entities.”  

Input Text: You provide the system with a block of text. This could be anything from a news article to a tweet to a page of content that you can’t seem to rank and has already made you cry twice today.  

Finding Entities: The system scans the text and looks for words or groups of words that match patterns or characteristics typical of specific kinds of information. For example, it might recognise “Leeds” as a location or “21 Degrees Digital” as an organisation.  

Classification: Each entity is then categorised into predefined groups such as person, location, organisation, date, etc. This helps in understanding what each piece of recognised information represents.  

Output: The final output is the original text with labels attached to each identified entity, indicating what type of information each entity represents, and, in some tools, it gives you a score against each keyword and entity. Giving you a quantifiable metric of how well (if at all) your copy hits the keyword. Effective copywriting is a science, not an art. Good copy isn’t elusive, and it can be measured with numbers.  

Understanding On Page SEO - How Google Reads Your Web Copy

“That’s fine, but how does it actually know what to mark as a keyword?” Great question! It’s so great, in fact, I’m going to make the next part of this (already way too long) blog answer just that.  

SPOILERS: It’s all math. Well, sort of…  

NLU tools start by looking at the intent behind a piece of copy. Those of you who have some SEO knowledge or work with those who do have probably been subjected to ramblings about “search intent.” This is similar. It’s basically the machine trying to find the point of the piece of copy.  

Intent Recognition  

Intent recognition is about determining what the writer or user wants to achieve when they input text into a system. This is particularly important in dialogue systems and chatbots. The process typically involves:  

Data Pre-processing: The input text is cleaned and normalized. This may involve converting all characters to lowercase, removing punctuation, and possibly stemming (taking out prefixes and suffixes from words) and lemmatization.

This is a whole rabbit hole that could be an essay in itself. (Shoutout to Stack Overflow for this part, by the way.)  

Lemmatization (the short version) factors in the wider context of words; for a real example, see this answer from Stack Overflow user Sumit Pokhrel: (https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming#:~:text=Stemming%20just%20removes,go%20with%20Lemmatization.)  

The point is, NLU and, by extension, Google (I’m getting there, I promise) don’t read your text exactly as you write it.  

Feature Extraction: Features are taken from the pre-processed text. This might involve simple bag-of-words models, where the presence of certain words implies certain intents, or more complex models that consider the sequence of words using techniques like n-grams.  

How N-grammes Work:  

When analysing the text, an n-gramme of size 1 is referred to as a “unigram,” size 2 is a “bigrams,” size 3 is a “trigrams,” and so on and so forth. These sequences help in capturing the context or dependency of words within the text, which is hard to do if you’re only focusing on one word at a time with no context. Humans don’t read each word one at a time with no understanding of the wider sentence; this is machines using numbers to try and achieve the same.  

Example: For the sentence “The quick brown fox,” the n-gram would be:  

Unigrams: [“The”], [“quick”], [“brown”], [“fox”]  

Bigrams: [“The quick”], [“quick brown”], [“brown fox”].  

Trigrams: [“The quick brown”], [“quick brown fox”].  

 

Understanding On Page SEO - How Google Reads Your Web Copy

Basically, an n-gram is used by machines to help predict what the next word in the sentence’s sequence is likely to be. AI chatbots would use n-gram taken from its database (which is nearly everything ever written) to know that “you” is most likely to follow “Hi, how are…” (make sense?)  

To make this a little easier to understand (and for my own sanity), let’s bring this back to SEO.  

When a programme uses n-grams to analyse text, the frequency and context in which words appear is something that will influence whether the machine identifies them as keywords.  

Frequency and Context in N-gram Analysis  

Frequency: If an n-gram (which can be a single word or a combination of words) appears frequently in a text, it signals to the machine that this n-gram  is potentially important within that specific document or set of documents. For instance, if the trigram “Leeds SEO Agency” appears repeatedly in a series of articles, the programme may identify “Leeds SEO Agency” as a key topic or keyword.  

  • So, more words = higher n-gram = better keyword rankings?  
  • No. Not at all.  
  • That would likely be keyword stuffing, and N-gram are possibly (likely?) a small part of how Google detects keyword stuffing.  

 

You see, N-grams are not reviewed in isolation; the context in which they appear also plays an important part in this. If an n-gram often occurs in significant positions (think headlines, meta descriptions, subheadings), it might be weighted more heavily. This context helps the programme figure out not just the frequency but also the importance of the n-gram in relation to the overall content.  

Pattern recognition plays a part and uses n-grammes too. The programme learns patterns. If it keeps seeing the same usage in relevant contexts, it will mark that pattern for the future. This capability enables it to predict and recognise important topics or keywords based on how words are grouped and where they appear in the text. (See the “Hi, how are you?” example.) A machine isn’t asking how you are; it just knows from odds and probabilities that it should use or find those words in the given context.  

Statistical Measures: Tools often use statistical methods like TF-IDF (Term Frequency-Inverse Document Frequency), which weighs the frequency of an n-gram against how often it appears across all documents. If an n-gram frequent in a particular document but rare in others, it’s likely to be considered a keyword for that document because it signifies a unique topic or focus area.  

TF-IDF is a statistical measure. It is used to show how important a word is to a document. It’s used in a collection of documents, like a database or the entire web.  

Term Frequency (TF)  

What it is: term frequency measures how a term appears in a document. If a term appears many times in a document, its term frequency is high.  

Why it matters: More frequent terms in a document are often more central to the subject of that document.  

Inverse Document Frequency (IDF)  

What it is: Inverse Document Frequency measures how common or rare a term is across all documents in a collection. It helps decide whether a term is common or rare across all documents. If this database is the whole internet, IDF could be a factor in how Google finds unique content. I theorise that this may be a factor in how Google finds content with authority. After all, your professional lived experience is only ever going to appear in your work, giving you unique or rare n-grams and rare terms in your copy.  

Why it matters: If a term appears in very few documents, it’s likely that this term is more relevant. It’s more unique within the few documents where it does appear.  

Combining TF and IDF (TF-IDF)  

What it is: TF-IDF is the multiplication of TF and IDF. This number measures the relevance of a term in a document. The term’s uniqueness is what balances it across all documents.  

Why it matters: You can use TF-IDF to find terms that are unique to a document. They are not common everywhere. This helps in distinguishing which terms describe the content of a document.  

Example  

Imagine you have a collection of sports articles. One of them talks a lot about “fencing.” This example works because no one talks about fencing. (Which is a shame because it’s a great sport that’s really accessible, and who doesn’t want to hit people with swords?)  

Term Frequency: “Fencing” appears many times in one particular article.  

Inverse document frequency is a way to rank words. “Fencing” is rare in other sports articles. They mostly talk about football, cricket, rugby, and other ball-based nonsense.  

TF-IDF Score: High for “fencing” in that specific article, showing it’s a key topic in that article but not a common term across all sports articles.

Understanding On Page SEO - How Google Reads Your Web Copy

In simple terms  

A machine, or I guess people do this too, tries to find out what each article is about. TF-IDF helps find which words are important in each article. It’s like if you wanted to find a specialist agency in a big city.

If you needed a leading SEO agency in Leeds, for example, Google would use this method, which points out which terms are special for each document or website. It helps to sift out the noise of common words that don’t tell you much about the content’s unique topics.

I expect this is all just a small (tiny) part of Google’s algorithm and the factors into which websites are served up to you on search engines.  

But it’s one we can understand and use to give our websites a fighting chance on search engine results page

Understanding On Page SEO - How Google Reads Your Web Copy

In Conclusion (finally)

Machines like Googlebot and Generative AI use mathematical equations when reviewing copy to help find what is important. It’s all math. It’s only more complex and nuanced because of the unfathomable size of Google’s and Generative AI’s datasets. 

The bots can’t read in the way you and I can. It finds patterns in words and terms and measures them against its data set. It takes rare terms and usage within context and sees them as important parts of the document or webpage.  

Do you want to better keyword rankings on Google?  

Saying something that no one else is saying in a unique way. 

Share your knowledge, in your way.  

By Dominic Ellis 

FREE Digital Healthcheck

Want to learn how to improve your digital marketing presence? Our FREE Digital Health Check will give you an expert’s view on how to boost your business.

What have you got to lose?