A Data Model to Generate Keywords List based on Articles

3 min readFeb 12, 2021

Motivation

Problem want to solve?

About 83% of the SEO respondents have experienced keyword cannibalization which occurs the multiple pages of a site target has same keywords [1]. In addition, the client or company had used the same key phrase for every article on their site before they become their SEO, and sometime they revise the article that affect to keywords cannibalization. According to the paragraph, the problem that want to solve is as follow:

How to generate keyword list automatically according to the article?

Strategic Goal?

The following is strategies to achieve this goal:

Extract keywords from an article
Generate to keywords list

Data Model

The data model is developed based on Natural Language Processing (NLP) techniques, which are keywords extraction and words embedding. In addition, the data sources are based on an article and a corpus model that we can made ourselves or got from another resource such as Google or Wikipedia corpus. The flow of the data model begins from data input, data processing, and become to data output. The detail of data model is as follows:

A. Input Data

These input data were obtained from two sources. First from the article itself and second from open sources (such as Google or Wikipedia corpus) or database built by ourselves. The first data contains text including title and paragraph, while the second data contains list of words or phrase that will become words embedding corpus, however this data is also obtained from many articles.

B. Processing Data

These processing data were based on NLP techniques. First, we use TF-IDF to extract keywords from article and second we use Doc2Vec to get the keywords similarity from word embeddings corpus. The following is the detail of these NLP techniques:

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is weighting
words processing that is intended to reflect how important a word in
document. We can use Python Sklearn library to extract keyword
from article.

Doc2Vec

Doc2Vec is an NLP technique for representing documents as a vector. It is used to get adjacent sentences based on vectors. We can use Python Gensim library to get phrases from the vocabulary that are mapped to a multi-dimensional vector space.

C. Output Data

This output data was a keywords list and contain several keywords similarities related to article. For example, if client have an automotive article, the system would generate the keyword list related to that article, such as ‘futuristic transportation’, ‘modern automotive’, ‘expensive car’, etc. The point is a keywords list that similar and relate to article.

Conclusion

The keywords generator can be used as a reference for keywords of organic SEO and can be used as a comparison to Google Keyword Planner or another keyword planner tools.
The limited of this keyword generator was not handling the ranking, it just list of keywords without ranking. It would handle the ranking if system have another resources and model, it can be a further development or research.
However, the idea of this keywords generator system was promising, it can reduce cost and time, which this make it easier for content creators and expert in SEO planning.

References

[1] https://blog.alexa.com/seo-problems-youre-probably-not-monitoring/#:~:text=There%20are%20three%20lesser%2Dknown,Google%20penalties%2C%20and%20keyword%20cannibalization.
[2] https://towardsdatascience.com/a-gentle-introduction-to-calculating-the-tf-idf-values-9e391f8a13e5
[3] https://towardsdatascience.com/fine-grained-analysis-of-sentence-embeddings-a3ff0a42cce5