Having an original and interesting copy, which is not found in another part of the network, is very much appreciated by the search engine “Google”. For this reason, duplicate content is one of the great problems your website can face when you want to have a good natural website. But what is meant by duplicate content? In this post we will tell you what it is so you can easily identify it and learn how to avoid it.
Duplicate content is content that appears repeatedly within the same site or on two different sites (in this case we are also talking about plagiarism or quoting). In other words, we find the same content (text, images, etc.) on two pages with completely different URLs. This is a recurring problem, especially among e-merchants who quickly fall into this trap due to the multiple filters dedicated to the user experience. Duplicate content, also known as “DC”, may be subject to a computational penalty.
One might legitimately think that by applying this rule, they are safe from duplicating content. Which is “I don’t copy and paste my product sheets from another site or from a page on my site, so it’s okay” I wish it were that simple today, with the proliferation of With huge menus and faceted filters, duplicate content is a real sword that goes unnoticed. According to some estimates, 29% of the web will duplicate this way. Here are the most common reasons for copying content:
URLs and crawl parameters are a recurring source that causes duplicate content. This can be a problem not only because of the parameters themselves, but also because of the order in which they appear in the URL.
https://exemple.com/produits/femmes/robes/vert.html It can be copied using
Additionally, a user’s private session can create duplicate content. If the session identifier is automatically generated and one of the parameters is a ‘URL’, it may generate duplicate content if that ‘URL’ is used elsewhere and thus crawled by Google. Since it is very difficult to predict the results of “URL” parameters, it is best to avoid them as much as possible. However, parameterized URLs are often poorly indexed or poorly positioned in Google.
When migrating from HTTP to HTTPS, instances of duplicate content can increase significantly if some controls are ignored. There are two pages available in HTTP and HTTPS versions that search engines consider to be quite similar and are equally penalized.
Duplicate content is one of the main causes of duplicate content. The recurring problem of all e-commerce sites is related to the product catalog feed. Some sites have tens of thousands of products to enter, and some feature only color or size. Few sites have enough human resources to write one article for each product. If Google claims to apply a certain tolerance, then in practice we are aware that not all pages are indexed or positioned.
Duplicate content will affect the way search engines index your content. They have to choose between copying content for reference. Also, search engines will spend time crawling the same content multiple times (depending on how often your content is repeated) so some good content will likely rank lower. Because, again, search engines want to provide the best user experience, so they will not serve multiple versions of the same content and will choose the version of the content they consider best each time. What you risk if you do not solve the problem of duplicate content will simply see your positions fall in search results, and thus lose traffic. Google can also remove some pages from search results. In fact,If you perform link acquisition procedures or if you receive links normally, the fact that there are multiple entry points will dilute or scatter the value of these incoming links, while if all these links reach one page, their weight in terms of popularity and popularity will be more important. Duplicate content hinders your content’s potential in terms of search engine visibility and negatively impacts your SEO traffic. But I assure you, my friend, that there are solutions.
When plagiarism is suspected, there are tools to find out which sites may have copied your content.” Positeo, Plagiarism, and Copyscape “have free versions. If we want to detect duplicate content in bulk, it is necessary to take the paid versions.
The tracker can only mark duplicate internal content on the site. Among the most powerful tracking tools are “Botify and Oncrawl”. There are also less powerful tools for smaller sites, such as “Site Analyzer or Screaming Frog Spider” in its free version.
These tools let you know the percentage of repetition between pages.
It’s good to have techniques to avoid duplication of content, although it’s not always straightforward, such as:
The best way to avoid duplicate content is to use the canonical rel= attribute. This attribute is used to tell search engines which URL to consider as the original. This way, if the bots find a duplicate page, they know they will ignore it.
The rel=”canonical” attribute is embedded directly in the page’s HTML header (or “header”).
It’s in this form: General Format: [Address Code] . This tag must be added to each duplicate. The original page should also have a canonical url, which this time will point to itself.
Sometimes the duplicate content is topical. For example, it could be a page for a new product with a new reference, but its content is identical to the old version of the product. In this case, the canonical tag is not the most correct because search engines will continue to crawl the old version that has become outdated. The 301 redirect prevents duplicate content while moving the popularity of the old page.
This is the least “clean” solution. In fact, a well-designed site does not need to put pages in noindex. However, some technical limitations sometimes prevent the application of best practices. The content=”noindex, follow” tag can be added manually on each page. This allows, in particular, to quickly correct problems with duplicate content and is as follows:
… [title code] … “>” Robots
This tag allows robots to navigate pages, but prevents indexing. By using it, we “remove” ourselves from search engines. It’s like telling them, “I know I have duplicate pages, but I promise I’m not doing it on purpose nor trying to manipulate bots to get multiple identical pages in SERPs. A common mistake is to prevent these pages from being browsed in a robots.txt file. So that bots can see Noindex tag, you should be able to scan it.
The Google search console allows you to set the preferred domain for your site, specifying whether Googlebot should navigate multiple URL parameters differently. Depending on your site structure and the origin of your duplicate content, setting up your preferred domain or managing settings can be a helpful solution. This is a method that will only work with “Google”. It will not correct your problems in other search engines. To do this, you also have to reflect these changes in the webmaster tools to other search engines, which can be very tedious. It is always better to process the backend.
When developing a clean architecture free of duplicate content, it is essential to maintain consistency in the internal network. Each internal link must point to the canonical URL and not to the duplicate page.
Duplicate content is a big problem for all sites, especially e-commerce sites. Even with experience, it is sometimes difficult to anticipate all possible examples of duplicate content. This is why it is essential to invest in a good tracking tool that will allow you to constantly monitor the integrity of your site.
All Copyright Reserved for Nofal S.E.O™ 2022