How to avoid getting blocked
Last updated
Last updated
Scanning and fetching data directly from store websites have several advantages:
A simple, clear process for importing product.
You can work with stores that have no API or data feeds.
Up-to-date and complete data.
But large-scale data extraction and product data parsing have their own challenges. One of them is that some websites can implement anti-bot mechanisms. Sometimes, your bot can be blocked if it sends too many requests per day/hour. Usually, a restriction is imposed on your hosting's IP address.
The most obvious reason for bots being blocked is preventing heavy automated traffic that could affect website performance. Please note that getting a product and updating a price requires a separate HTTP request to the target website.
That's why the general rule for stable use of External Importer is to send as few requests to the target websites as possible!
External Importer was developed to be a good bot, not to cause inconveniences to other websites. For example, the plugin will follow the rules from robots.txt and has request limits. To find these settings, go to External Importer > Settings > Extractor
.
Please pay attention to the Daily limit
. It's an important value. The plugin will count any requests to every domain for every 24 hours and block automated queries that exceed this limit.
There's also an option for 1- or 24-hour blocking available to prevent new requests if several errors in a row are received from the target website.
Only automated requests, price updates, or auto import will be blocked. These limits aren't considered if you extract the products by manually entering the URL on the Product Import page.
So please, don't try to import too many products at a time from one source. It's better to split big tasks into several days.
We also recommend setting as long a pause as possible between requests for each product:
What else can you do to avoid getting blocked?
Don't update prices too often.
Your server's IP may be temporary or permanently blocked on the target website's side for the following reasons:
You're sending too many requests.
You use a shared IP with a bad reputation or bad hosting neighbors.
The website has low bot tolerance and blocks bots globally.
The country where your server is located is blocked on the website.
You can get the following errors because of blocking:
403 - Forbidden
503 - Service Unavailable
429 - Too Many Requests
408 - Request Timeout
400 - Bad Request
What you can do:
Temporarily disable price updates, and don't send new requests. Temporary blocks are usually removed a day later.
Try negotiating with the website owners to whitelist your IP. Some advertisers might do you a favor as you promote their products and generate traffic.
Use a dedicated IP instead of a shared IP.
Change the hosting. We don't recommend using large-scale cloud hostings like Amazon Web Services or Google Cloud, as some websites block the whole subnets of these services.
Use proxies.
Use built-in crawling services.