How to avoid getting blocked
Scanning and fetching data directly from store websites have several advantages:
- A simple, clear process for importing product.
- You can work with stores that have no API or data feeds.
- Up-to-date and complete data.
But large-scale data extraction and product data parsing have their own challenges. One of them is that some websites can implement anti-bot mechanisms. Sometimes, your bot can be blocked if it sends too many requests per day/hour. Usually, a restriction is imposed on your hosting's IP address.
The most obvious reason for bots being blocked is preventing heavy automated traffic that could affect website performance. Please note that getting a product and updating a price requires a separate HTTP request to the target website.
That's why the general rule for stable use of External Importer is to send as few requests to the target websites as possible!
External Importer was developed to be a good bot, not to cause inconveniences to other websites. For example, the plugin will follow the rules from robots.txt and has request limits. To find these settings, go to
External Importer > Settings > Extractor.
Please pay attention to the
Daily limit. It's an important value. The plugin will count any requests to every domain for every 24 hours and block automated queries that exceed this limit.
There's also an option for 1- or 24-hour blocking available to prevent new requests if several errors in a row are received from the target website.
Only automated requests, price updates, or auto import will be blocked. These limits aren't considered if you extract the products by manually entering the URL on the Product Import page.
So please, don't try to import too many products at a time from one source. It's better to split big tasks into several days.
We also recommend setting as long a pause as possible between requests for each product:
What else can you do to avoid getting blocked?
Don't update prices too often.
Your server's IP may be temporary or permanently blocked on the target website's side for the following reasons:
- You're sending too many requests.
- You use a shared IP with a bad reputation or bad hosting neighbors.
- The website has low bot tolerance and blocks bots globally.
- The country where your server is located is blocked on the website.
You can get the following errors because of blocking:
- 403 - Forbidden
- 503 - Service Unavailable
- 429 - Too Many Requests
- 408 - Request Timeout
- 400 - Bad Request
What you can do:
- Temporarily disable price updates, and don't send new requests. Temporary blocks are usually removed a day later.
- Try negotiating with the website owners to whitelist your IP. Some advertisers might do you a favor as you promote their products and generate traffic.
- Use a dedicated IP instead of a shared IP.
- Change the hosting. We don't recommend using large-scale cloud hostings like Amazon Web Services or Google Cloud, as some websites block the whole subnets of these services.
- Use proxies.
- Use built-in crawling services (paid): Scraperapi, Proxycrawl, Scrapingdog.
Built-in third party services (Scraperapi, Proxycrawl, Scrapingdog) can work as browser emulators and allow you to bypass IP restrictions, blocks or captchas. Also, these services allow you to extract data from dynamically generated sites with JS rendering (custom parsers may be required). These are paid services, but have free limits. Go to
External Importer > Settings > Extractorto add your API keys and configure requests for individual domains through these services.
Also you can use free or paid proxies. You can find many free proxies in search engines, but they usually aren't stable. We recommend using paid proxy services.
To add a proxy, go to
External Importer > Settings > Extractor > Proxy list. Then, add domains to send requests via proxy to the
Proxy whitelist domainsfield.