top of page
Writer's pictureBurton Kelso, Tech Expert

How To Stop AI From Scraping Data From Your Website



AI is a wonderful tool, but it isn't perfect. Along with its internal databases, AI scrapes information from websites all over the planet to get the data it needs to provide you with the information you are looking for, including your personal and business websites. you don't want AI to use your website and intellectual property to train its Large Language Model. Here's what you need to know.


What are AI scrapers? Bots are crawling all over the Internet. The more well-known bots are search engine bots which collect data to index websites and to see where websites rank on Google and other search engines. Spam bots are designed to harvest data such as your email address. Bots used with Generative AI tools such as Co-Pilot, Gemini, and ChatGPT are designed to collect your website's information and then use that data to train their AI to answer your content and photorealistic image prompts. This is controversial because web creators aren't happy with the intellectual property and privacy violations with bots taking that information.


Should you block AI from scraping your website? On one hand if, you allow AI to scrape your website, you are helping train AI and make it better. On the other hand you could see your website content show up on other websites without giving credit to you. It’s important to know that Google-extended will not block Google's SGE from crawling your website, and therefore blocking Google AI bots poses no risk to your organic search rankings.


How to stop AI bots from scraping your website. There are multiple ways you can use to try to stop AI bots from scraping your website. Some of these suggestions may require advanced knowledge of web design. You might have to consult with your website creator to implement some of these tips.


Change the settings in your site builder tools.  If you or your web designed created a website with Wix, Squarespace, GoDaddy or another website builder tool, you can go into settings and instruct your website to block AI scraping.


Using the robots.txt protocol. AI developers acknowledge using robots.txt command which allows you to tell AI crawlers not to scrape data from your website. You can add a robots.txt file to your website using the following line.

User-agent: ChatGPT

Disallow: /


This tells ChatGPT to block the crawling of all pages on your website. To specify blocking of specific pages or subfolders, just amend the / to your required URL.


You have to enter the command for each Chatbot. If you want to block Google Gemini from scraping your website, use the following line in your robots.txt file.

User-agent: Google-Extended

Disallow: /

Blocking other AI bots


If you’d rather keep your website information away from other brands of AI, then you may also want to consider Common Crawl. Common Crawl is one of the largest datasets used by AI for training, with ChatGPT and other large language models all utilizing this. Because of this, CCBot is the 2nd most blocked AI chatbot.


As with GPT Bot and Google, you can prevent CCBot from scraping your content by using the robots.txt exclusion protocol. Add the lines below to your robots.txt file to stop its crawling activities:

User-agent: CCBot

Disallow: /


Blocking AI Chatbots from larger companies is easy, but new smaller bots are always popping up, which means that blocking them via robots.txt isn’t always the answer.


Use CAPTCHAs. You probably hate CAPTCHAs when you visit websites. Who wants to click on squares to show how many pictures have a fire hydrant, but they work by preventing bots from accessing websites. CAPTCHAs work by Implementing actions that deter automated bots by requiring a human-like response or computational proof.

Web Application Firewall (WAF)


Install and configure a WAF to filter website traffic. A WAF filter, or Web Application Firewall watches all the traffic coming to and from your website and blocks anything that looks dangerous or harmful. This helps keep the website safe from hackers and other bad guys who might try to cause trouble.


Website IP Blocking. Website IP blocking is a method used to restrict access to your website or online service based on the IP address of the user trying to connect.


Take Legal Measures. If you feel that AI crawling is infringing on any of your intellectual property rights, then you might want to consult legal advice can be the best course of action.


Hopefully, this will give you tips to help prevent AI chatbots from scraping data from your website. What are your thoughts on AI using your content? Are you happy to help these bots learn and become more useful, or do you feel that it is a threat to content producers? Let me know in the comments below. If you have any questions, please reach out. I'm always available.

Looking for More Useful Tech Tips? Our Tuesday Tech Tips Blog is released every Tuesday. If you like video tips, we LIVE STREAM new episodes of 'Computer and Tech Tips for Non-Tech People' every Wednesday at 1:00 pm CST on Facebook, Instagram, LinkedIn, and Twitter. Technology product reviews are posted every Thursday. You can view previous episodes on our YouTube channel.


Sign Up for Our Newsletter! Click this link to sign-up and subscribe and you will receive every tip directly in your inbox each week.


Want to ask me a tech question? Send it to burton@callintegralnow.com. I love technology. I've read all of the manuals and I'm serious about making technology fun and easy to use for everyone.


Need computer or technology help? If you need on-site or remote tech support for your Windows\Macintosh, computers, laptops, Android/Apple smartphone, tablets, printers, routers, smart home devices, and anything that connects to the Internet, please feel free to contact my team at Integral. Our team of friendly tech experts organization can help you with any IT needs you might have. Reach out to us a www.callintegralnow.com or phone at 888.256.0829.


Please share this with your friends and family! If you found this post useful, would you mind helping me out by sharing it? Just click one of the handy social media sharing buttons below.


The above content is provided for information purposes only. All information included therein is

subject to change without notice. I am not responsible for any direct or indirect damages, arising from or related to the use of or reliance on the above content.




29 views0 comments

Recent Posts

See All

Comments


bottom of page