Yup. The robots.txt file is not only meant to block robots from accessing the site, it’s also meant to block bots from accessing resources that are not interesting for human readers, even indirectly.
For example, MediaWiki installations are pretty clever in that by default, /w/
is blocked and /wiki/
is encouraged. Because nobody wants technical pages and wiki histories in search results, they only want the current versions of the pages.
Fun tidbit: in the late 1990s, there was a real epidemic of spammers scraping the web pages for email addresses. Some people developed wpoison.cgi
, a script whose sole purpose was to generate garbage web pages with bogus email addresses. Real search engines ignored these, thanks to robots.txt. Guess what the spam bots did?
Do the AI bros really want to go there? Are they asking for model collapse?
Reddit has an user data checkout feature (IIRC, check out the user settings or maybe reddit help pages to find it).
It’s a bit crap though.
It takes a long time to process, especially if you happened to post in the era when the Reddit data infrastructure was horribly terrible instead of merely ordinarily terrible, and apparently this involves some handwork in the worst cases on behalf of the staff.
Some data may be missing or truncated. It doesn’t give you data from privated/banned subreddits (which was a fun thing to discover because last time I tried to do this the blackouts were on), and even for legit stuff, long comments/posts may be truncated. Even so, I’m pretty sure that the dumps just straight up didn’t have all of my posts from several years ago, even if those were on public subreddits. So you need to make sure the checked out data is sensible.
In conjunction to the official dumps, I recommend a few other tools, especially since the dumps aren’t really magnificently usable on their own. One tool that I found personally invaluable is reddit-user-to-sqlite, which allows you to import Reddit data dumps and available live user data (I think it does this by scraping or something, I’m sure it worked despite the API being shut down) to sqlite database, and Datasette is a nice frontend for browsing the posts.
As for scrubbing, there’s tools for that are supposed to work. I think.