Why are there so many new AI-powered tools that scrape websites?

DuckWrangler9000@lemmy.world · 23 days ago

Why are there so many new AI-powered tools that scrape websites?

General_Effort@lemmy.world · 23 days ago

A toy like that is easy to create and not that expensive to offer. Much more expensive than some JavaScript or CSS, but in the end it’s not that different.

I think people don’t really understand this whole scraping thing. For example, you can torrent all of Reddit until the API-change; all the comments, profiles, usernames, including now deleted stuff. There is a lot of outrage here over Reddit cracking down on these 3rd party tools. It’s difficult to see how that outrage over cracking down on 3rd party tools, fits with this outrage here over not cracking down on 3rd party tools.

Anyway, if someone want to archive all of Bluesky, they don’t need to offer some AI toy. They can just download the content via the API.

probableprotogen@lemmy.dbzer0.com · 23 days ago

You can still torrent Reddit pushshift data past the API change. But yea I definitely agree otherwise, these are just cheap toys that less experienced developers create for portfolios.

DuckWrangler9000@lemmy.world · 23 days ago

A toy like that is easy to create and not that expensive to offer.

Right, and the developers of Bsky didn’t think to maybe block something that scrapes all that personal information?

VerPoilu@sopuli.xyz · 23 days ago

Like Lemmy or Mastodon, BlueSky was made with the idea of federation. While BlueSky is not there yet, federated services are inherently very easy to scrape.

Maybe it’s time for people to understand that anything they post/vote/comment/like should be considered public domain.

dustyData@lemmy.world · edit-2 23 days ago

The only money to be made in the LLM craze is data scraping, collection, filtering, collation and data set selling. When in a gold rush, don’t dig, sell shovels. And AI needs a shit ton of shovels.

The only people making money are Nvidia, the third party data center operators and data brokers. Everyone else running and using the models are losing money. Even OpenAI, the biggest AI vendor, is running at a loss. Eventually the bubble will burst and data brokers will still have something to sell. In the mean time, the fastest way to increase model performance is by increasing the size, that means more data is needed to train them.