AI agents are starting to eat SaaS

>>jnord+(OP)
Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since LLMs can only scrub public data.

Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.

>>Oarch+65
Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.

>>matt-p+Pa
Well, perhaps this is naive of me from the perspective of not fully understanding the training process. However, at some point, with all available training data having been exhausted, gains with synthetic data exhausted, and a large pool of publicly available AI generated code, at what point is it 'smart' to scrape codebases from what you identify as high quality code based, clean it up to remove identifiers, and use that for training a smaller model?

zlacker