zlacker

> Well probably no, they didn't pick and choose at all, they just "chose" everyone who put code online with a license. Which is a legal statement of ownership by each of those people, and implies legal liability as well.

What you're describing is a choice. They chose which people to believe, with zero vetting.

> The point is that with ML training data, such a vast quantity is required that it's unreasonable to expect humans to be able to research and guarantee the legal provenance of it all.

I'm not sure what you're presenting here is actually true. A key part of ML training is the training part. Other domains require a pass/fail classification of the model's output (see image identification, speech recognition, etc.) so why is source code any different? The idea that "it's too much data" is absolutely a cop-out and absurd, especially for a company sitting on ~$100B in cash reserves.

Your argument kind of demonstrates the underlying point here: They took the cheapest/easiest option and it's harmed the product.

> A crawler simply believes that licenses, which are legally binding statements, are made by actual owners, rather than being fraud. It does seem reasonable to address the issue with takedowns, however.

Yes, and to reiterate, they chose this method. They were not obligated to do this, they were not forced to pick this way of doing things, and given the complete lack of transparency it's a large leap of faith to assume that their training data simply looked at LICENSE files to determine which licenses were present.

For what it's worth, it doesn't seem that that's what OpenAI did when they trained the model initially in their paper[1]:

    Our training dataset was collected in May 2020 from 54 mil-
    lion public software repositories hosted on GitHub, contain-
    ing 179 GB of unique Python files under 1 MB. We filtered
    out files which were likely auto-generated, had average line
    length greater than 100, had maximum line length greater
    than 1000, or contained a small percentage of alphanumeric
    characters. After filtering, our final dataset totaled 159 GB.

I have not seen anything concrete about any further training after that, largely because it isn't transparent.

[1]: https://arxiv.org/pdf/2107.03374.pdf