zlacker

No more so than regurgitating an entire book. While it could technically be possible in the case of certain repos that are ubiquitous on the internet (and therefore overrepresented in training data to the point that they are "regurgitated" verbatim, in whole), it is extremely unlikely and would only occur after deliberate prompting. The NYT suit against Open AI shows (in discovery) that the NYT was only able to get partial results after deliberately prompting the model with portions of the text they were trying to force it to regurgitate.

So. Yes, technically possible. But impossible by accident. Furthermore when you make this argument you reveal that you don't understand how these models work. They do not simply compress all the data they were trained on into a tiny storable version. They are effectively multiplication matrices that allow math to be done to predict the most likely next token (read: 2-3 Unicode characters) given some input.

So the model does not "contain" code. It "contains" a way of doing calculations for predicting what text comes next.

Finally, let's say that it is possible that the model does spit out not entire works, but a handful of lines of code that appear in some codebase.

This does not constitute copyright infringement, as the lines in question a) represent a tiny portion of the whole work (and copyright only protecst against the reduplication of whole works or siginficant portions of the work), and B) there are a limited number of ways to accomplish a certain function and it is not only possible but inevitable that two devs working independently could arrive at the same implementation. Therefore using an identical implementation (which is what this case would be) of a part of a work is no more illegal than the use of a certain chord progression or melodic phrasing or drum rhythm. Courts have ruled about this thoroughly.

replies(2): >>typpil+BC >>aspenm+5S

>>popalc+(OP)
It's also why some companies do clean room design.

>>popalc+(OP)
> No more so than regurgitating an entire book.

Like this?

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)

replies(1): >>popalc+RY3

>>aspenm+5S
Yes, that is one of those works that is over-represented in the training data, as I explained in the part of the comment you clearly did not comprehend.

replies(1): >>aspenm+2Z3

>>popalc+RY3
> you clearly did not comprehend

I comprehend it just fine, I was adding context for those who may not comprehend.