Just implies they tuned it for user experience.
I was expecting there to be some discovery around them deliberately fine tuning their model to output modifications if and only if the code had a certain license.
https://storage.courtlistener.com/recap/gov.uscourts.cand.40...
I’m not sure what would be acceptable output for a code generation tool if rewriting the examples isn’t ok and reimplementing something that performs the same function still isn’t ok. Are we automatically granting de-facto code patents on all published code now?
There are also sampling schemes, top_p and top_k which can each individually help choose tokens that are less probable (but still highly probable) but more correct, and they are often used together for the best effect.
And then there are various decoding methods like beam search where choosing the most optimal beam may not mean the most optimal individual token.
By default a simple greedy search is used which always chooses the next highest probability token.
Are github users gamers? Really puts the "git" into "github" there.
Why would it be? If a function performs the data transform I need you better believe i'm copy pasting that sucker with a hyperlink to where I found it
But then again, I'm not trying to win in court.
My employer (IMHO smartly) forbids use of LLMs in company IP and company laptops, etc. Many others I'm sure are doing the same, and many others will follow.
[Lawyer and developer Matthew Butterick announced last month that he'd teamed up with the Joseph Saveri Law Firm to investigate Copilot. They wanted to know if and how the software infringed upon the legal rights of coders by scraping and emitting their work without proper attribution under current open-source licenses.]
https://www.theregister.com/2022/11/07/in_brief_ai/
https://www.theregister.com/2022/10/19/github_copilot_copyri...
I understand why these might feel different to you, but textbooks and stack overflow are also proprietary, licensed pieces of work. I don’t see why there would be much of a legal distinction.
There are two worlds.
In one, everytime someone publishes code with a license attached, they've taken a chunk out of the set of valid lines of software capable of being permissibly written without license encumberance. This is the world the poster you are replying to is imagining we're headed toward, and this case basically does a fantastic job of laying a test case/precedent for.
The other world, is one where everyone accepts all programming code is math, and copyrighting things is like erecting artificial barriers to facilitate information asymmetry. I.e. trying to own 2 + 2. In this second hypothetical world, we summarily reject IP as a thing.
The 2nd world is what I'd rather live in, as the first truly feels more and more like hell to me. However, given the first one is the world we're in, I'd like to see the mental gymnastics employed to undermine Microsoft's original software philosophy.
EDIT: Voir dire will be a hoot. Any wagers on how many software people make it onto the jury if any?
Don't know how you would even write code in your own style. As soon as you start altering it, the result is different. It's more/less efficient.
- Code is not intellectual property; I don't see this as easily defensible. It takes time, effort, and in some cases seriously heavy resources to come up with some of the tech companies rely on. Should all private companies rescind copyright on literally everything their staff write?
- Intellectual property is a nonsense concept altogether; in this case, I don't think you're ever going to get your way in the court of public opinion.
code that reverts to a conserved sequence of bytes interchanged ,no functional variations.
code that is so common knowledge it has become street graffiti, belongs in world 2
versus code that creates a functionality not available by direct command, is innovative and should be attributed. this sounds like what 1st world should be.
https://storage.courtlistener.com/recap/gov.uscourts.cand.40...
https://storage.courtlistener.com/recap/gov.uscourts.cand.40...
https://storage.courtlistener.com/recap/gov.uscourts.cand.40...
Friendly people.
I've received emails like that too over the years. What hugely controversial thing do I do? I have a website where I sometimes write about $stuff and I post on HN. Keeping the basic info private is probably a good thing especially if they're based in the US, because "SWATting" etc, but beyond that it doesn't seem "credible" in the sense that it's very likely someone will show up at their door with a gun.
Since the first two are redacted, I wonder if they sent them with their real names.
Non trivial include names, comments, logging, error checking, structure, ordering of operations that aren’t sequential.
If this were true of copyright, we would’ve run out of permissible novels a long time ago. There’s plenty to complain about with how software IP works, but copyright seems pretty sane. The alternative of protecting IP via trade secret is not a world I want to live in. That seems bad for open source.
https://en.wikipedia.org/wiki/Idea%E2%80%93expression_distin...
Being a bit hand-wavy with it: It’s akin to torrenting music/movies. The torrented files are lossy compressed representations of the original waveform from the music producer. Limewire, or Pirate Bay, or whatever provide interface to retrieve them (download or stream). The model weights are a form of lossy compression, and inference is like a document retrieval.
One may say, “it’s like an employee working at company X, then going to work at company Y, they retain their knowledge and experience.” I would say it’s more like, employee going from X to Y, but retaining audio and video recordings of all interactions he had, notes, documents, and other proprietary info and bringing it to company Y.
there really are a lot of other scenarios that involve writing software, to make software. Its not possible to list them all.. the list changes while I type
in simpl terms:
mov bax eax ; an obvious function; no IP
mov eax eax ; seems useless unless you know what de-referencing is. probably IP
this is of course example not considering granularities at level of patents on a language, or macro directives
The central idea of programming languages is that the grammar is very restrictive compared to natural languages. It's quite likely that, with the exception of variable names and whitespace, some function you wrote to implement a circular buffer is coincidentally identical to code that exists in Sony's or Lockheed Martin's codebases.
Plus there's the birthday problem -- coincidences can happen way more than you expect. And even with prose, constraints like non-fiction can narrow things down quickly. If everyone on HN had to write a theee-sentence summary of, say, how a bicycle works, there would probably be coincidentally identical summaries.
Does your company allow you to outsource your work to people in a poorer nation for a fraction of the cost that you are paid? Why not? Perhaps you should vote with your feet and find a less Luddite employer.
Aside from obligatory syntactic bits, what is the most common line of code across all software ever developed?
It'll probably be C or Java. HTML doesn't count.
And it's probably something boring like:
i++;The age of the AI bro is here, and as I’ve been in the space as someone genuinely interested in the models, working with them from time to time, for a while. I’m giving a lot of eye rolls in meetings when these people start talking about the underlying tech.
Personally this whole llm debate about copyright is quite funny. As someone who very much has skin in the game(my art is trained on midjourney.), and who runs in a circle of artists, it’s interesting to see people’s ego’s come at play here. The ones who are excited about these as tools are the ones who are openly inspired and want to inspire however the ones who claim copyright infringement seem to come off as insecure, almost like they are afraid that this idea of theirs will be the last great idea they have. There’s already a separation happening in the art world of people who are exploding in creative output vs the people who are so defensive and cling to the old way of doing things.
If I had my way, I’d see copyright laws abolished completely. A complete free for all in innovation. And people who claim that without parents and copyright then there’s no incentive to make money seriously underestimate humans and their ego to continually innovate.
EDIT: Apparently the lawyers are attending via Zoom.
It was ASM code I think, and their defense was that there was basically one way to write a function that does this.
I am certain that I can find code from Linux or gcc or emacs on Stack Overflow that is under a GPL license and not compatible with the CC license Stack Overflow uses... and yet it's there. What's more, people will copy that code into their own ignoring the CC license too.
How is that really any different than using Copilot if the original license and attribution are something to respect.
Note that I do think that the original license is something to respect which is why for any of the code that I write that has copyright that matters on it (toy program for home? meh. Hobby project repo that I'm working on that I'll publish? yep. Employer's code for work? absolutely.) I either don't touch questionable sources or run a license check on it when using it.
The key thing is that I don't consider the use of Copilot to be any more controversial than copying from Stack Overflow - which has been done by countless programmers for a decade before Copilot existed and no one got up in arms about it then.
I wrote a paper during college that I should release some time about when /g/ threw an absolute shitfit over Linus going "so, I've been a kinda shit human being to people and I'm going to step back and get some help", going as far as to blame his daughter/"the woke mob"/multiple named core kernel contributors for killing their god.
At one point, I attended a GitHub event that wasn't directly sponsored by github but encouraged a lot of github users to show up. While there I met several people who, outside the venue, were talking animatedly about Terry Davis. Listening in on the conversation revealed that they more or less just approved of his extensive use of racist language and epithets.
I haven't checked, but I would suspect that Linus' recent "trans rights" by proxy post has caused at least one or two aneurisms in the /g/ user group.
Our managers get emails if we make calls to known LLMs, and there's guidance on locally running LLMs and using their output ("it's okay for small things maybe, but be careful"). Why?
Because legal's job is to protect the company from legal threats. Sometimes that means making some awkward choices, like handwringing over the use of GPL licensed software in publicly exposed example code (such as sample apps) purely because some aspects of the GPL haven't been tested in American courts, much less international ones.
So the use cases for LLMs there are mostly source-to-source transformative ("Turn this function and documentation into javadoc format please") or similar -- stuff where you can show that the LLM isn't introducing anything that might maybe possibly have any hint of externally licensed software.
Even if a programming grammar is more restrictive, there’s some length where things become almost certainly unique.
Specifically for GPT models, the temperature parameter is used to get outputs wihch are a bit more "creative" and less deterministic. https://help.promptitude.io/en/ai-providers/gpt-temperature
I could never countenance operating under these conditions.
> When evaluating pass@k, it is important to optimize sampling temperature for the particular value of k. In Figure 5, we plot pass@k against the number of samples k and the sampling temperature. We find that higher temperatures are optimal for larger k, because the resulting set of samples has higher diversity, and the metric rewards only whether the model generates any correct solution.
> In particular, for a 679M parameter model, the optimal temperature for pass@1 is T∗ = 0.2 and the optimal temperature for pass@100 is T∗ = 0.8. With these temperatures, we find that pass@1 and pass@100 scale smoothly as a function of model size (Figure 6).
So even with pass@1 (likelihood of getting the right answer in 1 attempt) you don't use T=0, so there will be slight variations in the output each time.
FWIW, humans certainly can infringe other peoples' copyrights and can do so even if they aren't actively intending to do so. There is some boundary across which you are no longer just learning something and you are now copying, and it isn't clear at all that these generative AI techniques are actively considering the latter the way a human is required to.
But, sure: if you are against the idea of copyright entirely then it is hard to consider the idea inconsistent, though I would think a world without copyright would be a particularly hard one for an artist to make money at all...
Surely you're not suggesting that there's no such thing as "original work". The production of which may have very high capital and labour costs - which if not protected from theft - would remove the incentives of producing original work.
>As someone who very much has skin in the game(my art is trained on midjourney)
I don't know your specific situation, but there's obviously different scales of importance here. What if your art was your sole source of income, and people were reproducing it under their own name? or if you had a product where you poured millions into developing some novel IP/methods, and some employee brought it with them when they went to work at your competitors?
I basically use it as stackoverflow on steroids. it is not even close to gpt-4 in terms of reproducing some original idea I could not find in a search engine
Look at FizzBuzz. If you were to set strict requirements on performance (and allow for reiterative testing), the results from different groups of people would be identical. They would reach the same conclusion because that's how code works, it's far more aligned to math than it is creative writing.
So you cannot take an existing code solution and translate it to your own style. You are altering the program, the efficiency, and therefore the solution itself. Even when you do something like changing 1 single variable name!
> some employee brought it with them when they went to work at your competitors?
Other programming languages have copied lots of D features. We at the DLF don't mind at all. Though often they copy them and kinda miss the mark.
(Yes, we sometimes copy features from other languages, too, and try to improve on them.)
I’m intrigued. I’d like to see the subset of the list that are people who were in Republican politicians.
An aside about this from a moderately longtime Nix user and very occasional Nixpkgs contributor:
I used to occasionally post about Nix on /g/ before virtually anyone there knew what it was just to gauge reactions, and boy were people shitty and dismissive about it. It was all hot takes, broad strokes, and very little curiosity about the technical details. And even though Nix is 'cool' on /g/ now, all of those things are still true about the way /g/ treats NixOS and other distros.
The interest that 90% of /g/ users have in Linux distros like NixOS is as a bullshit status symbol, a token in some consumerist identity game. The presence of that shallow, status-obsessed, needlessly edgy type of person in the Nix community is definitely more visible in the Nix(OS) community now than it was a few years ago, but it still sticks out like a sore thumb against the backdrop of longtime Nix users and the culture they've evolved together.
For that reason, I strongly recommend engaging with the Nix community in community-owned channels, like discourse.nixos.org or the community Matrix channels, rather than message boards like 4chan or mainstream social media platforms like Reddit. If you do that, you'll find kinder, more knowledgeable people (and perhaps in some cases, kinder more knowledgeable personas for the same people).
If you're reading this and you've unfortunately encountered Nix 'evangelists' with those shitty attitudes online, please understand that those influences are external to the community, and as far as most participants in the community are concerned, quite unwelcome.
https://devclass.com/2022/10/17/github-copilot-under-fire-as...
It's a 25 or so line function that looks like a pedestrian implementation of a sparse matrix transpose algorithm. The author should have been patented it to protected it, not copyrighted it.
But here we are talking about autocompleting code. I don't think programmers want the autocompleter to be creative. They want the exact same solution everyone uses, hopefully the right one, with only minor changes so that it matches their style and use their own variable names. In my case, I am the programmer, I decide what to do, I just want my autocompleter to save me some keystrokes and copy-pasting boilerplate from the web, the more it looks like existing code the better. I have enough work fixing my own bugs, thank you.
Speaking about bugs, how come everyone talks about code generation that, I think, doesn't bring that much value. Sure, it saves a few keystrokes and copy-pasting from StackOverflow, but I don't feel like it is the thing programmers spend most of the time doing. Dealing with bugs is. By bugs, there are the big ones that have tickets and can take days to analyze and fix, but also the ones that are just a normal part of writing code, like simple typos that result in compiler errors. I think that machine learning could be of great help here.
Just a system that tells me "hey, look here, this is not what I expected to see" would be of great help. Unexpected doesn't mean there is a bug, but it is something worth paying attention to. I know it has been done, but few people seem to talk about it. Or maybe a classifier trained on bug fix commits. If a piece of code looks like code that has been changed in a bug fix commit, there is a good chance it is also a bug. Have it integrated to the IDE, highlight the suspicious part as I type, just as modern IDEs highlight compilation errors in real time.
Lots of little things.
This concept has specific technical meaning -
https://www.nolo.com/legal-encyclopedia/fair-use-what-transf...
It seems obvious to me that to call model weights "lossy compression" is not only incorrect from a technical (software dev) point of view, but also from this legal perspective.
The weights serve a different purpose than the original works from which they are derived, and wouldn't/couldn't POSSIBLY exist were it not for the original work of the authors of the models.
It's bad practice to go around espousing strong and condemnatory opinions about topics you don't have a full grasp of. In this case, it's both the technical details and the legal system.
It makes you look like a fool and costs you your credibility amongst peers in future encounters.
Your arguments regarding "transformative work" is what the discussion is all about. Let's see where it lands with case law.
FWIW, my stance is copyrighted content should not be used in training without request.
>It's bad practice to go around espousing strong and condemnatory opinions about topics you don't have a full grasp of. In this case, it's both the technical details and the legal system. >It makes you look like a fool and costs you your credibility amongst peers in future encounters.
Agree, thank you for the feedback. My analogy was quite exaggerative.
I'm certainly no expert on copyright law, but my understanding is that its purpose is to protect the financial interests of certain creators from the progress of technology (e.g. copy paste). I've heard arguments that removing copyright would lead to less creativity or reduced quantity or quality of work, but I'm personally a bit skeptical (probably for the same reasons as you - I think people have a natural desire to create). Even in terms of financials, I would speculate that an employment/patronage model would become more widespread.
I think there's something to be said about the benefits of having freely available knowledge, music, and art for common consumption. When I was a child in high school (or well, always lol), my parents couldn't afford a lot of material I needed or wanted for studying (especially for standardized testing, SAT and AP tests) and most of the books in my local library either did not exist or were outdated. But when I discovered that much of this information could be found online, it really changed my world and made success in life feel attainable to me. I consider myself quite wealthy now, but I don't think I would have been able to escape poverty if all this information was paywalled from me. Maybe others would argue the writers are not being compensated for their efforts, but if there are other people in the world in the same position as past me who could positively benefit from it, I think that's a better world to live in, personally.
Incidentally, the release of StableDiffusion has actually inspired me to draw a little. Not sure why, but I find it inspiring being able to iterate on a prompt and produce something of quality that I can try to replicate on my own. Even if I fail, I still have something to appreciate that maps fairly well to the concept in my head.
My hope is that these technologies might lead to a change in our financial system (I think UBI would be a good idea), but I suppose we'll see where everything ends up. I think there's likely going to be a lot of pain in the short-term (especially since there are those who don't want to adapt), but hopefully everyone will positively benefit in the long-term.
I disagree, the internet wouldn't be half as full of knowledge as it is if it weren't for the loudly ignorant giving the experts somebody to correct.
When someone teaches me, they don't own all my future creative output.
When someone teaches an AI, they do.
That's the principal difference between human learning and machine learning.
Yeah yeah, your side are the good guys and the other side is a bit dodgy.
The other day I saw some gym bro in the IG comment section trying to flex on people with "do you even know what backpropagation is?"
Today's MS isn't really the same, and they've clearly made their peace with Linux. But it still happens that the GPL is in some fundamental ways at odds with commercial exploitation of open source code. So any corporate entity is going to struggle with it because at best it requires being very careful in distribution, or trying to negotiate or cut a deal with the licensee. At worst it can lead to legal problems and IP leakage on your own product.
So, not claiming any conspiracy. Or intent to violate intentionally. But it is in the convenient interests of companies like MS/OpenAI/GitHub to treat open source work as effectively public domain rather than under copyright, and to push the limits there.
The risk to an employer is of course the accidental introduction of such copylefted material into their code-base through copilot or similar tools.
I suspect two sources of disconnect with the broader community on hackernews that doesn't seem to see the issue here:
a) Much of the folks on this forum are working in the full-stack/web space where fundamentally novel, patented, or conceptually difficult algorithms and datastructures are rare. For them Copilot is an absolute blessing in helping to reduce the tedium of boilerplate. However in the embedded systems, operating systems, compiler, game engine dev, database internals etc. world there are other aspects at work. In certain contexts, Copilot has been shown to reproduce complicated or difficult code taken from copyrighted or copylefted (or maybe even patented sources) without attribution. And apparently now with some explicit obfuscation.
To put it another way: it's unlikely that Copilot's going to violate licenses with its assistance with turning your value/model objects from one structure to another, or writing a call into a SQL ORM. But it's quite possible that if I'm writing a DB join algorithm or some complicated math in a rendering engine or a compiler optimization phase that it could "crimp notes" from a source under restrictive license... because those things are absolutely in its learning set and the LLM doesn't "know" about the licensing behind them.
b) Either misunderstanding of, or lack of knowledge of, or outright hostility to... copylefted or attribution licenses which require special handling.
Using Copilot is an automated process, and the source of the material used in learning is deeply obfuscated in the learning model.
That's why I make the analogy back to cryptocurrency mixers.