zlacker

I don't know if this is true, but I would assume that the tokenizers they used for Codex use actual language parsers which would drop invalid files like this and make this attack infeasible.

When I was playing around a couple years ago with the Fastai courses in language modeling I used the Python tokenize module to feed my model, and with excellent parser libraries like Lark[0] out there it wouldn't take that long to build real quality parsers.

Of course I could be totally wrong and they might just be dumping pure text in, shutter.

[0]: https://github.com/lark-parser/lark