zlacker

> You validate that strings are UTF-8 at the place where you care that they are UTF-8.

The problem with this, as with any lack of static typing, is that you now have to validate at _every_ place that cares, or to carefully track whether a value has already been validated, instead of validating once and letting the compiler check that it happened.

replies(1): >>klodol+L3

>>Kinran+(OP)
In practice, the validation generally happens when you convert to JSON or use an HTML template or something like that, so it’s not so many places.

Validation is nice but Rust’s principled approach leaves me high and dry sometimes. Maybe Rust will finish figuring out the OsString interface and at that point we can say Rust has “won” the conversation, but it’s not there yet, and it’s been years.

replies(1): >>stouse+l6

>>klodol+L3
> validation generally happens when

Except when it doesn’t and then you have to deal with fucking Cthulhu because everyone thought they could just make incorrect assumptions that aren’t actually enforced anywhere because “oh that never happens”.

That isn’t engineering. It’s programming by coincidence.

> Maybe Rust will finish figuring out the OsString interface

The entire reason OsString is painful to use is because those problems exist and are real. Golang drops them on the floor and forces you pick up the mess on the random day when an unlucky end user loses data. Rust forces you to confront them, as unfortunate as they are. It's painful once, and then the problem is solved for the indefinite future.

Rust also provides OsStrExt if you don’t care about portability, which greatly removes many of these issues.

I don’t know how that’s not ideal: mistakes are hard, but you can opt into better ergonomics if you don’t need the portability. If you end up needing portability later, the compiler will tell you that you can’t use the shortcuts you opted into.

replies(1): >>maxdam+Mz1

>>stouse+l6
Can you give an example of how Go's approach causes people to lose data? This was alluded to in the blog post but they didn't explain anything.

It seems like there's some confusion in the GGGGGP post, since Go works correctly even if the filename is not valid UTF-8 .. maybe that's why they haven't noticed any issues.

replies(1): >>xyzzyz+sO1

>>maxdam+Mz1
Imagine that you're writing a function that'll walk the directory to copy some files somewhere else, and then delete the directory. Unfortunately, you hit this

https://github.com/golang/go/issues/32334

oops, looks like some files are just inaccessible to you, and you cannot copy them.

Fortunately, when you try to delete the source directory, Go's standard library enters infinite loop, which saves your data.

https://github.com/golang/go/issues/59971

replies(2): >>maxdam+FU1 >>klodol+n02

>>xyzzyz+sO1
Ah, mentioning Windows filenames would have been useful.

I guess the issue isn't so much about whether strings are well-formed, but about whether the conversion (eg, from UTF-16 to UTF-8 at the filesystem boundary) raises an error or silently modifies the data to use replacement characters.

I do think that is the main fundamental mistake in Go's Unicode handling; it tends to use replacement characters automatically instead of signalling errors. Using replacement characters is at least conformant to Unicode but imo unless you know the text is not going to be used as an identifier (like a filename), conversion should instead just fail.

The other option is using some mechanism to preserve the errors instead of failing quietly (replacement) or failing loudly (raise/throw/panic/return err), and I believe that's what they're now doing for filenames on Windows, using WTF-8. I agree with this new approach, though would still have preferred they not use replacement characters automatically in various places (another one is the "json" module, which quietly corrupts your non-UTF-8 and non-UTF-16 data using replacement characters).

Probably worth noting that the WTF-8 approach works because strings are not validated; WTF-8 involves converting invalid UTF-16 data into invalid UTF-8 data such that the conversion is reversible. It would not be possible to encode invalid UTF-16 data into valid UTF-8 data without changing the meaning of valid Unicode strings.

>>xyzzyz+sO1
IMO the right thing to do here is even messier than Go’s approach, which is give people utf-16-ish strings on Windows.

replies(1): >>maxdam+F82

>>klodol+n02
They have effectively done this (since the linked issue was raised), by just converting Windows filenames to WTF-8.

I think this is sensible, because the fact that Windows still uses UTF-16 (or more precisely "Unicode 16-bit strings") in some places shouldn't need to complicate the API on other platforms that didn't make the UCS-2/UTF-16 mistake.

It's possible that the WTF-8 strings might not concatenate the way they do in UTF-16 or properly enforced WTF-8 (which has special behaviour on concatenation), but they'll still round-trip to the intended 16-bit string, even after concatenation.