zlacker

Go was designed by some old-school folks that maybe stuck a bit too hard to their principles, losing sight of the practical conveniences.

I'd say that it's entirely the other way around: they stuck to the practical convenience of solving the problem that they had in front of them, quickly, instead of analyzing the problem from the first principles, and solving the problem correctly (or using a solution that was Not Invented Here).

Go's filesystem API is the perfect example. You need to open files? Great, we'll create

  func Open(name string) (*File, error)

function, you can open files now, done. What if the file name is not valid UTF-8, though? Who cares, hasn't happen to me in the first 5 years I used Go.

replies(9): >>nasret+O >>koakum+11 >>ants_e+S2 >>jerf+Yc >>herbst+0g >>silver+Yq >>stouse+gF >>kragen+2k1 >>perryi+7w2

>>xyzzyz+(OP)
Note that Go strings can be invalid UTF-8, they dropped panicking on encountering an invalid UTF string before 1.0 I think

replies(1): >>xyzzyz+d1

>>xyzzyz+(OP)
> What if the file name is not valid UTF-8

Nothing? Neither Go nor the OS require file names to be UTF-8, I believe

replies(2): >>johnco+r4 >>zimpen+0w

>>nasret+O
This also epitomizes the issue. What's the point of having `string` type at all, if it doesn't allow you to make any extra assumptions about the contents beyond `[]byte`? The answer is that they planned to make conversion to `string` error out when it's invalid UTF-8, and then assume that `string`s are valid UTF-8, but then it caused problems elsewhere, so they dropped it for immediate practical convenience.

replies(6): >>assbut+85 >>ronces+e5 >>0x000x+D6 >>naikro+X6 >>tialar+Ig >>Ferret+LHa

>>xyzzyz+(OP)
> they stuck to the practical convenience of solving the problem that they had in front of them, quickly, instead of analyzing the problem from the first principles, and solving the problem correctly (or using a solution that was Not Invented Here).

I've said this before, but much of Go's design looks like it's imitating the C++ style at Google. The comments where I see people saying they like something about Go it's often an idiom that showed up first in the C++ macros or tooling.

I used to check this before I left Google, and I'm sure it's becoming less true over time. But to me it looks like the idea of Go was basically "what if we created a Python-like compiled language that was easier to onboard than C++ but which still had our C++ ergonomics?"

replies(1): >>shrubb+fq

>>koakum+11
Well, Windows is an odd beast when 8-bit file names are used. If done naively, you can’t express all valid filenames with even broken UTF-8 and non-valid-Unicode filenames cannot be encoded to UTF-8 without loss or some weird convention.

You can do something like WTF-8 (not a misspelling, alas) to make it bidirectional. Rust does this under the hood but doesn’t expose the internal representation.

replies(2): >>andyfe+Ns >>jstimp+Ut

>>xyzzyz+d1
string is just an immutable []byte. It's actually one of my favorite things about Go that strings can contain invalid utf-8, so you don't end up with the Rust mess of String vs OSString vs PathBuf vs Vec<u8>. It's all just string

replies(1): >>zozbot+Wb

>>xyzzyz+d1
I've always thought the point of the string type was for indexing. One index of a string is always one character, but characters are sometimes composed of multiple bytes.

replies(2): >>birn55+3a >>crazyg+Hc

>>xyzzyz+d1
Why not use utf8.ValidString in the places it is needed? Why burden one of the most basic data types with highly specific format checks?

It's far better to get some � when working with messy data instead of applications refusing to work and erroring out left and right.

replies(1): >>const_+pL

>>xyzzyz+d1
I think maybe you've forgotten about the rune type. Rune does make assumptions.

[]Rune is for sequences of UTF characters. rune is an alias for int32. string, I think, is an alias for []byte.

replies(1): >>TheDon+rJ

>>ronces+e5
You can't do that in a performant way and going that route can lead to problems, because characters (= graphemes in the language of Unicode) generally don't always behave as developers assume.

>>assbut+85
Rust &str and String are specifically intended for UTF-8 valid text. If you're working with arbitrary byte sequences, that's what &[u8] and Vec<u8> are for in Rust. It's not a "mess", it's just different from what Golang does.

replies(2): >>gf000+te >>maxdam+yv

>>ronces+e5
Yup. But to be clear, in Unicode a string will index code points, not characters. E.g. a single emoji can be made of multiple code points, as well as certain characters in certain languages. The Unicode name for a character like this is a "grapheme", and grapheme splitting is so complicated it generally belongs in a dedicated Unicode library, not a general-purpose string object.

>>xyzzyz+(OP)
While the general question about string encoding is fine, unfortunately in a general-purpose and cross-platform language, a file interface that enforces Unicode correctness is actively broken, in that there are files out in the world it will be unable to interact with. If your language is enforcing that, and it doesn't have a fallback to a bag of bytes, it is broken, you just haven't encountered it. Go is correct on this specific API. I'm not celebrating that fact here, nor do I expect the Go designers are either, but it's still correct.

replies(1): >>klodol+hr

>>zozbot+Wb
If anything that will make Rust programs likely to be correct under any strange text input, while Go might just handle the happy path of ASCII inputs.

Stuff like this matters a great deal on the standard library level.

>>xyzzyz+(OP)
Much more egregious is the fact that the API allows returning both an error and a valid file handle. That may be documented to not happen. But look at the Read method instead. It will return both errors and a length you need to handle at the same time.

replies(1): >>nasret+Nu

>>xyzzyz+d1
Rust apparently got relatively close to not having &str as a primitive type and instead only providing a library alias to &[u8] when Rust 1.0 shipped.

Score another for Rust's Safety Culture. It would be convenient to just have &str as an alias for &[u8] but if that mistake had been allowed all the safety checking that Rust now does centrally has to be owned by every single user forever. Instead of a few dozen checks overseen by experts there'd be myriad sprinkled across every project and always ready to bite you.

replies(3): >>adastr+SO >>inferi+501 >>stevek+p01

>>ants_e+S2
Didn’t Go come out of a language that was written for Plan9, thus pre-dating Rob Pike’s work at Google?

replies(3): >>ants_e+yI >>kragen+qq1 >>pjmlp+H03

>>xyzzyz+(OP)
> What if the file name is not valid UTF-8, though

They could support passing filename as `string | []byte`. But wait, go does not even have union types.

replies(1): >>lblume+Zs

>>jerf+Yc
This is one of those things that kind of bugs me about, say, OsStr / OsString in Rust. In theory, it’s a very nice, principled approach to strings (must be UTF-8) and filenames (arbitrary bytes, almost, on Linux & Mac). In practice, the ergonomics around OsStr are horrible. They are missing most of the API that normal strings have… it seems like manipulating them is an afterthought, and it was assumed that people would treat them as opaque (which is wrong).

Go’s more chaotic approach to allow strings to have non-Unicode contents is IMO more ergonomic. You validate that strings are UTF-8 at the place where you care that they are UTF-8. (So I’m agreeing.)

replies(3): >>Kinran+AF >>ducker+cG >>pas+472

>>johnco+r4
I believe the same is true on linux, which only cares about 0x2f bytes (i.e. /)

replies(3): >>matt_k+3C >>orthox+1D >>johnco+P62

>>silver+Yq
But []byte, or a wrapper like Path, is enough, if strings are easily convertible into it. Rust does it that way via the AsRef<T> trait.

>>johnco+r4
What do you mean by "when 8-bit filenames are used"? Do you mean the -A APIs, like CreateFileA()? Those do not take UTF-8, mind you -- unless you are using a relatively recent version of Windows that allows you to run your process with a UTF-8 codepage.

In general, Windows filenames are Unicode and you can always express those filenames by using the -W APIs (like CreateFileW()).

replies(2): >>af78+pw >>johnco+682

>>herbst+0g
The Read() method is certainly an exception rather than a rule. The common convention is to return nil value upon encountering an error unless there's real value in returning both, e.g. for a partial read that failed in the end but produced some non-empty result nevertheless. It's a rare occasion, yes, but if you absolutely have to handle this case you can. Otherwise you typically ignore the result if err!=nil. It's a mess, true, but real world is also quite messy unfortunately, and Go acknowledges that

replies(1): >>stouse+uT

>>zozbot+Wb
It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

You should always be able to iterate the code points of a string, whether or not it's valid Unicode. The iterator can either silently replace any errors with replacement characters, or denote the errors by returning eg, `Result<char, Utf8Error>`, depending on the use case.

All languages that have tried restricting Unicode afaik have ended up adding workarounds for the fact that real world "text" sometimes has encoding errors and it's often better to just preserve the errors instead of corrupting the data through replacement characters, or just refusing to accept some inputs and crashing the program.

In Rust there's bstr/ByteStr (currently being added to std), awkward having to decide which string type to use.

In Python there's PEP-383/"surrogateescape", which works because Python strings are not guaranteed valid (they're potentially ill-formed UTF-32 sequences, with a range restriction). Awkward figuring out when to actually use it.

In Raku there's UTF8-C8, which is probably the weirdest workaround of all (left as an exercise for the reader to try to understand .. oh, and it also interferes with valid Unicode that's not normalized, because that's another stupid restriction).

Meanwhile the Unicode standard itself specifies Unicode strings as being sequences of code units [0][1], so Go is one of the few modern languages that actually implements Unicode (8-bit) strings. Note that at least two out of the three inventors of Go also basically invented UTF-8.

[0] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

> Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

> Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.

replies(3): >>xyzzyz+cH >>empath+kI >>amluto+0qt

>>koakum+11
> Nothing?

It breaks. Which is weird because you can create a string which isn't valid UTF-8 (eg "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98") and print it out with no trouble; you just can't pass it to e.g. `os.Create` or `os.Open`.

(Bash and a variety of other utils will also complain about it being valid UTF-8; neovim won't save a file under that name; etc.)

replies(2): >>yencab+9G >>kragen+Gl1

>>jstimp+Ut
I think it depends on the underlying filesystem. Unicode (UTF-16) is first-class on NTFS. But Windows still supports FAT, I guess, where multiple 8-bit encodings are possible: the so-called "OEM" code pages (437, 850 etc.) or "ANSI" code pages (1250, 1251 etc.). I haven't checked how recent Windows versions cope with FAT file names that cannot be represented as Unicode.

>>andyfe+Ns
And 0x00.

>>andyfe+Ns
And 0x00, if I remember correctly.

>>xyzzyz+(OP)
[flagged]

replies(3): >>blibbl+jP >>jen20+kQ >>0x696C+731

>>klodol+hr
> You validate that strings are UTF-8 at the place where you care that they are UTF-8.

The problem with this, as with any lack of static typing, is that you now have to validate at _every_ place that cares, or to carefully track whether a value has already been validated, instead of validating once and letting the compiler check that it happened.

replies(1): >>klodol+lJ

>>zimpen+0w
That sounds like your kernel refusing to create that file, nothing to do with Go.

  $ cat main.go
  package main

  import (
   "log"
   "os"
  )

  func main() {
   f, err := os.Create("\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98")
   if err != nil {
    log.Fatalf("create: %v", err)
   }
   _ = f
  }
  $ go run .
  $ ls -1
  ''$'\275\262''='$'\274'' ⌘'
  go.mod
  main.go

replies(3): >>comman+3r1 >>zimpen+QB1 >>kragen+Sn3

>>klodol+hr
The big problem isn't invalid UTF-8 but invalid UTF-16 (on Windows et al). AIUI Go had nasty bugs around this (https://github.com/golang/go/issues/59971) until it recently adopted WTF-8, an encoding that was actually invented for Rust's OsStr.

WTF-8 has some inconvenient properties. Concatenating two strings requires special handling. Rust's opaque types can patch over this but I bet Go's WTF-8 handling exposes some unintuitive behavior.

There is a desire to add a normal string API to OsStr but the details aren't settled. For example: should it be possible to split an OsStr on an OsStr needle? This can be implemented but it'd require switching to OMG-WTF-8 (https://rust-lang.github.io/rfcs/2295-os-str-pattern.html), an encoding with even more special cases. (I've thrown my own hat into this ring with OsStr::slice_encoded_bytes().)

The current state is pretty sad yeah. If you're OK with losing portability you can use the OsStrExt extension traits.

replies(1): >>klodol+MI

>>maxdam+yv
The way Rust handles this is perfectly fine. String type promises its contents are valid UTF-8. When you create it from array of bytes, you have three options: 1) ::from_utf8, which will force you to handle invalid UTF-8 error, 2) ::from_utf8_lossy, which will replace invalid code points with replacement character code point, and 3) from_utf8_unchecked, which will not do the validity check and is explicitly marked as unsafe.

replies(1): >>maxdam+aO

>>maxdam+yv
> It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

Because 99.999% of the time you want it to be valid and would like an error if it isn't? If you want to work with invalid UTF-8, that should be a deliberate choice.

replies(1): >>maxdam+PO

>>shrubb+fq
not that I recall but I may not be recalling correctly.

But certainly, anyone will bring their previous experience to the project, so there must be some Plan9 influence in there somewhere

replies(1): >>kragen+wq1

>>ducker+cG
Yeah, I avoided talking about Windows which isn’t UTF-16 but “int16 string” the same way Unix filenames are int8 strings.

IMO the differences with Windows are such that I’m much more unhappy with WTF-8. There’s a lot that sucks about C++ but at least I can do something like

  #if _WIN32
  using pathchar = wchar_t;
  constexpr pathchar sep = L'\\';
  #else
  using pathchar = char;
  constexpr pathchar sep = '/';
  #endif
  using pathstring = std::basic_string<pathchar>;

Mind you this sucks for a lot of reasons, one big reason being that you’re directly exposed to the differences between path representations on different operating systems. Despite all the ways that this (above) sucks, I still generally prefer it over the approaches of Go or Rust.

>>Kinran+AF
In practice, the validation generally happens when you convert to JSON or use an HTML template or something like that, so it’s not so many places.

Validation is nice but Rust’s principled approach leaves me high and dry sometimes. Maybe Rust will finish figuring out the OsString interface and at that point we can say Rust has “won” the conversation, but it’s not there yet, and it’s been years.

replies(1): >>stouse+VL

>>naikro+X6
`string` is not an alias for []byte.

Consider:

    for i, chr := range string([]byte{226, 150, 136, 226, 150, 136}) {
      fmt.Printf("%d = %v\n", i, chr)
      // note, s[i] != chr
    }

How many times does that loop over 6 bytes iterate? The answer is it iterates twice, with i=0 and i=3.

There's also quite a few standard APIs that behave weirdly if a string is not valid utf-8, which wouldn't be the case if it was just a bag of bytes.

replies(1): >>naikro+LJ7

>>0x000x+D6
IMO utf8 isn't a highly specific format, it's universal for text. Every ascii string you'd write in C or C++ or whatever is already utf8.

So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.

The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.

This makes no sense with the string type. String is text, but now we don't have text. That's a problem.

We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!

replies(2): >>adastr+eP >>kragen+Jn1

>>klodol+lJ
> validation generally happens when

Except when it doesn’t and then you have to deal with fucking Cthulhu because everyone thought they could just make incorrect assumptions that aren’t actually enforced anywhere because “oh that never happens”.

That isn’t engineering. It’s programming by coincidence.

> Maybe Rust will finish figuring out the OsString interface

The entire reason OsString is painful to use is because those problems exist and are real. Golang drops them on the floor and forces you pick up the mess on the random day when an unlucky end user loses data. Rust forces you to confront them, as unfortunate as they are. It's painful once, and then the problem is solved for the indefinite future.

Rust also provides OsStrExt if you don’t care about portability, which greatly removes many of these issues.

I don’t know how that’s not ideal: mistakes are hard, but you can opt into better ergonomics if you don’t need the portability. If you end up needing portability later, the compiler will tell you that you can’t use the shortcuts you opted into.

replies(1): >>maxdam+mf2

>>xyzzyz+cH
But there's no option to just construct the string with the invalid bytes. 3) is not for this purpose; it is for when you already know that it is valid.

If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

https://doc.rust-lang.org/std/primitive.str.html#invariant

> Constructing a non-UTF-8 string slice is not immediate undefined behavior, but any function called on a string slice may assume that it is valid UTF-8, which means that a non-UTF-8 string slice can lead to undefined behavior down the road.

replies(3): >>adastr+GP >>gf000+DY >>xyzzyz+VP1

>>empath+kI
Do you want grep to crash when your text file turned out to have a partially written character in it? 99.999% seems very high, and you haven't given an actual use case for the restriction.

replies(2): >>gf000+1Z >>empath+rZ

>>tialar+Ig
. (early morning brain fart -- I need my coffee)

replies(1): >>tialar+zQ

>>const_+pL
Not all text is UTF-8, and there are real world contexts (e.g. Windows) where this matters a lot.

replies(1): >>const_+8V

>>stouse+gF
> Golang makes it easy to do the dumb, wrong, incorrect thing that looks like it works 99.7% of the time. How can that be wrong? It works in almost all cases!

my favorite example of this was the go authors refusing to add monotonic time into the standard library because they confidently misunderstood its necessity

(presumably because clocks at google don't ever step)

then after some huge outages (due to leap seconds) they finally added it

now the libraries are a complete a mess because the original clock/time abstractions weren't built with the concept of multiple clocks

and every go program written is littered with terrible bugs due to use of the wrong clock

https://github.com/golang/go/issues/12914 (https://github.com/golang/go/issues/12914#issuecomment-15075... might qualify for the worst comment ever)

replies(1): >>0cf861+LS1

>>maxdam+aO
I don’t understand this complaint. (3) sounds like exactly what you are asking for. And yes, doing unsafe thing is unsafe.

replies(1): >>maxdam+Hd2

>>stouse+gF
I can count on fewer hands the number of times I've been bitten by such things in over 10 years of professional Go vs bitten just in the last three weeks by half-assed Java.

replies(2): >>gf000+kV >>stouse+iW

>>adastr+SO
So it's true that technically the primitive type is str, and indeed it's even possible to make a &mut str though it's quite rare that you'd want to mutably borrow the string slice.

However no &str is not "an alias for &&String" and I can't quite imagine how you'd think that. String doesn't exist in Rust's core, it's from alloc and thus wouldn't be available if you don't have an allocator.

replies(1): >>zozbot+AS

>>tialar+zQ
str is not really a "primitive type", it only exists abstractly as an argument to type constructors - treating the & operator as a "type constructor" for that purpose, but including Box<>, Rc<>, Arc<> etc. So you can have Box<str> or Arc<str> in addition to &str or perhaps &mut str, but not really 'str' in isolation.

>>nasret+Nu
Go doesn't acknowledge that. It punts.

Most of the time if there's a result, there's no error. If there's an error, there's no result. But don't forget to check every time! And make sure you don't make a mistake when you're checking and accidentally use the value anyway, because even though it's technically meaningless it's still nominally a meaningful value since zero values are supposed to be meaningful.

Oh and make sure to double-check the docs, because the language can't let you know about the cases where both returns are meaningful.

The real world is messy. And golang doesn't give you advance warning on where the messes are, makes no effort to prevent you from stumbling into them, and stands next to you constantly criticizing you while you clean them up by yourself. "You aren't using that variable any more, clean that up too." "There's no new variables now, so use `err =` instead of `err :=`."

>>adastr+eP
Yes, Windows text is broken in its own special way.

We can try to shove it into objects that work on other text but this won't work in edge cases.

Like if I take text on Linux and try to write a Windows file with that text, it's broken. And vice versa.

Go let's you do the broken thing. In Rust or even using libraries in most languages, you cant. You have to specifically convert between them.

That's why I mean when I say "storing random binary data as text". Sure, Windows almost UTF16 abomination is kind of text, but not really. Its its own thing. That requires a different type of string OR converting it to a normal string.

replies(1): >>adastr+uW

>>jen20+kQ
There is a lot to say about Java, but the libraries (both standard lib and popular third-party ones) are goddamn battle-hardened, so I have a hard time believing your claim.

replies(3): >>jen20+TV >>p2deta+dr1 >>tom_m+vs2

>>gf000+kV
You can believe what you like, of course, but "battle tested" does not mean "isn't easy to abuse".

>>jen20+kQ
Is golang better than Java? Sure, fine, maybe. I'm not a Java expert so I don't have a dog in the race.

Should and could golang have been so much better than it is? Would golang have been better if Pike and co. had considered use-cases outside of Google, or looked outward for inspiration even just a little? Unambiguously yes, and none of the changes would have needed it to sacrifice its priorities of language simplicity, compilation speed, etc.

It is absolutely okay to feel that go is a better language than some of its predecessors while at the same time being utterly frustrated at the the very low-hanging, comparatively obvious, missed opportunities for it to have been drastically better.

>>const_+8V
Even on Linux, you can't have '/' in a filename, or ':' on macOS. And this is without getting into issues related to the null byte in strings. Having a separate Path object that represents a filename or path + filename makes sense, because on every platform there are idiosyncratic requirements surrounding paths.

It maybe legacy cruft downstream of poorly thought out design decisions at the system/OS level, but we're stuck with it. And a language that doesn't provide the tooling necessary to muddle through this mess safely isn't a serious platform to build on, IMHO.

There is room for languages that explicitly make the tradeoff of being easy to use (e.g. a unified string type) at the cost of not handling many real world edge cases correctly. But these should not be used for serious things like backup systems where edge cases result in lost data. Go is making the tradeoff for language simplicity, while being marketed and positioned as a serious language for writing serious programs, which it is not.

replies(1): >>const_+W81

>>maxdam+aO
How could any library function work with completely random bytes? Like, how would it iterate over code points? It may want to assume utf8's standard rules and e.g. know that after this byte prefix, the next byte is also part of the same code point (excuse me if I'm using wrong terminology), but now you need complex error handling at every single line, which would be unnecessary if you just made your type represent only valid instances.

Again, this is the same simplistic, vs just the right abstraction, this just smudges the complexity over a much larger surface area.

If you have a byte array that is not utf-8 encoded, then just... use a byte array.

replies(1): >>kragen+gq1

>>maxdam+PO
Crash? No. But I can safely handle the error where it happens, because the language actually helps me with this situation by returning a proper Result type. So I have to explicitly check which "variant" I have, instead of forgetting to call the validate function in case of go.

>>maxdam+PO
Rust doesn't crash when it gets an error unless you tell it to. You make a choice how to handle the error because you have to it or it won't compile. If you don't care about losing information when reading a file, you can use the lossy function that gracefully handles invalid bytes.

>>tialar+Ig
Even so you end up with paper cuts like len which returns the number of bytes.

replies(1): >>toast0+161

>>tialar+Ig
It wouldn't have been an alias, it would have been struct Str([u8]). Nothing would have been different about the safety story.

https://github.com/rust-lang/rfcs/issues/2692

replies(1): >>stouse+vf1

>>stouse+gF
[flagged]

replies(1): >>jack_h+FG1

>>inferi+501
The problem with string length is there's probably at least four concepts that could conceivably be called length, and few people are happy when none of them are len.

Of the top of my head, in order of likely difficulty to calculate: byte length, number of code points, number of grapheme/characters, height/width to display.

Maybe it would be best for Str not to have len at all. It could have bytes, code_points, graphemes. And every use would be precise.

replies(3): >>inferi+d81 >>stouse+bh1 >>branko+az1

>>toast0+161
Problems arise when you try to take a slice of a string and end up picking an index (perhaps based on length) that would split a code point. String/str offers an abstraction over Unicode scalars (code points) via the chars iterator, but it all feels a bit messy to have the byte based abstraction more or less be the default.

FWIW the docs indicate that working with grapheme clusters will never end up in the standard library.

replies(2): >>toast0+5d1 >>xyzzyz+3O1

>>adastr+uW
> Even on Linux, you can't have '/' in a filename, or ':' on macOS

Yes this is why all competent libraries don't actually use string for path. They have their own path data type because it's actually a different data type.

Again, you can do the Go thing and just use the broken string, but that's dumb and you shouldn't. They should look at C++ std::filesystem, it's actually quite good in this regard.

> And a language that doesn't provide the tooling necessary to muddle through this mess safely isn't a serious platform to build on, IMHO.

I agree, even PHP does a better job at this than Go, which is really saying something.

> Go is making the tradeoff for language simplicity, while being marketed and positioned as a serious language for writing serious programs, which it is not.

I would agree.

replies(1): >>astran+rh1

>>inferi+d81
> but it all feels a bit messy to have the byte based abstraction more or less be the default.

I mean, really neither should be the default. You should have to pick chars or bytes on use, but I don't think that's palatable; most languages have chosen one or the other as the preferred form. Or some have the joy of being forward thinking in the 90s and built around UCS-2 and later extended to UTF-16, so you've got 16-bit 'characters' with some code points that are two characters. Of course, dealing with operating systems means dealing with whatever they have as well as what the language prefers (or, as discussed elsewhere in this thread, pretending it doesn't exist to make easy things easier and hard things harder)

>>stevek+p01
I love this kind of historical knowledge. Thanks for sharing it!

>>toast0+161
> The problem with string length is there's probably at least four concepts that could conceivably be called length.

The answer here isn't to throw up your hands, pick one, and other cases be damned. It's to expose them all and let the engineer choose. To not beat the dead horse of Rust, I'll point that Ruby gets this right too.

    * String#length                   # count Unicode code units
    * String#bytes#length             # count bytes
    * String#grapheme_clusters#length # count grapheme clusters

Similarly, each of those "views" lets you slice, index, etc. across those concepts naturally. Golang's string is the worst of them all. They're nominally UTF-8, but nothing actually enforces it. But really they're just buckets of bytes, unless you send them to APIs that silently require them to be UTF-8 and drop them on the floor or misbehave if they're not.

Height/width to display is font-dependent, so can't just be on a "string" but needs an object with additional context.

>>const_+W81
> Yes this is why all competent libraries don't actually use string for path. They have their own path data type because it's actually a different data type.

What is different about it? I don't see any constraints here relevant to having a different type. Note that this thread has already confused the issue, because they said filename and you said path. A path can contain /, it just happens to mean something.

If you want a better abstraction to locations of files on disk, then you shouldn't use paths at all, since they break if the file gets moved.

replies(1): >>const_+SC1

>>xyzzyz+(OP)
If the filename is not valid UTF-8, Golang can still open the file without a problem, as long as your filesystem doesn't attempt to be clever. Linux ext4fs and Go both consider filenames to be binary strings except that they cannot contain NULs.

This is one of the minor errors in the post.

>>zimpen+0w
It sounds like you found a bug in your filesystem, not in Golang's API, because you totally can pass that string to those functions and open the file successfully.

>>const_+pL
The approach you are advocating is the approach that was abandoned, for good reasons, in the Unix filesystem in the 70s and in Perl in the 80s.

One of the great advances of Unix was that you don't need separate handling for binary data and text data; they are stored in the same kind of file and can be contained in the same kinds of strings (except, sadly, in C). Occasionally you need to do some kind of text-specific processing where you care, but the rest of the time you can keep all your code 8-bit clean so that it can handle any data safely.

Languages that have adopted the approach you advocate, such as Python, frequently have bugs like exception tracebacks they can't print (because stdout is set to ASCII) or filenames they can't open when they're passed in on the command line (because they aren't valid UTF-8).

replies(1): >>kragen+3l2

>>gf000+DY
There are a lot of operations that are valid and well-defined on binary strings, such as sorting them, hashing them, writing them to files, measuring their lengths, indexing a trie with them, splitting them on delimiter bytes or substrings, concatenating them, substring-searching them, posting them to ZMQ as messages, subscribing to them as ZMQ prefixes, using them as keys or values in LevelDB, and so on. For binary strings that don't contain null bytes, we can add passing them as command-line arguments and using them as filenames.

The entire point of UTF-8 (designed, by the way, by the group that designed Go) is to encode Unicode in such a way that these byte string operations perform the corresponding Unicode operations, precisely so that you don't have to care whether your string is Unicode or just plain ASCII, so you don't need any error handling, except for the rare case where you want to do something related to the text that the string semantically represents. The only operation that doesn't really map is measuring the length.

replies(2): >>gf000+6r1 >>xyzzyz+ut1

>>shrubb+fq
Yes, Golang is superficially almost identical to Pike's Newsqueak.

replies(1): >>pjmlp+I03

>>ants_e+yI
They were literally using the Plan9 C compiler and linker.

replies(1): >>ants_e+aR1

>>yencab+9G
I'm confused, so is Go restricted to UTF-8 only filenames, because it can read/write arbitrary byte sequences (which is what string can hold), which should be sufficient for dealing with other encodings?

replies(1): >>yencab+3s1

>>kragen+gq1
Then [u8] can surely implement those functions.

>>gf000+kV
They might very well be, because time-handling in Java almost always sucked. In the beginning there was java.util.Date and it was very poorly designed. Sun tried to fix that with java.util.Calendar. That worked for a while but it was still cumbersome, Calendar.getInstance() anyone? After that someone sat down and wrote Joda-Time, which was really really cool and IMO the basis of JSR-310 and the new java.time API. So you're kind of right, but it only took them 15 years to make it right.

replies(1): >>gf000+sF1

>>comman+3r1
Go is not restricted, since strings are only conventionally utf-8 but not restricted to that.

replies(1): >>comman+T82

>>kragen+gq1
> There are a lot of operations that are valid and well-defined on binary strings, such as (...), and so on.

Every single thing you listed here is supported by &[u8] type. That's the point: if you want to operate on data without assuming it's valid UTF-8, you just use &[u8] (or allocating Vec<u8>), and the standard library offers what you'd typically want, except of the functions that assume that the string is valid UTF-8 (like e.g. iterating over code points). If you want that, you need to convert your &[u8] to &str, and the process of conversion forces you to check for conversion errors.

replies(2): >>kragen+Px1 >>maxdam+y32

>>xyzzyz+ut1
That's semantically okay, but giving &str such a short name creates a dangerous temptation to use it for things such as filenames, stdio, and command-line arguments, where that process of conversion introduces errors into code that would otherwise work reliably for any non-null-containing string, as it does in Go. If it were called something like ValidatedUnicodeTextSlice it would probably be fine.

replies(3): >>adastr+7K1 >>xyzzyz+aP1 >>amluto+uqt

>>toast0+161
You could also have the number of code UNITS, which is the route C# took.

>>yencab+9G
> That sounds like your kernel refusing to create that file

Yes, that was my assumption when bash et al also had problems with it.

>>astran+rh1
A string can contain characters a path cannot, depending on the operating system. So only some strings are valid paths.

Typically the way you do this is you have the constructor for path do the validation or you use a static path::fromString() function.

Also paths breaking when a file is moved is correct behavior sometimes. For example something like openFile() or moveFile() requires paths. Also path can be relative location.

replies(1): >>astran+lS1

>>p2deta+dr1
At the time of Date's "reign", were there any other language with a better library? And Calendar is not a replacement for Date so it's a bit out of the picture.

Joda time is an excellent library and indeed it was basically the basis for java's time API, and.. for pretty much any modern language's time API, but given the history - Java basically always had the best time library available at the time.

replies(1): >>p2deta+Mr3

>>0x696C+731
It’s not about making zero mistakes, it’s about learning from previous languages which made mistakes and not repeating them. I decided against using go pretty early on because I recognized just how many mistakes they were repeating that would end up haunting maintainers.

replies(1): >>0x696C+PQ1

>>kragen+Px1
I'd agree if it was &[bytes] or whatever. But &[u8] isn't that much different from &str.

replies(1): >>kragen+iO1

>>inferi+d81
You can easily treat `&str` as bytes, just call `.as_bytes()`, and you get `&[u8]`, no questions asked. The reason why you don't want to treat &str as just bytes by default is that it's almost always a wrong thing to do. Moreover, it's the worst kind of a wrong thing, because it actually works correctly 99% of the time, so you might not even realize you have a bug until much too late.

If your API takes &str, and tries to do byte-based indexing, it should almost certainly be taking &[u8] instead.

replies(1): >>inferi+Ib2

>>adastr+7K1
Isn't &[u8] what you should be using for command-line arguments and filenames and whatnot? In that case you'd want its name to be short, like &[u8], rather than long like &[bytes] or &[raw_uncut_byte] or something.

replies(1): >>adastr+p32

>>kragen+Px1
It's actually extremely hard to introduce problems like that, precisely because Rust's standard library is very well designed. Can you give an example scenario where it would be a problem?

replies(1): >>kragen+r82

>>maxdam+aO
> If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

Yes, and that's a good thing. It allows every code that gets &str/String to assume that the input is valid UTF-8. The alternative would be that every single time you write a function that takes a string as an argument, you have to analyze your code, consider what would happen if the argument was not valid UTF-8, and handle that appropriately. You'd also have to redo the whole analysis every time you modify the function. That's a horrible waste of time: it's much better to:

1) Convert things to String early, and assume validity later, and

2) Make functions that explicitly don't care about validity take &[u8] instead.

This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

replies(1): >>maxdam+bW1

>>jack_h+FG1
[flagged]

replies(1): >>achier+jk2

>>kragen+wq1
Yes I'm aware

replies(1): >>kragen+xd2

>>const_+SC1
> A string can contain characters a path cannot, depending on the operating system. So only some strings are valid paths.

Can it? If you want to open a file with invalid UTF8 in the name, then the path has to contain that.

And a path can contain the path separator - it's the filename that can't contain it.

> For example something like openFile() or moveFile() requires paths.

macOS has something called bookmark URLs that can contain things like inode numbers or addresses of network mounts. Apps use it to remember how to find recently opened files even if you've reorganized your disk or the mount has dropped off.

IIRC it does resolve to a path so it can use open() eventually, but you could imagine an alternative. Well, security issues aside.

replies(1): >>adastr+o42

>>blibbl+jP
This issue is probably my favorite Goism. Real issue identified and the feedback is, “You shouldn’t run hardware that way. Run servers like Google does without time jumping.” Similar with the original stance to code versioning. Just run a monorepo!

>>xyzzyz+VP1
> This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

Doesn't this demonstrate my point? If you can do everything with &[u8], what's the point in validating UTF-8? It's just a less universal string type, and your program wastes CPU cycles doing unnecessary validation.

replies(1): >>matt_k+u03

>>kragen+iO1
OsStr/OsString is what you would use in those circumstances. Path/PathBuf specifically for filenames or paths, which I think uses OsStr/OsString internally. I've never looked at OsStr's internals but I wouldn't be surprised if it is a wrapper around &[u8].

Note that &[u8] would allow things like null bytes, and maybe other edge cases.

replies(1): >>kragen+E82

>>xyzzyz+ut1
The problem is that there are so many functions that unnecessarily take `&str` rather than `&[u8]` because the expectation is that textual things should use `&str`.

So you naturally write another one of these functions that takes a `&str` so that it can pass to another function that only accepts `&str`.

Fundamentally no one actually requires validation (ie, walking over the string an extra time up front), we're just making it part of the contract because something else has made it part of the contract.

replies(1): >>kragen+ya2

>>astran+lS1
Rust allows null bytes in str. Most (all?) OS don't allow null bytes in filenames.

>>andyfe+Ns
Windows paths are not necessarily well-formed UTF-16 (UCS-2 by some people’s definition) down to the filesystem level. If they were always well formed, you could convert to a single byte representation by straightforward Unicode re-encoding. But since they aren’t - there are choices that need to be made about what to do with malformed UTF-16 if you want to round trip them to single byte strings such that they match UTF-8 encoding if they are well formed.

In Linux, they’re 8-bit almost-arbitrary strings like you noted, and usually UTF-8. So they always have a convenient 8-bit encoding (I.e. leave them alone). If you hated yourself and wanted to convert them to UTF-16, however, you’d have the same problem Windows does but in reverse.

>>klodol+hr
It's completely in-line with Rust's approach. Concentrate on the hard stuff that lifts every boat. Like the type system, language features, and keep the standard library very small, and maybe import/adopt very successful packages. (Like once_cell. But since removing things from std is considered a forever no-no, it seems path handling has to be solved by crates. Eg. https://github.com/chipsenkbeil/typed-path )

>>jstimp+Ut
Windows filenames in the W APIs are 16-bit (which the A APIs essentially wrap with conversions to the active old-school codepage), and are normally well formed UTF-16. But they aren’t required to be - NTFS itself only cares about 0x0000 and 0x005C (backslash) I believe, and all layers of the stack accept invalid UTF-16 surrogates. Don’t get me started on the normal Win32 path processing (Unicode normalization, “COM” is still a special file, etc.), some of which can be bypassed with the “\\?\” prefix when in NTFS.

The upshot is that since the values aren’t always UTF-16, there’s no canonical way to convert them to single byte strings such that valid UTF-16 gets turned into valid UTF-8 but the rest can still be roundtripped. That’s what bastardized encodings like WTF-8 solve. The Rust Path API is the best take on this I’ve seen that doesn’t choke on bad Unicode.

>>xyzzyz+aP1
Well, for example, the extremely exotic scenario of passing command-line arguments to a program on little-known operating systems like Linux and FreeBSD; https://doc.rust-lang.org/book/ch12-01-accepting-command-lin... recommends:

  use std::env;

  fn main() {
      let args: Vec<String> = env::args().collect();
      ...
  }

When I run this code, a literal example from the official manual, with this filename I have here, it panics:

    $ ./main $'\200'
    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "\x80"', library/std/src/env.rs:805:51
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

($'\200' is bash's notation for a single byte with the value 128. We'll see it below in the strace output.)

So, literally any program anyone writes in Rust will crash if you attempt to pass it that filename, if it uses the manual's recommended way to accept command-line arguments. It might work fine for a long time, in all kinds of tests, and then blow up in production when a wild file appears with a filename that fails to be valid Unicode.

This C program I just wrote handles it fine:

  #include <unistd.h>
  #include <fcntl.h>
  #include <stdio.h>
  #include <stdlib.h>

  char buf[4096];

  void
  err(char *s)
  {
    perror(s);
    exit(-1);
  }

  int
  main(int argc, char **argv)
  {
    int input, output;
    if ((input = open(argv[1], O_RDONLY)) < 0) err(argv[1]);
    if ((output = open(argv[2], O_WRONLY | O_CREAT, 0666)) < 0) err(argv[2]);
    for (;;) {
      ssize_t size = read(input, buf, sizeof buf);
      if (size < 0) err("read");
      if (size == 0) return 0;
      ssize_t size2 = write(output, buf, (size_t)size);
      if (size2 != size) err("write");
    }
  }

(I probably should have used O_TRUNC.)

Here you can see that it does successfully copy that file:

    $ cat baz
    cat: baz: No such file or directory
    $ strace -s4096 ./cp $'\200' baz
    execve("./cp", ["./cp", "\200", "baz"], 0x7ffd7ab60058 /* 50 vars */) = 0
    brk(NULL)                               = 0xd3ec000
    brk(0xd3ecd00)                          = 0xd3ecd00
    arch_prctl(ARCH_SET_FS, 0xd3ec380)      = 0
    set_tid_address(0xd3ec650)              = 4153012
    set_robust_list(0xd3ec660, 24)          = 0
    rseq(0xd3ecca0, 0x20, 0, 0x53053053)    = 0
    prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=9788*1024, rlim_max=RLIM64_INFINITY}) = 0
    readlink("/proc/self/exe", ".../cp", 4096) = 22
    getrandom("\xcf\x1f\xb7\xd3\xdb\x4c\xc7\x2c", 8, GRND_NONBLOCK) = 8
    brk(NULL)                               = 0xd3ecd00
    brk(0xd40dd00)                          = 0xd40dd00
    brk(0xd40e000)                          = 0xd40e000
    mprotect(0x4a2000, 16384, PROT_READ)    = 0
    openat(AT_FDCWD, "\200", O_RDONLY)      = 3
    openat(AT_FDCWD, "baz", O_WRONLY|O_CREAT, 0666) = 4
    read(3, "foo\n", 4096)                  = 4
    write(4, "foo\n", 4)                    = 4
    read(3, "", 4096)                       = 0
    exit_group(0)                           = ?
    +++ exited with 0 +++
    $ cat baz
    foo

The Rust manual page linked above explains why they think introducing this bug by default into all your programs is a good idea, and how to avoid it:

> Note that std::env::args will panic if any argument contains invalid Unicode. If your program needs to accept arguments containing invalid Unicode, use std::env::args_os instead. That function returns an iterator that produces OsString values instead of String values. We’ve chosen to use std::env::args here for simplicity because OsString values differ per platform and are more complex to work with than String values.

I don't know what's "complex" about OsString, but for the time being I'll take the manual's word for it.

So, Rust's approach evidently makes it extremely hard not to introduce problems like that, even in the simplest programs.

Go's approach doesn't have that problem; this program works just as well as the C program, without the Rust footgun:

  package main

  import (
          "io"
          "log"
          "os"
  )

  func main() {
          src, err := os.Open(os.Args[1])
          if err != nil {
                  log.Fatalf("open source: %v", err)
          }

          dst, err := os.OpenFile(os.Args[2], os.O_CREATE|os.O_WRONLY, 0666)
          if err != nil {
                  log.Fatalf("create dest: %v", err)
          }

          if _, err := io.Copy(dst, src); err != nil {
                  log.Fatalf("copy: %v", err)
          }
  }

(O_CREATE makes me laugh. I guess Ken did get to spell "creat" with an "e" after all!)

This program generates a much less clean strace, so I am not going to include it.

You might wonder how such a filename could arise other than as a deliberate attack. The most common scenario is when the filenames are encoded in a non-Unicode encoding like Shift-JIS or Latin-1, followed by disk corruption, but the deliberate attack scenario is nothing to sneeze at either. You don't want attackers to be able to create filenames your tools can't see, or turn to stone if they examine, like Medusa.

Note that the log message on error also includes the ill-formed Unicode filename:

  $ ./cp $'\201' baz
  2025/08/22 21:53:49 open source: open ζ: no such file or directory

But it didn't say ζ. It actually emitted a byte with value 129, making the error message ill-formed UTF-8. This is obviously potentially dangerous, depending on where that logfile goes because it can include arbitrary terminal escape sequences. But note that Rust's UTF-8 validation won't protect you from that, or from things like this:

  $ ./cp $'\n2025/08/22 21:59:59 oh no' baz
  2025/08/22 21:59:09 open source: open 
  2025/08/22 21:59:59 oh no: no such file or directory

I'm not bagging on Rust. There are a lot of good things about Rust. But its string handling is not one of them.

replies(1): >>anarki+4P2

>>adastr+p32
You can't get null bytes from a command-line argument. And going by >>44991638 it's common to not use OsString when accepting command-line arguments, because std::env::args yields Strings, which means that probably most Rust programs that accept filenames on the command line have this bug.

replies(1): >>adastr+4d2

>>yencab+3s1
Then I am having a hard time understanding the issue in the post, it seems pretty vague, is there any idea what specific issue is happening, is it how they've used Go, or does Go have an inherent implementation issue, specifically these lines:

If you stuff random binary data into a string, Go just steams along, as described in this post.

Over the decades I have lost data to tools skipping non-UTF-8 filenames. I should not be blamed for having files that were named before UTF-8 existed.

replies(3): >>comex+Xe2 >>yencab+dl2 >>kragen+1o3

>>maxdam+y32
It's much worse than that—in many cases, such as passing a filename to a program on the Linux command line, correct behavior requires not validating, so erroring out when validation fails introduces bugs. I've explained this in more detail in >>44991638 .

>>xyzzyz+3O1

  If your API takes &str, and tries to do byte-based indexing, it should
  almost certainly be taking &[u8] instead.

Str is indexed by bytes. That's the issue.

replies(1): >>xyzzyz+d94

>>kragen+E82
Rust String can contain null bytes! Rust uses explicit string lengths. Agree though that most OS wouldn't be able to pass null bytes in arguments though.

replies(1): >>kragen+Od2

>>ants_e+aR1
Literally building the project out of the Plan 9 source code is very far from "bring[ing] their previous experience to the project, (...) some Plan9 influence in there somewhere"

replies(1): >>ants_e+7k2

>>adastr+GP
> I don’t understand this complaint. (3) sounds like exactly what you are asking for. And yes, doing unsafe thing is unsafe

You're meant to use `unsafe` as a way of limiting the scope of reasoning about safety.

Once you construct a `&str` using `from_utf8_unchecked`, you can't safely pass it to any other function without looking at its code and reasoning about whether it's still safe.

Also see the actual documentation: https://doc.rust-lang.org/std/primitive.str.html#method.from...

> Safety: The bytes passed in must be valid UTF-8.

>>adastr+4d2
Right, but it can't contain invalid UTF-8, which is valid in both command-line parameters and in filenames on Linux, FreeBSD, and other normal Unixes. See my link above for a demonstration of how this causes bugs in Rust programs.

>>comman+T82
Yeah, the complaint is pretty bizarre, or at least unclear.

>>stouse+VL
Can you give an example of how Go's approach causes people to lose data? This was alluded to in the blog post but they didn't explain anything.

It seems like there's some confusion in the GGGGGP post, since Go works correctly even if the filename is not valid UTF-8 .. maybe that's why they haven't noticed any issues.

replies(1): >>xyzzyz+2u2

>>kragen+xd2
It's a C compiler. Is your point that Go is influenced by C? ...

replies(2): >>kragen+Rk2 >>tom_m+Ks2

>>0x696C+PQ1
This isn't an issue of intelligence, and GP didn't imply that it was.

replies(1): >>0x696C+5W2

>>ants_e+7k2
I think you should upgrade to a less badly quantized neural network model.

replies(1): >>ants_e+bm2

>>kragen+Jn1
As I demonstrated in >>44991638 , it's easy to run into this problem in, for example, Rust.

>>comman+T82
Let me translate: "I have decided to not like something so now I associate miscellaneous previous negative experiences with it"

>>kragen+Rk2
I don't see why you've been continually replying so impolitely. I've tried to give you the benefit of the doubt, but I see I've just wasted my time.

replies(1): >>kragen+2o2

>>ants_e+bm2
Certainly isn't what it looks like to me.

replies(1): >>ants_e+Xo2

>>kragen+2o2
okay well. good luck getting angry at people on the internet or whatever else you do

>>gf000+kV
ROFL really?

>>ants_e+7k2
They started there, but it now is compiled by go itself.

>>maxdam+mf2
Imagine that you're writing a function that'll walk the directory to copy some files somewhere else, and then delete the directory. Unfortunately, you hit this

https://github.com/golang/go/issues/32334

oops, looks like some files are just inaccessible to you, and you cannot copy them.

Fortunately, when you try to delete the source directory, Go's standard library enters infinite loop, which saves your data.

https://github.com/golang/go/issues/59971

replies(2): >>maxdam+fA2 >>klodol+XF2

>>xyzzyz+(OP)
> What if the file name is not valid UTF-8, though?

Then make it valid UTF-8. If you try to solve the long tail of issues in a commonly used function of the library its going to cause a lot of pain. This approach is better. If someone has a weird problem like file names with invalid characters, they can solve it themselves, even publish a package. Why complicate 100% of uses for solving 0.01% of issues?

replies(1): >>nomel+hw2

>>perryi+7w2
> Then make it valid UTF-8.

I think you misunderstand. How do you do that for a file that exists on disk that's trying to be read? Rename it for them? They may not like that.

>>xyzzyz+2u2
Ah, mentioning Windows filenames would have been useful.

I guess the issue isn't so much about whether strings are well-formed, but about whether the conversion (eg, from UTF-16 to UTF-8 at the filesystem boundary) raises an error or silently modifies the data to use replacement characters.

I do think that is the main fundamental mistake in Go's Unicode handling; it tends to use replacement characters automatically instead of signalling errors. Using replacement characters is at least conformant to Unicode but imo unless you know the text is not going to be used as an identifier (like a filename), conversion should instead just fail.

The other option is using some mechanism to preserve the errors instead of failing quietly (replacement) or failing loudly (raise/throw/panic/return err), and I believe that's what they're now doing for filenames on Windows, using WTF-8. I agree with this new approach, though would still have preferred they not use replacement characters automatically in various places (another one is the "json" module, which quietly corrupts your non-UTF-8 and non-UTF-16 data using replacement characters).

Probably worth noting that the WTF-8 approach works because strings are not validated; WTF-8 involves converting invalid UTF-16 data into invalid UTF-8 data such that the conversion is reversible. It would not be possible to encode invalid UTF-16 data into valid UTF-8 data without changing the meaning of valid Unicode strings.

>>xyzzyz+2u2
IMO the right thing to do here is even messier than Go’s approach, which is give people utf-16-ish strings on Windows.

replies(1): >>maxdam+fO2

>>klodol+XF2
They have effectively done this (since the linked issue was raised), by just converting Windows filenames to WTF-8.

I think this is sensible, because the fact that Windows still uses UTF-16 (or more precisely "Unicode 16-bit strings") in some places shouldn't need to complicate the API on other platforms that didn't make the UCS-2/UTF-16 mistake.

It's possible that the WTF-8 strings might not concatenate the way they do in UTF-16 or properly enforced WTF-8 (which has special behaviour on concatenation), but they'll still round-trip to the intended 16-bit string, even after concatenation.

>>kragen+r82
There might be potential improvements, like using OsString by default for `env::args()` but I would pick Rust's string handling over Go’s or C's any day.

replies(1): >>kragen+In3

>>achier+jk2
They very much did imply that.

>>maxdam+bW1
> except things that do require you to assume it's valid UTF-8

That's the point.

replies(1): >>maxdam+Ko3

>>shrubb+fq
Kind of, Limbo, written for Inferno, taking into consideration what made Alef's design for Plan 9 a failure, like not having garbage collection.

>>kragen+qq1
More like Limbo and Alef.

replies(1): >>kragen+Of3

>>pjmlp+I03
Agreed.

>>anarki+4P2
It's reasonable to argue that C's string handling is as bad as Rust's, or worse.

>>yencab+9G
I've posted a longer explanation in >>44991638 . I'm interested to hear which kernel and which firesystem zimpenfish is using that has this problem.

replies(1): >>yencab+ws3

>>comman+T82
The post is wrong on this point, although it's mostly correct otherwise. Just steaming along when you have random binary data in a string, as Golang does, is how you avoid losing data to tools that skip non-UTF-8 filenames, or crash on them.

>>matt_k+u03
But no one has demonstrated an actual operation that requires valid UTF-8. The reasoning is always circular: "I require valid UTF-8 because someone else requires valid UTF-8".

Eventually there should be an underlying operation which can only work on valid UTF-8, but that doesn't exist. UTF-8 was designed such that invalid data can be detected and handled, without affecting the meaning of valid subsequences in the same string.

replies(1): >>amluto+trt

>>gf000+sF1
I’m sorry but I do not agree at all.

That “reign” continued forever if you count when java.time got introduced and no, Calendar was not much better in the mean time. Python already had datetime in 2002 or 2003 and VB6 was miles ahead back when Java had just util.Date.

>>kragen+Sn3
I believe macOS forces UTF-8 filenames and normalizes them to something near-but-not-quite Unicode NFD.

Windows doing something similar wouldn't surprise me at all. I believe NTFS internally stores filenames as UTF-16, so enforcing UTF-8 at the API boundary sounds likely.

replies(1): >>kragen+sx3

>>yencab+ws3
That sounds right. Fortunately, it's not my problem that they're using a buggy piece of shit for an OS.

>>inferi+Ib2
As a matter of fact, you cannot do

  let s = “asd”;
  println!(“{}”, s[0]);

You will get a compiler error telling you that you cannot index into &str.

replies(1): >>inferi+2j4

>>xyzzyz+d94
Right, you have to give it a usize range. And that will index by bytes. This:

  fn main() {
      let s = "12345";
      println!("{}", &s[0..1]);
  }

compiles and prints out "1".

This:

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", &s[0..1]);
  }

compiles and panics with the following error:

  byte index 1 is not a char boundary; it is inside 'ሴ' (bytes 0..3) of `ሴ2345`

To get the nth char (scalar codepoint):

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", s.chars().nth(1).unwrap());
  }

To get a substring:

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", s.chars().skip(0).take(1).collect::<String>());
  }

To actually get the bytes you'd have to call #as_bytes which works with scalar and range indices, e.g.:

  fn main() {
      let s = "\u{1234}2345";
      println!("{:02X?}", &s.as_bytes()[0..1]);
      println!("{:02X}", &s.as_bytes()[0]);
  }

IMO it's less intuitive than it should be but still less bad than e.g. Go's two types of nil because it will fail in a visible manner.

replies(1): >>xyzzyz+sH4

>>inferi+2j4
It's actually somewhat hard to hit that panic in a realistic scenario. This is because you are unlikely to be using slice indices that are not on a character boundary. Where would you even get them from? All the standard library functions will return byte indices on a character boundary. For example, if you try to do something like slice the string between first occurrence of character 'a', and of character 'z', you'll do something like

  let start = s.find('a')?;
  let end = s.find('z')?;
  let sub = &s[start..end];

and it will never panic, because find will never return something that's not on a char boundary.

replies(1): >>inferi+xJ4

>>xyzzyz+sH4

  Where would you even get them from?

In my case it was in parsing text where a numeric value had a two character prefix but a string value did not. So I was matching on 0..2 (actually 0..2.min(string.len()) which doubly highlights the indexing issue) which blew up occasionally depending on the string values. There are perhaps smarter ways to do this (e.g. splitn on a space, regex, giant if-else statement, etc, etc) but this seemed at first glance to be the most efficient way because it all fit neatly into a match statement.

The inverse was also a problem: laying out text with a monospace font knowing that every character took up the same number of pixels along the x-axis (e.g. no odd emoji or whatever else). Gotta make sure to call #len on #chars instead of the string itself as some of the text (Windows-1250 encoded) got converted into multi-byte Unicode codepoints.

>>TheDon+rJ
Go programmers (and `range`) assume that string is always valid UTF-8 but there is no guarantee by the language that a string is valid UTF-8. The string itself is still a []byte. `range` sees the `string` type and has special handling for strings that it does not have when it ranges over []byte. Recall that aliased types are not viewed as the same type at any time.

A couple quotes from the Go Blog by Rob Pike:

> It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

> Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

Both from https://go.dev/blog/strings

If you want UTF-8 in a guaranteed way, use the functions available in unicode/utf8 for that. Using `string` is not sufficient unless you make sure you only put UTF-8 into those strings.

If you put valid UTF-8 into a string, you can be sure that the string holds valid UTF-8, but if someone else puts data into a string, and you assume that it is valid UTF-8, you may have a problem because of that assumption.

>>xyzzyz+d1
> if it doesn't allow you to make any extra assumptions about the contents beyond `[]byte`?

It does though? Strings are internable, comparable, can be keys, etc.

>>maxdam+yv
> It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

At the protocol (or disk, etc) boundary. If I write code that consumes bytes that are intended to be UTF-8, I need to make a choice about what to do if they aren’t UTF-8 somewhere. A strict UTF-8 string forces me to make that choice in a considered location. In a language where a “string” is just bytes, I can forget, or to pieces of code can disagree on what the contract is. And bugs result.

Check out MySQL for a fun example of getting this wildly, impressively wrong. At least a Rust or a type checked-Python 3 wrapper around some MySQL code enforces a degree of correctness, which is much better than having your transaction fail to commit or commit indirectly was down the stack when you get bytes you didn’t expect.

(MySQL can still reject strictly valid UTF-8 data for utterly pathetic historical reasons if you configure it incorrectly.)

>>kragen+Px1
> command-line arguments

Command-line arguments on Windows are their own special disaster.

>>maxdam+Ko3
> UTF-8 was designed such that invalid data can be detected and handled, without affecting the meaning of valid subsequences in the same string.

But there is not a canonical response to invalid data. So literally every operation that might need to make a choice of what to do when presented what invalid data should either (a) accept a parameter asking what to do on error and potentially fail or (b) take a parameter type that forces errors to be handled in advance.