It's far better to get some � when working with messy data instead of applications refusing to work and erroring out left and right.
[]Rune is for sequences of UTF characters. rune is an alias for int32. string, I think, is an alias for []byte.
Stuff like this matters a great deal on the standard library level.
Score another for Rust's Safety Culture. It would be convenient to just have &str as an alias for &[u8] but if that mistake had been allowed all the safety checking that Rust now does centrally has to be owned by every single user forever. Instead of a few dozen checks overseen by experts there'd be myriad sprinkled across every project and always ready to bite you.
You should always be able to iterate the code points of a string, whether or not it's valid Unicode. The iterator can either silently replace any errors with replacement characters, or denote the errors by returning eg, `Result<char, Utf8Error>`, depending on the use case.
All languages that have tried restricting Unicode afaik have ended up adding workarounds for the fact that real world "text" sometimes has encoding errors and it's often better to just preserve the errors instead of corrupting the data through replacement characters, or just refusing to accept some inputs and crashing the program.
In Rust there's bstr/ByteStr (currently being added to std), awkward having to decide which string type to use.
In Python there's PEP-383/"surrogateescape", which works because Python strings are not guaranteed valid (they're potentially ill-formed UTF-32 sequences, with a range restriction). Awkward figuring out when to actually use it.
In Raku there's UTF8-C8, which is probably the weirdest workaround of all (left as an exercise for the reader to try to understand .. oh, and it also interferes with valid Unicode that's not normalized, because that's another stupid restriction).
Meanwhile the Unicode standard itself specifies Unicode strings as being sequences of code units [0][1], so Go is one of the few modern languages that actually implements Unicode (8-bit) strings. Note that at least two out of the three inventors of Go also basically invented UTF-8.
[0] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...
> Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...
> Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.
Because 99.999% of the time you want it to be valid and would like an error if it isn't? If you want to work with invalid UTF-8, that should be a deliberate choice.
Consider:
for i, chr := range string([]byte{226, 150, 136, 226, 150, 136}) {
fmt.Printf("%d = %v\n", i, chr)
// note, s[i] != chr
}
How many times does that loop over 6 bytes iterate? The answer is it iterates twice, with i=0 and i=3.There's also quite a few standard APIs that behave weirdly if a string is not valid utf-8, which wouldn't be the case if it was just a bag of bytes.
So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.
The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.
This makes no sense with the string type. String is text, but now we don't have text. That's a problem.
We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!
If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.
https://doc.rust-lang.org/std/primitive.str.html#invariant
> Constructing a non-UTF-8 string slice is not immediate undefined behavior, but any function called on a string slice may assume that it is valid UTF-8, which means that a non-UTF-8 string slice can lead to undefined behavior down the road.
However no &str is not "an alias for &&String" and I can't quite imagine how you'd think that. String doesn't exist in Rust's core, it's from alloc and thus wouldn't be available if you don't have an allocator.
We can try to shove it into objects that work on other text but this won't work in edge cases.
Like if I take text on Linux and try to write a Windows file with that text, it's broken. And vice versa.
Go let's you do the broken thing. In Rust or even using libraries in most languages, you cant. You have to specifically convert between them.
That's why I mean when I say "storing random binary data as text". Sure, Windows almost UTF16 abomination is kind of text, but not really. Its its own thing. That requires a different type of string OR converting it to a normal string.
It maybe legacy cruft downstream of poorly thought out design decisions at the system/OS level, but we're stuck with it. And a language that doesn't provide the tooling necessary to muddle through this mess safely isn't a serious platform to build on, IMHO.
There is room for languages that explicitly make the tradeoff of being easy to use (e.g. a unified string type) at the cost of not handling many real world edge cases correctly. But these should not be used for serious things like backup systems where edge cases result in lost data. Go is making the tradeoff for language simplicity, while being marketed and positioned as a serious language for writing serious programs, which it is not.
Again, this is the same simplistic, vs just the right abstraction, this just smudges the complexity over a much larger surface area.
If you have a byte array that is not utf-8 encoded, then just... use a byte array.
Of the top of my head, in order of likely difficulty to calculate: byte length, number of code points, number of grapheme/characters, height/width to display.
Maybe it would be best for Str not to have len at all. It could have bytes, code_points, graphemes. And every use would be precise.
FWIW the docs indicate that working with grapheme clusters will never end up in the standard library.
Yes this is why all competent libraries don't actually use string for path. They have their own path data type because it's actually a different data type.
Again, you can do the Go thing and just use the broken string, but that's dumb and you shouldn't. They should look at C++ std::filesystem, it's actually quite good in this regard.
> And a language that doesn't provide the tooling necessary to muddle through this mess safely isn't a serious platform to build on, IMHO.
I agree, even PHP does a better job at this than Go, which is really saying something.
> Go is making the tradeoff for language simplicity, while being marketed and positioned as a serious language for writing serious programs, which it is not.
I would agree.
I mean, really neither should be the default. You should have to pick chars or bytes on use, but I don't think that's palatable; most languages have chosen one or the other as the preferred form. Or some have the joy of being forward thinking in the 90s and built around UCS-2 and later extended to UTF-16, so you've got 16-bit 'characters' with some code points that are two characters. Of course, dealing with operating systems means dealing with whatever they have as well as what the language prefers (or, as discussed elsewhere in this thread, pretending it doesn't exist to make easy things easier and hard things harder)
The answer here isn't to throw up your hands, pick one, and other cases be damned. It's to expose them all and let the engineer choose. To not beat the dead horse of Rust, I'll point that Ruby gets this right too.
* String#length # count Unicode code units
* String#bytes#length # count bytes
* String#grapheme_clusters#length # count grapheme clusters
Similarly, each of those "views" lets you slice, index, etc. across those concepts naturally. Golang's string is the worst of them all. They're nominally UTF-8, but nothing actually enforces it. But really they're just buckets of bytes, unless you send them to APIs that silently require them to be UTF-8 and drop them on the floor or misbehave if they're not.Height/width to display is font-dependent, so can't just be on a "string" but needs an object with additional context.
What is different about it? I don't see any constraints here relevant to having a different type. Note that this thread has already confused the issue, because they said filename and you said path. A path can contain /, it just happens to mean something.
If you want a better abstraction to locations of files on disk, then you shouldn't use paths at all, since they break if the file gets moved.
One of the great advances of Unix was that you don't need separate handling for binary data and text data; they are stored in the same kind of file and can be contained in the same kinds of strings (except, sadly, in C). Occasionally you need to do some kind of text-specific processing where you care, but the rest of the time you can keep all your code 8-bit clean so that it can handle any data safely.
Languages that have adopted the approach you advocate, such as Python, frequently have bugs like exception tracebacks they can't print (because stdout is set to ASCII) or filenames they can't open when they're passed in on the command line (because they aren't valid UTF-8).
The entire point of UTF-8 (designed, by the way, by the group that designed Go) is to encode Unicode in such a way that these byte string operations perform the corresponding Unicode operations, precisely so that you don't have to care whether your string is Unicode or just plain ASCII, so you don't need any error handling, except for the rare case where you want to do something related to the text that the string semantically represents. The only operation that doesn't really map is measuring the length.
Every single thing you listed here is supported by &[u8] type. That's the point: if you want to operate on data without assuming it's valid UTF-8, you just use &[u8] (or allocating Vec<u8>), and the standard library offers what you'd typically want, except of the functions that assume that the string is valid UTF-8 (like e.g. iterating over code points). If you want that, you need to convert your &[u8] to &str, and the process of conversion forces you to check for conversion errors.
Typically the way you do this is you have the constructor for path do the validation or you use a static path::fromString() function.
Also paths breaking when a file is moved is correct behavior sometimes. For example something like openFile() or moveFile() requires paths. Also path can be relative location.
If your API takes &str, and tries to do byte-based indexing, it should almost certainly be taking &[u8] instead.
Yes, and that's a good thing. It allows every code that gets &str/String to assume that the input is valid UTF-8. The alternative would be that every single time you write a function that takes a string as an argument, you have to analyze your code, consider what would happen if the argument was not valid UTF-8, and handle that appropriately. You'd also have to redo the whole analysis every time you modify the function. That's a horrible waste of time: it's much better to:
1) Convert things to String early, and assume validity later, and
2) Make functions that explicitly don't care about validity take &[u8] instead.
This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.
Can it? If you want to open a file with invalid UTF8 in the name, then the path has to contain that.
And a path can contain the path separator - it's the filename that can't contain it.
> For example something like openFile() or moveFile() requires paths.
macOS has something called bookmark URLs that can contain things like inode numbers or addresses of network mounts. Apps use it to remember how to find recently opened files even if you've reorganized your disk or the mount has dropped off.
IIRC it does resolve to a path so it can use open() eventually, but you could imagine an alternative. Well, security issues aside.
Doesn't this demonstrate my point? If you can do everything with &[u8], what's the point in validating UTF-8? It's just a less universal string type, and your program wastes CPU cycles doing unnecessary validation.
Note that &[u8] would allow things like null bytes, and maybe other edge cases.
So you naturally write another one of these functions that takes a `&str` so that it can pass to another function that only accepts `&str`.
Fundamentally no one actually requires validation (ie, walking over the string an extra time up front), we're just making it part of the contract because something else has made it part of the contract.
use std::env;
fn main() {
let args: Vec<String> = env::args().collect();
...
}
When I run this code, a literal example from the official manual, with this filename I have here, it panics: $ ./main $'\200'
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "\x80"', library/std/src/env.rs:805:51
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
($'\200' is bash's notation for a single byte with the value 128. We'll see it below in the strace output.)So, literally any program anyone writes in Rust will crash if you attempt to pass it that filename, if it uses the manual's recommended way to accept command-line arguments. It might work fine for a long time, in all kinds of tests, and then blow up in production when a wild file appears with a filename that fails to be valid Unicode.
This C program I just wrote handles it fine:
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
char buf[4096];
void
err(char *s)
{
perror(s);
exit(-1);
}
int
main(int argc, char **argv)
{
int input, output;
if ((input = open(argv[1], O_RDONLY)) < 0) err(argv[1]);
if ((output = open(argv[2], O_WRONLY | O_CREAT, 0666)) < 0) err(argv[2]);
for (;;) {
ssize_t size = read(input, buf, sizeof buf);
if (size < 0) err("read");
if (size == 0) return 0;
ssize_t size2 = write(output, buf, (size_t)size);
if (size2 != size) err("write");
}
}
(I probably should have used O_TRUNC.)Here you can see that it does successfully copy that file:
$ cat baz
cat: baz: No such file or directory
$ strace -s4096 ./cp $'\200' baz
execve("./cp", ["./cp", "\200", "baz"], 0x7ffd7ab60058 /* 50 vars */) = 0
brk(NULL) = 0xd3ec000
brk(0xd3ecd00) = 0xd3ecd00
arch_prctl(ARCH_SET_FS, 0xd3ec380) = 0
set_tid_address(0xd3ec650) = 4153012
set_robust_list(0xd3ec660, 24) = 0
rseq(0xd3ecca0, 0x20, 0, 0x53053053) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=9788*1024, rlim_max=RLIM64_INFINITY}) = 0
readlink("/proc/self/exe", ".../cp", 4096) = 22
getrandom("\xcf\x1f\xb7\xd3\xdb\x4c\xc7\x2c", 8, GRND_NONBLOCK) = 8
brk(NULL) = 0xd3ecd00
brk(0xd40dd00) = 0xd40dd00
brk(0xd40e000) = 0xd40e000
mprotect(0x4a2000, 16384, PROT_READ) = 0
openat(AT_FDCWD, "\200", O_RDONLY) = 3
openat(AT_FDCWD, "baz", O_WRONLY|O_CREAT, 0666) = 4
read(3, "foo\n", 4096) = 4
write(4, "foo\n", 4) = 4
read(3, "", 4096) = 0
exit_group(0) = ?
+++ exited with 0 +++
$ cat baz
foo
The Rust manual page linked above explains why they think introducing this bug by default into all your programs is a good idea, and how to avoid it:> Note that std::env::args will panic if any argument contains invalid Unicode. If your program needs to accept arguments containing invalid Unicode, use std::env::args_os instead. That function returns an iterator that produces OsString values instead of String values. We’ve chosen to use std::env::args here for simplicity because OsString values differ per platform and are more complex to work with than String values.
I don't know what's "complex" about OsString, but for the time being I'll take the manual's word for it.
So, Rust's approach evidently makes it extremely hard not to introduce problems like that, even in the simplest programs.
Go's approach doesn't have that problem; this program works just as well as the C program, without the Rust footgun:
package main
import (
"io"
"log"
"os"
)
func main() {
src, err := os.Open(os.Args[1])
if err != nil {
log.Fatalf("open source: %v", err)
}
dst, err := os.OpenFile(os.Args[2], os.O_CREATE|os.O_WRONLY, 0666)
if err != nil {
log.Fatalf("create dest: %v", err)
}
if _, err := io.Copy(dst, src); err != nil {
log.Fatalf("copy: %v", err)
}
}
(O_CREATE makes me laugh. I guess Ken did get to spell "creat" with an "e" after all!)This program generates a much less clean strace, so I am not going to include it.
You might wonder how such a filename could arise other than as a deliberate attack. The most common scenario is when the filenames are encoded in a non-Unicode encoding like Shift-JIS or Latin-1, followed by disk corruption, but the deliberate attack scenario is nothing to sneeze at either. You don't want attackers to be able to create filenames your tools can't see, or turn to stone if they examine, like Medusa.
Note that the log message on error also includes the ill-formed Unicode filename:
$ ./cp $'\201' baz
2025/08/22 21:53:49 open source: open ζ: no such file or directory
But it didn't say ζ. It actually emitted a byte with value 129, making the error message ill-formed UTF-8. This is obviously potentially dangerous, depending on where that logfile goes because it can include arbitrary terminal escape sequences. But note that Rust's UTF-8 validation won't protect you from that, or from things like this: $ ./cp $'\n2025/08/22 21:59:59 oh no' baz
2025/08/22 21:59:09 open source: open
2025/08/22 21:59:59 oh no: no such file or directory
I'm not bagging on Rust. There are a lot of good things about Rust. But its string handling is not one of them. If your API takes &str, and tries to do byte-based indexing, it should
almost certainly be taking &[u8] instead.
Str is indexed by bytes. That's the issue.You're meant to use `unsafe` as a way of limiting the scope of reasoning about safety.
Once you construct a `&str` using `from_utf8_unchecked`, you can't safely pass it to any other function without looking at its code and reasoning about whether it's still safe.
Also see the actual documentation: https://doc.rust-lang.org/std/primitive.str.html#method.from...
> Safety: The bytes passed in must be valid UTF-8.
Eventually there should be an underlying operation which can only work on valid UTF-8, but that doesn't exist. UTF-8 was designed such that invalid data can be detected and handled, without affecting the meaning of valid subsequences in the same string.
let s = “asd”;
println!(“{}”, s[0]);
You will get a compiler error telling you that you cannot index into &str. fn main() {
let s = "12345";
println!("{}", &s[0..1]);
}
compiles and prints out "1".This:
fn main() {
let s = "\u{1234}2345";
println!("{}", &s[0..1]);
}
compiles and panics with the following error: byte index 1 is not a char boundary; it is inside 'ሴ' (bytes 0..3) of `ሴ2345`
To get the nth char (scalar codepoint): fn main() {
let s = "\u{1234}2345";
println!("{}", s.chars().nth(1).unwrap());
}
To get a substring: fn main() {
let s = "\u{1234}2345";
println!("{}", s.chars().skip(0).take(1).collect::<String>());
}
To actually get the bytes you'd have to call #as_bytes which works with scalar and range indices, e.g.: fn main() {
let s = "\u{1234}2345";
println!("{:02X?}", &s.as_bytes()[0..1]);
println!("{:02X}", &s.as_bytes()[0]);
}
IMO it's less intuitive than it should be but still less bad than e.g. Go's two types of nil because it will fail in a visible manner. let start = s.find('a')?;
let end = s.find('z')?;
let sub = &s[start..end];
and it will never panic, because find will never return something that's not on a char boundary. Where would you even get them from?
In my case it was in parsing text where a numeric value had a two character prefix but a string value did not. So I was matching on 0..2 (actually 0..2.min(string.len()) which doubly highlights the indexing issue) which blew up occasionally depending on the string values. There are perhaps smarter ways to do this (e.g. splitn on a space, regex, giant if-else statement, etc, etc) but this seemed at first glance to be the most efficient way because it all fit neatly into a match statement.The inverse was also a problem: laying out text with a monospace font knowing that every character took up the same number of pixels along the x-axis (e.g. no odd emoji or whatever else). Gotta make sure to call #len on #chars instead of the string itself as some of the text (Windows-1250 encoded) got converted into multi-byte Unicode codepoints.
A couple quotes from the Go Blog by Rob Pike:
> It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.
> Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.
Both from https://go.dev/blog/strings
If you want UTF-8 in a guaranteed way, use the functions available in unicode/utf8 for that. Using `string` is not sufficient unless you make sure you only put UTF-8 into those strings.
If you put valid UTF-8 into a string, you can be sure that the string holds valid UTF-8, but if someone else puts data into a string, and you assume that it is valid UTF-8, you may have a problem because of that assumption.
It does though? Strings are internable, comparable, can be keys, etc.
At the protocol (or disk, etc) boundary. If I write code that consumes bytes that are intended to be UTF-8, I need to make a choice about what to do if they aren’t UTF-8 somewhere. A strict UTF-8 string forces me to make that choice in a considered location. In a language where a “string” is just bytes, I can forget, or to pieces of code can disagree on what the contract is. And bugs result.
Check out MySQL for a fun example of getting this wildly, impressively wrong. At least a Rust or a type checked-Python 3 wrapper around some MySQL code enforces a degree of correctness, which is much better than having your transaction fail to commit or commit indirectly was down the stack when you get bytes you didn’t expect.
(MySQL can still reject strictly valid UTF-8 data for utterly pathetic historical reasons if you configure it incorrectly.)
Command-line arguments on Windows are their own special disaster.
But there is not a canonical response to invalid data. So literally every operation that might need to make a choice of what to do when presented what invalid data should either (a) accept a parameter asking what to do on error and potentially fail or (b) take a parameter type that forces errors to be handled in advance.