zlacker

Rust apparently got relatively close to not having &str as a primitive type and instead only providing a library alias to &[u8] when Rust 1.0 shipped.

Score another for Rust's Safety Culture. It would be convenient to just have &str as an alias for &[u8] but if that mistake had been allowed all the safety checking that Rust now does centrally has to be owned by every single user forever. Instead of a few dozen checks overseen by experts there'd be myriad sprinkled across every project and always ready to bite you.

replies(3): >>adastr+ay >>inferi+nJ >>stevek+HJ

>>tialar+(OP)
. (early morning brain fart -- I need my coffee)

replies(1): >>tialar+Rz

>>adastr+ay
So it's true that technically the primitive type is str, and indeed it's even possible to make a &mut str though it's quite rare that you'd want to mutably borrow the string slice.

However no &str is not "an alias for &&String" and I can't quite imagine how you'd think that. String doesn't exist in Rust's core, it's from alloc and thus wouldn't be available if you don't have an allocator.

replies(1): >>zozbot+SB

>>tialar+Rz
str is not really a "primitive type", it only exists abstractly as an argument to type constructors - treating the & operator as a "type constructor" for that purpose, but including Box<>, Rc<>, Arc<> etc. So you can have Box<str> or Arc<str> in addition to &str or perhaps &mut str, but not really 'str' in isolation.

>>tialar+(OP)
Even so you end up with paper cuts like len which returns the number of bytes.

replies(1): >>toast0+jP

>>tialar+(OP)
It wouldn't have been an alias, it would have been struct Str([u8]). Nothing would have been different about the safety story.

https://github.com/rust-lang/rfcs/issues/2692

replies(1): >>stouse+NY

>>inferi+nJ
The problem with string length is there's probably at least four concepts that could conceivably be called length, and few people are happy when none of them are len.

Of the top of my head, in order of likely difficulty to calculate: byte length, number of code points, number of grapheme/characters, height/width to display.

Maybe it would be best for Str not to have len at all. It could have bytes, code_points, graphemes. And every use would be precise.

replies(3): >>inferi+vR >>stouse+t01 >>branko+si1

>>toast0+jP
Problems arise when you try to take a slice of a string and end up picking an index (perhaps based on length) that would split a code point. String/str offers an abstraction over Unicode scalars (code points) via the chars iterator, but it all feels a bit messy to have the byte based abstraction more or less be the default.

FWIW the docs indicate that working with grapheme clusters will never end up in the standard library.

replies(2): >>toast0+nW >>xyzzyz+lx1

>>inferi+vR
> but it all feels a bit messy to have the byte based abstraction more or less be the default.

I mean, really neither should be the default. You should have to pick chars or bytes on use, but I don't think that's palatable; most languages have chosen one or the other as the preferred form. Or some have the joy of being forward thinking in the 90s and built around UCS-2 and later extended to UTF-16, so you've got 16-bit 'characters' with some code points that are two characters. Of course, dealing with operating systems means dealing with whatever they have as well as what the language prefers (or, as discussed elsewhere in this thread, pretending it doesn't exist to make easy things easier and hard things harder)

>>stevek+HJ
I love this kind of historical knowledge. Thanks for sharing it!

>>toast0+jP
> The problem with string length is there's probably at least four concepts that could conceivably be called length.

The answer here isn't to throw up your hands, pick one, and other cases be damned. It's to expose them all and let the engineer choose. To not beat the dead horse of Rust, I'll point that Ruby gets this right too.

    * String#length                   # count Unicode code units
    * String#bytes#length             # count bytes
    * String#grapheme_clusters#length # count grapheme clusters

Similarly, each of those "views" lets you slice, index, etc. across those concepts naturally. Golang's string is the worst of them all. They're nominally UTF-8, but nothing actually enforces it. But really they're just buckets of bytes, unless you send them to APIs that silently require them to be UTF-8 and drop them on the floor or misbehave if they're not.

Height/width to display is font-dependent, so can't just be on a "string" but needs an object with additional context.

>>toast0+jP
You could also have the number of code UNITS, which is the route C# took.

>>inferi+vR
You can easily treat `&str` as bytes, just call `.as_bytes()`, and you get `&[u8]`, no questions asked. The reason why you don't want to treat &str as just bytes by default is that it's almost always a wrong thing to do. Moreover, it's the worst kind of a wrong thing, because it actually works correctly 99% of the time, so you might not even realize you have a bug until much too late.

If your API takes &str, and tries to do byte-based indexing, it should almost certainly be taking &[u8] instead.

replies(1): >>inferi+0V1

>>xyzzyz+lx1

  If your API takes &str, and tries to do byte-based indexing, it should
  almost certainly be taking &[u8] instead.

Str is indexed by bytes. That's the issue.

replies(1): >>xyzzyz+vS3

>>inferi+0V1
As a matter of fact, you cannot do

  let s = “asd”;
  println!(“{}”, s[0]);

You will get a compiler error telling you that you cannot index into &str.

replies(1): >>inferi+k24

>>xyzzyz+vS3
Right, you have to give it a usize range. And that will index by bytes. This:

  fn main() {
      let s = "12345";
      println!("{}", &s[0..1]);
  }

compiles and prints out "1".

This:

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", &s[0..1]);
  }

compiles and panics with the following error:

  byte index 1 is not a char boundary; it is inside 'ሴ' (bytes 0..3) of `ሴ2345`

To get the nth char (scalar codepoint):

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", s.chars().nth(1).unwrap());
  }

To get a substring:

  fn main() {
      let s = "\u{1234}2345";
      println!("{}", s.chars().skip(0).take(1).collect::<String>());
  }

To actually get the bytes you'd have to call #as_bytes which works with scalar and range indices, e.g.:

  fn main() {
      let s = "\u{1234}2345";
      println!("{:02X?}", &s.as_bytes()[0..1]);
      println!("{:02X}", &s.as_bytes()[0]);
  }

IMO it's less intuitive than it should be but still less bad than e.g. Go's two types of nil because it will fail in a visible manner.

replies(1): >>xyzzyz+Kq4

>>inferi+k24
It's actually somewhat hard to hit that panic in a realistic scenario. This is because you are unlikely to be using slice indices that are not on a character boundary. Where would you even get them from? All the standard library functions will return byte indices on a character boundary. For example, if you try to do something like slice the string between first occurrence of character 'a', and of character 'z', you'll do something like

  let start = s.find('a')?;
  let end = s.find('z')?;
  let sub = &s[start..end];

and it will never panic, because find will never return something that's not on a char boundary.

replies(1): >>inferi+Ps4

>>xyzzyz+Kq4

  Where would you even get them from?

In my case it was in parsing text where a numeric value had a two character prefix but a string value did not. So I was matching on 0..2 (actually 0..2.min(string.len()) which doubly highlights the indexing issue) which blew up occasionally depending on the string values. There are perhaps smarter ways to do this (e.g. splitn on a space, regex, giant if-else statement, etc, etc) but this seemed at first glance to be the most efficient way because it all fit neatly into a match statement.

The inverse was also a problem: laying out text with a monospace font knowing that every character took up the same number of pixels along the x-axis (e.g. no odd emoji or whatever else). Gotta make sure to call #len on #chars instead of the string itself as some of the text (Windows-1250 encoded) got converted into multi-byte Unicode codepoints.