Friday, April 22, 2022

Go vs Rust - Strings

String

Go

In Go language, strings are different from other languages like Java, C++, Python, etc. it is a sequence of variable-width characters where each and every character is represented by one or more bytes using UTF-8 Encoding. Or in other words, strings are the immutable chain of arbitrary bytes(including bytes with zero value) or string is a read-only slice of bytes and the bytes of the strings can be represented in the Unicode text using UTF-8 encoding.

Due to UTF-8 encoding Golang string can contain a text which is the mixture of any language present in the world, without any confusion and limitation of the page.

    // Creating and initializing a slice of byte
    myslice1 := []byte{0x47, 0x65, 0x65, 0x6b, 0x73}
  
    // Creating a string from the slice
    mystring1 := string(myslice1)
    mystr := "Welcome to GeeksforGeeks ??????"
    // Finding the length of the string
    // Using len() function
    length1 := len(mystr)
  
    // Using RuneCountInString() function
    length2 := utf8.RuneCountInString(mystr)
    res1 := strings.Trim(str1, "@$")

Rust

There are two types of strings in Rust: String and &str.

A String is stored as a vector of bytes (Vec<u8>), but guaranteed to always be a valid UTF-8 sequence. String is heap allocated, growable and not null terminated.

&str is a slice (&[u8]) that always points to a valid UTF-8 sequence, and can be used to view into a String, just like &[T] is a view into Vec<T>.

let pangram: &'static str = "the quick brown fox jumps over the lazy dog";

let s1 = String::from("Hello, ");
let s2 = String::from("world!");
let s3 = s1 + &s2; // Note that s1 has been moved here and can no longer be used
print!("{} ", s3);

let mut s = String::from("lo");
s.push('l');
let s2 = "bar";
s.push_str(&s2);
print!("{} ", s);

String Slices as Parameters

A more experienced Rustacean would write the following linebecause it allows us to use the same function on both Strings and &strs:

fn first_word(s: &str) -> &str {

The concepts of ownership, borrowing, and slices are what ensure memory safety in Rust programs at compile time. The Rust language gives you control over your memory usage like other systems programming languages, but having the owner of data automatically clean up that data when the owner goes out of scope.

fn first_word(s: &String) -> &str {
    let bytes = s.as_bytes();
    for (i, &item) in bytes.iter().enumerate() {
        if item == b' ' {
            return &s[0..i];
        }
    }
    &s[..]
}
fn main() {
    let mut s = String::from("hello world");
    let word = first_word(&s);
    s.clear(); // Error!
}

Bytes and Scalar Values and Grapheme Clusters

If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is ultimately stored as a Vec of u8 values that looks like this:

[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
224, 165, 135]

That’s 18 bytes and is how computers ultimately store this data. If we look at them as Unicode scalar values, which are what Rust’s char type is, those bytes look like this:

['न', 'म', 'स', '्', 'त', 'े']
There are six char values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:

["न", "म", "स्", "ते"]

If we need to perform operations on individual Unicode scalar values, the best way to do so is to use the chars method.

for c in "नमस्ते".chars() {
    println!("{}", c);
}

Indexing into a string is often a bad idea because it’s not clear what the return type of the string indexing operation should be: a byte value, a character, a grapheme cluster, or a string slice. Therefore, Rust asks you to be more specific if you really need to use indices to create string slices.

let hello = "Здравствуйте";
let s = &hello[0..4];

s will be a &str that contains the first four bytes of the string. Earlier, we mentioned that each of these characters was two bytes, which means s will be Зд.
What would happen if we used &hello[0..1]? The answer: Rust will panic at runtime. So should use ranges to create string slices with caution, because it can crash your program.

Escape character

Go

Escape character Description
\\ Backslash(\)
\000 Unicode character with the given 3-digit 8-bit octal code point
\’ Single quote (‘). It is only allowed inside character literals
\” Double quote (“). It is only allowed inside interpreted string literals
\a ASCII bell (BEL)
\b ASCII backspace (BS)
\f ASCII formfeed (FF)
\n ASCII linefeed (LF
\r ASCII carriage return (CR)
\t ASCII tab (TAB)
\uhhhh Unicode character with the given 4-digit 16-bit hex code point.
Unicode character with the given 8-digit 32-bit hex code point.
\v ASCII vertical tab (VT)
\xhh Unicode character with the given 2-digit 8-bit hex code point.

Rust 

All number literals allow _ as a visual separator: 1_234.0E+18f64

\x41 7-bit character code (exactly 2 digits, up to 0x7F)
\n Newline
\r Carriage return
\t Tab
\\ Backslash
\0 Null

\u{7FFF} 24-bit Unicode character code (up to 6 digits)

\' Single quote
\" Double quote


No comments:

Post a Comment