More

burntsushi · 2024-04-27T16:36:47

The initial version of ripgrep was absolutely a rapid prototype.

I do rapid prototyping all the time.

I'm not saying Rust is good for game dev, but the idea that Rust cannot be used for rapid prototyping in any context is a myth.

reply

OvermindDL1 · 2024-04-27T18:03:14

Indeed, I personally find Rust to be very nice for rapid prototyping, incremental recompilation is usually a second or two even in my giant projects (mold helps on the linking step but that's less of a rust thing anyway), and I'm very curious how cranelift will change things in the future, it would be nice to hot swap function implementations on the fly at least.

areeh · 2024-04-28T08:47:10

Are there any particular techniques or styles that stand out to you as useful when prototyping in Rust?

burntsushi · 2024-04-28T12:19:55

`clone()` and `unwrap()` and `todo!()` without fear. Just let it loose.

For me, prototyping is, IMO, about finding shortcuts to demonstrate something that is unknown to you. The idea is that shortcuts represent things you know how to do, but would otherwise take work to avoid and aren't necessary for demonstrating the thing that is unknown. `clone()` and `unwrap()` are just Rust-specific examples of that.

reply

burntsushi · 2024-04-21T21:21:26

Go's strings aren't poor design. The only difference between a Go string and a Rust &str/String is that the latter is required to be valid UTF-8. In Go, a string is only conventionally valid UTF-8. It is permitted to contain invalid UTF-8. This is a feature, not a bug, because it more closely represents the reality of data encoded in a file on Unix. Of course, this feature comes with a trade-off, because Rust's guarantee that &str/String is valid UTF-8 is also a feature and not a bug.

I wrote more about this here: https://blog.burntsushi.net/bstr/#motivation-based-on-concep...

I mention gecko as an example repository that contains data that isn't valid UTF-8. But it isn't unique. The cpython repository does too. When you make your string type have the invariant that it must be valid UTF-8, you're giving up something when it comes to writing tools that process the contents of arbitrary files.

reply

SAI_Peregrinus · 2024-04-22T00:03:02

Go strings aren't necessarily text. Rust strings are text, as long as you consider things like emoji or Egyptian hieroglyphics to be text. Lots of confusion has come from the imprecise meaning of "string", whether it's referring to arbitrary byte sequences, restricted byte sequences (e.g. not containing 0x00), arbitrary sequences of characters with some encoding, restricted sequences of characters with some encoding, or something else is often unclear. And when it's a restricted sequence what those restrictions are is also often unclear.

You sometimes need a way to operate on entirely arbitrary sequences of bytes. This is mostly easy, it's been a long time since non-octet bytes were relevant in most situations, so the vast majority of the time you can just assume they're all octets.

You sometimes need a way to operate on arbitrary text. This inherently requires knowing how that text is encoded, but as long as you know that it's mostly easy.

You sometimes need a way to operate on text-like things that aren't necessarily text, like the output of old CLI programs that used the BEL character to alert the user to events. Or POSIX filenames. Or text where you don't know the encoding. This is where the bugs lie, where we make unchecked assumptions about the data that turn out to be invalid.

reply

burntsushi · 2024-04-22T00:25:11

You didn't really respond directly to anything I said, nor anything I said in the blog I linked (that I also wrote). You also seem to be speaking to me as if I'm some spring chicken. I'm not. I'm on the Rust libs-api team and I'm in favor of the &str/String API design (including its UTF-8 requirement). I wrote ripgrep. I've spent 10 years working on regex engines. I understand text encodings and the design space of string data types. I've implemented string data types. It might help to understand things a little better by perusing the bstr crate API[1]. Notice that it doesn't require valid UTF-8, yet assumes by convention that the string is UTF-8. And this assumption provides a path to implementing things like "iterate over all grapheme clusters" with sensible semantics when invalid UTF-8 is seen.

You'll notice that I didn't say "Go's string design is good and we should all use it." I made an argument that's Go's string design is not poor and provided an argument for why that is. In particular, I described trade offs and a particular pragmatic point on which abdicating the UTF-8 requirement makes for a more seamless experience when dealing with arbitrary file content.

> but as long as you know that it's mostly easy. [..snip..] Or text where you don't know the encoding.

You don't know. That was my whole point! I gave real-world concrete examples of popular things (Mozilla and CPython repositories) that contain text files that aren't entirely valid UTF-8. They are only mostly valid UTF-8. If I instead treated them as malformed and refused to process them in my command line utilities or libraries, I would get instant bug reports.

> Go strings aren't necessarily text.

I would generally consider this to be an incorrect statement. The more precise statement is that Go strings may contain invalid UTF-8. But the operations defined on strings treat strings as text. For example, if you iterate over the codepoints in a Go string, you'll get U+FFFD for bytes that are invalid UTF-8. By your own reasoning, U+FFFD must be considered text because it can also appear in a Rust &str/String. Despite the fact that a Go string and a []byte can represent arbitrary sequences of bytes, a Go string is not the same thing as a []byte. Aside from mutability and growability, the operations on them (both those provided as a library and those provided by the language definition itself) are what distinguish them. They are what make a `string` text, even when it contains invalid UTF-8.

There are deep trade offs here, but the UTF-8-is-required does have downsides that UTF-8-by-convention does not have. And of course, vice versa.

[1]: https://docs.rs/bstr

reply

SAI_Peregrinus · 2024-04-22T12:50:30

Sorry, I was trying to expand on your points, not contradict any of them!

burntsushi · 2024-04-22T18:12:35

Apparently I completely misinterpreted. My apologies. Thanks for clarifying.

burntsushi · 2024-04-16T23:51:12

No. It's a problem because you can only have one version of any given package in your dependency tree. You can't have `foo 2` and `foo 3` in your dependency tree. Without that limitation, there is a release valve of sorts where you can incur two different semver incompatible releases of the same package in your dependency tree in exchange for a working build. The hope is that it would be transitory state until all of your dependencies migrate.

Rust, for example, has precisely this same problem, except that it is limited to public dependencies. For example, if `serde 2` were ever to be published, then there would likely be a period of immense pain where, effectively, everyone needs to migrate all at once. Even though `serde 1` and `serde 2` can both appear in the same dependency tree (unlike in Python), because it is a public dependency, everyone needs to be using the same version of the library or else the `Serialize` trait from `serde 1` will be considered distinct from the `Serialize` trait (or whatever) in `serde 2`.

But if I, say published a `regex 2.0.0` tomorrow, then folks could migrate at their leisure. The only downside is that you'd have `regex 1` and `regex 2` in your dependency tree. Potentially for a long time until everyone migrated over. But your build would still work because it is uncommon for `regex` to be a public dependency.

(Rust does have the semver trick[1] available to it as another release valve of sorts.)

This problem is definitely not because of missing interfaces or whatever.

[1]: https://github.com/dtolnay/semver-trick

Chris_Newton · 2024-04-17T03:14:32

It's a problem because you can only have one version of any given package in your dependency tree. You can't have `foo 2` and `foo 3` in your dependency tree.

That does seem to be the fundamental problem with the Python model of dependency management.

If your dependencies have transitive dependencies of their own but your dependency model is a tree and everything is clearly namespaced/versioned, you might end up with multiple versions of the same package installed, but at least they won’t conflict.

If your dependency model is flat but each dependency bakes in its own transitive dependencies so they’re hidden from the rest of the system, for example via static linking, again you might end up with multiple versions of the same package (or some part of it) installed, but again they won’t conflict.

But if your dependency model is flat and each dependency can require specific versions of its transitive dependencies to be installed as peers, you fundamentally can’t avoid the potential for unresolvable conflicts.

A pragmatic improvement in the third case is, as others have suggested, to replace the SemVer-following mypackage 1.x.y and mypackage 2.x.y with separate top-level packages mypackage1 x.y and mypackage2 x.y. Now you have reintroduced namespaces and you can install mypackage1 and mypackage2 together as peers without conflict. Moreover, if increasing x and y faithfully represent minor and point releases, always using the latest versions of mypackage1 and mypackage2 should normally satisfy any other packages that depend on them, however many there are.

Of course it doesn’t always work like that in practice. However, at least the problem is now reduced to manually adjusting versions to resolve conflicts where a package didn’t match its versions to its behaviour properly and/or Hyrum’s Law is relevant, which is probably much less work than before.

int_19h · 2024-04-17T05:38:51

As the article explains, this is precisely why the social expectations around Python package versioning are very different from JS package version (i.e. you can't just break things willy nilly even in major releases and cite semver as justification).

That aside, note the obvious problems here for any language that uses nominal typing - like, say, Python. Since types from dependencies can often surface in one's public API, having a tree of dependencies means that many libraries will end up referring to different (and thus ipso facto incompatible) versions of the same type.

Chris_Newton · 2024-04-17T07:05:45

social expectations around Python package versioning are very different from JS package version

If anything, I’d say in my experience the Python community tends to be more willing to make big changes. After all, Python itself famously did so with the 2 to 3 transition, and to some extent we’re seeing a second round of big changes even now as optional typing spreads through the ecosystem.

Admittedly, the difference could also be because so few packages in JS world seem to last long enough for multiple major versions to become an issue. The Python ecosystem seems more willing to settle on a small number of de facto standard libraries for common tasks.

Since types from dependencies can often surface in one's public API, having a tree of dependencies means that many libraries will end up referring to different (and thus ipso facto incompatible) versions of the same type.

Leaving aside the questionable practice of exposing details of internal dependencies directly through one’s own public interface, I don’t see how this is any different to any other potential naming conflict. Whatever dependency model you pick, you’re always going to have the possibility that two dependencies use the same name as part of their interface, and in Python you’re always going to have to disambiguate explicitly if you want to import both in the same place. However, once you’ve done so, there is no longer any naming clash to leak through your own interface either.

int_19h · 2024-04-17T18:39:32

> After all, Python itself famously did so with the 2 to 3 transition

That transition has been so traumatic for the whole ecosystem that, if anything, it became an abject lesson as to why you don't do stuff like that. "Never again" is the current position of PSF wrt any hypothetical future Python 3 -> 4 transition.

Major Python libraries pretty much never just remove things over the course of a single major release. Things get officially announced first, then deprecated for at least one release cycle but often longer (which is communicated via DeprecationWarning etc), then finally retired.

> Leaving aside the questionable practice of exposing details of internal dependencies directly through one’s own public interface

Not all dependencies are internal. If library A exposes type X, and library B exposes type Y that by design extends X (so that instances of Y can be passed anywhere X is expected), that is very intentionally public.

Now imagine that library C exposes type Z that also by design extends X. If B and C both get their copy of A, then there are two identical types X that are not type-compatible.

Now suppose we have the app that depends on both B and C. Its author wants to write a generic function F that accepts an instance of X (or a subtype) and does something with it. How do they write a type signature for F such that it can accept both Y and Z?

Chris_Newton · 2024-04-18T00:51:09

Major Python libraries pretty much never just remove things over the course of a single major release. Things get officially announced first, then deprecated for at least one release cycle but often longer (which is communicated via DeprecationWarning etc), then finally retired.

I’m not sure that’s a realistic generalisation. To pick a few concrete examples, there were some breaking changes in SQLAlchemy 2, Pydantic 2, and as an interesting example of the “rename the package instead of bumping the major version” idea mentioned elsewhere, from Psycopg2 to Psycopg (3). I think it’s fair to say all of those are significant packages within the Python ecosystem.

Not all dependencies are internal. If library A exposes type X, and library B exposes type Y that by design extends X […] Now imagine that library C exposes type Z that also by design extends X

Yes, you can create some awkward situations with shared bases in Python, and you could split all of the relevant types into different libraries, and this isn’t a situation that Python’s object model (or those of many other OO languages) handles very gracefully.

Could you please clarify the main point you’d like to make here? The shared base/polymorphism complications seem to apply generally with Python’s object model, unless you have a set of external dependencies that are designed to share a common base type from a common transitive dependency and support code that is polymorphic as if each name refers to a single, consistent type and yet the packages in question are not maintained and released in sync.

That seems like quite an unusual scenario. Even if it happens, it seems like the most that can safely be assumed by code importing from B and C — unless B and C explicitly depend on exactly the same version of A — is that Y extends (X from A v1.2.3) while Z extends (X from A v1.2.4). If B and C aren’t explicitly managed together, I’m not convinced it’s reasonable for code using them both to assume the base types that happen to share the same name X that they extend and expose through their respective interfaces are really the same type.

burntsushi · 2024-04-16T14:37:00

https://github.com/BurntSushi/rebar

For regex, you can't really distill it down to one single fastest algorithm.

It's somewhat similar even for substring search. But certainly, the fastest algorithms are going to be the ones that make use of SIMD in some way.

burntsushi · 2024-04-13T12:06:45

Unrelated note: I had always thought "nonplussed" was basically a synonym for something like "bewildering confusion." But the way you used it in this context suggested the exact opposite. It turns out that "nonplussed" has also come to mean "unperturbed": https://en.wiktionary.org/wiki/nonplussed

Quite confusing, because the two different meanings are nearly opposite to one another.

See also: https://www.merriam-webster.com/grammar/nonplussed

thaumasiotes · 2024-04-13T15:02:27

> Quite confusing, because the two different meanings are nearly opposite to one another.

It's pretty easy to see where the innovative sense came from: "plussed" doesn't mean anything, but "non" is clearly negative. So when you encounter the word, you can tell that it describes (a) a reaction in which (b) something doesn't happen. So everyone independently guesses that it means failing to have much of a reaction, and when everyone thinks a word means something, then it does mean that thing.

You see the same thing happen with "inflammable", where everyone is aware that "in" means "not" and "flame" means "fire". (Except that in the original sense, in is an intensifying prefix rather than a negative prefix. This doesn't occur in many other English words, although "inflammation" and "inflamed" aren't rare. Maybe "infatuate".)

burntsushi · 2024-04-15T15:33:19

That's pretty much what my second link is about. :-)

codezero · 2024-04-13T15:11:03

Wow, now I need to go figure what source material I misinterpreted to thoughtlessly use this word incorrectly. Thanks!

burntsushi · 2024-04-12T23:52:44

> Possibly the most natural representation of a UTC instant would be an integer, because fundamentally UTC is a count of milliseconds

It's not! It's a subtle point, but because of leap seconds, the correct representation of UTC is the tuple (year, month, day, hour, minute, second, ...), with the `...` being filled in with your desired precision.

From: http://www.madore.org/~david/computers/unix-leap-seconds.htm...

> Unlike TAI and UT1, the UTC time scale should not be considered as a pure real number (or seconds count): instead, it should be viewed as a broken-down time (year-month-day-hour-minute-second-fraction) in which the number of seconds ranges from 0 to 60 inclusive (there can be 61 or 59 seconds in a minute); during a positive leap second the number of seconds takes the value 60 (while a negative leap second would skip the value 59, but this has never occurred).

...

> If we attempt to condense UTC to a single number (say, the number of seconds since 1970-01-01T00:00:00 or since 1900-01-01T00:00:00, or the number of 86400s-days since 1858-11-17T00:00:00, or something of the sort), we encounter the problem that the same value can refer to two different instants since the clock has been set back one second (negative leap seconds, of course, would cause no such difficulty).

Most datetime libraries get this wrong. I know because I'm working on a new one that specifically doesn't get this wrong.

erhaetherth · 2024-04-13T23:23:16

We could technically do # of minutes since 1970-01-01T00:00:00 (or whatever) & seconds though, no? (int32 minutes, double seconds) would get us a pretty big range in a pretty compact format.

burntsushi · 2024-04-15T15:57:20

There are lots of equivalent representations. The main point here is that it's misleading to think of UTC as just a timestamp from some epoch. It needs something richer than that.

burntsushi · 2024-04-11T11:23:45

ripgrep author here.

Better Unicode support in the regex engine. More flexible ignore rules (you aren't just limited to what `.gitignore` says, you can also use `.ignore` and `.rgignore`). Automatic support for searching UTF-16 files. No special flags required to search outside of git repositories or even across multiple git repositories in one search. Preprocessors via the `--pre` flag that let you transform data before searching it (e.g., running `pdftotext` on `*.pdf` files). And maybe some other things.

`git grep` on the other hand has `--and/--or/--not` and `--show-function` that ripgrep doesn't have (yet).

burntsushi · 2024-03-30T13:39:38

Yeah my story is that I paid ~$10 for my invite on ebay. Wild.

burntsushi · 2024-03-26T19:06:34

Yes! The regex crate does this and it saves quite a bit of memory.

taeric · 2024-03-26T20:16:25

It is the kind of low hanging optimization that I used to think was greatly oversold. And, to be fair, for many programs I would still wager it probably is. If you are chasing a ton of pointers, though, it is definitely worth considering how many more you can hold in memory with smaller sizes.

I think I remember a discussion a while back lamenting that we just tacitly accepted wide pointers. If anyone has a link going over that, I'd be delighted to read it again. I'm 90% sure I did not understand it when I first saw it. :D

burntsushi · 2024-03-26T19:02:40

> you’re just laundering the unsafe pointer arithmetic behind array indexing

Perhaps true in a very narrow sense, but you could say the same thing about all of Rust. "It's just laundering unsafe stuff behind a safe interface." And indeed, the ability to encapsulate unsafe internals inside a safe interface is one of the primary selling points of the language. It is also one of the key characteristics that differentiate it from languages that do not currently have this ability, such as C and C++. Whether you think this is an actual advantage or not is I suppose up to you, but I certainly think it is. And I think your use of the word "just" is papering over a lot of stuff.

For a more concrete code-level comparison with C, I did the leg work to translate a C program to a number of different Rust programs by varying some constraints. One of those Rust programs does indeed use indices instead of pointers. The README talks about the trade offs. See: https://github.com/BurntSushi/rsc-regexp/

Chabsff · 2024-03-26T19:09:34

> languages that do not currently have this ability

That seems to be a very common position, and one that's super weird to me. C and particularly C++ absolutely have that ability with library support if you know what you are doing.

The only material difference, from my point of view, is that the default behavior of the language is different.

I will fully grant that the path of least resistance being dangerous is a huge issue in C/C++, and one that Rust addresses, but extending that all the way to saying that the language lacks the ability is really excessive.

tialaramex · 2024-03-26T19:36:13

There's a big cultural problem and there are several big technical problems.

Rust has a safety culture, and C++ does not. In Rust's safety culture it was obvious that std::mem::unintialized (an unsafe function) should be deprecated because it's more dangerous than it appears, it's actually hard to use it correctly. That's why today we have the MaybeUninit type. In C++ it was apparently equally obvious that std::span, a brand new type in C++ 20, should not have a safe index operation.

Technically the safe/ unsafe distinction being at the language level makes it hard to fake. You can say your C++ only uses your safe abstractions, but the language itself doesn't care, so without inspecting every part of it to check you're never more than one slip away from catastrophe.

Most importantly in this context, at the language level Rust is committed to this safety distinction. If you write code where Rust's compiler can't see why it's OK, the compiler rejects your program. C++ requires that a conforming compiler must instead accept programs unless it can show why they're wrong. These are two possible ways to cut the Gordion knot of Rice's Theorem, but they have very different consequences.

burntsushi · 2024-03-26T19:46:11

You can't encapsulate safety in C or C++. There's no `unsafe` keyword like Rust (or like Modula 3). If you have to say "if you know what you're doing," then you haven't encapsulated anything. There really is a categorical difference here. It's not excessive at all. It's the entire point.

Now if I were to say something like, "Rust's safety means that you can never have UB anywhere ever and CVEs will never happen for anything if you use Rust." Then yes, that's excessive. But to say that Rust can encapsulate `unsafe` and C and C++ cannot? I don't see how that's excessive. It's describing one of the most obvious differences between the programming languages.

You can restrict yourself to particular subsets of C (I'm thinking about MISRA) or C++, but these usually come with even more significant trade offs than Rust. And I'm not aware of any such subset that provides the ability to encapsulate safety in a way that lets folks not using that subset benefit from it in a way that is impossible to misuse (as a matter of an API guarantee).

pjmlp · 2024-03-27T08:40:46

Only to add some historical context, ESPOL/NEWP for Burroughs B5000 in 1961 were one of the first systems programming languages with unsafe, many others followed upon that.

Burroughs B5000 had an additional feature for executables using unsafe code that we only have nowadays on managed runtimes like Java and CLR, binaries with unsafe code were tainted and required someone with admin access to enable them for execution.

Regarding C and C++, Visual Studio, Clion and clang tidy are the best we have in terms of tooling for the general public supporting the Core Guidelines (including lifetime checks), and they are still relatively basic in what they can actually validate.

burntsushi · 2024-03-27T12:17:31

AIUI, Modula-3 provided an ability to actually encapsulate unsafety, in that the concept was elevated to the level of interfaces. Did any language prior to Modula-3 have that capability?

I think that's fundamentally different---although related---to just having an `unsafe` keyword. To take something I know well, Go has an `unsafe` package that acts as a sort of unsafe keyword. If we ignore data races, you can say that a Go program can't violate memory safety if there is no use of the `unsafe` package.

The problem though is that you can't really build new `unsafe` abstractions. You can't write a function that might cause UB on some inputs in a way that requires the caller to write `unsafe`. (You can do this by convention of course, e.g., by putting `Unsafe` in the name of the function.)

In Rust, `unsafe` doesn't just give you the ability to, e.g., dereference raw pointers. It also is required in order to call other `unsafe` functions. You get the benefit of composition so that you can build arbitrary abstractions around `unsafe` with the compiler's support.

My understanding is that Modula-3 supported this style of encapsulation (which is what I was talking about in this thread). What languages prior to Modula-3 supported it, or was Modula-3 the first?

pjmlp · 2024-03-27T13:23:21

Starting with that 1961 example, ESPOL/NEWP.

Since Unisys still sells Burroughs, nowadays ClearPath MCP, you can get the latest NEWP manual here, section 8.

https://public.support.unisys.com/framework/publicterms.aspx...

Followed by Mesa/Cedar (CHECKED, TRUSTED, UNCHECKED), Modula-2 (IMPORT SYSTEM), the languages of Oberon linage (which follow up on the IMPORT SYSTEM approach), Ada (using Unchecked),....

In the languages that use the IMPORT SYSTEM approach, the compiler can mark the module as unsafe, and anything that might depend on it.

Some the Modula-3 folks worked previously on Cedar at Xerox, by the way.

Mesa - http://www.bitsavers.org/pdf/xerox/mesa/5.0_1979/documentati...

Cedar - http://www.bitsavers.org/pdf/xerox/parc/cedar/Cedar_7.0/09_C...

burntsushi · 2024-03-27T14:36:40

Very interesting. Thank you.

kelnos · 2024-03-27T00:16:43

> C and particularly C++ absolutely have that ability with library support if you know what you are doing.

I won't speak to C++, as it's a very different language now since the last time I used it. I've been writing C for more than 20 years, and I still make mistakes. And there's nothing keeping me from accidentally doing something unsafe outside my unsafe abstraction, aside from my own perfection at never making mistakes (yeah, right).

Rust requires you to be explicit about the unsafe things you do. And, realistically, even when I'm building a safe interface on top of necessarily-unsafe code, the unsafe portions aren't even that large compared to the entirety of the abstraction. That makes things much easier to audit, and the compiler tells me which sections of code I need to pay more attention to.

To me, this is lacking the ability. "If you know what you are doing" is a laughable constraint. Even people who theoretically do (and I suspect programming ability is a lot like people's self-reported skill at driving a car) still make mistakes sometimes.

cmeacham98 · 2024-03-26T19:12:08

I think you're missing the point - abusing a vector in this way is not a safe interface and is instead a good way to introduce several memory safety bugs (for example, UAFs) that the normal rust memory model explicitly prevents.

adwn · 2024-03-26T19:19:29

> abusing a vector in this way is not a safe interface and is instead a good way to introduce several memory safety bugs (for example, UAFs)

No, as long as you don't use the unsafe keyword, an out-of-bounds vector access won't lead to use-after-free or other memory safety bugs.

Arnavion · 2024-03-26T19:33:08

Out-of-bounds access is not required for the pseudo-UAF we're talking about here. Deleting a node in the middle of a linked list will leave a "hole" in the backing Vec. You cannot shift the next elements down to fill the hole because that will invalidate all handles to them. If the backing Vec holds the nodes directly, as TFA's implementation does, then there is no way to mark the hole as a hole. So any bug where a different node's handle accidentally ends up accessing this hole instead will lead to that code observing a "freed" node.

One workaround is to make the backing Vec hold Option of Node instead so that deleting a node can set the hole to None, in which case the bug I described above has the opportunity to unwrap() and panic instead of silent UAF. Though you'll also need additional tracking for those holes so that you can fill them with new nodes later, at which point you'd be better off using a proper freelist / slab instead of a Vec anyway (as TFA also mentions).

kelnos · 2024-03-27T00:26:32

Well, we're not talking about "pseudo-UAF", we're talking about actual-UAF and actual-memory-safety.

You use scare quotes around "freed" for a reason: the data has not actually been freed.

The bug you're talking about is a logic error. It could be a bad bug, depending on circumstances, but there's no memory safety issue here.

Arnavion · 2024-03-27T02:15:32

>You use scare quotes around "freed" for a reason: the data has not actually been freed.

Who said it hasn't? I would assume such a node to have been given to `std::ptr::drop_in_place`. Not doing that would be a leak until the list as a whole was dropped.

cmeacham98 · 2024-03-26T19:26:56

I don't mean a literal UAF, but a "use array index after free" because you're using indexes (which only have bounds checking) as heap pointers.

Rust's borrow checker doesn't account for when you re-implement parts of memory management as array indexes.

im3w1l · 2024-03-26T19:25:53

Safe rust is a turing complete language. That implies that it is possible to build an emulator for any other programming language in safe rust. That emulated language and programs running on it can have memory-safety issues, even if the emulator itself cannot be corrupted.

When writing such an emulator the natural way to set up the memory is to use an array, and the natural way to implement pointers is indices into that array. In other words, this pattern is part-ways there to creating an emulator with internal safety issues.

However guaranteeing that any corruption will be contained to the array is certainly a lot better than nothing.

Chabsff · 2024-03-26T19:26:00

Rust's safety mechanisms don't exist just to prevent bugs from escalating into security issues, they exist to prevent the whole class of bugs related to reference handling from being present in the first place. That's supposed to mean programs that work more consistently. Handwaving techniques that lead to programs panicking as "Well at least it doesn't become a security concern" is missing the forest for the trees.

potatochup · 2024-03-26T19:22:58

This is generally true, but afaik there are still some (fairly complex) ways to write memory unsafe code in safe rust https://faultlore.com/blah/everyone-poops/

mandarax8 · 2024-03-26T19:44:10

How is reusing a freed index not a UAF? If I roll my own allocator I can still get UAFs even though the memory accessed is not yet free'd.

kelnos · 2024-03-27T00:28:33

Because that's not what "UAF" means. Also not what "freed" means.

To have a UAF, there has to be memory that is actually freed, and you have to attempt to access that memory. No memory is freed here (in the OP's implementation). Even if it was, at worst you'd get a panic for trying to access past the end of the Vec.

None of that is a UAF or a memory safety issue. It's just a logic bug.

amichal · 2024-03-26T19:22:38

Nope but "handle confusion" safety perhpas?

kelnos · 2024-03-27T00:20:19

No, that's incorrect. While yes, it's true that you can point two different nodes at the same array index, or mismanage your array indices in a variety of ways, that is a logic error. It will indeed make your program behave incorrectly (and depending on what it's doing, that may have security implications), but there is no memory safety issue. No one is using anything after freeing it; if you try to access an index that is past the end of the Vec, it will panic. Panicking, while undesirable, is memory-safe.

You may think that's a difference without distinction, but "memory safety" and "use after free" have specific definitions, and this ain't them.

burntsushi · 2024-03-26T19:51:46

You later clarified that by "memory safety bugs" you don't actually mean "memory safety bugs," but rather "use array index after free." But that isn't a memory safety bug. (It might be a denial of service bug or a logic bug, but because of bounds checks, it isn't a memory safety bug.) So no, I'm afraid I haven't missed the point at all.

Could you please read the link I shared? There's all sorts of nuance in the README. And there is absolutely no pretending in my comment or in the link I shared that using indices instead of pointers has zero downsides.

cmeacham98 · 2024-03-26T20:12:20

It is a memory safety bug - by using this "indexes as pointers" methodology it is possible to write code where two different owners simultaneously believe they are the sole owner of an object.

Writing that using normal pointers is impossible in safe rust (barring a compiler bug, which do exist but are rare).

Rusky · 2024-03-26T20:31:25

You don't get two owners that way. You get something slightly less powerful than owners, which still has all its preconditions satisfied: you can check that it is in bounds, and if so you will find an element of the expected type. You cannot corrupt the underlying allocator's data structures, you cannot resize the array when there are outstanding pointers to its elements, and you cannot violate the type system.

Rust's goal was never to exclude all shared mutability. (Otherwise why support things like locks?) Rather, it excludes only the kinds of shared mutability that open the door to undefined behavior. The point of all these sorts of "workarounds" is that, because there is no longer a single unrestricted kind of shared mutability, you now get to pick which version of restricted mutability you want: reference counting vs bounds checking vs ghostcell's higher-rank lifetimes vs whatever else.

kelnos · 2024-03-27T00:23:29

> two different owners simultaneously believe they are the sole owner of an object.

Not in the sense of ownership that matters in Rust. The Vec owns the data. A node with an index in it does not own that data. It merely refers to it.

The key here is when you answer the question, "what happens when I screw it up the indices?" And the answer is logic errors or a panic. Neither of those are memory safety issues.

burntsushi · 2024-03-26T20:17:55

Can you share a program where this results in UB without using `unsafe`?

Please also consider what I was responding to:

> you’re just laundering the unsafe pointer arithmetic behind array indexing

cmeacham98 · 2024-03-26T21:39:02

The definition of memory safety is not "code that does not result in UB".

burntsushi · 2024-03-27T00:12:28

So just to be clear here, the progression is:

"memory safety bugs" -> "for example, UAFs" -> "I don't mean a literal UAF" -> "use array index after free" -> 'memory safety is not "code that does not result in UB"'

I mean, you can define "memory safety" to be whatever you want it to be, but the definition everyone else uses (including Rust) is absolutely connected with undefined behavior. More than that, the entire context of this thread assumes that definition. Rust certainly does. And if you are going to use a different definition than everyone else, at least have the courtesy to provide it.

If people used your definition, then it would be wrong to, for example, say that "Java is a memory safe programming language." But that is, as far as I know, widely regarded to be a true statement.

This sort of disagreement is profoundly irritating, because I made it exceptionally clear what I meant from the get-go. All you had to do was respond and say, "oh, it sounds like we are just using different definitions of the term 'memory safety.' if we use your definition, I agree with what you said."

Rusky · 2024-03-26T23:58:32

It is the definition of memory safety that Rust uses.

It would be easier to discuss whatever non-UB failure modes you have in mind, in the context of Rust, if you used a different term.

kelnos · 2024-03-27T00:24:54

Maybe not, but it's also not whatever unconventional definition you've come up with here.