Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: Purported UB around generic functions like OPENSSL_LH_HASHFUNC #22896

Open
jwalch opened this issue Nov 30, 2023 · 33 comments
Open

FYI: Purported UB around generic functions like OPENSSL_LH_HASHFUNC #22896

jwalch opened this issue Nov 30, 2023 · 33 comments
Assignees
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug

Comments

@jwalch
Copy link
Contributor

jwalch commented Nov 30, 2023

We started seeing a complaint from LLVM17 sanitizer via some application nominally using OpenSSL:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    #1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    #2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

Strangely there were not other reports, despite the fact that obviously this idiom is used for various LHASH and in STACK also.

Unsurprisingly this seems to be impacting many other C-based projects whose authors are not especially thrilled either:

systemd/systemd#29972
python/cpython#111178
php/php-src@ca22505

There seems also to even be some conversation in the original LLVM thread on whether or not this is truly UB: https://reviews.llvm.org/D148827

@jwalch jwalch added the issue: bug report The issue was opened to report a bug label Nov 30, 2023
@nhorman
Copy link
Contributor

nhorman commented Nov 30, 2023

That definately looks like a false positive to me. Specifically to judge from the specific error:

i suppose that llvm is warning because the typecast from void * to ERR_STRING_DATA might have some alignment issue, but I can't see how it ever would, given that inserts to the hash table are all of type ERR_STRING_DATA already, and so should be properly aligned.

I would be willing to be that if you modified err_string_data_hash as follows:

diff --git a/crypto/err/err.c b/crypto/err/err.c
index b95182d702..4484db8970 100644
--- a/crypto/err/err.c
+++ b/crypto/err/err.c
@@ -168,8 +168,9 @@ static unsigned long get_error_values(ERR_GET_ACTION g,
                                       int *flags);
 
 #ifndef OPENSSL_NO_ERR
-static unsigned long err_string_data_hash(const ERR_STRING_DATA *a)
+static unsigned long err_string_data_hash(const void *d)
 {
+    const ERR_STRING_DATA *a = (const ERR_STRING_DATA *)d;
     unsigned long ret, l;
 
     l = a->error;

The explicit cast would quiet the error, which would perhaps then serve as an argument to llvm that the check is being overly agressive (as an implicit cast should be identical to the explicit one)

@davidben
Copy link
Contributor

davidben commented Nov 30, 2023

This isn't about alignment and is a true positive. A function with type void (T*) cannot be called through a function pointer of type void (*)(void *). You can do the cast, but when you call it, it needs to be cast back to the original type.

I don't think the proposed fix is (necessarily) right either. This is actually a pretty thorny consequence of how OpenSSL's STACK_OF(T) and LHASH_OF(T) types do type erasure. The question is whether functions like lh_TYPE_new, sk_TYPE_pop_free, and sk_TYPE_set_cmp_func are expected to take void*- or T*-based function pointers. You have to pick one or the other.

For STACK_OF(T), both the docs and long-standing practice force us to pick the T* version. The T* version is also more type-safe. LHASH_OF(T) is interesting because the docs actually imply a void* pointer. But looking at the code, I think this may just be a doc bug. The helper macro does define type-specific function pointers.

Now, how do you get it cast to the right type if you pick the T* one and do type erasure? It's pretty thorny but you basically need to pass in type-aware thunks. Here's what I did for BoringSSL. It's pretty gross, but fixes the UB.

As you go to do that, you'll quickly discover that the standard library qsort and bsearch functions are actually practically impossible to use to implement OpenSSL's API. (They're missing a void* userdata or context parameter to pass in a closure.) bsearch_r and qsort_r fix this, but are a portability nightmare. We ultimately just reimplemented both in the library to avoid UB.

@thesamesam
Copy link
Contributor

See #22375 too (cc @kroeckx).

@mattcaswell mattcaswell added branch: master Merge to master branch triaged: bug The issue/pr is/fixes a bug branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 and removed issue: bug report The issue was opened to report a bug labels Dec 1, 2023
@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

So, I'll start this by admitting that I didn't/don't fully understand the issue here, but I did some reading, and I think I have a slightly better handle on it now, though I'm still struggling to see why this is a problem. For the benefit of future readers, I wanted to get my thoughts down here.

The facts (and fact-adjacent info)
llvm's ub sanitizer is complaining about undefined behavior. Particularly, it is complaining that this code in our hash table library:

static OPENSSL_LH_NODE **getrn(OPENSSL_LHASH *lh,
                               const void *data, unsigned long *rhash)
{   
    OPENSSL_LH_NODE **ret, *n1; 
    unsigned long hash, nn;
    OPENSSL_LH_COMPFUNC cf;

    hash = (*(lh->hash)) (data);    <===== Here
    *rhash = hash;     
... 

Is making a call through a function pointer where the prototype of lh->hash which is:

 unsigned long (*OPENSSL_LH_HASHFUNC) (const void *);

Doesn't match the signature of the function it's pointing to, which in this case is:

static unsigned long err_string_data_hash(const ERR_STRING_DATA *a)

Additional info/Questions

C89 seems pretty vague on this issue. It indicates in A.6.5.7 that pointers can be cast to objects, and vice versa, and makes no mention of how arguments are handled when that is done. 3.3.16 notes that assignments (which is further clarified in c99) that:

In simple assignment ( = ), the value of the right operand is converted to the type of the assignment expression and replaces the value stored in the object designated by the left operand

If we assume that parameters to function calls are handled as assignments, then the above should be considered safe in C89, as the pointer to void is converted to a pointer to ERR_STRING_DATA, and everything is ok.

C99 has significantly more to say about this, in section 6.5.2.2:

7 If the expression that denotes the called function has a type that does include a prototype,
the arguments are implicitly converted, as if by assignment, to the types of the
corresponding parameters, taking the type of each parameter to be the unqualified version
of its declared type. The ellipsis notation in a function prototype declarator causes
argument type conversion to stop after the last declared parameter. The default argument
promotions are performed on trailing arguments.

Which I think matches what we're doing here as lh->hash has a prototype defined in the OPENSSL_LHASH structure (the type is OPENSSL_HASH_COMPFUNC), and so the arguments are implicitly converted, which is what we want.

So far, so good, the section on assignments (6.5.16) says, relevant to this operation, that an assignment has to conform to at least one of the following constraints:

1. the left operand has qualified or unqualified arithmetic type and the right has
arithmetic type (**We don't conform here, but it's not applicable because we're working with pointer types)**

2. the left operand has a qualified or unqualified version of a structure or union type
compatible with the type of the right (**We don't conform, but this isn't applicable as we're dealing with pointer types**)

3. both operands are pointers to qualified or unqualified versions of compatible types,
and the type pointed to by the left has all the qualifiers of the type pointed to by the
right (**This is applicable and I think we conform**)

4. one operand is a pointer to an object or incomplete type and the other is a pointer to a
qualified or unqualified version of void, and the type pointed to by the left has all 
the qualifiers of the type pointed to by the right (**this is applicable, and I think we conform**)

5. the left operand is a pointer and the right is a null pointer constant (**Not applicable**)

Speaking to items (3) and (4), if we assume that The assignment operator constraints apply to the type conversions, as suggested by 6.5.2.2, taking the const ERR_STRING_DATA pointer in the called function to be the left operand of the assignment expression, and the passed void * data as the right operand in the expression, they are both pointers to qualified versions of compatible types (at least it seems to me that const void * and const ERR_STRING_DATA * should be compatible), and they both meet the conditions of having all the same qualifiers. As such it seems this code meets the constraints (3) and (4), and so the call behavior should not be considered undefined behavior.

Given the above, I'm struggling to understand how this isn't a conformant operation. Can someone elaborate?

Or is llvm really triggering on the fact that the getrn function calls the hash function like this:

hash = (*(lh->hash)) (data);

Rather than this:

hash = lh->hash(data);

which @levitte and I noted in #22375 is just deeply wierd. Section 6.5.3.2 of C99 expressly notes that indirection of a function pointer results in the function designator, and so is fine, but I could see how ubsanitizer might trip up on such an operation, if its instrumentation recorded that as a different type.

@mspncp
Copy link
Contributor

mspncp commented Dec 2, 2023

At least your last question should be easy to answer if @jwalch could be so kind to change this line and rerun the test.

@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

I just ran it here, under a ubsan config, and saw no errors (though thats not definitive in any way), as I wasn't seeing it before

@thesamesam
Copy link
Contributor

Note that you need >= Clang 17 for this.

@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

oh, thats right, I was probably configured for gcc. Let me try again

@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

yeah, when configure to use clang, and enable ubsan, it throws a bunch of complaints, about every call through lh->hash, and converting that wierd dereference doesn't help.

@t-j-h
Copy link
Member

t-j-h commented Dec 2, 2023

For historical context, that "weird dereference" was to deal with a compiler bug that crashed for various versions of Borland C on windows unless you used the explicit pointer dereference approach.

Keep in mind that lhash predates SSLeay - that code had been to a lot of unusual places long before it went into SSLeay.

@ArsenArsen
Copy link

So, I'll start this by admitting that I didn't/don't fully understand the issue here, but I did some reading, and I think I have a slightly better handle on it now, though I'm still struggling to see why this is a problem. For the benefit of future readers, I wanted to get my thoughts down here.

The facts (and fact-adjacent info) llvm's ub sanitizer is complaining about undefined behavior. Particularly, it is complaining that this code in our hash table library:

static OPENSSL_LH_NODE **getrn(OPENSSL_LHASH *lh,
                               const void *data, unsigned long *rhash)
{   
    OPENSSL_LH_NODE **ret, *n1; 
    unsigned long hash, nn;
    OPENSSL_LH_COMPFUNC cf;

    hash = (*(lh->hash)) (data);    <===== Here
    *rhash = hash;     
... 

Is making a call through a function pointer where the prototype of lh->hash which is:

 unsigned long (*OPENSSL_LH_HASHFUNC) (const void *);

Doesn't match the signature of the function it's pointing to, which in this case is:

static unsigned long err_string_data_hash(const ERR_STRING_DATA *a)

Additional info/Questions

C89 seems pretty vague on this issue. It indicates in A.6.5.7 that pointers can be cast to objects, and vice versa, and makes no mention of how arguments are handled when that is done. 3.3.16 notes that assignments (which is further clarified in c99) that:

In simple assignment ( = ), the value of the right operand is converted to the type of the assignment expression and replaces the value stored in the object designated by the left operand

If we assume that parameters to function calls are handled as assignments, then the above should be considered safe in C89, as the pointer to void is converted to a pointer to ERR_STRING_DATA, and everything is ok.

C99 has significantly more to say about this, in section 6.5.2.2:

7 If the expression that denotes the called function has a type that does include a prototype,
the arguments are implicitly converted, as if by assignment, to the types of the
corresponding parameters, taking the type of each parameter to be the unqualified version
of its declared type. The ellipsis notation in a function prototype declarator causes
argument type conversion to stop after the last declared parameter. The default argument
promotions are performed on trailing arguments.

Which I think matches what we're doing here as lh->hash has a prototype defined in the OPENSSL_LHASH structure (the type is OPENSSL_HASH_COMPFUNC), and so the arguments are implicitly converted, which is what we want.

So far, so good, the section on assignments (6.5.16) says, relevant to this operation, that an assignment has to conform to at least one of the following constraints:

1. the left operand has qualified or unqualified arithmetic type and the right has
arithmetic type (**We don't conform here, but it's not applicable because we're working with pointer types)**

2. the left operand has a qualified or unqualified version of a structure or union type
compatible with the type of the right (**We don't conform, but this isn't applicable as we're dealing with pointer types**)

3. both operands are pointers to qualified or unqualified versions of compatible types,
and the type pointed to by the left has all the qualifiers of the type pointed to by the
right (**This is applicable and I think we conform**)

4. one operand is a pointer to an object or incomplete type and the other is a pointer to a
qualified or unqualified version of void, and the type pointed to by the left has all 
the qualifiers of the type pointed to by the right (**this is applicable, and I think we conform**)

5. the left operand is a pointer and the right is a null pointer constant (**Not applicable**)

Speaking to items (3) and (4), if we assume that The assignment operator constraints apply to the type conversions, as suggested by 6.5.2.2, taking the const ERR_STRING_DATA pointer in the called function to be the left operand of the assignment expression, and the passed void * data as the right operand in the expression, they are both pointers to qualified versions of compatible types (at least it seems to me that const void * and const ERR_STRING_DATA * should be compatible), and they both meet the conditions of having all the same qualifiers. As such it seems this code meets the constraints (3) and (4), and so the call behavior should not be considered undefined behavior.

Given the above, I'm struggling to understand how this isn't a conformant operation. Can someone elaborate?

the paragraphs you refer to are not relevant as the 'prototype' in question refers to the prototype in the function pointer type, which is still mismatched with the real function type, and hence, the wrong conversions happen at call time

Or is llvm really triggering on the fact that the getrn function calls the hash function like this:

hash = (*(lh->hash)) (data);

Rather than this:

hash = lh->hash(data);

which @levitte and I noted in #22375 is just deeply wierd. Section 6.5.3.2 of C99 expressly notes that indirection of a function pointer results in the function designator, and so is fine, but I could see how ubsanitizer might trip up on such an operation, if its instrumentation recorded that as a different type.

it's not. see:

/tmp$ echo 'static void f(unsigned) {} void (*foo)(unsigned) = f;' > a.c; echo 'int main(void) { extern void (*foo)(int); (*foo)(1); foo(1); }' > b.c; clang -O3 -fsanitize=function a.c b.c && ./a.out 
a.c:1:23: warning: omitting the parameter name in a function definition is a C2x extension [-Wc2x-extensions]
    1 | static void f(unsigned) {} void (*foo)(unsigned) = f;
      |                       ^
1 warning generated.
b.c:1:43: runtime error: call to function f through pointer to incorrect function type 'void (*)(int)'
a.c: note: f defined here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior b.c:1:43 in 
b.c:1:54: runtime error: call to function f through pointer to incorrect function type 'void (*)(int)'
a.c: note: f defined here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior b.c:1:54 in 
/tmp$

@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

thank you @t-j-h , I couldn't for the life of me figure out what on earth that was doing

@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

@ArsenArsen
I'm not sure I follow your comment:

the paragraphs you refer to are not relevant as the 'prototype' in question refers to the prototype in the function pointer type, which is still mismatched with the real function type, and hence, the wrong conversions happen at call time

lh->hash is defined as OPENSSL_LH_HASHFUNC, which is of the following typedef:

typedef unsigned long (*OPENSSL_LH_HASHFUNC) (const void *);

which I grant you doesn't match the definition of err_string_data_hash, or many other instances of lhash usage (it replaces const void * with its specific comparison data type.

But the type conversion rules I outlined above should still apply, no?

Or are you saying that, because lh->hash has the prototype outlined above, and the compiler doesn't expressly know the prototype of the function that it points to, no conversion is undertaken? In other words, if there exists some compiler/platform in the wild that expressly needs to do some level of conversion to the pointer (whatever that might be), then that conversion won't happen there, causing potential breakage. I can't imagine what conversion that compiler would be doing, but I could understand it as an error in the philosophical sense, that would trigger the ubsan warning, that perhaps never manifests as a real runtime error, at least not in any of the platorms openssl current supports.

@thesamesam
Copy link
Contributor

That latter case can happen with CFI. UBsan also regularly warns about things which compilers don't yet enforce/optimise on.

@ArsenArsen
Copy link

Or are you saying that, because lh->hash has the prototype outlined above, and the compiler doesn't expressly know the prototype of the function that it points to, no conversion is undertaken?

indeed

I can't imagine what conversion that compiler would be doing, but I could understand it as an error in the philosophical sense, that would trigger the ubsan warning, that perhaps never manifests as a real runtime error, at least not in any of the platorms openssl current supports.

such errors perhaps won't manifest as a missing conversion (but not every arch has the same representation for all pointers), but i can see this kind of error failing CFI in the future as compilers get better at detecting bugs for the user.

@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

@thesamesam , ok, I think I'm starting to get it now. CFI, as a run time validation tool for, among other things, indirect pointer calls, might modify some data in the function pointer for validation purposes, and if the type conversion isn't made properly, CFI breaks. Even if the conversion is effectively a no-op for the purposes of our run time operation in isloation, other tools that rely on those conversions happening may break, and thats whats ubsan is warning about here.

Ok, thank you for the education. It would seem to me then, that the most reasonable solution (though its ugly), would be to modify DEFINE_LHASH_OF_EX such that the the definition of lh_##type##_new accept function pointers whos arguments are all void *, and perform the explicit cast to the proper type in the respective _hash function implementation.

Note: This is also applicable to the compare function that gets passed to lh_##type##_new

@nhorman nhorman self-assigned this Dec 2, 2023
nhorman added a commit to nhorman/openssl that referenced this issue Dec 2, 2023
ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While this is nominally not an issue for general run time usage (a
pointer to void can be used as a pointer to a specific type, as long as
the pointer respects the constraints of the C language specification),
it can be problematic for ancilliary tools that expect the compiler to
detect when a type conversion occurs (even if it is a no-op, CFI being
an expemplar case).

As such, ubsan warns us about this issue to avoid breakages in such
conditions.

There could be several fixes for it, such as including some per-type
type thunking code, to make the conversion prior to the callback, but
such solutions are very involved and require a significant amount of
code.

A more general solution (which is provided here), is to convert all the
call backs and function callback type definition to only pass void
pointers, and make the needed conversion in the callbacks themselves.

Fixes openssl#22896
@nhorman
Copy link
Contributor

nhorman commented Dec 2, 2023

ugh, it also applies to the free function, and the doall functions. yuk.

However, i've got a start on what might be a proposed fix here:
#22915

It doesn't fix everything (the free and doall function macros still need to be fixed up, the docs also need adjustment), and this probably will need simmilar work on other data strucure types (PRIORITY_LIST, SA, etc). But if anyone wants to play with the above draft pr to see if its going in the right direction, please feel free

@nhorman
Copy link
Contributor

nhorman commented Dec 3, 2023

Note to self: This could also be implemented with _Generic to maintain compile time type checking, but that requires the availability of the _Generic keyword, which is only available in C11

nhorman added a commit to nhorman/openssl that referenced this issue Dec 3, 2023
ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While this is nominally not an issue for general run time usage (a
pointer to void can be used as a pointer to a specific type, as long as
the pointer respects the constraints of the C language specification),
it can be problematic for ancilliary tools that expect the compiler to
detect when a type conversion occurs (even if it is a no-op, CFI being
an expemplar case).

As such, ubsan warns us about this issue to avoid breakages in such
conditions.

This is an alternative fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Fixes openssl#22896
@nhorman
Copy link
Contributor

nhorman commented Dec 4, 2023

@davidben Thank you for the references. I suppose this is all moot at this point, since @thesamesam and @ArsenArsen have pointed out that CFI breaks without doing this thunking, and so regardless of anything else here, we likely need to fix this

That said, for the purpose of education, regarding your references
regarding 6.3.2.3, we make this conversion (using the intial post as reference), when we call lh_ERR_STRING_DATA_new. Specifically we convert err_string_data_hash, which is of type

unsigned long (*)(const ERR_STRING_DATA *)

to type

unsigned long (*)(const void *)

Based on that and section 6.5.2.2 clang's ubsan is asserting that those two types are not compatible. So the question in my mind is "Are those two types in fact, incompatible"?

Looking at 6.2.7 to determine the rules for compatibility:

Two types have compatible type if their types are the same. Additional rules for
determining whether two types are compatible are described in 6.7.2 for type specifiers,
in 6.7.3 for type qualifiers, and in 6.7.5 for declarators.46)

(footnote 46 included above notes "Two types need not be identical to be compatible"

The remainder of the paragraph discusses the compatibility rules for structures, unions and ennumerated types. No mention is made for pointers to types there that I can see. Strictly speaking the two function pointer types are not the same, so I can see why ubsan would warn about it, but it still seems to me that void * should be compatible with ERR_STRING_DATA *.

To support that argument, I'm looking at the section on coversions (6.3.2.3):

Several operators convert operand values from one type to another automatically. This
subclause specifies the result required from such an implicit conversion, as well as those
that result from a cast operation (an explicit conversion). The list in 6.3.1.8 summarizes
the conversions performed by most ordinary operators; it is supplemented as required by
the discussion of each operator in 6.5.
...

A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned57) for the
pointed-to type, the behavior is undefined.

That, to me seems to say that, If the above conversion of function pointer types occurs (which it has), when called in getrn, the compiler should implicitly convert the argument void * to ERR_STRING_DATA *, and that conversion is compatible, as long as the alignment of the void * agrees with the required alignment of ERR_STRING_DATA *, which it always should as long as we only ever store ERR_STRING_DATA types in the hash table (which we do)

Again, since CFI breaks without this thunking, that seems to me to be sufficient to fix this issue, I'm just trying to understand if this is truly undefined behavior. because the type of the target function pointer is defined in a compliation unit external to the core lhash code, I can see (as discussed above), how the compiler may not be able to determine the conversion to make, triggering the ubsan warning, but it seems like they should be, in the philisophical sense, compatible, as the pointer type conversion, could, in theory be implicitly made.

@davidben
Copy link
Contributor

davidben commented Dec 4, 2023

You've got causality backwards. CFI works the way it does because C's rules imply you cannot call such functions. Although, certainly CFI is more effective the more finely you can partition the space of functions and calls, since the point of CFI-like mitigations is to constrain the control-flow graph, as much as possible, to the "correct" one.

regarding 6.3.2.3, we make this conversion (using the intial post as reference), when we call lh_ERR_STRING_DATA_new. [...] Based on that and section 6.5.2.2 clang's ubsan is asserting that those two types are not compatible. So the question in my mind is "Are those two types in fact, incompatible"?

Again, that is not what is going on. The conversion is fine. It is that you have to cast the pointer back when you call the function. That is the point at which Clang is flagging the issue. Please read the text I cited again.

And, no, they are not compatible. Compatible does not mean convertible. Compatible means, colloquially, the same type. That is, in fact, what it says right in the spec. "Two types have compatible type if their types are the same." The reason for the footnote and all the complicated rules is just because C has lots of ways to say "the same" type, and it gets very fussy. E.g. anonymous structs cause a bit of a mess.

@davidben
Copy link
Contributor

davidben commented Dec 4, 2023

cppreference defines this notion in a hopefully more digestible way. Language specs can be hard to read if you're not used to them:
https://en.cppreference.com/w/c/language/type#Compatible_types

I think you've just taken the footnote and extrapolated an equivalence between "compatible" and "convertible" where there is none.

@nhorman
Copy link
Contributor

nhorman commented Dec 4, 2023

@davidben ok, I think I get what thats saying, thank you. If I'm reading it right what the cpp reference is indicating is that function pointer types are compatible if the return type is the same, and if the parameter counts match, and each type in the parameter list is 'compatible', where compatibility, for the purpose of this discussion implies, from that document, for each parameter:

they are pointer types and are pointing to compatible types

because void * points to a void object, and ERR_STRING_DATA * points to an ERR_STRING_DATA object, those types are not the same, and therefore not compatible.

Counter-intuitive, given that void * are so commonly cast to other data types, but I understand what the standard is saying now. Lots of projects are going to struggle with this.

@davidben
Copy link
Contributor

davidben commented Dec 4, 2023

Counter-intuitive, given that void * are so commonly cast to other data types, but I understand what the standard is saying now.

Right, formally speaking, a pointer of type void * and a pointer of int * are different. It's just that it's that the conversion is defined and happens implicitly. Just like how C will implicitly convert uint32_t{0xffffffff} to uint16_t{0xffff} even though these are decidedly not the same type or value.

Lots of projects are going to struggle with this.

Yeah, it turns out there is a pretty wide gap between C, the language as described in the spec and implemented by compilers, and C, the language that most C programmers think they are writing. 😞 The strict aliasing rule is similarly very harsh. (Type-punning via pointer casts is practically never allowed.)

Unfortunately, this gap translates to code getting miscompiled, so we have to live with it until newer versions of the standard fix things up. (Though I wouldn't expect this one to change. Changing it would potentially even weaken CFI's protections.)

@ArsenArsen
Copy link

i fear this is unlikely to be fixed by or in C, as doing this 'right' requires significant polymorphism facilities, far beyond what _Generic offers

@nhorman nhorman added the hold: need otc decision The OTC needs to make a decision label Dec 5, 2023
@nhorman
Copy link
Contributor

nhorman commented Dec 5, 2023

OTC: need to discuss options to fix this. Current options are:

  1. Add thunking functions to cast function pointers and types appropriately ( large change set, increases code size, maintains type safety)

  2. Revert callbacks to be generic, and do type convesion in callbacks (large change set, but smaller than (1), easier to implement, eliminates type safety of various callbacks)

@davidben
Copy link
Contributor

davidben commented Dec 5, 2023

Revert callbacks to be generic, and do type convesion in callbacks (large change set, but smaller than (1), easier to implement, eliminates type safety of various callbacks)

Is this option to make all the callbacks be void*-typed? I do not think this is viable. These callbacks are exposed publicly, and callers have relied on the callbacks being typed for a very, very long time. It would be unreasonable to pull the rug out from under the whole OpenSSL ecosystem and say that they're void*-typed now.

That is, the C requirement to match the type signature when you call the function means that these two APIs are incompatible:

void sk_FOO_pop_free1(STACK_OF(FOO) *sk, void (*free_func)(FOO *));
void sk_FOO_pop_free2(STACK_OF(FOO) *sk, void (*free_func)(void *));

If you told the caller that the API is sk_FOO_pop_free1, they will write the function with a FOO* parameter. If you then change it to sk_FOO_pop_free2, you have broken their code.

For function pointers which are not exposed publicly, you can change the type signature all you like. But changing any public ones is an API break.

@nhorman
Copy link
Contributor

nhorman commented Dec 5, 2023

@davidben my original thought was to do that where needed, but yes, it quickly became obvious that doing so would have several negative repercussions. I'm less worried about ABI breakage there, as this is going to be a future versions only fix, where that is feasible. However it also strips all typechecking which is bad. But this is an option to potentially consider for the OTC

#22917 is closer to the approach you initially suggested

It passes around type thunking code to the major affected data structures in a per-type fashion that is reponsible for casting the generic function callback to its appropriate typed callback.

Its ugly, and not 100% covering of all the ubsan issues I'm seeing, and its also creating some code compilation issues (its pushing some symbols outside of the allowable reloc range for m68k and some other older cpus), but this generally seems like the way to go.

In either case, I think the OTC needs to make a decision on the appropriate way to go.

@kroeckx
Copy link
Member

kroeckx commented Dec 5, 2023 via email

@nhorman
Copy link
Contributor

nhorman commented Dec 6, 2023

For reference, there are several open issues on this same problem being reported with guard:/cf enabled on windows:
#22944
#22554
#22387
#16147
#1592

@thesamesam
Copy link
Contributor

Reasons this is undefined behavior is that on some architectures void * might be a different size then other pointer types. There are also ABIs where depending on the type the pointer is passed in a different register.

(This isn't the only reason but it's covered extensively above.)

nhorman added a commit to nhorman/openssl that referenced this issue Dec 13, 2023
ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While this is nominally not an issue for general run time usage (a
pointer to void can be used as a pointer to a specific type, as long as
the pointer respects the constraints of the C language specification),
it can be problematic for ancilliary tools that expect the compiler to
detect when a type conversion occurs (even if it is a no-op, CFI being
an expemplar case).

As such, ubsan warns us about this issue to avoid breakages in such
conditions.

This is an alternative fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Fixes openssl#22896
nhorman added a commit to nhorman/openssl that referenced this issue Jan 3, 2024
ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While this is nominally not an issue for general run time usage (a
pointer to void can be used as a pointer to a specific type, as long as
the pointer respects the constraints of the C language specification),
it can be problematic for ancilliary tools that expect the compiler to
detect when a type conversion occurs (even if it is a no-op, CFI being
an expemplar case).

As such, ubsan warns us about this issue to avoid breakages in such
conditions.

This is an potential fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Related to openssl#22896
@nhorman nhorman linked a pull request Jan 3, 2024 that will close this issue
nhorman added a commit to nhorman/openssl that referenced this issue Jan 4, 2024
Introduce hash thunking functions to do proper casting

ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While the C language specification allows for pointers to data of one
type to be converted to pointers of another type, it does not
allow for pointers to functions with one signature to be called
while pointing to functions of another signature.  Compilers often allow
this behavior, but strictly speaking it results in undefined behavior

As such, ubsan warns us about this issue

This is an potential fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Related to openssl#22896
@t8m t8m removed the hold: need otc decision The OTC needs to make a decision label Jan 9, 2024
@t8m
Copy link
Member

t8m commented Jan 9, 2024

OTC: We should attempt to fix it via thunking but we must keep the ABI stable.

nhorman added a commit to nhorman/openssl that referenced this issue Jan 15, 2024
ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While this is nominally not an issue for general run time usage (a
pointer to void can be used as a pointer to a specific type, as long as
the pointer respects the constraints of the C language specification),
it can be problematic for ancilliary tools that expect the compiler to
detect when a type conversion occurs (even if it is a no-op, CFI being
an expemplar case).

As such, ubsan warns us about this issue to avoid breakages in such
conditions.

This is an potential fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Related to openssl#22896
nhorman added a commit to nhorman/openssl that referenced this issue Jan 15, 2024
Introduce hash thunking functions to do proper casting

ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    openssl#1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    openssl#2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While the C language specification allows for pointers to data of one
type to be converted to pointers of another type, it does not
allow for pointers to functions with one signature to be called
while pointing to functions of another signature.  Compilers often allow
this behavior, but strictly speaking it results in undefined behavior

As such, ubsan warns us about this issue

This is an potential fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Related to openssl#22896
openssl-machine pushed a commit that referenced this issue Jan 17, 2024
ubsan on clang17 has started warning about the following undefined
behavior:

crypto/lhash/lhash.c:299:12: runtime error: call to function err_string_data_hash through pointer to incorrect function type 'unsigned long (*)(const void *)'
[...]/crypto/err/err.c:184: note: err_string_data_hash defined here
    #0 0x7fa569e3a434 in getrn [...]/crypto/lhash/lhash.c:299:12
    #1 0x7fa569e39a46 in OPENSSL_LH_insert [...]/crypto/lhash/lhash.c:119:10
    #2 0x7fa569d866ee in err_load_strings [...]/crypto/err/err.c:280:15
[...]

The issue occurs because, the generic hash functions (OPENSSL_LH_*) will
occasionaly call back to the type specific registered functions for hash
generation/comparison/free/etc, using functions of the (example)
prototype:

[return value] <hash|cmp|free> (void *, [void *], ...)

While the functions implementing hash|cmp|free|etc are defined as
[return value] <fnname> (TYPE *, [TYPE *], ...)

The compiler, not knowing the type signature of the function pointed to
by the implementation, performs no type conversion on the function
arguments

While the C language specification allows for pointers to data of one
type to be converted to pointers of another type, it does not
allow for pointers to functions with one signature to be called
while pointing to functions of another signature.  Compilers often allow
this behavior, but strictly speaking it results in undefined behavior

As such, ubsan warns us about this issue

This is an potential fix for the issue, implemented using, in effect,
thunking macros.  For each hash type, an additional set of wrapper
funtions is created (currently for compare and hash, but more will be
added for free/doall/etc).  The corresponding thunking macros for each
type cases the actuall corresponding callback to a function pointer of
the proper type, and then calls that with the parameters appropriately
cast, avoiding the ubsan warning

This approach is adventageous as it maintains a level of type safety,
but comes at the cost of having to implement several additional
functions per hash table type.

Related to #22896

Reviewed-by: Sasa Nedvedicky <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from #23192)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug
Projects
Status: In progress
10 participants