Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompiling DOS MZ EXE to C #1234

Closed
palmerj opened this issue Dec 30, 2022 · 6 comments
Closed

Decompiling DOS MZ EXE to C #1234

palmerj opened this issue Dec 30, 2022 · 6 comments

Comments

@palmerj
Copy link

palmerj commented Dec 30, 2022

I tried recko to decompile a DOS MZ EXE to C. After loading and analysing the EXE I was able to export the code to C. It didn't come in one C file, which was weird. So I tried merging the C files together. After looking at the code and trying to compile it with gcc I ran into all sorts of issues. First were many types were not defined:

./TEST.h:105462:2: error: unknown type name 'real64'
        real64 rFFFFFFFF;       // FFFFFFFF
        ^
./TEST.h:105463:2: error: unknown type name 'Eq_1544'
        Eq_1544 t0001;  // 1
        ^
./TEST.h:105467:2: error: unknown type name 'Eq_10'
        Eq_10 t12CD0;   // 12CD0
        ^
./TEST.h:105468:2: error: unknown type name 'Eq_10'
        Eq_10 t12DC0;   // 12DC0
        ^
./TEST.h:105472:1: error: unknown type name 'real64'
real64 g_rFFFFFFFF; // FFFFFFFF
^
./TEST.h:105484:2: error: unknown type name 'Eq_21747'
        Eq_21747 a082B[];       // 82B
        ^
./TEST.h:105493:2: error: expected member name or ';' after declaration specifiers
./TEST.h:105492:19: error: expected ';' at end of declaration list
        Eq_21765 a0F7A[];       // F7A
                         ^
                         ;
./TEST.h:105494:2: error: type name requires a specifier or qualifier
        <anonymous> t1B1C;      // 1B1C
        ^
./TEST.h:105494:2: error: expected member name or ';' after declaration specifiers
./TEST.h:105493:20: error: expected ';' at end of declaration list
        <anonymous> t1AA0;      // 1AA0
                          ^
                          ;
./TEST.h:105504:2: error: unknown type name 'int16'
        int16 u0;
        ^
./TEST.h:105505:2: error: unknown type name 'uint8'
        uint8 u1;
        ^
./TEST.h:105506:2: error: unknown type name 'real64'
        real64 u2;
        ^
./TEST.h:105507:2: error: unknown type name 'byte'

Second I see many function calls in the files that are not defined e.g

SLICE, SEQ, Test etc.

Also many dos_* function calls, including msdos_unknown_2143. Of course I can try to port the msdos_* to POSIX, but msdos_unknown_2143 is very unknown!

Lastly I see many many occurrences of<type-error>, <code>, <invalid> and <anonymous>.

Could someone kindly please explain the process I should go through to try and get clean code to compile. I'm starting to think I will need to manually RE the whole programme now, which is 15k lines of code from a 53kb exe. Maybe Ghidra would be a better tool for this?

@maximilien-noal
Copy link

maximilien-noal commented Jan 4, 2023

Ghidra has some limitations with real mode, but it is a good option overall.

But ghidra is not focused on DOS support or real mode support.

@palmerj
Copy link
Author

palmerj commented Jan 4, 2023

Thank you.

Ghidra has some limitations with real mode

Do you know what these limitations are? I'm playing with it now and potentially see issues with pointers not being converted correctly.

@maximilien-noal
Copy link

IIRC, mainly detecting the real end of functions. Most of the time it works. Sometimes it doesn't.

And also this kind of issues (instruction dissasembly in real mode not working sometimes) :
NationalSecurityAgency/ghidra#4832

@uxmal
Copy link
Owner

uxmal commented Jan 5, 2023

First off, thanks for posting the issue. At the present time, no decompiler that I'm aware of generates compileable code except for the most simple binary files. Decompilation is much harder than compilation, as it is trying to reconstruct information that is destroyed during the compilation stage. User assistance is critical; user-provided type information in particular vastly improves the output.

Reko is maintained by a small number of contributors working in their spare time, which forces us to a "implement on demand model". We depend on a cooperative process of users asking for features and binaries exhibiting problems, and the Reko contributors providing those features "just in time". So if you're finding features missing in Reko, please continue reporting them so we can evaluate and provide implementations.

I tried recko to decompile a DOS MZ EXE to C. After loading and analysing the EXE I was able to export the code to C. It didn't come in one C file, which was weird.

By default, Reko generates a separate C file for each segment it decompiles. If segments are larger than 64 kiB, it will break them up into 64 kiB chunks. This is done to avoid nightmarishly large files when decompiling large files. You can change this behavior by loading a file, selecting Edit > Properties, and in the General tab, select Single file per program.

So I tried merging the C files together. After looking at the code and trying to compile it with gcc I ran into all sorts of issues. First were many types were not defined:

The real32 and real64 and various intxx are used because the C basic types int, float, and double are platform dependent. If you're intent on compiling the output, there are several approaches you could use. For instance, you could create a wrapping C file build.c, and the reko output file :

#include "basic_types.h"
#include "myexe_seg0800.c"
#include "myexe_seg0C13.c"

Inside of basic_types.h you create typedefs that are appropriate for your platform:

typedef short int int16;
typedef int int32;
typedef float real32;
// etc...

A case could be made for using the <stdint.h> types int32_t, int16_t etc. I'll open a separate issue to track that.

The various unresolved Eq_xxx are an unaddressed issue. It's likely it's caused by types being referenced before they are declared/defined. I'll open a separate issue to track that.

The <anonymous> types are caused by failures in the type analysis. These in turn are caused by failures in the data flow analysis, see below.

Second I see many function calls in the files that are not defined e.g SLICE, SEQ, Test etc.

These are remnants of the Reko intermediate language that have not been cleaned up. This is very likely due to a failure in the data flow analysis phase, but without seeing the binary it is impossible to determine why. Consider making the binary available so I can debug it.

Also many dos_* function calls, including msdos_unknown_2143. Of course I can try to port the msdos_* to POSIX, but msdos_unknown_2143 is very unknown!

This is another case of the "implement on demand" process I write about above. int 21h service 0x43 has several sub-services specified in the AL register:

  • 00h - Get file attributes
  • 01h - Chmod
  • 02h - Get compressed file size
  • 03h - Get access rights
    ... and quite a few more.

Reko has only implemented the first two -- but it's easy to add support for the others. However, even after implementing those sub-services, Reko requires knowing what the value in AL is during decompilation time. In some cases, AL is set right before the int 21h call, in which case things are well. In other cases AL may be passed into the procedure where the int 21h is called. In that case it's not possible to decide which sub-service is being request without more knowledge.

Consider providing the names of the missing dos__* calls you mentioned, in a separate issue. We need these reports of missing features in order to implement them.

Could someone kindly please explain the process I should go through to try and get clean code to compile.

The best way is to provide Reko with type information. You can specify the type of global memory variables (select a chunk of memory and select Set type), and the parameters and return types of procedures (select a procedure and Edit signature). This will improve Reko's output dramatically. Reach out to the discord channel if you want more help with this.

Thanks again for taking the time to write this issue. I will close this issue, and open other more specific and actionable issues.

@uxmal uxmal closed this as completed Jan 5, 2023
@palmerj
Copy link
Author

palmerj commented Jan 5, 2023

Thank you very much for your detailed reply. It's much appreciated.

Here is the the EXE for your further analysis:

Consider providing the names of the missing dos__* calls you mentioned, in a separate issue. We need these reports of missing features in order to implement them.

Sure:

msdos_create_truncate_file
msdos_unknown_2143
msdos_read_file
msdos_open_file
msdos_close_file
msdos_set_file_position
msdos_allocate_memory_block
msdos_ioctl_get_device_info
msdos_write_file
msdos_delete_file
msdos_terminate
msdos_set_interrupt_vector
msdos_get_interrupt_vector

The best way is to provide Reko with type information. You can specify the type of global memory variables (select a chunk of memory and select Set type), and the parameters and return types of procedures (select a procedure and Edit signature). This will improve Reko's output dramatically. Reach out to the discord channel if you want more help with this.

Ok thanks. I will start looking deeper into this. That I guess will need a debugger to work this out given the information at hand.

To add to the on demand list:

  • It would also be nice if standard functions (e.g memset, strdup, fopen etc etc) could be automatically detected based on byte sequence signature. I believe DCC can do this with it's sig files, but I can't get that to work on my EXE it's crashes.
  • String analysis could also be better from what I can tell. It's not picking up many of the real string at all. And when it is, it's not inlining the strings into the decompiling process. I guess the segment and memory models with the DOS MZ format need to be taken into account for this to be robust.

@palmerj
Copy link
Author

palmerj commented Jan 5, 2023

Just some quick experiences with setting the sigs and data types. It would be much better to be able to do this via shortcut or right click context menu in the C decompiler view. Also, it seems that after I set a function sig, I need to re-analyse the project, which is a little unexpected.

Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants