Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add some EBCDIC encodings #112

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Mithgol
Copy link
Contributor

@Mithgol Mithgol commented Nov 6, 2015

Fixes #111 partially.

EBCDIC 037 mapping has been taken from https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT and automatically converted from 0xXXXX to \uXXXX format for JavaScript.

EBCDIC 1140 is said to be different only at code point 9F (I have manually retyped that difference).

Note: this pull request does not contain tests because I am not sure how they should look like.

@Mithgol Mithgol changed the title add EBCDIC 037 and EBCDIC 1140 encodings add some EBCDIC encodings Nov 6, 2015
@Mithgol
Copy link
Contributor Author

Mithgol commented Nov 6, 2015

EBCDIC 500 mapping has been taken from https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP500.TXT and automatically converted from 0xXXXX to \uXXXX format for JavaScript.

EBCDIC 1148 is said to be different only at code point 9F (I have manually retyped that difference).

@Mithgol Mithgol mentioned this pull request Nov 6, 2015
@devin122
Copy link

There is some problems here with some of the control mappings. The problem arises because EBCDIC has a Carriage Return, New Line, and Line Feed. The problem with these mappings is that control characters in EBCDIC which do not translate have been given arbitrary unicode values starting at 0x80. This includes the NL character (0x20 in EBCDIC), which is assigned U+0080. On the systems I've touched the EBCDIC NL character is used in place of the LF character for marking EOL

@Mithgol
Copy link
Contributor Author

Mithgol commented Apr 16, 2017

Currently Wikipedia says that EBCDIC NL is 0x15 in EBCDIC 500 (and in its variation EBCDIC 1148) and in EBCDIC 037 (and in its variation EBCDIC 1140).

These four are mapped (by the Microsoft mappings, mentioned above) to U+0085 (officially said to be “NEXT LINE” or “NEL”) which seems correct to me.

@devin122
Copy link

Im not really sure how many programs handle U+0085 properly. The other side is, when converting the other direction, with LF being the usual line terminator. means it gets converted to EBCDIC LF (0x25). I need to double check, but on the EBCDIC machines I've had access to, they do not like this at all. They want NL line endings.

@RovoMe
Copy link

RovoMe commented Jun 30, 2020

As I have to add support for various EBCDIC encodings also I did a little research on this matter and I found an implementation by IBM which they open sourced.

Here ConversionMaps is used to map between encodings and code pages (or more formally CCSIDs). In ConvTable this mapping is now used to load the respective converter (i.e. ConvTable1140 to map between Unicode and EBCDIC (CCSID 037 = Euro update 1140 according to "Code pages with Latin-1 character sets" on the Wikipedia entry)). Skimming through their codebase a nice amount of such mappings are available, that might be helpful in adding support for those encodings to iconv-lite.

On using a bit more complex EBCDIC sample taken from this page I was able, after some back and force conversions and modifying my local sbcs-data.js file, to validate the correctness of the ebcdic.txt sample file against the ascii.txt file with a test like this:

    it("Read EBCDIC from stream", () => {
        let expected: string = fs.readFileSync("./test/ascii.txt", "latin1");
        while (expected.includes("\n") || expected.includes("\r")) {
            expected = expected.replace("\n", "").replace("\r", "");
        }

        // https://querysurge.zendesk.com/hc/en-us/articles/215029906-QuerySurge-and-Mainframe-Data-EBCDIC-Files
        // the EBCDIC file is UTF-8 encoded, so we'll need to specify this in the call. For the output
        // ASCII file, we'll use the ISO-8859-1 encoding. The record length for the sample file is 67
        // bytes
        const stream: Stream =
            fs.createReadStream("./test/ebcdic.txt")
                .pipe(iconv.decodeStream("utf8"))
                .pipe(iconv.encodeStream("iso88591"))
                .pipe(iconv.decodeStream("ebcdic037"))
                // .pipe(iconv.decodeStream("ebcdic1140"))
                // .pipe(iconv.decodeStream("ebcdic500"))
                // .pipe(iconv.decodeStream("ebcdic1148"))
        ;
        
        const chunks: unknown[] = [];
        stream.on("data", (chunk: string) => chunks.push(Buffer.from(chunk)));
        stream.on("end", () => assert.deepStrictEqual(chunks.toString(), expected));
    });

This sample test works with CCSID: 037, 277, 280, 284, 285, 297, 500, 1047 but fails for i.e. 273

BTW, one can check EBCDIC files in IntelliJ quite easily just by changing the file encoding from the default UTF-8 to i.e. IBM01140 or similar ones. Unfortunately, I need such support in Visual Studio Code, which seem to rely on jschardet and iconv-lite to probe and convert between encodings.

HTH

@ashtuchkin
Copy link
Owner

Thanks for the research @RovoMe! Any specific action items you would like to add here, or is it mostly additional info?

I always try to generate the encodings directly from authoritative sources, e.g. see in https://github.com/ashtuchkin/iconv-lite/blob/master/generation/gen-dbcs.js we download corresponding tables from unicode.org or encoding.spec.whatwg.org.

To support EBCDIC, ideally I'd want something like gen-ebcdic.js that downloads the tables from unicode.org and transforms it to iconv-lite format. Java sources are not work great for that purpose, unfortunately.

Also I think the NL concern by @devin122 is valid (see https://en.wikipedia.org/wiki/Newline#Representation). We might want to address it by 1) encoding/decoding without changes by default, this would keep 1:1 representation of all latin1 characters, but then 2) add a codec option like EBCDICNLConversion: '\n', which would enable conversion of NL char to corresponding char(s). This conversion can probably be a separate PR.

Finally, FYI, we do work on integrating iconv-lite into VS Code, but it hasn't happened yet.

@Fish1
Copy link

Fish1 commented Aug 11, 2021

I would like this please. Thanks!

@Dman247
Copy link

Dman247 commented Aug 11, 2021

Agreed, it would be very helpful to have the capability of opening encodings like CP037.

@GitMensch
Copy link

Is there any chance this PR is going forward?

@GitMensch
Copy link

vscode depends on this issue - microsoft/vscode#49891 is "the big and old" one, duplicates are at least microsoft/vscode#147064 microsoft/vscode#179693.

@ashtuchkin Can you take a look at integrating this and publish a new version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EBCDIC
7 participants