Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support .format for bytes #48232

Closed
benjaminp opened this issue Sep 27, 2008 · 96 comments
Closed

support .format for bytes #48232

benjaminp opened this issue Sep 27, 2008 · 96 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@benjaminp
Copy link
Contributor

BPO 3982
Nosy @loewis, @warsaw, @brettcannon, @terryjreedy, @gpshead, @ncoghlan, @pitrou, @vstinner, @ericvsmith, @tiran, @benjaminp, @glyph, @ezio-melotti, @florentx, @vadmium, @serhiy-storchaka
Files
  • byte_format.py: Imitate str.format with bytes function
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-07-26.07:00:02.695>
    created_at = <Date 2008-09-27.15:50:40.624>
    labels = ['interpreter-core', 'type-feature']
    title = 'support .format for bytes'
    updated_at = <Date 2016-06-10.22:00:08.951>
    user = 'https://github.com/benjaminp'

    bugs.python.org fields:

    activity = <Date 2016-06-10.22:00:08.951>
    actor = 'ncoghlan'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-07-26.07:00:02.695>
    closer = 'ncoghlan'
    components = ['Interpreter Core']
    creation = <Date 2008-09-27.15:50:40.624>
    creator = 'benjamin.peterson'
    dependencies = []
    files = ['32009']
    hgrepos = []
    issue_num = 3982
    keywords = []
    message_count = 95.0
    messages = ['73931', '73935', '73936', '73937', '73938', '73939', '74019', '74021', '74022', '74050', '84121', '84123', '90421', '90423', '90425', '90428', '127210', '130215', '130253', '130284', '163369', '163379', '171791', '171795', '171796', '171799', '171800', '171801', '171803', '171804', '171806', '171815', '171816', '171821', '171824', '180414', '180415', '180416', '180419', '180420', '180423', '180426', '180427', '180430', '180431', '180432', '180433', '180436', '180437', '180439', '180441', '180442', '180445', '180446', '180447', '180448', '180449', '180452', '180453', '180454', '180466', '180489', '180490', '180491', '180492', '180493', '180500', '198112', '199181', '199199', '199203', '199204', '199206', '199207', '199251', '199253', '199254', '199258', '199260', '199264', '199265', '199266', '199267', '199268', '199270', '199271', '199432', '199438', '223976', '223979', '224022', '224023', '266568', '268157', '268160']
    nosy_count = 26.0
    nosy_names = ['loewis', 'barry', 'brett.cannon', 'terry.reedy', 'gregory.p.smith', 'exarkun', 'ncoghlan', 'pitrou', 'vstinner', 'eric.smith', 'christian.heimes', 'benjamin.peterson', 'glyph', 'ezio.melotti', 'durin42', 'Arfrever', 'arjennienhuis', 'flox', 'ecir.hana', 'uau', 'tshepang', 'underrun', 'martin.panter', 'serhiy.storchaka', '[email protected]', 'stendec']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue3982'
    versions = ['Python 3.5']

    @benjaminp
    Copy link
    Contributor Author

    I just working on porting some networking code from 2.x to 3.x and it
    heavily uses string formatting. Since bytes don't support any kind of
    formatting, it's becoming tedious and inelegant to do it with "+". Can
    .format be supported in bytes?

    [I understand format is implemented with stringlib so shouldn't it be
    fairly easy to implement?]

    @benjaminp benjaminp added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Sep 27, 2008
    @ericvsmith
    Copy link
    Member

    Yes, it would be easy to add. Maybe bring this up on python-dev (or
    python-3000) to get consensus?

    Are we in feature freeze for 3.0?

    @benjaminp
    Copy link
    Contributor Author

    On Sat, Sep 27, 2008 at 12:33 PM, Eric Smith <[email protected]> wrote:

    Eric Smith <[email protected]> added the comment:

    Yes, it would be easy to add. Maybe bring this up on python-dev (or
    python-3000) to get consensus?

    Yes, that will have to be done.

    Are we in feature freeze for 3.0?

    Unfortunately, yes.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 27, 2008

    I'm skeptical. What networking code specifically are you using, and what
    specifically does it use string formatting for?

    @benjaminp
    Copy link
    Contributor Author

    On Sat, Sep 27, 2008 at 12:35 PM, Martin v. Löwis
    <[email protected]> wrote:

    Martin v. Löwis <[email protected]> added the comment:

    I'm skeptical. What networking code specifically are you using, and what
    specifically does it use string formatting for?

    I'm working on the tests for ftplib. [1] The dummy server uses string
    formatting to build responses.

    [1] https://svn.python.org/view/python/trunk/Lib/test/test_ftplib.py?view=markup

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 27, 2008

    I'm working on the tests for ftplib. [1] The dummy server uses string
    formatting to build responses.

    I see. I propose to add a method push_string, defined as

      def push_string(self, s):
          self.push(s.encode("ascii")

    In FTP, the responses are, by definition, ASCII-encoded strings.
    The proper way to generate them is to make a string, then encode it.

    @vstinner
    Copy link
    Member

    I don't think that b'...'.format() is a good idea. Programmers will
    continue to mix characters and bytes since .format() target are
    characters.

    @ericvsmith
    Copy link
    Member

    I don't think that b'...'.format() is a good idea. Programmers
    will continue to mix characters and bytes since .format() target
    are characters.

    b''.format() would return bytes, not a string. This is also how it works
    in 2.6.

    I'm also not sold on implementing it, although it would be easy and I
    can see a few uses for it. I think Martin's suggesting of encoding back
    to ascii might be the best thing to do (that is, don't implement
    b''.format()).

    @vstinner
    Copy link
    Member

    I think Martin's suggesting of encoding back to ascii might be
    the best thing to do

    As I understand, you would like to use bytes as characters, like
    b'{code} {message}'.format(code=100, message='OK'). So why no using
    explicit conversion to ASCII? ftp='{code} {message}'.format(code=100,
    message='OK').encode('ASCII').

    If you need to work on bytes, it means that you will use the full
    range 0..255 whereas ASCII reject bytes in 128..255.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 29, 2008

    > I think Martin's suggesting of encoding back to ascii might be
    > the best thing to do

    As I understand, you would like to use bytes as characters, like
    b'{code} {message}'.format(code=100, message='OK'). So why no using
    explicit conversion to ASCII? ftp='{code} {message}'.format(code=100,
    message='OK').encode('ASCII').

    That's indeed exactly what I had proposed - only that you shouldn't
    repeat the .encode('ascii') all over the place, but instead wrap that
    into a function (which I proposed to call push_string, along with the
    existing .push function.

    @vstinner
    Copy link
    Member

    loewis> That's indeed exactly what I had proposed
    loewis> - only that you shouldn't repeat the .encode('ascii')
    loewis> all over the place, (...)

    If you can only use bytes 0..127, it can not used for binary protocols
    and so I don't think that it's really useful. If your protocol is
    ASCII text, use explicit conversion to ASCII.

    I also not fan on functions having different result type
    (format->bytes or str, it depends...).

    @ericvsmith
    Copy link
    Member

    I also not fan on functions having different result type
    (format->bytes or str, it depends...).

    In 3.x, str.format() and bytes.format() would be two different methods
    on two different objects. I don't think there's any expectation that
    they have the same return type. There's no such expectation for
    str.strip() and bytes.strip() either.

    Similarly, in 2.6, str.format() has a different return type than
    unicode.format().

    Now the builtin format() function is another issue. In 2.6 the return
    type does depend on the types of the arguments. In 3.x, I'd suggest
    leaving it as unicode and you won't be allowed to pass in bytes.

    @arjennienhuis
    Copy link
    Mannequin

    arjennienhuis mannequin commented Jul 11, 2009

    There are many binary formats that use ASCII numbers.

    'HTTP chunking' uses ASCII mixed with binary (octets).

    With 2.6 you could write:

    def chunk(block):
        return b'{0:x}\r\n{1}\r\n'.format(len(block), block)

    With 3.0 you'd have to write this:

    def chunk(block):
        return format(len(block), 'x').encode('ascii') + b'\r\n' + block +
    b'\r\n'

    You cannot convert to ascii at the end of the pipeline as there are
    bytes > 127 in the data blocks.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 11, 2009

    def chunk(block):
    return format(len(block), 'x').encode('ascii') + b'\r\n' + block +
    b'\r\n'

    You cannot convert to ascii at the end of the pipeline as there are
    bytes > 127 in the data blocks.

    I wouldn't write it in such a complicated way. Instead, use

    def chunk(block):
       return hex(len(block)).encode('ascii') + b'\r\n' + block + b'\r\n'

    This doesn't need any format call, and describes adequatly how the
    protocol works: send an ASCII-encoded hex length, send CRLF, send
    the block, then send another CRLF. Of course, I would probably write
    that into the socket right away, rather than copying it into a different
    bytes object first.

    @arjennienhuis
    Copy link
    Mannequin

    arjennienhuis mannequin commented Jul 11, 2009

    def chunk(block):
      return hex(len(block)).encode('ascii') + b'\r\n' + block + b'\r\n'

    hex(10) returns '0xa' instead of 'a'.

    This doesn't need any format call, and describes adequatly how the
    protocol works: send an ASCII-encoded hex length, send CRLF, send
    the block, then send another CRLF. Of course, I would probably write
    that into the socket right away, rather than copying it into a different
    bytes object first.

    The point is that need to convert to ascii for each int that you send.
    You cannot just wrap the socket with an encoding. This makes porting
    difficult.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 11, 2009

    hex(10) returns '0xa' instead of 'a'.

    Ah, right. So I would still use

    '{0:x}'.format(100).encode("ascii")

    rather than the format builtin format function. Actually, I would
    probably use

    ('%x' % len(bytes)).encode("ascii")

    The point is that need to convert to ascii for each int that you send.
    You cannot just wrap the socket with an encoding. This makes porting
    difficult.

    This I don't understand. What porting becomes more difficult?
    From 2.x to 3.x? Why do you have any .format calls in your code that you
    want to port - .format was only added in 2.6, so if you want to support
    2.x, you surely are not using .format, are you?

    @uau
    Copy link
    Mannequin

    uau mannequin commented Jan 27, 2011

    This kind of formatting is needed quite often when working on network protocols or file formats, and I think the replies here fail to address important issues. In general you can't encode after formatting, as that doesn't work with binary data, and often it's not appropriate for the low-level routines doing the formatting to know what charset the data is in even if it is text (so it should be fed in already encoded as bytes). The replies from Martin v. Löwis seem to argue that you could use methods other than formatting; that would work almost as well as an argument to remove formatting support from text strings, and IMO cases where formatting is the best option are common.

    Here's an example (based on real use but simplified):

    template = b"""
    stuff here
    header1: {}
    header2: {}
    more stuff
    """
    
    def lowlevel_send(s, b1, b2):  # s socket, b1 and b2 bytes
        s.send(template.format(b1, b2))

    To clarify the requirements a bit, the issue is not so much about having a .format method on byte string objects (that's just the most natural-looking way of solving it); the core requirement is to have a formatting operator that can take byte strings as *arguments* and produce byte string *output* where the arguments can be placed unchanged.

    @terryjreedy
    Copy link
    Member

    For future reference, struct.pack, not mentioned here, is a binary bytes formatting function. It can mix ascii bytes with binary octets. It works the same in Python 2 and 3.

    Str.bytes does two things: convert objects to strings according to the contents of field specifiers; interpolate the resulting strings into a template string according to the locations of the field specifiers. If desired bytes represent encoded text, then encoding computed text is the obvious Py3 solution.

    For some mixed ascii-binary uses, struct.pack is not as elegant as a bytes.format might be. But I think such a method should use struct format codes within field specifiers to convert objects into binary bytes rather than text.

    @arjennienhuis
    Copy link
    Mannequin

    arjennienhuis mannequin commented Mar 7, 2011

    struct.pack does not work with variable length data. Something like:

    b'{0:x}\r\n{1}\r\n'.format(len(block), block)

    or

    b'%x\r\n%s\r\n' % (len(block), block)

    is not possible with struct.pack

    @terryjreedy
    Copy link
    Member

    You are right, I misinterpreted the meaning of 's' without a count (and opened bpo-11436 to clarify). However, for the fairly common case where a variable-length binary block is preceded by a 4 byte *binary* count, one can do something which is not too bad:

    >>> block = b'lsfjdlksaj'
    >>> n=len(block)
    >>> struct.pack('I%ds'%n, n, block)
    b'\n\x00\x00\x00lsfjdlksaj'

    If leading blanks are acceptable for your example with count as ascii hex digits, one can do something that I admit is worse:

    >>> struct.pack('10s%ds2s'%n, ('%8x\r\n'%n).encode(), block, b'\r\n')
    b'       a\r\nlsfjdlksaj\r\n'

    Of course, for either of these in isolation, I would probably only use .pack for the binary conversion and otherwise use '+' or b''.join(...).

    @uau
    Copy link
    Mannequin

    uau mannequin commented Jun 21, 2012

    I've hit this limitation a couple more times, and none of the proposed workarounds are adequate. Working with protocols and file formats that use human-readable markup is significantly clumsier than it was with Python 2 (using either the % operator, which also lost its support for byte strings in Python 3, or .format()).

    This bug report was closed by its original creator, after early posts where IMO nobody made as good a case for the feature as they could have. Is it possible to reopen this bug or is it necessary to file a new one?

    Is there any clear argument AGAINST having .format() for bytes, other than work needed to implement it? Some posts mention "mixing characters and bytes", but I see no reason why this would be much of a real practical concern if it's a method on bytes objects producing bytes output.

    @terryjreedy
    Copy link
    Member

    If you want to discuss this issue further, I think you post to python-ideas list with concrete examples.

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Oct 2, 2012

    Since Benjamin originally requested this feature, and then decided that he could accomplish his desired goal (ftplib porting, as far as I can tell) without it, I think that the "rejected" status is actually incorrect. I think that Benjamin just wanted to indicate that he no longer needed the feature. This doesn't mean that no one else will need the feature, and as it turns out the comments seem to reveal that other people do need the feature (also, I need the feature).

    So, adjusting the ticket metadata to reflect that this is a valid feature request just waiting for someone to implement it, not a rejected idea that is not welcome in Python.

    @exarkun exarkun mannequin reopened this Oct 2, 2012
    @tiran
    Copy link
    Member

    tiran commented Oct 2, 2012

    The proposal sounds like a good idea to me.

    Benjamin, what needs to be done to implement the feature?

    @serhiy-storchaka
    Copy link
    Member

    Formatting is a very complicated part of Python (especially after Victor's optimizations). I think no one wants to maintain this code for a long time. The price of maintaining exceeds the potential very limited benefits from the use.

    @ericvsmith
    Copy link
    Member

    I was just logging in to make this point, but Serhiy beat me to it. When I wrote several years ago that this was "easy", it was before the (awesome) PEP-393 work. I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.

    So I think Martin's advice to just encode to ascii is the best course of action.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    I'd like to put a nudge towards supporting the __mod__ interface on bytes -
    for Mercurial this is the single biggest impediment to even getting our
    testrunner working, much less starting the porting process.

    Given a spec hasn't been written (bytes.__mod__ can't support the same things as str.__mod__), and nobody seems to step up to write it, I'd say this is unlikely to appear in 3.4.

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Oct 8, 2013

    Is there any chance we could just have it work for bytes, ints, and floats? That'd solve the immediate need, and it'd be obviously correct how to have those behave.

    Punting this to 3.5 basically means we'll have to either wait for 3.5, or do something awful like use cffi to grab sprintf to port Mercurial.

    @ericvsmith
    Copy link
    Member

    If you could write up a concrete proposal, including which format specifiers would be supported, that would be helpful.

    Would it be extensible with something like __bformat__?

    There's really quite a bit of work to be done to specify how this would work.

    @ericvsmith
    Copy link
    Member

    Also, with the PEP-393 changes, the implementation will be much more difficult. Sharing code with str (unicode) will likely be impossible, or require much refactoring of the existing code.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    Is there any chance we could just have it work for bytes, ints, and
    floats? That'd solve the immediate need, and it'd be obviously
    correct how to have those behave.

    You mean "%s" and "%d"?

    Punting this to 3.5 basically means we'll have to either wait for
    3.5, or do something awful like use cffi to grab sprintf to port
    Mercurial.

    Or write a pure Python implementation.

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Oct 8, 2013

    On Tue, Oct 8, 2013 at 11:08 AM, Antoine Pitrou <[email protected]>wrote:

    > Is there any chance we could just have it work for bytes, ints, and
    > floats? That'd solve the immediate need, and it'd be obviously
    > correct how to have those behave.

    You mean "%s" and "%d"?

    Basically, yes.

    > Punting this to 3.5 basically means we'll have to either wait for
    > 3.5, or do something awful like use cffi to grab sprintf to port
    > Mercurial.

    Or write a pure Python implementation.

    Hah. Probably too slow for anything beyond a proof of concept, no?

    @glyph
    Copy link
    Mannequin

    glyph mannequin commented Oct 8, 2013

    On Oct 8, 2013, at 8:10 AM, Augie Fackler <[email protected]> wrote:

    Hah. Probably too slow for anything beyond a proof of concept, no?

    It should perform acceptably on PyPy ;-).

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    > > Punting this to 3.5 basically means we'll have to either wait for
    > > 3.5, or do something awful like use cffi to grab sprintf to port
    > > Mercurial.
    >
    > Or write a pure Python implementation.

    Hah. Probably too slow for anything beyond a proof of concept, no?

    If it's only for the Mercurial test suite, that shouldn't be a problem?

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Oct 8, 2013

    On Tue, Oct 8, 2013 at 5:11 PM, Antoine Pitrou <[email protected]>wrote:

    Antoine Pitrou added the comment:

    > > > Punting this to 3.5 basically means we'll have to either wait for
    > > > 3.5, or do something awful like use cffi to grab sprintf to port
    > > > Mercurial.
    > >
    > > Or write a pure Python implementation.
    >
    > Hah. Probably too slow for anything beyond a proof of concept, no?

    If it's only for the Mercurial test suite, that shouldn't be a problem?

    It's not just the testsuite though: we do this _all over_ hg itself. For
    example, status needs to do something like this:

    sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
    'some/filesystem/path'})

    except we don't know the encoding of the filesystem path (Hi unix!) so we
    have to treat the whole thing as opaque bytes. It's even more fun for
    'log', becase then it's got localized strings in it as well.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 8, 2013

    2013/10/8 Augie Fackler <[email protected]>:

    sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
    'some/filesystem/path'})

    except we don't know the encoding of the filesystem path (Hi unix!) so we
    have to treat the whole thing as opaque bytes.

    You are doing it wrong. In Python 3, you "should" store filenames as
    Unicode (str type). If Python fails to decode a filename, undecodable
    bytes are stored as surrogate characters (see the PEP-383).

    The Unicode type became natural in Python 3, as byte string (old "str"
    type) was natural in Python 2.

    sys.stdout.write() expects a Unicode string, not a byte string.

    Does it mean that Mercurial is moving to Python 3? Cool :-)

    @ericvsmith
    Copy link
    Member

    I've lost track what we were talking about. I thought we were trying to support b'<something>'.format() in 3.4, for a restricted set of arguments.

    I don't see how a third-party package is going to help, if the goal is to allow 3.4 to be source compatible with 2.7. And the recent example uses %-formatting, which is not the subject of this ticket.

    What proposal is actually on the table here?

    @glyph
    Copy link
    Mannequin

    glyph mannequin commented Oct 8, 2013

    On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:

    What proposal is actually on the table here?

    Sorry Eric, you're right, there is too much discussion here. This issue ought to be about .format, like the title says. There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task. While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Oct 8, 2013

    On Oct 8, 2013, at 5:24 PM, STINNER Victor <[email protected]> wrote:

    STINNER Victor added the comment:

    2013/10/8 Augie Fackler <[email protected]>:
    > sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
    > 'some/filesystem/path'})
    >
    > except we don't know the encoding of the filesystem path (Hi unix!) so we
    > have to treat the whole thing as opaque bytes.

    You are doing it wrong. In Python 3, you "should" store filenames as
    Unicode (str type). If Python fails to decode a filename, undecodable
    bytes are stored as surrogate characters (see the PEP-383).

    No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way. We're not able to change either our on-disk data format OR our stdout format, even to support a newer version of Python. I don't know the encoding of the filename's bytes, but I _must_ faithfully reproduce them exactly as they are or I'll break tools like make(1) and patch(1). Similarly, if a file goes from ISO-8859-1 to UTF-8, I have to emit a diff that has some ISO bytes and some UTF bytes - it's not in *any* valid encoding. Changing that is a showstopper regression.

    The Unicode type became natural in Python 3, as byte string (old "str"
    type) was natural in Python 2.

    sys.stdout.write() expects a Unicode string, not a byte string.

    Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?

    Does it mean that Mercurial is moving to Python 3? Cool :-)

    Not likely, honestly. I tackle this when I've got some spare cycles and my ability to handle pain is high. As it stands, I have the test-runner barely working, but it's making wrong assumptions to get there. The best estimate is that it's a year of work to upgrade to Python 3.

    ----------


    Python tracker <[email protected]>
    <https://bugs.python.org/issue3982\>


    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Oct 8, 2013

    On Oct 8, 2013, at 6:19 PM, Glyph Lefkowitz <[email protected]> wrote:

    Glyph Lefkowitz added the comment:

    On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:

    > What proposal is actually on the table here?

    Sorry Eric, you're right, there is too much discussion here. This issue ought to be about .format, like the title says. There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task. While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.

    Yeah, my bad too. I suppose I should add a new bug for %-formatting on bytes objects?

    Note that for hg, we can't drop Python 2.6 or so (we'll only drop *2.4* if we can do 2.6 and some 3.x from a single source tree) for a while, due to supporting the system interpreter on a variety of LTS platforms.

    @terryjreedy
    Copy link
    Member

    Augie, to understand what Viktor meant, I suggest reading
    https://www.python.org/dev/peps/pep-0383/
    One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Oct 8, 2013

    On Oct 8, 2013, at 6:28 PM, "Terry J. Reedy" <[email protected]> wrote:

    https://www.python.org/dev/peps/pep-0383/
    One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.

    At a quick skim, likely not good enough, because https://en.wikipedia.org/wiki/Shift_JIS isn't completely ASCII-compatible, and we've got a fair number of users on weird Shift-JIS using platforms.

    @glyph
    Copy link
    Mannequin

    glyph mannequin commented Oct 8, 2013

    On Oct 8, 2013, at 3:19 PM, Augie Fackler wrote:

    No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way.

    The PEP-383 technique for handling file names is completely capable of round-tripping exact bytes, given one encoding for both input and output. You can still handle file names this way internally in Mercurial and not risk disturbing any observable output. You do not need to change that in order to do what Victor suggests.

    We should get together in some other forum and discuss file-name handling though, since you can't actually round-trip "opaque bytes" through a *filesystem* and not disturb your output.

    Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?

    You can use sys.stdout.buffer.write.

    @terryjreedy
    Copy link
    Member

    Here is a proof of concept Python function, with a minimal test. It is similar to how str.format could be coded in Python, with re.split and ''.join, except that it does not allow anything before : in the format specification. By default (no format spec given), it copies bytes objects without change. If a format specification *is* given, it does not restrict the object, as this code simply uses builtin format sandwiched between decode and encode.

    @ezio-melotti
    Copy link
    Member

    You can use sys.stdout.buffer.write.

    Note that there's no guarantee that sys.stdout.buffer exists, e.g. if sys.stdout has been replaced with a StringIO.

    @glyph
    Copy link
    Mannequin

    glyph mannequin commented Oct 11, 2013

    Tempting as it is to reply to the comment about 'buffer' not existing, we're way off topic here. Let's please keep further comments on this bug to issues about a 'format' methods on the 'bytes' object.

    @underrun
    Copy link
    Mannequin

    underrun mannequin commented Jul 25, 2014

    First off, +1 for this feature. It's not just for twisted, but anyone doing anything with binary data (storage, compression, encryption and networking for me) with python since 2.6 will very likely have been using .format for building messages. I know I have and obviously others have been doing so as well.

    The advantages of .format to me are:

    • compatible with 2.6 (porting and single code base support easier)
    • ease of composition (the format langauge makes it easy to build complex data structures out of bytes)
    • readability (named fields make complex formats obvious)
    • consistency (manipulating a block of bytes or characters can be done in a similar way)

    Specific comments on the patch supplied by terry.reedy:

    • it doesn't support named fields
    • it doesn't handle padding
    • it doesn't handle nested formats (like '{0:{1}>{2}}'.format(data,pad_char,pad_width)
    • formatting byte strings with a width embedds the repr of the byte string ( bf(b'{:>10}', [b'test']) == b" b'test'" )

    Really this isn't a good way to solve the problem.

    Has a PEP been created for this? If not how can I help make that happen?

    Including this in 3.5 would be so helpful for us low level systems programmers out there who have lots of code using .format for binary interfaces in python 2.6/2.7 already.

    Also, not to add to derailment, but if we're adding a .format for python3 bytes it would be great if .format could pad with the null byte ('\0') which it currently converts to spaces internally (which is strange). Since this unexpected conversion is bad (so padding with null doesn't happen in python2) its more like a bug fix... actually - maybe that's a separate bug to file on the current .format for text...

    @underrun
    Copy link
    Mannequin

    underrun mannequin commented Jul 25, 2014

    sorry, terry's patch does handle padding - just with the caveats i listed later. i should have removed that bullet.

    @terryjreedy
    Copy link
    Member

    https://legacy.python.org/dev/peps/pep-0461/
    adds % formatting for bytes and bytes array.

    Nick, I have the impression that there was a decision to not add bytes.format. Correct? If so, this issue should be closed. If not, what, if anything, has been decided?

    @ncoghlan
    Copy link
    Contributor

    Right, bytes.format was considered as part of the PEP-461 discussions, and rejected as an operation that only made sense in the text domain: https://www.python.org/dev/peps/pep-0461/#proposed-variations

    With PEP-461 accepted, and PEP-460 withdrawn, that means we won't be adding bytes.format and bytearray.format.

    bpo-20284 covers the implementation of PEP-461.

    @gpshead
    Copy link
    Member

    gpshead commented May 28, 2016

    This came up in the language summit today when discussing twisted. .format() is still not supported on bytes though % is in 3.5.

    realistically it sounded like twisted needs to support python 3.4 for many years so they can't rely on bytes having a .format() method that also works on 2.7 anyways... but assuming .format() is only useful for text may still have been an oversight. (i'll have to go re-read PEP-460 and 461 and discussion before commenting further)

    @underrun
    Copy link
    Mannequin

    underrun mannequin commented Jun 10, 2016

    Gregory - I'm glad that you're willing to consider this again. It still is a constant issue for me, and .format with variable width fields in binary protocols is so the right tool for the job. If there is anything I can do to help get this added to 3.6 let me know. The forward/backward compatibility issue is secondary to me to the flexibility gained from having .format available for bytes.

    Also padding with null bytes that don't get converted would be awesome.

    @ncoghlan
    Copy link
    Contributor

    The core problem with the idea of adding bytes.format to Python 3 is that the real power of str.format actually lies in the extensible __format__ protocol and the associated format() builtin, as those rely heavily on text-specific assumptions.

    I interpreted Amber's comments at the language summit as referring more to our changing tune regarding mod formatting from:

    • mod formatting is deprecated, use brace formatting instead; to
    • they're both fully supported, neither is deprecated; to
    • use brace formatting for text data, mod formatting for binary data

    Folks that followed our original "stop using mod formatting" guidance thus needed to change course when it became our recommended technique for formatting binary data.

    Since we now know format() and __format__ aren't suitable for binary data (PEP-361 originally included it, and it got dropped as we kept finding awkward corner cases), that means any new binary formatting proposal needs to explain:

    • how it compares to existing serialisation techniques (mod-formatting, the struct module, text-formatting+encoding, etc)
    • why it needs to be a builtin method or function rather than a new serialisation module

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @sosi-deadeye
    Copy link

    I could have sworn that bytes.format had been implemented. When I needed it once, I came to the realization that this method never existed in Python 3.0, but it did in Python 2.7.

    I also remember that bytes.format triggered an error if the input was of data type str.

    Who else has this false memory?
    Is this the Mandela effect?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests