Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for ABNF #3311

Merged
merged 1 commit into from
Nov 2, 2016
Merged

Add Support for ABNF #3311

merged 1 commit into from
Nov 2, 2016

Conversation

sanssecours
Copy link
Contributor

Hi,

this commit adds support for Augmented Backus–Naur form, a metalanguage used to specify language grammars. The language does not seem to be that popular on GitHub, but important enough to be used in popular repositories. A RFC document, describing the current version of the language, is available here.

If I should change anything in this pull request, then please just comment below. Thank you.

Kind regards,
René

@Alhadis Alhadis self-assigned this Nov 1, 2016
@Alhadis
Copy link
Collaborator

Alhadis commented Nov 1, 2016

👋 Hey @sanssecours, thanks for the pull requests. I'll address them both in this post, since they're so closely related.

I'd say there're enough results across GitHub for both languages for them to warrant addition:

Language Files Repos Users
ABNF 2,308 828 706
EBNF 2,011 997 886

I wouldn't mind hearing @pchaigno and @arfon's thoughts too, though. :)

Also, a few other things:

  • You'll need to change the language-type of each entry to data, not programming. It's a notation for describing a language, as we both know, and not a real language per se.
  • You might find this grammar to be useful; it provides highlighting for both ABNF and EBNF.

@arfon
Copy link
Contributor

arfon commented Nov 1, 2016

I wouldn't mind hearing @pchaigno and @arfon's thoughts too, though. :)

Agreed. These look like they meet the required usage guidelines.

  • You'll need to change the language-type of each entry to data, not programming. It's a notation for describing a language, as we both know, and not a real language per se.
  • You might find this grammar to be useful; it provides highlighting for both ABNF and EBNF.

👍

@Alhadis
Copy link
Collaborator

Alhadis commented Nov 1, 2016

Great! That settles it then. =)

One more thing: you'll also need to show where you acquired those samples, as we require them to be released under a permissive license. Specifically, one of these.

Lastly, I'd also ask you to remove the color fields for each language. As they fall under the data category, they won't be listed in language-statistic bars in repositories... which means the colours won't ever be seen by users.

@sanssecours
Copy link
Contributor Author

  • You'll need to change the language-type of each entry to data, not programming. It's a notation for describing a language, as we both know, and not a real language per se.

Thanks for the information. I just changed the group. May I ask why ANTLR files are listed under programming then? As far as I know .gs4 files also describe grammars. Here is an example of an ANTLR file/grammar from the official website (Expr.g4):

grammar Expr;       
prog:   (expr NEWLINE)* ;
expr:   expr ('*'|'/') expr
    |   expr ('+'|'-') expr
    |   INT
    |   '(' expr ')'
    ;
NEWLINE : [\r\n]+ ;
INT     : [0-9]+ ;

You might find this grammar to be useful; it provides highlighting for both ABNF and EBNF.

Thanks for the info. I think I will incorporate the core rules in my version of the ABNF TextMate grammar too. The EBNF grammar seems to be a direct translation of Arne's grammar :o) so we should be covered there.

One more thing: you'll also need to show where you acquired those samples, as we require them to be released under a permissive license. Specifically, one of these.

All of the examples are from the English Wikipedia pages for ABNF and EBNF. I added a link to the respective articles at the top of each sample file. The files are covered under CC BY-SA 4.0, which sound permissive to me, although I think I also need to provide a link to the license and the name of the authors. Finding the authors of the code sounds like a complicated task, since so many people contribute to Wikipedia articles.

Should I just search for files that are covered by one of the licenses you linked to instead? If you think this option is more appropriate, is a comment that contains a link to the original source inside the files enough?

@Alhadis
Copy link
Collaborator

Alhadis commented Nov 1, 2016

May I ask why ANTLR files are listed under programming then?

I'm not sure, as I'm unfamiliar with ANTLR; @larsbrinkhoff might know more about it than I do. But if it's really just another *BNF-like notation, its classification should definitely be changed. =)

All of the examples are from the English Wikipedia pages for ABNF and EBNF.

@pchaigno Do you know if the CC BY-SA license is appropriate? I've forgotten what the stance on Wikipedia-sourced samples is.

If you think this option is more appropriate, is a comment that contains a link to the original source inside the files enough?

No, I'm afraid not. The samples aren't for human readers: they're for the classifier. Linguist uses them as a basis for Bayesian classification, helping it identify languages based on the frequency of keywords and what-have-you. The bigger the reference material, the better.

Should I just search for files that are covered by one of the licenses you linked to instead?

You could, or you could write your own. Sometimes the latter is easier, particularly for data formats. :)

@sanssecours
Copy link
Contributor Author

No, I'm afraid not. The samples aren't for human readers: they're for the classifier. Linguist uses them as a basis for Bayesian classification, helping it identify languages based on the frequency of keywords and what-have-you. The bigger the reference material, the better.

Sorry for the confusing description above 😀. I meant, is it enough to just specify the source at the top of a sample file? For example, if I include the file toml.abnf in the sample directory should I add the comment

; Source: https://github.com/toml-lang/toml/blob/abnf/toml.abnf

at the top of the file? Where exactly should I add the information about the source/license of a sample file?

You could, or you could write your own. Sometimes the latter is easier, particularly for data formats. :)

You are right. Writing a grammar that is valid according to the syntax of EBNF/ABNF is not that hard. Writing a grammar that also make sense semantically is not trivial though, especially if it should include a lot of the features of the metalanguage.

@Alhadis
Copy link
Collaborator

Alhadis commented Nov 1, 2016

Sorry for the confusing description above 😀. I meant, is it enough to just specify the source at the top of a sample file?

Ah, I see what you mean. No, it doesn't work that way. The material will need to be explicitly released under a permissive license.

Where exactly should I add the information about the source/license of a sample file?

Here will do. =) It doesn't have to be in the codebase; just as long as we have a record of where/when it was acquired.

Writing a grammar that also make sense semantically is not trivial though, especially if it should include a lot of the features of the metalanguage

Well there's not a lot that can go wrong with *BNF, thankfully. ;)

@pchaigno
Copy link
Contributor

pchaigno commented Nov 1, 2016

You'll need to change the language-type of each entry to data, not programming. It's a notation for describing a language, as we both know, and not a real language per se.

@Alhadis In the end, programming is just a label. IMHO, what matters is what it means for a language to be a programming language on github.com. In this case, it means it will be in the statistics bar and a project can be tagged as a ABNF project.

@sanssecours Would you expect to see git repositories with only ABNF files or are they always included with some other files (as some kind of documentation/specification for instance)?

@sanssecours
Copy link
Contributor Author

Here will do. =) It doesn't have to be in the codebase; just as long as we have a record of where/when it was acquired.

That sounds easy enough. Thank you for the information.

@sanssecours Would you expect to see git repositories with only ABNF files or are they always included with some other files (as some kind of documentation/specification for instance)?

I would expect that most repositories use ABNF/EBNF very sparsely, maybe as source of documentation for some programming/configuration language. There are however repositories, like this one, where the main file seems to be an ABNF grammar.

@larsbrinkhoff
Copy link
Contributor

May I ask why ANTLR files are listed under programming then?

My impression is that ANTLR is similar to Yacc. They include both grammar rules and program code.

@sanssecours
Copy link
Contributor Author

I replaced the sample files from Wikipedia with samples from GitHub. I added the source and the license at the top of the files. I hope that is okay.

The table below shows the sources and licenses of the files too.

Files Source License
toml.abnf https://github.com/toml-lang/toml MIT
tranquil.abnf https://github.com/fjolnir/Tranquil BSD 3 Clause License
grammar.ebnf https://github.com/sunjay/lion MIT
material.ebnf, object.ebnf, types.ebnf https://github.com/io7m/jsom0 ISC

If there is anything left for me to do, then please just comment below. Thank you.

Copy link
Collaborator

@Alhadis Alhadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few touch-ups needed, and then I think we're good to go. =)

extensions:
- ".abnf"
interpreters:
- ex_abnf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is for languages that can be identified from interpreter directives. =)

For instance, a file with #!/usr/bin/env node in its first line can be identified as JavaScript by listing node as an interpreter.

Since that doesn't apply here, it's safe to leave this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. This should be fixed now.

interpreters:
- ex_abnf
tm_scope: source.abnf
group: ABNF
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group field here is redundant; we only use this to let one language affect a parent language's usage statistics. For example, GAS falls under the category of Assembly, so we want to have it count in the usage stats of Assembly languages across GitHub.

The name is slightly misleading; a more accurate name for this field might be parent_language.

In short, it's logistically impossible for anything to be its own parent. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. I just removed the group fields.

@@ -0,0 +1,61 @@
; Source: https://github.com/fjolnir/Tranquil
; License: BSD 3 Clause License
; Modified for standard compatibility by René Schwaiger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you need to modify here? We generally try to keep the files as-is (with the eventual mistakes, etc.) as they are used for tests and training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original file uses single quotes, which are not part of the standard as far as I can tell.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no need to have done that. This is the sort of in-the-wild discrepancy that's both natural and expected; and also part of the reason we need samples in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sort of in-the-wild discrepancy that's both natural and expected; and also part of the reason we need samples in the first place.

I do not think that it makes that much sense to add incorrect files. Otherwise we could just add the following “ABNF code” as (part of) a sample file 😀:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute
irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.

The ABNF TextMate grammar will also (correctly) mark parts of the original tranquil.abnf grammar as incorrect:

tranquil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that it makes that much sense to add incorrect files.

It does when it is what people use in-the-wild. If you think this file is not representative of the usage of ABNF on github.com you can replace it ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the relevant rule which I believe is applying the error highlighting:

[a-zA-Z][a-zA-Z0-9\-]*|(?<invalid>\S)

So it's basically saying "highlight anything that's not whitespace, and that isn't an ASCII word character or dash that commences with an alphabetic ASCII character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I add a comment that tranquil.abnf it is a non-standard-compliant file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recall that these files are for the classifier: 99% of the time, they remain unread by human beings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Because it is a non-standard-compliant file 😀. I would not want someone to think that it represents a correct ABNF grammar.

However, I just removed tranquil.abnf. I hope that is a compromise that works for everyone.

Augmented Backus–Naur form ([ABNF][]) is a metalanguage used to specify
language grammars.

[ABNF]: https://en.wikipedia.org/wiki/Augmented_Backus–Naur_form
@Alhadis
Copy link
Collaborator

Alhadis commented Nov 2, 2016

Looks good enough to me then, I guess. @arfon?

@arfon arfon merged commit 4172390 into github-linguist:master Nov 2, 2016
@arfon
Copy link
Contributor

arfon commented Nov 2, 2016

👍 looks great. Thanks @sanssecours.

This will be live in the next release of Linguist (later this week)

@pchaigno
Copy link
Contributor

pchaigno commented Nov 2, 2016

Thanks for your patience @sanssecours!

@Crissov Crissov mentioned this pull request Mar 15, 2017
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants