Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify the HightlightJS language #391

Closed
iHiD opened this issue Jun 4, 2021 · 18 comments
Closed

How to specify the HightlightJS language #391

iHiD opened this issue Jun 4, 2021 · 18 comments

Comments

@iHiD
Copy link

iHiD commented Jun 4, 2021

Hello. Thanks for this great library.

At the moment, it seems highlightJS is guessing the language based on the the file extension (I think Im reading that correctly here). That often results in incorrect matchings. If we know the language of the file that the diff is for, we'd like to specify it manually. Is there a way to do that? If not, could an option be added please? If you'd like us to add the option, could you point us in the direction you'd like us to do it please? And if its already an option and I've missed it, my apologies :)

@joshgoebel
Copy link

Perhaps relevant:

const language = file.getAttribute('data-lang');

@iHiD
Copy link
Author

iHiD commented Jun 4, 2021

@joshgoebel Thanks. I believe html2diff writes that itself from reading the diff and extracting the extension. My current fallback is to get that attribute and manually change it, then reparse highlightjs on the file, but that's not ideal obviously.

@iHiD
Copy link
Author

iHiD commented Jun 4, 2021

It's also determining different highlightJS classes for different LOCs

Screenshot 2021-06-04 at 13 28 13

Screenshot 2021-06-04 at 13 28 22

Screenshot 2021-06-04 at 13 28 05

@joshgoebel
Copy link

My current fallback is to get that attribute and manually change it, then reparse highlightjs on the file, but that's not ideal obviously.

Yeah, but It's not terrible either for small amounts of data. :-)

It's also determining different highlightJS classes for different LOCs

Did you mean language classes? I think so based on your snaps. I would definitely call that a bug. Of course a REAL diff diff could contain multiple languages (multiple files in one diff) but I'm not sure if that's relevant here. Definitely randomly changing language from line to line isn't desirable. Auto-detect is not intended for use against individual lines and the noise from doing that would be MUCH higher than the signal.

@iHiD
Copy link
Author

iHiD commented Jun 4, 2021

Yes. Different language classes. The data-lang is set to exs, which isn't a supported HighlightJS language (although maybe we should add it as a synonym for Elixir?), so therefore highlightJS is guessing the other language classes I presume. This would be fixed if we could manually set the data-lang by passing it into the original d2h method call.

@joshgoebel
Copy link

joshgoebel commented Jun 4, 2021

The data-lang is set to exs, which isn't a supported HighlightJS language (although maybe we should add it as a synonym for Elixir?),

Sure, we often do have extensions as aliases. Though only the canonical ones... which might have implications for languages Exercism uses other than Elixir with many different runtimes that use different extensions.

so therefore highlightJS is guessing the other language classes I presume.

yes, if getLanguage fails (can't find the language) it guesses, I believe I saw that in the code.

This would be fixed if we could manually set the data-lang by passing it into the original d2h method call.

Yeah, sounds about right.

@rtfpessoa
Copy link
Owner

Hi @iHiD,

Unfortunately at the moment it is not possible to specify languages.

The code is basically passing the extensions to highlight.js and getting the language from it.
If this does not work we could probably contribute the extension to highlight.js but from the answer from josh in #391 (comment) I am not sure if it makes sense for them. Although it is not clear to me the negative side.
The other option could be to allow diff2html to receive an alias mapping from file extension to language name.

I would be up to receive a contribution with the second option if we cannot do the first.

What do you think? Do you have any other ideas?

@joshgoebel
Copy link

joshgoebel commented Jun 4, 2021

Ok lets circle back.

it seems highlightJS is guessing the language based on the the file extension (I think Im reading that correctly here). That often results in incorrect matchings.

We don't guess the language based on the file extension, we have a discrete mapping table... file extensions are directly aliased to languages. I suppose you could say then that's it's diff2html "guessing" based on the file extension - but that is what almost all text editors do, and it's most often a great heuristic... are there cases where that alone wouldn't work for Exercism if we actually made sure all the extensions were properly mapped to languages?

IE: Are there cases where the extension is going to be vague or misleading or missing?


The issue of random highlighting per line is a different issue in my mind. If the language wasn't known it would be preferable to just not highlighting at all, which is what I suggested in the other issue I opened.

@iHiD
Copy link
Author

iHiD commented Jun 4, 2021

@rtfpessoa Hey 🙂 Thanks for the reply.

So my preferred fix would be to add a config option to Diff2Html, which takes a highlightjs language and uses that. If that's option isn't passed in then the language-selection falls back to the current behaviour (based on the file extension). That would mean the current behaviour wasn't impacted for existing implementations, and we'd be adding a new feature for situations like ours where we know what the correct highlightjs language is.

Are there cases where that alone wouldn't work for Exercism if we actually made sure all the extensions were properly mapped to languages?

@joshgoebel You probably know this better than me, but I imagine we have situations where multiple languages share extensions. I'm pretty sure we ran into that in v2. But equally we can make a really solid guess at which language that extension is for as we have loads of context here.

I'm not sure which "we" hat you're wearing here - ie where the mapping would be. In highlightjs - sure that'd be brilliant. In Exercism - fine too as long as we have a mechanism to pass that in. In diff2html, yep - that makes sense too, although it would probably be better from d2h to be getting that upstream from highlightjs.

So if you were happy to do it upstream in highlightjs, that would work best. I'd still have some concern about clashes, but it's less of a concern. FYI: This was the list we compiled for v2 (for Prism but 🤷): https://github.com/exercism/v2-website/blob/main/config/initializers/prism.rb#L187-L315

@joshgoebel
Copy link

joshgoebel commented Jun 4, 2021

but I imagine we have situations where multiple languages share extensions.

That is indeed a problem. I think that's only true with one of our 184 languages now, but perhaps we're missing some extensions as well...

I'm not sure which "we" hat you're wearing here - ie where the mapping would be.

Well, as I think I mentioned elsewhere Highlight.js traditionally includes canonical extensions. For example we should include exs for Elixir in Highlight.js. We do not include vendor specific extensions such as say like 5 different implementations of Pascal, all with choosing to use their own file extensions. That isn't our [Highlight.js] problem to solve.

That said it's trivial to register additional aliases with the Highlight.js API if Exercism had to maintain a small additional list that we wouldn't upstream to core.


The larger issue with using purely extension here is the case where extension is ambiguous. Could you come up with an exact example to share? If that happens in the real world I think that would be a reasonable argument that the current behavior of using only file extension isn't enough for all use cases and that the library should consider allowing one to pass the correct language (when file extension isn't sufficient).

@rtfpessoa
Copy link
Owner

rtfpessoa commented Jun 4, 2021

So my preferred fix would be to add a config option to Diff2Html, which takes a highlightjs language and uses that.

I think that solution is going to be to hackish. Diffs tend go have multiple files that usually spread across multiple languages, if we just accept a fallback language seems like it is going to be misused really easy and not have a nice outcome.

That is also the reason why I would prefer to receive an extension mapping, since that way it would support the use case correctly.

I imagine we have situations where multiple languages share extensions.

That sounds weird conceptually (although I would imagine it to happen in real world 😞), I cannot remember any case that I would know.

I'm not sure which "we" hat you're wearing here

I mean we in the sense from the diff2html side (either me or you) could contribute it.

where the mapping would be

Currently diff2html uses the getLanguage function and it already seems to have the concept of aliases, maybe it could just be a matter of adding one more.

Taking into consideration the @joshgoebel answer seems like it would have to be handled mostly in diff2html side as configuration.
I still think the option I suggested is probably the most interesting.

EDIT: Updated after seeing previous message

@rtfpessoa
Copy link
Owner

That is indeed a problem. I think that's only true with one of our 184 languages now, but perhaps we're missing some extensions as well...

@joshgoebel just out of curiosity can you share which one it is?

@joshgoebel
Copy link

I honestly don't recall. You can find the full list at:

https://github.com/highlightjs/highlight.js/blob/main/SUPPORTED_LANGUAGES.md

That's why I was asking to see if @iHiD has a more concrete real-world example.

@joshgoebel
Copy link

joshgoebel commented Jun 4, 2021

Diffs tend go have multiple files that usually spread across multiple languages

Yeah, really you'd have to supporting passing in a map, not a single language:

{ 
  "booger.asm" => "mips_assembly" 
}

There are also ambiguous extensions:

Then you have categories also like assembly... .s, .S, .asm seems to be the trend, but is that x86 assembly? ARM assembly? mips? And "subtypes" like Arduino, which has the same extensions as CPP.

  • Is cpp Arduino code or C++ code?
  • Is asm MIPS, x86, ARM, Atmega?
  • Is rb Ruby or Ruby on Rails code?

So even when the file extension gets you close, sometimes it doesn't get you all the way.

@rtfpessoa
Copy link
Owner

That makes sense, thanks for the examples.

Then definitely diff2html would need to receive at least a suffix that could be either whole filenames or just extensions for it to correctly pass languages to highlight.js.

@iHiD
Copy link
Author

iHiD commented Jun 4, 2021

That said it's trivial to register additional aliases with the Highlight.js API if Exercism had to maintain a small additional list that we wouldn't upstream to core.

This would solve my problem. Can you point me to how best to do that please Josh (maybe on Exercism's Slack so as not to take this issue on a tangent too much?). Thanks for your help 🙂

I think I'm done here with that as a solution, so happy to close this. I also still like the idea of specifying a language, and the map approach would work great for that as well, but I'd rather take the additional mappings, and if there's some that "miss", just leave them unhighlighted (as per the other issue), or clash (then so be it - it'll be rare). As this hasn't come up as an issue before, it's probably low value enough that it doesn't huge matter (Exercism is unusual in having a lot of different files/types so we get edge-cases). Thanks for your time @rtfpessoa and building out such a useful library. It'll be featuring in a few places in Exercism v3, including this one that we built out today:

Screenshot 2021-06-04 at 12 42 46

@iHiD iHiD closed this as completed Jun 4, 2021
@rtfpessoa
Copy link
Owner

This would solve my problem. Can you point me to how best to do that please Josh (maybe on Exercism's Slack so as not to take this issue on a tangent too much?)

I would also be interested in the final solution since it might be interesting to use in diff2html.

Maybe it would be to invoke registerAliases?

Thanks for your time @rtfpessoa and building out such a useful library. It'll be featuring in a few places in Exercism v3, including this one that we built out today

Great that it is being useful for you 🙏

@joshgoebel
Copy link

joshgoebel commented Jun 5, 2021

It's just registerAliases: https://highlightjs.readthedocs.io/en/latest/api.html#registeraliases

For example to "fix" Elixir outside of core:

hljs.registerAliases(["exs"], { languageName: "elixir"});

Creates an alias from exs => elixir. So now exs is effectively elixir for purposes of which language is used for highlighting. You could easily imagine an object literal or array of extension to language mappings and you just loop over them all and register them.

it might be interesting to use in diff2html.

I'm not sure if a 3rd party library should register aliases without informing the user, but perhaps if it was documented that you were doing that... at the very least you could mention in your readme how easy it is to do and show an example. The issue is that someone might use the library in other places and this changes the behavior of the library GLOBALLY. There is no way to define scoped aliases (though it also isn't super hard to keep the mapping table outside Highlight.js either).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants