Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove gsl/nmatrix dependencies from Gemfile and build workflow #87

Open
0xdevalias opened this issue Jul 26, 2020 · 3 comments
Open

Remove gsl/nmatrix dependencies from Gemfile and build workflow #87

0xdevalias opened this issue Jul 26, 2020 · 3 comments

Comments

@0xdevalias
Copy link
Owner

0xdevalias commented Jul 26, 2020

This should be done after #86 is merged.

Even when we removed Jekyll's --lsi option, the site build still only seemed to take ~7sec.

So maybe --lsi either isn't working, or just doesn't really have a big impact on our site build time.

In light of this.. i'm thinking we can leave the --lsi option enabled for now, but can probably remove the gsl/nmatrix optimisations we had added. Though we should probably do this in a follow up PR.

This also most likely renders #83 as irrelevant.

Originally posted by @0xdevalias in #86 (comment)

@0xdevalias
Copy link
Owner Author

👋 Hi,

I stumbled onto this thread from jekyll/classifier-reborn#193.

A few notes that you might find helpful:

  • You're not noticing any difference in build times with the --lsi option because your site (as it is today in this repo) doesn't use related posts (so the --lsi option does nothing). To use LSI, you need to call site.related_posts somewhere in a Liquid template. For example, you might add something like the following to _layouts/post.html:
    {% for post in site.related_posts limit:3 %}
      <p>{{ post.title }}</p>
    {% endfor %}
  • When you call site.related_posts, if you don't pass the --lsi option, it's just recent posts.
  • If you are using site.related_posts and you pass the --lsi option, You'll see Populating LSI... in your jekyll build --lsi output. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.

Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!

Originally posted by @mkasberg in #83 (comment)

@0xdevalias
Copy link
Owner Author

0xdevalias commented Jun 20, 2024

classifier-reborn has supported an alternative to gsl since v2.3.0, which might be a good alternative to switch to here:

The referenced issue links from the Gemfile:

Originally posted by @0xdevalias in #20 (comment)

The following posts by @mkasberg are also worth reading/considering before going too deep with this:

Having ChatGPT explain the differences between using LSI and embeddings for this purpose:

Latent Semantic Indexing (LSI)

  • Method: Uses Singular Value Decomposition (SVD) on term-document matrices.
  • Representation: Lower-dimensional space capturing latent semantic structures.
  • Applications: Information retrieval, document clustering, text summarization.
  • Advantages: Handles synonymy, less computationally intensive.
  • Limitations: Limited in capturing complex linguistic phenomena, performance depends on the corpus.

Embeddings (e.g., OpenAI embeddings)

  • Method: Uses deep learning models like transformers.
  • Representation: Dense vectors capturing semantic meaning, context, and relationships.
  • Applications: Sentiment analysis, text classification, named entity recognition, question answering.
  • Advantages: Captures complex linguistic phenomena, state-of-the-art performance, versatile.
  • Limitations: Computationally intensive, requires significant resources, may need fine-tuning.

Summary

  • LSI is simpler and effective for basic tasks but less nuanced.
  • Embeddings provide richer, context-aware representations and superior performance on a wide range of tasks but require more computational power.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant