Request: Nougat OCR Integration #3294

OhadRubin · 2023-09-21T06:29:29Z

Request: Nougat OCR Integration

I suggest adding Nougat OCR into llama.cpp to enable the processing of scientific PDF documents.
This can act as a first step towards adding multimodal models to this project!

Implementation:
It seems that Nougat is based on standard transformer architecture (like Bart and Swin Transformer) and most of the work would be on figuring out how to add the image processing.

Let me know what you think!
P.S.: Love this repo! I hope to add my own retrieval-pretrained transformer at some point to this repo.

Fcucgvhhhvjv · 2023-09-21T08:12:41Z

Request: Nougat OCR Integration

I suggest adding Nougat OCR into llama.cpp to enable the processing of scientific PDF documents. This can act as a first step towards adding multimodal models to this project!

Implementation: It seems that Nougat is based on standard transformer architecture (like Bart and Swin Transformer) and most of the work would be on figuring out how to add the image processing.

Let me know what you think! P.S.: Love this repo! I hope to add my own retrieval-pretrained transformer at some point to this repo.

goerch · 2023-09-21T19:44:22Z

As soon as @ggerganov tackles multi-modal (not sure, maybe he did already) I'm interested. For now: not in project scope, me thinks.

ggerganov · 2023-09-28T19:50:35Z

I recently learned about this model and I am very interested in adding support for it.
Not sure if llama.cpp would be the best place to do so.

It's likely to remain low prio for the near future, but if there is a community effort, I'll be happy to support it

kairan77 · 2023-10-24T19:35:25Z

Impressive results with English papers and Ebooks. Some preliminary findings on the nougat project.

for each pdf page, workflow = (pdf processing -> image pre processing -> swin transformer -> loop < 2k iterations [mbart transformer -> custom lm_head] -> post processing), total runtime on Intel CPU out of which
mbart transformer takes 80%
custom lm_head takes 15%, per iteration compute a (50000x1000, and 1000x1) matrix product
swin transformer takes about 4%
all the rest < 1%
final post_processing can be dropped, since I found the model is good enough such that result no difference w/ or w/o post processing
all image preprocessing can also be dropped, since it is much easier to handle on the client side without all the bulky library import in python.
the SMALL nougat model can be perfectly decomposed into independent swin model and mbart model, but the base nougat model cannot without resorting to torch.save on the full mbart model.
swin model when quantized to 8bit, degrades performance severely, and since it takes less <<1s per page, not worth poking it any further
mbart model when quantized to 8bit using pytorch dynamic quantize, small model degrades by a tiny amount, where as with the base model I don't see any difference in output on over 100 pages of test runs, runtime on cpu running in python environment reduced by only 15%, but increased by 3 fold when python code is compiled into executable.
original nougat model on a completely filled A4 paper runs shy of 60 seconds on 10th gen i5-cpu
mbart runtime per iteration takes around 40-50ms on 10th gen i5 cpu (including the lm_head operation),
when the model is converted to onnx, runtime is also worse by about 3-5 times, due to optimizations using history in the original transformer model not being available in the exported onnx model

1st My question:
@ggerganov I have pruned out everything else apart from the inner loop (mbart + lm_head) so that the exposed api takes tensor output from swin model, and yields token_ids without post processing. If this part is rewritten in C++ to run on a 5/6 bit quantized model, based on the information we have seen, do you think the inner loop runtime could be halved? What is your best guess on the speed gain if any?
2nd Question:
any pointer or code skeletons you can provide to get this going?

kairan77 · 2023-10-24T19:45:51Z

btw the encoding layers of both small and base nougat model use exactly the same swin model, the two models are only different in the underlying decoding layers [mbart]

also imho, everything before and after the inner loop is not worth rewriting in C, since they literally takes no time to run.

OhadRubin · 2023-12-25T02:30:50Z

any updates???

jpvelsamy · 2024-04-01T07:40:18Z

It would be great to have the OCR integrated into the mix. Any updates on this would be awesome

OriginalGoku · 2024-07-02T23:30:27Z

So i assume this is still not implemented?

ggerganov added model Model specific help wanted Extra attention is needed labels Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Nougat OCR Integration #3294

Request: Nougat OCR Integration #3294

OhadRubin commented Sep 21, 2023

Fcucgvhhhvjv commented Sep 21, 2023

Request: Nougat OCR Integration

goerch commented Sep 21, 2023 •

edited

Loading

ggerganov commented Sep 28, 2023

kairan77 commented Oct 24, 2023 •

edited

Loading

kairan77 commented Oct 24, 2023

OhadRubin commented Dec 25, 2023

jpvelsamy commented Apr 1, 2024

OriginalGoku commented Jul 2, 2024

Request: Nougat OCR Integration #3294

Request: Nougat OCR Integration #3294

Comments

OhadRubin commented Sep 21, 2023

Request: Nougat OCR Integration

Fcucgvhhhvjv commented Sep 21, 2023

Request: Nougat OCR Integration

goerch commented Sep 21, 2023 • edited Loading

ggerganov commented Sep 28, 2023

kairan77 commented Oct 24, 2023 • edited Loading

kairan77 commented Oct 24, 2023

OhadRubin commented Dec 25, 2023

jpvelsamy commented Apr 1, 2024

OriginalGoku commented Jul 2, 2024

goerch commented Sep 21, 2023 •

edited

Loading

kairan77 commented Oct 24, 2023 •

edited

Loading