Speech Integrated Visual Language Model
conda install -c conda-forge mpi4py mpich pip install colorama
pip install -r assets/requirements/requirements.txt
#Rest of packages pip install -r assets/requirements/requirements_custom.txt
#Custom Operator for deformabalbe vision encoder cd modeling/vision/encoder/ops && sh make.sh && cd ../../../../
#Run Demo pip install gradio pip install timm pip install nltk pip install Pillow==9.5.0 pip install transformers pip install kornia
pip install cog
Setup LLaVA using eval mode, to trigger the SEEM model for grounding
Will ask LLaVA for objects in the image, then use Named Entity Relationship for gathering thing classes (or thing phrases) for instance segmenation.