This is an innovative project aimed at enhancing the visual experience for individuals with impairments. Leveraging machine learning and natural language processing, this repository houses the codebase for generating efficient and coherent natural language descriptions of captured images. The project integrates seamlessly with image recognition,
Overview Visionsarathi is a project dedicated to improving the visual experience for individuals with impairments. Leveraging machine learning and natural language processing, the goal is to generate efficient and coherent natural language descriptions of captured images. The project integrates seamlessly with image recognition, offering users a unified and user-friendly interface to interact with the system.
Features Efficient Image Descriptions: Utilize advanced NLP techniques to generate clear and concise natural language descriptions for captured images.
Unified User Experience: Integrate the NLP system seamlessly with image recognition to provide a unified and coherent experience for users.
User-Friendly Interface: Develop an accessible interface for visually impaired users, allowing them to easily prompt the system for detailed image descriptions.
Project Structure The project structure is organized to ensure modularity and ease of understanding. Key components include:
image_processing: Contains code for image recognition and processing. nlp_module: Implements natural language processing for generating image descriptions.
Project Report: Image Captioning and OCR with Text Optimization
Overview The project aims to develop a system that receives images via a socket connection, processes them using an image captioning model fine-tuned on the COCO dataset, performs Optical Character Recognition (OCR) using EasyOCR, and optimizes the combined text using an Ollama large language model. The optimized text is then sent back to the client.
Technologies Used Python: Programming language used for development. PyTorch: Deep learning library used for image captioning model. BLIP (Bi-directional Language-Informed Pretraining): Model architecture used for image captioning. EasyOCR: Library used for Optical Character Recognition. Ollama: Large language model used for text optimization. PIL (Python Imaging Library): Library used for image processing. Socket Programming: Used for communication between the server and client. Struct: Used for packing and unpacking data for transmission over the network. Implementation Details Image Receiving and Processing A server socket is created to listen for client connections. Images are received in chunks and saved to disk. Each received image is processed and sent for captioning and OCR. Image Captioning The saved image is preprocessed and fed into the BLIP model for caption generation. The model generates a caption for the image. Optical Character Recognition (OCR) EasyOCR is used to perform OCR on the saved image. The detected text from the OCR is extracted. Text Optimization The image caption and OCR text are combined into a single string. This combined text is sent to the Ollama model for optimization. The optimized text is then sent back to the client. Client Communication The processed text is sent back to the client via the socket connection. The client receives the optimized text and can use it as needed. Error Handling Error handling is implemented throughout the code to handle exceptions gracefully, such as file operations, socket connections, and model loading. Future Improvements Implement authentication and data validation for enhanced security. Optimize code for better performance, especially in image processing and model inference. Enhance modularity by breaking down functions into smaller components for better maintainability. Add logging and monitoring for better debugging and system health monitoring. Conclusion The project successfully demonstrates the integration of image captioning, OCR, and text optimization using state-of-the-art deep learning models and libraries. With further refinement and enhancements, it can be deployed in real-world applications, such as assisting visually impaired individuals in understanding the content of images.
Detailed Technical Report
-
Introduction The image captioning and text optimization project aims to create a system that can generate descriptive captions for images and optimize them for clarity and conciseness. Additionally, the system performs Optical Character Recognition (OCR) to extract text from images, combining the OCR results with the generated captions.
-
Technologies Used Python: Used as the primary programming language for the project. PyTorch: Deep learning library utilized for training and inference of the image captioning model. BLIP (Bi-directional Language-Informed Pretraining): Model architecture used for image captioning, fine-tuned on the COCO dataset. EasyOCR: Python library for performing Optical Character Recognition (OCR) on images. Ollama: Large language model used for text optimization, ensuring the generated captions are concise and grammatically correct. PIL (Python Imaging Library): Library used for image processing tasks such as resizing and converting images. Socket Programming: Utilized for establishing communication between the server and client for image transmission and text transfer.
-
Implementation Details Image Reception and Processing A server socket is created to listen for incoming connections from clients. Images sent by clients are received in chunks and saved to disk. Each received image is preprocessed and passed through the image captioning and OCR pipelines. Image Captioning The saved image is preprocessed using transformations such as resizing and normalization. The preprocessed image is fed into the BLIP model, which generates descriptive captions for the image. Optical Character Recognition (OCR) EasyOCR library is employed to perform OCR on the saved image. The OCR results are extracted, providing additional textual information related to the image content. Text Optimization The generated image caption and OCR text are combined into a single string. This combined text is passed through the Ollama large language model for optimization. Ollama ensures that the text is grammatically correct, concise, and easy to understand. Client Communication The optimized text is sent back to the client through the established socket connection. Clients receive the optimized text and can utilize it as needed for further processing or display.
-
Error Handling The code includes error handling mechanisms to handle exceptions gracefully, such as file operations, socket connections, and model loading. Error messages are logged and displayed to facilitate debugging and troubleshooting.
-
Future Improvements Implement authentication and encryption mechanisms to ensure secure communication between the server and clients. Optimize code performance by leveraging asynchronous processing and parallelization techniques. Enhance modularity by encapsulating functionality into reusable modules for better maintainability and extensibility. Improve error handling and logging for comprehensive error reporting and system monitoring.
-
Conclusion The image captioning and text optimization project successfully demonstrates the integration of deep learning models and libraries for generating descriptive captions and optimizing text for clarity and conciseness. With further refinement and enhancements, the system can be deployed in various applications, including assistive technologies for visually impaired individuals and content creation tools for social media and marketing platforms.
When GPIO 19 is low, set GPIO 13 high. Store the audio output file as audio.wav and replace it if already present. Play the audio again when GPIO 6 is low.