Electrical Engineering and Systems Science
See recent articles
Showing new listings for Monday, 4 November 2024
- [1] arXiv:2411.00023 [pdf, html, other]
-
Title: Device-Directed Speech Detection for Follow-up Conversations Using Large Language ModelsOggi Rudovic, Pranay Dighe, Yi Su, Vineet Garg, Sameer Dharur, Xiaochuan Niu, Ahmed H. Abdelaziz, Saurabah Adya, Ahmed TewfikSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
- [2] arXiv:2411.00143 [pdf, html, other]
-
Title: Enhancing Brain Source Reconstruction through Physics-Informed 3D Neural NetworksComments: Under Review in IEEE Transactions on Medical ImagingSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
Reconstructing brain sources is a fundamental challenge in neuroscience, crucial for understanding brain function and dysfunction. Electroencephalography (EEG) signals have a high temporal resolution. However, identifying the correct spatial location of brain sources from these signals remains difficult due to the ill-posed structure of the problem. Traditional methods predominantly rely on manually crafted priors, missing the flexibility of data-driven learning, while recent deep learning approaches focus on end-to-end learning, typically using the physical information of the forward model only for generating training data. We propose the novel hybrid method 3D-PIUNet for EEG source localization that effectively integrates the strengths of traditional and deep learning techniques. 3D-PIUNet starts from an initial physics-informed estimate by using the pseudo inverse to map from measurements to source space. Secondly, by viewing the brain as a 3D volume, we use a 3D convolutional U-Net to capture spatial dependencies and refine the solution according to the learned data prior. Training the model relies on simulated pseudo-realistic brain source data, covering different source distributions. Trained on this data, our model significantly improves spatial accuracy, demonstrating superior performance over both traditional and end-to-end data-driven methods. Additionally, we validate our findings with real EEG data from a visual task, where 3D-PIUNet successfully identifies the visual cortex and reconstructs the expected temporal behavior, thereby showcasing its practical applicability.
- [3] arXiv:2411.00145 [pdf, html, other]
-
Title: CRB Optimization using a Parametric Scattering Model for Extended Targets in ISAC SystemsComments: 5 pages, 3 figuresSubjects: Signal Processing (eess.SP)
This paper presents a novel parametric scattering model (PSM) for sensing extended targets in integrated sensing and communication (ISAC) systems. The PSM addresses the limitations of traditional models by efficiently capturing the target's angular characteristics through a compact set of key parameters, including the central angle and angular spread, enabling efficient optimization. Based on the PSM, we first derive the Cramer-Rao Bound (CRB) for parameter estimation and then propose a beamforming design algorithm to minimize the CRB while meeting both communication signal-to-interference-plus-noise ratio (SINR) and power constraints. By integrating the PSM into the beamforming optimization process, the proposed framework achieves superior CRB performance while balancing the tradeoff between sensing accuracy and communication quality. Simulation results demonstrate that the PSM-based approach consistently outperforms traditional unstructured and discrete scattering models, particularly in resource-limited scenarios, highlighting its practical applicability and scalability.
- [4] arXiv:2411.00159 [pdf, other]
-
Title: Optimizing Energy Management and Sizing of Photovoltaic Batteries for a Household in Granada, Spain: A Novel Approach Considering Time ResolutionComments: Journal BatteriesSubjects: Systems and Control (eess.SY)
HEMS optimization for RTPVs with BESS: impact on costs and temporal resolution
- [5] arXiv:2411.00184 [pdf, other]
-
Title: A Novel Acoustic Wearable for Assessment of Tendon Health and Loading ConditionAmirhossein Yazdkhasti, Hendrik De Klerk, Andreea Renata Lucaciu, Rana Moeinzad, Hamid Ghaednia, Joseph H. SchwabComments: 21 pages, 9 figuresSubjects: Signal Processing (eess.SP); Tissues and Organs (q-bio.TO)
The current methods of assessing tendon health such as clinical examination, imaging techniques, and implanted pressure sensors, are often based on a subjective assessment or are not accurate enough, are extremely expensive, or are limited to relatively large damage such as partial or gross tear of the tendon and cannot accurately assess and monitor smaller damages such as micro tears or strains. This study proposes an acoustic-based wearable capable of estimating tendon load and predicting damage severity in both deep and superficial tendons. Our device consists of an array of acoustic transducers positioned around the targeted body area in the form of a cuff. One of the transducers generates an acoustic wave, which is capable of penetrating deep into the body. As these waves propagate through different tissues, they are influenced by the mechanical and geometrical properties of each tissue. The rest of the transducers are used to measure the propagated waves. The results suggest that the proposed wearable offers a promising alternative to existing superficial tendon monitoring wearable devices by improving the domain of reach. The proposed wearable shows robust performance in estimating the force applied to the tendon. It also can effectively be used to compare the health condition of two tendons and predict the type of damage.
- [6] arXiv:2411.00223 [pdf, html, other]
-
Title: Learning Optimal Interaction Weights in Multi-Agents SystemsComments: This work is under review at 2024 American Control ConferenceJournal-ref: 2024 American Control ConferenceSubjects: Systems and Control (eess.SY)
This paper presents a spatio-temporal inverse optimal control framework for understanding interactions in multi-agent systems (MAS). We employ a graph representation approach and model the dynamics of interactions between agents as state-dependent edge weights in a consensus algorithm, incorporating both spatial and temporal dynamics. Our method learns these edge weights from trajectory observations, such as provided by expert demonstrations, which allows us to capture the complexity of nonlinear, distributed interaction behaviors. We derive necessary and sufficient conditions for the optimality of these interaction weights, explaining how the network topology affects MAS coordination. The proposed method is demonstrated on a multi-agent formation control problem, where we show its effectiveness in recovering the interaction weights and coordination patterns from sample trajectory data.
- [7] arXiv:2411.00224 [pdf, other]
-
Title: A New Switched Reluctance Motor with Embedded Permanent Magnets for Transportation ElectrificationSubjects: Systems and Control (eess.SY)
A new three-phase hybrid-excited multi-tooth switched reluctance motor with embedded permanent magnets is proposed, capable of achieving higher torque density for transportation electrification applications. Operating principles and design considerations are discussed. A magnetic equivalent circuit is developed. Finite element method is employed in the field analysis. The advantages of the proposed topology over existing designs for switched reluctance motors and flux switching motors are presented. Finally, the optimized design is prototyped to experimentally confirm the design and simulation results.
- [8] arXiv:2411.00254 [pdf, html, other]
-
Title: A Novel Breast Ultrasound Image Augmentation Method Using Advanced Neural Style Transfer: An Efficient and Explainable ApproachSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Clinical diagnosis of breast malignancy (BM) is a challenging problem in the recent era. In particular, Deep learning (DL) models have continued to offer important solutions for early BM diagnosis but their performance experiences overfitting due to the limited volume of breast ultrasound (BUS) image data. Further, large BUS datasets are difficult to manage due to privacy and legal concerns. Hence, image augmentation is a necessary and challenging step to improve the performance of the DL models. However, the current DL-based augmentation models are inadequate and operate as a black box resulting lack of information and justifications about their suitability and efficacy. Additionally, pre and post-augmentation need high-performance computational resources and time to produce the augmented image and evaluate the model performance. Thus, this study aims to develop a novel efficient augmentation approach for BUS images with advanced neural style transfer (NST) and Explainable AI (XAI) harnessing GPU-based parallel infrastructure. We scale and distribute the training of the augmentation model across 8 GPUs using the Horovod framework on a DGX cluster, achieving a 5.09 speedup while maintaining the model's accuracy. The proposed model is evaluated on 800 (348 benign and 452 malignant) BUS images and its performance is analyzed with other progressive techniques, using different quantitative analyses. The result indicates that the proposed approach can successfully augment the BUS images with 92.47% accuracy.
- [9] arXiv:2411.00258 [pdf, html, other]
-
Title: Parameter Estimation on Homogeneous SpacesComments: Supplementary document is availableSubjects: Signal Processing (eess.SP)
The Fisher Information Metric (FIM) and the associated Cramér-Rao Bound (CRB) are fundamental tools in statistical signal processing, which inform the efficient design of experiments and algorithms for estimating the underlying parameters. In this article, we investigate these concepts for the case where the parameters lie on a homogeneous space. Unlike the existing Fisher-Rao theory for general Riemannian manifolds, our focus is to leverage the group-theoretic structure of homogeneous spaces, which is often much easier to work with than their Riemannian structure. The FIM is characterized by identifying the homogeneous space with a coset space, the group-theoretic CRB and its corollaries are presented, and its relationship to the Riemannian CRB is clarified. The application of our theory is illustrated using two examples from engineering: (i) estimation of the pose of a robot and (ii) sensor network localization. In particular, these examples demonstrate that homogeneous spaces provide a natural framework for studying statistical models that are invariant with respect to a group of symmetries.
- [10] arXiv:2411.00318 [pdf, html, other]
-
Title: Cyclic Reformulation Based System Identification for Periodically Time-varying SystemsSubjects: Systems and Control (eess.SY)
This paper addresses a system identification for linear periodically time-varying plants in the discrete-time setting. A system identification algorithm for linear, periodically time-varying plants is introduced based on a cyclic reformulation and a state coordinate transformation of the cycled system. By using our system identification algorithm, the high-accuracy model of the periodically time-varying plant can be obtained without using specific periodic input signals. The effectiveness of the proposed algorithm is demonstrated with numerical examples.
- [11] arXiv:2411.00326 [pdf, html, other]
-
Title: SpineFM: Leveraging Foundation Models for Automatic Spine X-ray SegmentationComments: 4 pages, 3 figures, submitted to ISBI 2025Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This paper introduces SpineFM, a novel pipeline that achieves state-of-the-art performance in the automatic segmentation and identification of vertebral bodies in cervical and lumbar spine radiographs. SpineFM leverages the regular geometry of the spine, employing a novel inductive process to sequentially infer the location of each vertebra along the spinal column. Vertebrae are segmented using Medical-SAM-Adaptor, a robust foundation model that diverges from commonly used CNN-based models. We achieved outstanding results on two publicly available spine X-Ray datasets, with successful identification of 97.8\% and 99.6\% of annotated vertebrae, respectively. Of which, our segmentation reached an average Dice of 0.942 and 0.921, surpassing previous state-of-the-art methods.
- [12] arXiv:2411.00334 [pdf, html, other]
-
Title: Power Source Allocation for RIS-aided Integrating Sensing, Communication, and Power Transfer Systems Based on NOMASubjects: Signal Processing (eess.SP)
This paper proposes a novel communication system framework based on a reconfigurable intelligent surface (RIS)-aided integrated sensing, communication, and power transmission (ISCPT) communication system. RIS is used to improve transmission efficiency and sensing accuracy. In addition, non-orthogonal multiple access (NOMA) technology is incorporated in RIS-aided ISCPT systems to boost the spectrum utilization efficiency of RIS-aided ISCPT systems. We consider the power minimization problem of the RIS-aided ISCPT-NOMA system. Power minimization is achieved by jointly optimizing the RIS phase shift, decoding order, power splitting (PS) factor, and transmit beamforming while satisfying quality of service (QoS), radar target sensing accuracy, and energy harvesting constraints. Since the objective function and constraints in the optimization problem are non-convex, the problem is an NP-hard problem. To solve the non-convex problem, this paper proposes a block coordinate descent (BCD) algorithm. Specifically, the non-convex problem is divided into four sub-problems: i.e. the transmit beamforming, RIS phase shift, decoding order and PS factor optimization subproblems. We employ semidefinite relaxation (SDR) and successive convex approximation (SCA) techniques to address the transmit beamforming optimization sub-problem. Subsequently, we leverage the alternating direction method of multipliers (ADMM) algorithm to solve the RIS phase shift optimization problem. As for the decoding order optimization, we provide a closed-form expression. For the PS factor optimization problem, the SCA algorithm is proposed. Simulation results illustrate the effectiveness of our proposed algorithm and highlight the balanced performance achieved across sensing, communication, and power transfer.
- [13] arXiv:2411.00337 [pdf, html, other]
-
Title: Coherent Hierarchical Probabilistic Forecasting of Electric Vehicle Charging DemandComments: Paper accepted for IEEE Transactions on Industrial Applications. Personal use of this material is permitted. Permission from Elsevier must be obtained for all other usesSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
The growing penetration of electric vehicles (EVs) significantly changes typical load curves in smart grids. With the development of fast charging technology, the volatility of EV charging demand is increasing, which requires additional flexibility for real-time power balance. The forecasting of EV charging demand involves probabilistic modeling of high dimensional time series dynamics across diverse electric vehicle charging stations (EVCSs). This paper studies the forecasting problem of multiple EVCS in a hierarchical probabilistic manner. For each charging station, a deep learning model based on a partial input convex neural network (PICNN) is trained to predict the day-ahead charging demand's conditional distribution, preventing the common quantile crossing problem in traditional quantile regression models. Then, differentiable convex optimization layers (DCLs) are used to reconcile the scenarios sampled from the distributions to yield coherent scenarios that satisfy the hierarchical constraint. It learns a better weight matrix for adjusting the forecasting results of different targets in a machine-learning approach compared to traditional optimization-based hierarchical reconciling methods. Numerical experiments based on real-world EV charging data are conducted to demonstrate the efficacy of the proposed method.
- [14] arXiv:2411.00338 [pdf, html, other]
-
Title: Computational Imaging Through Atmospheric TurbulenceSubjects: Image and Video Processing (eess.IV)
Since the seminal work of Andrey Kolmogorov in the early 1940's, imaging through atmospheric turbulence has grown from a pure scientific pursuit to an important subject across a multitude of civilian, space-mission, and national security applications. Fueled by the recent advancement of deep learning, the field is further experiencing a new wave of momentum. However, for these deep learning methods to perform well, new efforts are needed to build faster and more accurate computational models while at the same time maximizing the performance of image reconstruction.
The book is written primarily for image processing engineers, computer vision scientists, and engineering students who are interested in the field of atmospheric turbulence, statistical optics, and image processing. The book can be used as a graduate text, or advanced topic classes for undergraduates. - [15] arXiv:2411.00416 [pdf, html, other]
-
Title: Edge centrality and the total variation of graph distributional signalsSubjects: Signal Processing (eess.SP)
This short note is a supplement to [1], in which the total variation of graph distributional signals is introduced and studied. We introduce a different formulation of total variation and relate it to the notion of edge centrality. The relation provides a different perspective of total variation and may facilitate its computation.
- [16] arXiv:2411.00417 [pdf, html, other]
-
Title: Closed-Loop Stability of a Lyapunov-Based Switching Attitude Controller for Energy-Efficient Torque-Input-Selection During FlightComments: 2024 IEEE International Conference on Robotics and Biomimetics (ROBIO)Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
We present a new Lyapunov-based switching attitude controller for energy-efficient real-time selection of the torque inputted to an uncrewed aerial vehicle (UAV) during flight. The proposed method, using quaternions to describe the attitude of the controlled UAV, interchanges the stability properties of the two fixed points-one locally asymptotically stable and another unstable-of the resulting closed-loop (CL) switching dynamics of the system. In this approach, the switching events are triggered by the value of a compound energy-based function. To analyze and ensure the stability of the CL switching dynamics, we use classical nonlinear Lyapunov techniques, in combination with switching-systems theory. For this purpose, we introduce a new compound Lyapunov function (LF) that not only enables us to derive the conditions for CL asymptotic and exponential stability, but also provides us with an estimate of the CL system's region of attraction. This new estimate is considerably larger than those previously reported for systems of the type considered in this paper. To test and demonstrate the functionality, suitability, and performance of the proposed method, we present and discuss experimental data obtained using a 31-g quadrotor during the execution of high-speed yaw-tracking maneuvers. Also, we provide empirical evidence indicating that all the initial conditions chosen for these maneuvers, as estimated, lie inside the system's region of attraction. Last, experimental data obtained through these flight tests show that the proposed switching controller reduces the control effort by about 53%, on average, with respect to that corresponding to a commonly used benchmark control scheme, when executing a particular type of high-speed yaw-tracking maneuvers.
- [17] arXiv:2411.00433 [pdf, html, other]
-
Title: Joint Beamforming for Multi-target Detection and Multi-user Communication in ISAC SystemsComments: 5 pages, 4 figures, submitted to IEEE journalSubjects: Signal Processing (eess.SP)
Detecting weak targets is one of the main challenges for integrated sensing and communication (ISAC) systems. Sensing and communication suffer from a performance trade-off in ISAC systems. As the communication demand increases, sensing ability, especially weak target detection performance, will inevitably reduce. Traditional approaches fail to address this issue. In this paper, we develop a joint beamforming scheme and formulate it as a max-min problem to maximize the detection probability of the weakest target under the constraint of the signal-to-interference-plus-noise ratio (SINR) of multi-user communication. An alternating optimization (AO) algorithm is developed for solving the complicated non-convex problem to obtain the joint beamformer. The proposed scheme can direct the transmit energy toward the multiple targets properly to ensure robust multi-target detection performance. Numerical results show that the proposed beamforming scheme can effectively increase the detection probability of the weakest target compared to baseline approaches while ensuring communication performance.
- [18] arXiv:2411.00496 [pdf, html, other]
-
Title: Fundamental Trade-offs in Quantized Hybrid Radar Fusion: A CRB-Rate PerspectiveSubjects: Signal Processing (eess.SP)
While recent advancements have highlighted the role of low-resolution analog-to-digital converters (ADCs) in integrated sensing and communication (ISAC) systems, the specific impact of ADC resolution on hybrid radar fusion (HRF) remains relatively unexplored. The uplink (UL) paths in HRF, comprising both direct and reflected signals within the same frequency band, pose unique challenges, particularly given that the reflected signal is often significantly weaker than the direct path, making HRF systems susceptible to ADC resolution. To investigate the influence of quantization and ADC resolution on HRF, we employ the quantized Cramér-Rao bound (CRB) as a metric for sensing accuracy. This work derives the quantized CRB specifically for HRF systems and the quantized communication rate. We extend our analysis to obtain lower bounds on the Fisher Information Matrix (FIM) and UL communication rates, which we use to characterize quantized HRF systems. Using these derived bounds, we analyze quantized HRF systems through the lens of CRB-rate boundaries. We obtain the CRB-rate boundary through two optimization problems, where each solution point represents a trade-off boundary between the sensing accuracy and the communication rate. Extensive simulations illustrate the influence of ADC resolution, DR, and various system parameters on the CRB-rate boundary of HRF systems. These results offer critical insights into the design of efficient and high-performance HRF systems.
- [19] arXiv:2411.00506 [pdf, html, other]
-
Title: Weighted Null Space Fitting (WNSF): A Link between The Prediction Error Method and Subspace IdentificationSubjects: Systems and Control (eess.SY)
Subspace identification method (SIM) has been proven to be very useful and numerically robust for estimating state-space models. However, it is in general not believed to be as accurate as the prediction error method (PEM). Conversely, PEM, although more accurate, comes with non-convex optimization problems and requires local non-linear optimization algorithms and good initialization points. This contribution proposes a weighted null space fitting (WNSF) method to identify a state-space model, combining some advantages of the two mainstream approaches aforementioned. It starts with the estimate of a non-parametric model using least-squares, and then the reduction to a state-space model in the observer canonical form is a multi-step least-squares procedure where each step consists of the solution of a quadratic optimization problem. Unlike SIM, which focuses on the range space of the extended observability matrix, WNSF estimates its null space, avoiding the need for singular value decomposition. Moreover, the statistically optimal weighting for the null space fitting problem is derived. It is conjectured that WNSF is asymptotically efficient, which is supported by a simulation study.
- [20] arXiv:2411.00527 [pdf, html, other]
-
Title: MAROON: A Framework for the Joint Characterization of Near-Field High-Resolution Radar and Optical Depth Imaging TechniquesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Utilizing the complementary strengths of wavelength-specific range or depth sensors is crucial for robust computer-assisted tasks such as autonomous driving. Despite this, there is still little research done at the intersection of optical depth sensors and radars operating close range, where the target is decimeters away from the sensors. Together with a growing interest in high-resolution imaging radars operating in the near field, the question arises how these sensors behave in comparison to their traditional optical counterparts.
In this work, we take on the unique challenge of jointly characterizing depth imagers from both, the optical and radio-frequency domain using a multimodal spatial calibration. We collect data from four depth imagers, with three optical sensors of varying operation principle and an imaging radar. We provide a comprehensive evaluation of their depth measurements with respect to distinct object materials, geometries, and object-to-sensor distances. Specifically, we reveal scattering effects of partially transmissive materials and investigate the response of radio-frequency signals. All object measurements will be made public in form of a multimodal dataset, called MAROON. - [21] arXiv:2411.00547 [pdf, other]
-
Title: Demystifying the use of Compression in Virtual ProductionAnil Kokaram, Vibhoothi Vibhoothi, Julien Zouein, François Pitié, Christopher Nash, James Bentley, Philip Coulam-JonesComments: SMPTE Media Summit Paper on use of Compression in Virtual Production from TCD and DisguiseSubjects: Image and Video Processing (eess.IV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Virtual Production (VP) technologies have continued to improve the flexibility of on-set filming and enhance the live concert experience. The core technology of VP relies on high-resolution, high-brightness LED panels to playback/render video content. There are a number of technical challenges to effective deployment e.g. image tile synchronisation across the panels, cross panel colour balancing and compensating for colour fluctuations due to changes in camera angles. Given the complexity and potential quality degradation, the industry prefers "pristine" or lossless compressed source material for displays, which requires significant storage and bandwidth. Modern lossy compression standards like AV1 or H.265 could maintain the same quality at significantly lower bitrates and resource demands. There is yet no agreed methodology for assessing the impact of these standards on quality when the VP scene is recorded in-camera. We present a methodology to assess this impact by comparing lossless and lossy compressed footage displayed through VP screens and recorded in-camera. We assess the quality impact of HAP/NotchLC/Daniel2 and AV1/HEVC/H.264 compression bitrates from 2 Mb/s to 2000 Mb/s with various GOP sizes. Several perceptual quality metrics are then used to automatically evaluate in-camera picture quality, referencing the original uncompressed source content through the LED wall. Our results show that we can achieve the same quality with hybrid codecs as with intermediate encoders at orders of magnitude less bitrate and storage requirements.
- [22] arXiv:2411.00579 [pdf, html, other]
-
Title: Constraint-Driven Multi-USV Coverage Path Generation for Aquatic Environmental MonitoringSubjects: Systems and Control (eess.SY)
In this article, we address aquatic environmental monitoring using a fleet of unmanned surface vehicles (USVs). Specifically, we develop an online path generator that provides either of circular or elliptic paths based on the real-time feedback so that the USVs efficiently sample the sensor data over given aquatic environment. To this end, we begin by formulating a novel online path generation problem for a group of Dubins vehicles in the form of cost minimization based on the formulation of persistent coverage control. We then transform the cost minimization into a constraint-based specification so that a prescribed performance level is certified. An online coverage path generator is then designed based on the so-called constraint-based control in order to meet the performance certificate together with additional constraints inherent in the parameters that specify the paths. It is also shown there that the present constraint-based approach allows one to drastically reduce the computational complexity stemming from combinations of binary variables corresponding to the turning directions of the USVs. The present coverage path generator is finally demonstrated through simulations and experiments on an original testbed of multiple USVs.
- [23] arXiv:2411.00594 [pdf, other]
-
Title: Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapyMianyong Ding, Matteo Maspero, Annemieke S Littooij, Martine van Grotel, Raquel Davila Fajardo, Max M van Noesel, Marry M van den Heuvel-Eibrink, Geert O JanssensComments: 23 pages, 5 figures, 1 table. Submitted to Radiotherapy and Oncology (2024-11-01)Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Purposes: This study aimed to develop a computed tomography (CT)-based multi-organ segmentation model for delineating organs-at-risk (OARs) in pediatric upper abdominal tumors and evaluate its robustness across multiple datasets. Materials and methods: In-house postoperative CTs from pediatric patients with renal tumors and neuroblastoma (n=189) and a public dataset (n=189) with CTs covering thoracoabdominal regions were used. Seventeen OARs were delineated: nine by clinicians (Type 1) and eight using TotalSegmentator (Type 2). Auto-segmentation models were trained using in-house (ModelPMC-UMCU) and a combined dataset of public data (Model-Combined). Performance was assessed with Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), and mean surface distance (MSD). Two clinicians rated clinical acceptability on a 5-point Likert scale across 15 patient contours. Model robustness was evaluated against sex, age, intravenous contrast, and tumor type. Results: Model-PMC-UMCU achieved mean DSC values above 0.95 for five of nine OARs, while spleen and heart ranged between 0.90 and 0.95. The stomach-bowel and pancreas exhibited DSC values below 0.90. Model-Combined demonstrated improved robustness across both datasets. Clinical evaluation revealed good usability, with both clinicians rating six of nine Type 1 OARs above four and six of eight Type 2 OARs above three. Significant performance 2 differences were only found across age groups in both datasets, specifically in the left lung and pancreas. The 0-2 age group showed the lowest performance. Conclusion: A multi-organ segmentation model was developed, showcasing enhanced robustness when trained on combined datasets. This model is suitable for various OARs and can be applied to multiple datasets in clinical settings.
- [24] arXiv:2411.00605 [pdf, other]
-
Title: pcaGAN: Improving Posterior-Sampling cGANs via Principal Component RegularizationComments: To appear at NeurIPS 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In ill-posed imaging inverse problems, there can exist many hypotheses that fit both the observed measurements and prior knowledge of the true image. Rather than returning just one hypothesis of that image, posterior samplers aim to explore the full solution space by generating many probable hypotheses, which can later be used to quantify uncertainty or construct recoveries that appropriately navigate the perception/distortion trade-off. In this work, we propose a fast and accurate posterior-sampling conditional generative adversarial network (cGAN) that, through a novel form of regularization, aims for correctness in the posterior mean as well as the trace and K principal components of the posterior covariance matrix. Numerical experiments demonstrate that our method outperforms contemporary cGANs and diffusion models in imaging inverse problems like denoising, large-scale inpainting, and accelerated MRI recovery. The code for our model can be found here: this https URL.
- [25] arXiv:2411.00609 [pdf, html, other]
-
Title: Tumor Location-weighted MRI-Report Contrastive Learning: A Framework for Improving the Explainability of Pediatric Brain Tumor DiagnosisSara Ketabi, Matthias W. Wagner, Cynthia Hawkins, Uri Tabori, Birgit Betina Ertl-Wagner, Farzad KhalvatiSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Despite the promising performance of convolutional neural networks (CNNs) in brain tumor diagnosis from magnetic resonance imaging (MRI), their integration into the clinical workflow has been limited. That is mainly due to the fact that the features contributing to a model's prediction are unclear to radiologists and hence, clinically irrelevant, i.e., lack of explainability. As the invaluable sources of radiologists' knowledge and expertise, radiology reports can be integrated with MRI in a contrastive learning (CL) framework, enabling learning from image-report associations, to improve CNN explainability. In this work, we train a multimodal CL architecture on 3D brain MRI scans and radiology reports to learn informative MRI representations. Furthermore, we integrate tumor location, salient to several brain tumor analysis tasks, into this framework to improve its generalizability. We then apply the learnt image representations to improve explainability and performance of genetic marker classification of pediatric Low-grade Glioma, the most prevalent brain tumor in children, as a downstream task. Our results indicate a Dice score of 31.1% between the model's attention maps and manual tumor segmentation (as an explainability measure) with test classification performance of 87.7%, significantly outperforming the baselines. These enhancements can build trust in our model among radiologists, facilitating its integration into clinical practices for more efficient tumor diagnosis.
- [26] arXiv:2411.00617 [pdf, html, other]
-
Title: A Graph Attention-Guided Diffusion Model for Liver Vessel SegmentationComments: This work has been submitted to the IEEE for possible publicationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Improving connectivity and completeness are the most challenging aspects of small liver vessel segmentation. It is difficult for existing methods to obtain segmented liver vessel trees simultaneously with continuous geometry and detail in small vessels. We proposed a diffusion model-based method with a multi-scale graph attention guidance to break through the bottleneck to segment the liver vessels. Experiments show that the proposed method outperforms the other state-of-the-art methods used in this study on two public datasets of 3D-ircadb-01 and LiVS. Dice coefficient and Sensitivity are improved by at least 11.67% and 24.21% on 3D-ircadb-01 dataset, and are improved by at least 3.21% and 9.11% on LiVS dataset. Connectivity is also quantitatively evaluated in this study and our method performs best. The proposed method is reliable for small liver vessel segmentation.
- [27] arXiv:2411.00656 [pdf, html, other]
-
Title: Identification of Analytic Nonlinear Dynamical Systems with Non-asymptotic GuaranteesComments: NeurIPS 2024Subjects: Systems and Control (eess.SY)
This paper focuses on the system identification of an important class of nonlinear systems: linearly parameterized nonlinear systems, which enjoys wide applications in robotics and other mechanical systems. We consider two system identification methods: least-squares estimation (LSE), which is a point estimation method; and set-membership estimation (SME), which estimates an uncertainty set that contains the true parameters. We provide non-asymptotic convergence rates for LSE and SME under i.i.d. control inputs and control policies with i.i.d. random perturbations, both of which are considered as non-active-exploration inputs. Compared with the counter-example based on piecewise-affine systems in the literature, the success of non-active exploration in our setting relies on a key assumption on the system dynamics: we require the system functions to be real-analytic. Our results, together with the piecewise-affine counter-example, reveal the importance of differentiability in nonlinear system identification through non-active exploration. Lastly, we numerically compare our theoretical bounds with the empirical performance of LSE and SME on a pendulum example and a quadrotor example.
- [28] arXiv:2411.00664 [pdf, other]
-
Title: Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient RetrievalComments: 14 pages, 7 figures, submitted to IEEE/ACM Transactions on Audio, Speech, and Language ProcessingSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.
- [29] arXiv:2411.00668 [pdf, html, other]
-
Title: Model Predictive Contouring Control with Barrier and Lyapunov Functions for Stable Path-Following in UAV systemsBryan S. Guevara, Viviana Moya, Luis F. Recalde, David Pozo-Espin, Daniel C. Gandolfo, Juan M. ToiberoComments: Submitted to IEEE Access for reviewSubjects: Systems and Control (eess.SY)
In this study, we propose a novel method that integrates Nonlinear Model Predictive Contour Control (NMPCC) with an Exponentially Stabilizing Control Lyapunov Function (ES-CLF) and Exponential Higher-Order Control Barrier Functions to achieve stable path-following and obstacle avoidance in UAV systems. This framework enables unmanned aerial vehicles (UAVs) to safely navigate around both static and dynamic obstacles while strictly adhering to desired paths. The quaternion-based formulation ensures precise orientation and attitude control, while a robust optimization solver enforces the constraints imposed by the Control Lyapunov Function (CLF) and Control Barrier Functions (CBF), ensuring reliable real-time performance. The method was validated in a Model-in-the-Loop (MiL) environment, demonstrating effective path tracking and obstacle avoidance. The results highlight the framework's ability to minimize both orthogonal and tangential errors, ensuring stability and safety in complex environments.
- [30] arXiv:2411.00703 [pdf, html, other]
-
Title: Set-Theoretic Direct Data-driven Predictive ControlSubjects: Systems and Control (eess.SY)
Designing the terminal ingredients of direct data-driven predictive control presents challenges due to its reliance on an implicit, non-minimal input-output data-driven representation. By considering the class of constrained LTI systems with unknown time delays, we propose a set-theoretic direct data-driven predictive controller that does not require a terminal cost to provide closed-loop guarantees. In particular, first, starting from input/output data series, we propose a sample-based method to build N-step input output backward reachable sets. Then, we leverage the constructed family of backward reachable sets to derive a data-driven control law. The proposed method guarantees finite-time convergence and recursive feasibility, independent of objective function tuning. It requires neither explicit state estimation nor an explicit prediction model, relying solely on input-output measurements; therefore, unmodeled dynamics can be avoided. Finally, a numerical example highlights the effectiveness of the proposed method in stabilizing the system, whereas direct data-driven predictive control without terminal ingredients fails under the same conditions.
- [31] arXiv:2411.00726 [pdf, html, other]
-
Title: Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with CataractComments: 10 pages, 4 figuresSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework to fuse the information from CFP and IFP towards more accurate DR grading. Specifically, we construct a dual-stream architecture Cross-Fundus Transformer (CFT) to fuse the ViT-based features of two fundus image modalities. In particular, a meticulously engineered Cross-Fundus Attention (CFA) module is introduced to capture the correspondence between CFP and IFP images. Moreover, we adopt both the single-modality and multi-modality supervisions to maximize the overall performance for DR grading. Extensive experiments on a clinical dataset consisting of 1,713 pairs of multi-modal fundus images demonstrate the superiority of our proposed method. Our code will be released for public access.
- [32] arXiv:2411.00749 [pdf, html, other]
-
Title: PathoGen-X: A Cross-Modal Genomic Feature Trans-Align Network for Enhanced Survival Prediction from Histopathology ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN); Tissues and Organs (q-bio.TO)
Accurate survival prediction is essential for personalized cancer treatment. However, genomic data - often a more powerful predictor than pathology data - is costly and inaccessible. We present the cross-modal genomic feature translation and alignment network for enhanced survival prediction from histopathology images (PathoGen-X). It is a deep learning framework that leverages both genomic and imaging data during training, relying solely on imaging data at testing. PathoGen-X employs transformer-based networks to align and translate image features into the genomic feature space, enhancing weaker imaging signals with stronger genomic signals. Unlike other methods, PathoGen-X translates and aligns features without projecting them to a shared latent space and requires fewer paired samples. Evaluated on TCGA-BRCA, TCGA-LUAD, and TCGA-GBM datasets, PathoGen-X demonstrates strong survival prediction performance, emphasizing the potential of enriched imaging models for accessible cancer prognosis.
- [33] arXiv:2411.00772 [pdf, html, other]
-
Title: SANN-PSZ: Spatially Adaptive Neural Network for Head-Tracked Personal Sound ZonesComments: This work has been submitted to the IEEE for possible publicationSubjects: Audio and Speech Processing (eess.AS)
A deep learning framework for dynamically rendering personal sound zones (PSZs) with head tracking is presented, utilizing a spatially adaptive neural network (SANN) that inputs listeners' head coordinates and outputs PSZ filter coefficients. The SANN model is trained using either simulated acoustic transfer functions (ATFs) with data augmentation for robustness in uncertain environments or a mix of simulated and measured ATFs for customization under known conditions. It is found that augmenting room reflections in the training data can more effectively improve the model robustness than augmenting the system imperfections, and that adding constraints such as filter compactness to the loss function does not significantly affect the model's performance. Comparisons of the best-performing model with traditional filter design methods show that, when no measured ATFs are available, the model yields equal or higher isolation in an actual room environment with fewer filter artifacts. Furthermore, the model achieves significant data compression (100x) and computational efficiency (10x) compared to the traditional methods, making it suitable for real-time rendering of PSZs that adapt to the listeners' head movements.
New submissions (showing 33 of 33 entries)
- [34] arXiv:2411.00078 (cross-list from cs.CV) [pdf, html, other]
-
Title: How Good Are We? Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop EnrichmentJunlin Guo, Siqi Lu, Can Cui, Ruining Deng, Tianyuan Yao, Zhewen Tao, Yizhe Lin, Marilyn Lionts, Quan Liu, Juming Xiong, Yu Wang, Shilin Zhao, Catie Chang, Mitchell Wilkes, Mengmeng Yin, Haichun Yang, Yuankai HuoSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, "How good are we?", by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, "How can we improve?", by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications.
- [35] arXiv:2411.00107 (cross-list from cs.RO) [pdf, html, other]
-
Title: First, Learn What You Don't Know: Active Information Gathering for Driving at the Limits of HandlingAlexander Davydov, Franck Djeumou, Marcus Greiff, Makoto Suminaka, Michael Thompson, John Subosits, Thomas LewSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Combining data-driven models that adapt online and model predictive control (MPC) has enabled effective control of nonlinear systems. However, when deployed on unstable systems, online adaptation may not be fast enough to ensure reliable simultaneous learning and control. For example, controllers on a vehicle executing highly dynamic maneuvers may push the tires to their friction limits, destabilizing the vehicle and allowing modeling errors to quickly compound and cause a loss of control. In this work, we present a Bayesian meta-learning MPC framework. We propose an expressive vehicle dynamics model that leverages Bayesian last-layer meta-learning to enable rapid online adaptation. The model's uncertainty estimates are used to guide informative data collection and quickly improve the model prior to deployment. Experiments on a Toyota Supra show that (i) the framework enables reliable control in dynamic drifting maneuvers, (ii) online adaptation alone may not suffice for zero-shot control of a vehicle at the edge of stability, and (iii) active data collection helps achieve reliable performance.
- [36] arXiv:2411.00121 (cross-list from cs.SD) [pdf, html, other]
-
Title: I Can Hear You: Selective Robust Training for Deepfake Audio DetectionSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
- [37] arXiv:2411.00153 (cross-list from cs.SD) [pdf, html, other]
-
Title: Angular Distance Distribution Loss for Audio ClassificationSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Classification is a pivotal task in deep learning not only because of its intrinsic importance, but also for providing embeddings with desirable properties in other tasks. To optimize these properties, a wide variety of loss functions have been proposed that attempt to minimize the intra-class distance and maximize the inter-class distance in the embeddings space. In this paper we argue that, in addition to these two, eliminating hierarchies within and among classes are two other desirable properties for classification embeddings. Furthermore, we propose the Angular Distance Distribution (ADD) Loss, which aims to enhance the four previous properties jointly. For this purpose, it imposes conditions on the first and second order statistical moments of the angular distance between embeddings. Finally, we perform experiments showing that our loss function improves all four properties and, consequently, performs better than other loss functions in audio classification tasks.
- [38] arXiv:2411.00178 (cross-list from cs.CV) [pdf, other]
-
Title: Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule EndoscopyPanagiota Gatoula, Dimitrios E. Diamantis, Anastasios Koulaouzidis, Cristina Carretero, Stefania Chetcuti-Zammit, Pablo Cortegoso Valdivia, Begoña González-Suárez, Alessandro Mussetto, John Plevris, Alexander Robertson, Bruno Rosa, Ervin Toth, Dimitris K. IakovidisSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Sharing retrospectively acquired data is essential for both clinical research and training. Synthetic Data Generation (SDG), using Artificial Intelligence (AI) models, can overcome privacy barriers in sharing clinical data, enabling advancements in medical diagnostics. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis, with a comprehensive qualitative evaluation conducted by 10 international WCE specialists, focusing on image quality, diversity, realism, and clinical decision-making. The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools. The proposed protocol serves as a reference for future research on medical image-generation techniques.
- [39] arXiv:2411.00195 (cross-list from cs.SD) [pdf, html, other]
-
Title: Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature EngineeringComments: 6 pages, 6 figuresSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
This study presents a machine learning framework for assessing similarity between audio content and predicting sentiment score. We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments, serving as proxy labels for content quality. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations through Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, and Temporal characteristics. Leveraging these features, we train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively. Improvements over a baseline model based on absolute difference metrics are observed. These results demonstrate the potential of machine learning to capture sentiment and similarity in audio, offering an adaptable framework for AI applications in media analysis.
- [40] arXiv:2411.00198 (cross-list from cs.LG) [pdf, html, other]
-
Title: Kernel Operator-Theoretic Bayesian Filter for Nonlinear Dynamical SystemsSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Motivated by the surge of interest in Koopman operator theory, we propose a machine-learning alternative based on a functional Bayesian perspective for operator-theoretic modeling of unknown, data-driven, nonlinear dynamical systems. This formulation is directly done in an infinite-dimensional space of linear operators or Hilbert space with universal approximation property. The theory of reproducing kernel Hilbert space (RKHS) allows the lifting of nonlinear dynamics to a potentially infinite-dimensional space via linear embeddings, where a general nonlinear function is represented as a set of linear functions or operators in the functional space. This allows us to apply classical linear Bayesian methods such as the Kalman filter directly in the Hilbert space, yielding nonlinear solutions in the original input space. This kernel perspective on the Koopman operator offers two compelling advantages. First, the Hilbert space can be constructed deterministically, agnostic to the nonlinear dynamics. The Gaussian kernel is universal, approximating uniformly an arbitrary continuous target function over any compact domain. Second, Bayesian filter is an adaptive, linear minimum-variance algorithm, allowing the system to update the Koopman operator and continuously track the changes across an extended period of time, ideally suited for modern data-driven applications such as real-time machine learning using streaming data. In this paper, we present several practical implementations to obtain a finite-dimensional approximation of the functional Bayesian filter (FBF). Due to the rapid decay of the Gaussian kernel, excellent approximation is obtained with a small dimension. We demonstrate that this practical approach can obtain accurate results and outperform finite-dimensional Koopman decomposition.
- [41] arXiv:2411.00209 (cross-list from cs.CV) [pdf, html, other]
-
Title: Semantic Knowledge Distillation for Onboard Satellite Earth Observation Image ClassificationThanh-Dung Le, Vu Nguyen Ha, Ti Ti Nguyen, Geoffrey Eappen, Prabhu Thiruvasagam, Hong-fu Chou, Duc-Dung Tran, Luis M. Garces-Socarras, Jorge L. Gonzalez-Rios, Juan Carlos Merlano-Duncan, Symeon ChatzinotasComments: Under revisionsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
This study presents an innovative dynamic weighting knowledge distillation (KD) framework tailored for efficient Earth observation (EO) image classification (IC) in resource-constrained settings. Utilizing EfficientViT and MobileViT as teacher models, this framework enables lightweight student models, particularly ResNet8 and ResNet16, to surpass 90% in accuracy, precision, and recall, adhering to the stringent confidence thresholds necessary for reliable classification tasks. Unlike conventional KD methods that rely on static weight distribution, our adaptive weighting mechanism responds to each teacher model's confidence, allowing student models to prioritize more credible sources of knowledge dynamically. Remarkably, ResNet8 delivers substantial efficiency gains, achieving a 97.5% reduction in parameters, a 96.7% decrease in FLOPs, an 86.2% cut in power consumption, and a 63.5% increase in inference speed over MobileViT. This significant optimization of complexity and resource demands establishes ResNet8 as an optimal candidate for EO tasks, combining robust performance with feasibility in deployment. The confidence-based, adaptable KD approach underscores the potential of dynamic distillation strategies to yield high-performing, resource-efficient models tailored for satellite-based EO applications. The reproducible code is accessible on our GitHub repository.
- [42] arXiv:2411.00274 (cross-list from cs.CV) [pdf, html, other]
-
Title: Adaptive Residual Transformation for Enhanced Feature-Based OOD Detection in SAR ImagerySubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Recent advances in deep learning architectures have enabled efficient and accurate classification of pre-trained targets in Synthetic Aperture Radar (SAR) images. Nevertheless, the presence of unknown targets in real battlefield scenarios is unavoidable, resulting in misclassification and reducing the accuracy of the classifier. Over the past decades, various feature-based out-of-distribution (OOD) approaches have been developed to address this issue, yet defining the decision boundary between known and unknown targets remains challenging. Additionally, unlike optical images, detecting unknown targets in SAR imagery is further complicated by high speckle noise, the presence of clutter, and the inherent similarities in back-scattered microwave signals. In this work, we propose transforming feature-based OOD detection into a class-localized feature-residual-based approach, demonstrating that this method can improve stability across varying unknown targets' distribution conditions. Transforming feature-based OOD detection into a residual-based framework offers a more robust reference space for distinguishing between in-distribution (ID) and OOD data, particularly within the unique characteristics of SAR imagery. This adaptive residual transformation method standardizes feature-based inputs into distributional representations, enhancing OOD detection in noisy, low-information images. Our approach demonstrates promising performance in real-world SAR scenarios, effectively adapting to the high levels of noise and clutter inherent in these environments. These findings highlight the practical relevance of residual-based OOD detection for SAR applications and suggest a foundation for further advancements in unknown target detection in complex, operational settings.
- [43] arXiv:2411.00275 (cross-list from cs.SD) [pdf, other]
-
Title: Improving Musical Instrument Classification with Advanced Machine Learning TechniquesComments: 43 pages, 35 figures, 14 tablesSubjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Musical instrument classification, a key area in Music Information Retrieval, has gained considerable interest due to its applications in education, digital music production, and consumer media. Recent advances in machine learning, specifically deep learning, have enhanced the capability to identify and classify musical instruments from audio signals. This study applies various machine learning methods, including Naive Bayes, Support Vector Machines, Random Forests, Boosting techniques like AdaBoost and XGBoost, as well as deep learning models such as Convolutional Neural Networks and Artificial Neural Networks. The effectiveness of these methods is evaluated on the NSynth dataset, a large repository of annotated musical sounds. By comparing these approaches, the analysis aims to showcase the advantages and limitations of each method, providing guidance for developing more accurate and efficient classification systems. Additionally, hybrid model testing and discussion are included. This research aims to support further studies in instrument classification by proposing new approaches and future research directions.
- [44] arXiv:2411.00281 (cross-list from cs.CV) [pdf, html, other]
-
Title: Detection and tracking of gas plumes in LWIR hyperspectral video sequence dataJournal-ref: SPIE Defense, Security, and Sensing, 2013, Baltimore, Proceedings Volume 8743, Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XIX; 87430J (2013)Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Automated detection of chemical plumes presents a segmentation challenge. The segmentation problem for gas plumes is difficult due to the diffusive nature of the cloud. The advantage of considering hyperspectral images in the gas plume detection problem over the conventional RGB imagery is the presence of non-visual data, allowing for a richer representation of information. In this paper we present an effective method of visualizing hyperspectral video sequences containing chemical plumes and investigate the effectiveness of segmentation techniques on these post-processed videos. Our approach uses a combination of dimension reduction and histogram equalization to prepare the hyperspectral videos for segmentation. First, Principal Components Analysis (PCA) is used to reduce the dimension of the entire video sequence. This is done by projecting each pixel onto the first few Principal Components resulting in a type of spectral filter. Next, a Midway method for histogram equalization is used. These methods redistribute the intensity values in order to reduce flicker between frames. This properly prepares these high-dimensional video sequences for more traditional segmentation techniques. We compare the ability of various clustering techniques to properly segment the chemical plume. These include K-means, spectral clustering, and the Ginzburg-Landau functional.
- [45] arXiv:2411.00321 (cross-list from cs.SD) [pdf, html, other]
-
Title: MACE: Leveraging Audio for Evaluating Audio Captioning SystemsSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at this https URL
- [46] arXiv:2411.00335 (cross-list from cs.CV) [pdf, html, other]
-
Title: NCST: Neural-based Color Style Transfer for Video RetouchingComments: 10 pages, 8 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
Video color style transfer aims to transform the color style of an original video by using a reference style image. Most existing methods employ neural networks, which come with challenges like opaque transfer processes and limited user control over the outcomes. Typically, users cannot fine-tune the resulting images or videos. To tackle this issue, we introduce a method that predicts specific parameters for color style transfer using two images. Initially, we train a neural network to learn the corresponding color adjustment parameters. When applying style transfer to a video, we fine-tune the network with key frames from the video and the chosen style image, generating precise transformation parameters. These are then applied to convert the color style of both images and videos. Our experimental results demonstrate that our algorithm surpasses current methods in color style transfer quality. Moreover, each parameter in our method has a specific, interpretable meaning, enabling users to understand the color style transfer process and allowing them to perform manual fine-tuning if desired.
- [47] arXiv:2411.00357 (cross-list from cs.RO) [pdf, other]
-
Title: An Improved Rapidly Exploring Random Tree Algorithm for Path Planning in Configuration Spaces with Narrow ChannelsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Rapidly-exploring Random Tree (RRT) algorithms have been applied successfully to challenging robot motion planning and under-actuated nonlinear control problems. However a fundamental limitation of the RRT approach is the slow convergence in configuration spaces with narrow channels because of the small probability of generating test points inside narrow channels. This paper presents an improved RRT algorithm that takes advantage of narrow channels between the initial and goal states to find shorter paths by improving the exploration of narrow regions in the configuration space. The proposed algorithm detects the presence of narrow channel by checking for collision of neighborhood points with the infeasible set and attempts to add points within narrow channels with a predetermined bias. This approach is compared with the classical RRT and its variants on a variety of benchmark planning problems. Simulation results indicate that the algorithm presented in this paper computes a significantly shorter path in spaces with narrow channels.
- [48] arXiv:2411.00359 (cross-list from cs.LG) [pdf, html, other]
-
Title: Constrained Diffusion Implicit ModelsSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose constrained diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconstrained DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.
- [49] arXiv:2411.00373 (cross-list from cs.IT) [pdf, html, other]
-
Title: Discrete RIS Enhanced Space Shift Keying MIMO System via Reflecting Beamforming OptimizationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, a discrete reconfigurable intelligent surface (RIS)-assisted spatial shift keying (SSK) multiple-input multiple-output (MIMO) scheme is investigated, in which a direct link between the transmitter and the receiver is considered. To improve the reliability of the RIS-SSK-MIMO scheme, we formulate an objective function based on minimizing the average bit error probability (ABEP). Since the reflecting phase shift of RIS is discrete, it is difficult to address this problem directly. To this end, we optimize the RIS phase shift to maximize the Euclidean distance between the minimum constellations by applying the successive convex approximation (SCA) and penaltyalternating optimization method. Simulation results verify the superiority of the proposed RIS-SSK-MIMO scheme and demonstrate the impact of the number of RIS elements, the number of phase quantization bits, and the number of receive and transmit antennas in terms of reliability.
- [50] arXiv:2411.00374 (cross-list from cs.IT) [pdf, html, other]
-
Title: Power-Measurement-Based Channel Autocorrelation Estimation for IRS-Assisted Wideband CommunicationsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The passive and frequency-flat reflection of IRS, as well as the high-dimensional IRS-reflected channels, have posed significant challenges for efficient IRS channel estimation, especially in wideband communication systems with significant multi-path channel delay spread. To address these challenges, we propose a novel neural network (NN)-empowered framework for IRS channel autocorrelation matrix estimation in wideband orthogonal frequency division multiplexing (OFDM) systems. This framework relies only on the easily accessible reference signal received power (RSRP) measurements at users in existing wideband communication systems, without requiring additional pilot transmission. Based on the estimates of channel autocorrelation matrix, the passive reflection of IRS is optimized to maximize the average user received signal-to-noise ratio (SNR) over all subcarriers in the OFDM system. Numerical results verify that the proposed algorithm significantly outperforms existing powermeasurement-based IRS reflection designs in wideband channels.
- [51] arXiv:2411.00397 (cross-list from cs.NI) [pdf, html, other]
-
Title: Distributed Computation Offloading for Energy Provision Minimization in WP-MEC Networks with Multiple HAPsComments: submitted to the IEEE TransSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
This paper investigates a wireless powered mobile edge computing (WP-MEC) network with multiple hybrid access points (HAPs) in a dynamic environment, where wireless devices (WDs) harvest energy from radio frequency (RF) signals of HAPs, and then compute their computation data locally (i.e., local computing mode) or offload it to the chosen HAPs (i.e., edge computing mode). In order to pursue a green computing design, we formulate an optimization problem that minimizes the long-term energy provision of the WP-MEC network subject to the energy, computing delay and computation data demand constraints. The transmit power of HAPs, the duration of the wireless power transfer (WPT) phase, the offloading decisions of WDs, the time allocation for offloading and the CPU frequency for local computing are jointly optimized adapting to the time-varying generated computation data and wireless channels of WDs. To efficiently address the formulated non-convex mixed integer programming (MIP) problem in a distributed manner, we propose a Two-stage Multi-Agent deep reinforcement learning-based Distributed computation Offloading (TMADO) framework, which consists of a high-level agent and multiple low-level agents. The high-level agent residing in all HAPs optimizes the transmit power of HAPs and the duration of the WPT phase, while each low-level agent residing in each WD optimizes its offloading decision, time allocation for offloading and CPU frequency for local computing. Simulation results show the superiority of the proposed TMADO framework in terms of the energy provision minimization.
- [52] arXiv:2411.00413 (cross-list from cs.RO) [pdf, html, other]
-
Title: Multi-Uncertainty Aware Autonomous Cooperative PlanningSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Autonomous cooperative planning (ACP) is a promising technique to improve the efficiency and safety of multi-vehicle interactions for future intelligent transportation systems. However, realizing robust ACP is a challenge due to the aggregation of perception, motion, and communication uncertainties. This paper proposes a novel multi-uncertainty aware ACP (MUACP) framework that simultaneously accounts for multiple types of uncertainties via regularized cooperative model predictive control (RC-MPC). The regularizers and constraints for perception, motion, and communication are constructed according to the confidence levels, weather conditions, and outage probabilities, respectively. The effectiveness of the proposed method is evaluated in the Car Learning to Act (CARLA) simulation platform. Results demonstrate that the proposed MUACP efficiently performs cooperative formation in real time and outperforms other benchmark approaches in various scenarios under imperfect knowledge of the environment.
- [53] arXiv:2411.00426 (cross-list from cs.LG) [pdf, other]
-
Title: A KAN-based Interpretable Framework for Process-Informed Prediction of Global Warming PotentialSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Accurate prediction of Global Warming Potential (GWP) is essential for assessing the environmental impact of chemical processes and materials. Traditional GWP prediction models rely predominantly on molecular structure, overlooking critical process-related information. In this study, we present an integrative GWP prediction model that combines molecular descriptors (MACCS keys and Mordred descriptors) with process information (process title, description, and location) to improve predictive accuracy and interpretability. Using a deep neural network (DNN) model, we achieved an R-squared of 86% on test data with Mordred descriptors, process location, and description information, representing a 25% improvement over the previous benchmark of 61%; XAI analysis further highlighted the significant role of process title embeddings in enhancing model predictions. To enhance interpretability, we employed a Kolmogorov-Arnold Network (KAN) to derive a symbolic formula for GWP prediction, capturing key molecular and process features and providing a transparent, interpretable alternative to black-box models, enabling users to gain insights into the molecular and process factors influencing GWP. Error analysis showed that the model performs reliably in densely populated data ranges, with increased uncertainty for higher GWP values. This analysis allows users to manage prediction uncertainty effectively, supporting data-driven decision-making in chemical and process design. Our results suggest that integrating both molecular and process-level information in GWP prediction models yields substantial gains in accuracy and interpretability, offering a valuable tool for sustainability assessments. Future work may extend this approach to additional environmental impact categories and refine the model to further enhance its predictive reliability.
- [54] arXiv:2411.00461 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-enginesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised contrastive (MGSC) framework from plain intuition that samples with the same RUL label should be aligned in the feature space, and address the problems of too large minibatch size and unbalanced samples in the implementation. The RUL prediction with MGSC is implemented on using the proposed multi-phase training strategy. This paper also demonstrates a simple and scalable basic network structure and validates the proposed MGSC strategy on the CMPASS dataset using a convolutional long short-term memory network as a baseline, which effectively improves the accuracy of RUL prediction.
- [55] arXiv:2411.00464 (cross-list from cs.SD) [pdf, html, other]
-
Title: MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate ScenariosComments: Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT2024)Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.
- [56] arXiv:2411.00469 (cross-list from cs.SD) [pdf, html, other]
-
Title: MIRFLEX: Music Information Retrieval Feature Library for ExtractionComments: 2 pages, 4 tables, submitted to Extended Abstracts for the Late-Breaking Demo Session of the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
This paper introduces an extendable modular system that compiles a range of music feature extraction models to aid music information retrieval research. The features include musical elements like key, downbeats, and genre, as well as audio characteristics like instrument recognition, vocals/instrumental classification, and vocals gender detection. The integrated models are state-of-the-art or latest open-source. The features can be extracted as latent or post-processed labels, enabling integration into music applications such as generative music, recommendation, and playlist generation. The modular design allows easy integration of newly developed systems, making it a good benchmarking and comparison tool. This versatile toolkit supports the research community in developing innovative solutions by providing concrete musical features.
- [57] arXiv:2411.00477 (cross-list from cs.SD) [pdf, other]
-
Title: Multi Modal Information Fusion of Acoustic and Linguistic Data for Decoding Dairy Cow Vocalizations in Animal Welfare AssessmentComments: 31 pages, 22 figures, 2 tablesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
Understanding animal vocalizations through multi-source data fusion is crucial for assessing emotional states and enhancing animal welfare in precision livestock farming. This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques, integrating transcription, semantic analysis, contextual and emotional assessment, and acoustic feature extraction. We utilized the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form. By fusing multiple acoustic features frequency, duration, and intensity with transcribed textual data, we developed a comprehensive representation of cow vocalizations. Utilizing data fusion within a custom-developed ontology, we categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness. Analyzing the fused multi dimensional data, we identified anxiety related features indicative of emotional distress, including specific frequency measurements and sound spectrum results. Assessing the sentiment and acoustic features of vocalizations from 20 individual cows allowed us to determine differences in calling patterns and emotional states. Employing advanced machine learning algorithms, Random Forest, Support Vector Machine, and Recurrent Neural Networks, we effectively processed and fused multi-source data to classify cow vocalizations. These models were optimized to handle computational demands and data quality challenges inherent in practical farm environments. Our findings demonstrate the effectiveness of multi-source data fusion and intelligent processing techniques in animal welfare monitoring. This study represents a significant advancement in animal welfare assessment, highlighting the role of innovative fusion technologies in understanding and improving the emotional wellbeing of dairy cows.
- [58] arXiv:2411.00499 (cross-list from cs.CV) [pdf, html, other]
-
Title: Cross-modal semantic segmentation for indoor environmental perception using single-chip millimeter-wave radar raw dataComments: 5291 words, 17 pages, 11 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
In the context of firefighting and rescue operations, a cross-modal semantic segmentation model based on a single-chip millimeter-wave (mmWave) radar for indoor environmental perception is proposed and discussed. To efficiently obtain high-quality labels, an automatic label generation method utilizing LiDAR point clouds and occupancy grid maps is introduced. The proposed segmentation model is based on U-Net. A spatial attention module is incorporated, which enhanced the performance of the mode. The results demonstrate that cross-modal semantic segmentation provides a more intuitive and accurate representation of indoor environments. Unlike traditional methods, the model's segmentation performance is minimally affected by azimuth. Although performance declines with increasing distance, this can be mitigated by a well-designed model. Additionally, it was found that using raw ADC data as input is ineffective; compared to RA tensors, RD tensors are more suitable for the proposed model.
- [59] arXiv:2411.00543 (cross-list from cs.CV) [pdf, html, other]
-
Title: 3D Equivariant Pose Regression via Direct Wigner-D Harmonics PredictionComments: Accepted to NeurIPS 2024, Project webpage at this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
Determining the 3D orientations of an object in an image, known as single-image pose estimation, is a crucial task in 3D vision applications. Existing methods typically learn 3D rotations parametrized in the spatial domain using Euler angles or quaternions, but these representations often introduce discontinuities and singularities. SO(3)-equivariant networks enable the structured capture of pose patterns with data-efficient learning, but the parametrizations in spatial domain are incompatible with their architecture, particularly spherical CNNs, which operate in the frequency domain to enhance computational efficiency. To overcome these issues, we propose a frequency-domain approach that directly predicts Wigner-D coefficients for 3D rotation regression, aligning with the operations of spherical CNNs. Our SO(3)-equivariant pose harmonics predictor overcomes the limitations of spatial parameterizations, ensuring consistent pose estimation under arbitrary rotations. Trained with a frequency-domain regression loss, our method achieves state-of-the-art results on benchmarks such as ModelNet10-SO(3) and PASCAL3D+, with significant improvements in accuracy, robustness, and data efficiency.
- [60] arXiv:2411.00555 (cross-list from math.OC) [pdf, html, other]
-
Title: An exact column generation algorithm for load balancing in capacity sharing networksSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
Capacity sharing networks are typical heterogeneous communication networks widely applied in information and communications technology (ICT) field. In such networks, resources like bandwidth, spectrum, computation and storage are shared among various communication services. Meanwhile, the issue of network congestion is always a prominent challenge. To handle network congestion essentially needs to solve the load balancing of networks. In this paper, for capacity sharing networks, we formulate their load balancing problem as a maximum multi-commodity flow problem. For such a problem, always a large-scale linear programming, the column generation algorithm is a commonly used and crucial method to solve it. In each iteration, this algorithm involves solving a linear programming subproblem and determining whether to terminate or generate a new column for inclusion in the subproblem. This iterative procedure of solving and checking continues throughout the algorithm. Nevertheless, since the checking subproblem is NP-hard, its solution significantly impacts the overall efficiency of the algorithm. In this paper, we innovatively convert the checking subproblem into a single-constrained shortest path (SCSP) subproblem. By exactly solving the SCSP subproblem, we can obtain the optimal solution to the checking subproblem with same or less computing time. Experimental results demonstrate that our algorithm achieves computational efficiency comparable to heuristic algorithms while outperforming other state-of-the-art algorithms by at least an order of magnitude.
- [61] arXiv:2411.00560 (cross-list from cs.CV) [pdf, html, other]
-
Title: Topology and Intersection-Union Constrained Loss Function for Multi-Region Anatomical Segmentation in Ocular ImagesComments: 5 pages, 4 figures, International Symposium on Biomedical Imaging 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Ocular Myasthenia Gravis (OMG) is a rare and challenging disease to detect in its early stages, but symptoms often first appear in the eye muscles, such as drooping eyelids and double vision. Ocular images can be used for early diagnosis by segmenting different regions, such as the sclera, iris, and pupil, which allows for the calculation of area ratios to support accurate medical assessments. However, no publicly available dataset and tools currently exist for this purpose. To address this, we propose a new topology and intersection-union constrained loss function (TIU loss) that improves performance using small training datasets. We conducted experiments on a public dataset consisting of 55 subjects and 2,197 images. Our proposed method outperformed two widely used loss functions across three deep learning networks, achieving a mean Dice score of 83.12% [82.47%, 83.81%] with a 95% bootstrap confidence interval. In a low-percentage training scenario (10% of the training data), our approach showed an 8.32% improvement in Dice score compared to the baseline. Additionally, we evaluated the method in a clinical setting with 47 subjects and 501 images, achieving a Dice score of 64.44% [63.22%, 65.62%]. We did observe some bias when applying the model in clinical settings. These results demonstrate that the proposed method is accurate, and our code along with the trained model is publicly available.
- [62] arXiv:2411.00570 (cross-list from cs.MA) [pdf, other]
-
Title: Incentive-based Platoon Formation: Optimizing the Personal Benefit for DriversSubjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Platooning or cooperative adaptive cruise control (CACC) has been investigated for decades, but debate about its lasting impact is still ongoing. Even though platooning benefits and platoon formation are rather well understood for trucks, this is less clear for passenger cars, which have a higher heterogeneity in trips and drivers' preferences. Most importantly, it remains unclear how to form platoons of passenger cars in order to optimize the personal benefit for the individual driver. To this end, in this paper, we propose a novel platoon formation algorithm that optimizes the personal benefit for drivers of individual passenger cars. For computing vehicle-to-platoon assignments, the algorithm utilizes a new metric that we propose to evaluate the personal benefits of various driving systems, including platooning. By combining fuel and travel time costs into a single monetary value, drivers can estimate overall trip costs according to a personal monetary value for time spent. This provides an intuitive way for drivers to understand and compare the benefits of driving systems like human driving, adaptive cruise control (ACC), and, of course, platooning. Unlike previous similarity-based methods, our proposed algorithm forms platoons only when beneficial for the driver, rather than for the sake of platooning only. Results of a large-scale simulation study demonstrate that our proposed algorithm outperforms normal ACC as well as previous similarity-based platooning approaches by balancing fuel savings and travel time, independent of traffic and drivers' time cost.
- [63] arXiv:2411.00578 (cross-list from cs.CV) [pdf, html, other]
-
Title: Federated Voxel Scene Graph for Intracranial HemorrhageAntoine P. Sanner, Jonathan Stieber, Nils F. Grauhan, Suam Kim, Marc A. Brockmann, Ahmed E. Othman, Anirban MukhopadhyaySubjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
Intracranial Hemorrhage is a potentially lethal condition whose manifestation is vastly diverse and shifts across clinical centers worldwide. Deep-learning-based solutions are starting to model complex relations between brain structures, but still struggle to generalize. While gathering more diverse data is the most natural approach, privacy regulations often limit the sharing of medical data. We propose the first application of Federated Scene Graph Generation. We show that our models can leverage the increased training data diversity. For Scene Graph Generation, they can recall up to 20% more clinically relevant relations across datasets compared to models trained on a single centralized dataset. Learning structured data representation in a federated setting can open the way to the development of new methods that can leverage this finer information to regularize across clients more effectively.
- [64] arXiv:2411.00697 (cross-list from physics.optics) [pdf, other]
-
Title: All-Optical Excitable Spiking Laser Neuron in InP Generic Integration TechnologyComments: 21 pages, 13 figuresSubjects: Optics (physics.optics); Signal Processing (eess.SP); Applied Physics (physics.app-ph)
Brain-inspired, neuromorphic devices implemented in integrated photonic hardware have attracted significant interest recently as part of efforts towards novel non-von Neumann computing paradigms that make use of the low loss, high-speed and parallel operations in optics. An all-optical spiking laser neuron fabricated on the indium-phosphide generic integration technology platform may be a practical alternative to other semi-integrated photonic and electronic-based spiking neuron implementations. Owing to the large number of predefined building blocks, a plethora of applications have benefitted already from the generic integration process. This technology platform has now been utilised for the first time to demonstrate an all-optical spiking laser neuron. This paper present and discusses the design and measurement of the ultra-fast and rich spiking dynamics in these devices. We show that under external pulse injection and operated slightly below the lasing threshold, the laser neuron exhibits an excitable mode, in addition to a self-spiking mode far above the threshold when no pulse is injected. In the excitable mode, the required injected pulse energy is much lower than that of the generated excited response, meeting an important requirement for neuron cascadability. In addition, we investigate excitability at different injection wavelengths below the lasing wavelength, as well as the ultra-fast temporal properties of the spiking response. All of the discussed characteristics point to the laser neuron being an important candidate for scaling up to future fully-connected, multi-wavelength all-optical photonic spiking neural networks in indium-phosphide generic integration technology.
- [65] arXiv:2411.00705 (cross-list from cs.CV) [pdf, html, other]
-
Title: ReMatching Dynamic Reconstruction FlowComments: Our project website is at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Reconstructing dynamic scenes from image inputs is a fundamental computer vision task with many downstream applications. Despite recent advancements, existing approaches still struggle to achieve high-quality reconstructions from unseen viewpoints and timestamps. This work introduces the ReMatching framework, designed to improve generalization quality by incorporating deformation priors into dynamic reconstruction models. Our approach advocates for velocity-field-based priors, for which we suggest a matching procedure that can seamlessly supplement existing dynamic reconstruction pipelines. The framework is highly adaptable and can be applied to various dynamic representations. Moreover, it supports integrating multiple types of model priors and enables combining simpler ones to create more complex classes. Our evaluations on popular benchmarks involving both synthetic and real-world dynamic scenes demonstrate a clear improvement in reconstruction accuracy of current state-of-the-art models.
- [66] arXiv:2411.00731 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: Nightbeat: Heart Rate Estimation From a Wrist-Worn Accelerometer During SleepComments: 8 pages, 5 figuresSubjects: Quantitative Methods (q-bio.QM); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Today's fitness bands and smartwatches typically track heart rates (HR) using optical sensors. Large behavioral studies such as the UK Biobank use activity trackers without such optical sensors and thus lack HR data, which could reveal valuable health trends for the wider population. In this paper, we present the first dataset of wrist-worn accelerometer recordings and electrocardiogram references in uncontrolled at-home settings to investigate the recent promise of IMU-only HR estimation via ballistocardiograms. Our recordings are from 42 patients during the night, totaling 310 hours. We also introduce a frequency-based method to extract HR via curve tracing from IMU recordings while rejecting motion artifacts. Using our dataset, we analyze existing baselines and show that our method achieves a mean absolute error of 0.88 bpm -- 76% better than previous approaches. Our results validate the potential of IMU-only HR estimation as a key indicator of cardiac activity in existing longitudinal studies to discover novel health insights. Our dataset, Nightbeat-DB, and our source code are available on GitHub: this https URL.
- [67] arXiv:2411.00774 (cross-list from cs.SD) [pdf, html, other]
-
Title: Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLMComments: Project Page: this https URLSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
Cross submissions (showing 34 of 34 entries)
- [68] arXiv:2307.15615 (replaced) [pdf, html, other]
-
Title: A survey on deep learning in medical image registration: new technologies, uncertainty, evaluation metrics, and beyondJunyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong DuComments: Accepted to Medical Image Analysis ((c) MedIA). A list of open-sourced code from the papers reviewed has been organized and is available at this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning technologies have dramatically reshaped the field of medical image registration over the past decade. The initial developments, such as regression-based and U-Net-based networks, established the foundation for deep learning in image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regularizations, network architectures, and uncertainty estimation. These advancements have not only enriched the field of image registration but have also facilitated its application in a wide range of tasks, including atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.
- [69] arXiv:2308.05591 (replaced) [pdf, html, other]
-
Title: Optimizing Cache Content Placement in Integrated Terrestrial and Non-terrestrial NetworksComments: This work is expanded on our paper presented at IEEE Globecom 2023: F. Wang, G. Geraci and T. Q. S. Quek, "Optimizing Cache Content Placement in Integrated Terrestrial and Non-terrestrial Networks," GLOBECOM 2023 - 2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 2023, pp. 6609-6614Subjects: Systems and Control (eess.SY); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Non-terrestrial networks (NTN) have emerged as a transformative solution to bridge the digital divide and deliver essential services to remote and underserved areas. In this context, low Earth orbit (LEO) satellite constellations offer remarkable potential for efficient cache content broadcast in remote regions, thereby extending the reach of digital services. In this paper, we introduce a novel approach to optimize wireless edge content placement using NTN. Despite wide coverage, the varying NTN transmission capabilities must be carefully aligned with each content placement to maximize broadcast efficiency. In this paper, we introduce a novel approach to optimize wireless edge content placement using NTN, positioning NTN as a complement to TN for achieving optimal content broadcasting. Specifically, we dynamically select content for placement via NTN links. This selection is based on popularity and suitability for delivery through NTN, while considering the orbital motion of LEO satellites. Our system-level case studies, based on a practical LEO constellation, demonstrate the significant improvement in placement speed compared to existing methods, which neglect network mobility. We also demonstrate that NTN links significantly outperform standalone wireless TN solutions, particularly in the early stages of content delivery. This advantage is amplified when there is a higher correlation of content popularity across geographical regions.
- [70] arXiv:2402.13629 (replaced) [pdf, other]
-
Title: Adversarial Purification and Fine-tuning for Robust UDC Image RestorationComments: Failure to meet expectationsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This study delves into the enhancement of Under-Display Camera (UDC) image restoration models, focusing on their robustness against adversarial attacks. Despite its innovative approach to seamless display integration, UDC technology faces unique image degradation challenges exacerbated by the susceptibility to adversarial perturbations. Our research initially conducts an in-depth robustness evaluation of deep-learning-based UDC image restoration models by employing several white-box and black-box attacking methods. This evaluation is pivotal in understanding the vulnerabilities of current UDC image restoration techniques. Following the assessment, we introduce a defense framework integrating adversarial purification with subsequent fine-tuning processes. First, our approach employs diffusion-based adversarial purification, effectively neutralizing adversarial perturbations. Then, we apply the fine-tuning methodologies to refine the image restoration models further, ensuring that the quality and fidelity of the restored images are maintained. The effectiveness of our proposed approach is validated through extensive experiments, showing marked improvements in resilience against typical adversarial attacks.
- [71] arXiv:2405.11459 (replaced) [pdf, html, other]
-
Title: Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signalsHui Zheng, Hai-Teng Wang, Wei-Bang Jiang, Zhong-Tao Chen, Li He, Pei-Yang Lin, Peng-Hu Wei, Guo-Guang Zhao, Yun-Zhe LiuSubjects: Signal Processing (eess.SP); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.
- [72] arXiv:2405.15607 (replaced) [pdf, html, other]
-
Title: Channel Estimation and Reconstruction in Fluid Antenna System: Oversampling is EssentialComments: 13 pages, 16 figures - including subfigures. Accepted by IEEE TWCSubjects: Signal Processing (eess.SP)
Fluid antenna system (FAS) has recently surfaced as a promising technology for the upcoming sixth generation (6G) wireless networks. Unlike traditional antenna system (TAS) with fixed antenna location, FAS introduces a flexible component in which the radiating element can switch its position within a predefined space. This capability allows FAS to achieve additional diversity and multiplexing gains. Nevertheless, to fully reap the benefits of FAS, obtaining channel state information (CSI) over the predefined space is crucial. In this paper, we study the system with a transmitter equipped with a traditional fixed antenna and a receiver with a fluid antenna by considering an electromagnetic-compliant channel model. We address the challenges of channel estimation and reconstruction using Nyquist sampling and maximum likelihood estimation (MLE) methods. Our analysis reveals a fundamental tradeoff between the accuracy of the reconstructed channel and the number of estimated channels, indicating that half-wavelength sampling is insufficient for perfect reconstruction and that oversampling is essential to enhance accuracy. Despite its advantages, oversampling can introduce practical challenges. Consequently, we propose a suboptimal sampling distance that facilitates efficient channel reconstruction. In addition, we employ the MLE method to bound the channel estimation error by $\epsilon$, with a specific confidence interval (CI). Our findings enable us to determine the minimum number of estimated channels and the total number of pilot symbols required for efficient channel reconstruction in a given space. Lastly, we investigate the rate performance of FAS and TAS and demonstrate that FAS with imperfect CSI can outperform TAS with perfect CSI. In contrast to existing works, we also show that there is an optimal fluid antenna size that maximizes the achievable rate.
- [73] arXiv:2406.09277 (replaced) [pdf, html, other]
-
Title: End-to-end streaming model for low-latency speech anonymizationSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
- [74] arXiv:2407.04034 (replaced) [pdf, html, other]
-
Title: Optimizing a-DCF for Spoofing-Robust Speaker VerificationSubjects: Audio and Speech Processing (eess.AS)
Automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. We propose a spoofing-robust ASV system optimized directly for the recently introduced architecture-agnostic detection cost function (a-DCF), which allows targeting a desired trade-off between the contradicting aims of user convenience and robustness to spoofing. We combine a-DCF and binary cross-entropy (BCE) with a novel straightforward threshold optimization technique. Our results with an embedding fusion system on ASVspoof2019 data demonstrate relative improvement of $13\%$ over a system trained using BCE only (from minimum a-DCF of $0.1445$ to $0.1254$). Using an alternative non-linear score fusion approach provides relative improvement of $43\%$ (from minimum a-DCF of $0.0508$ to $0.0289$).
- [75] arXiv:2407.08236 (replaced) [pdf, html, other]
-
Title: HRRPGraphNet: Make HRRPs to Be Graphs for Efficient Target RecognitionComments: 3 pages, 3 figures. Accepted by IET Electronics LettersSubjects: Signal Processing (eess.SP)
High Resolution Range Profiles (HRRP) have become a key area of focus in the domain of Radar Automatic Target Recognition (RATR). Despite the success of deep learning based HRRP recognition, these methods needs a large amount of training samples to generate good performance, which could be a severe challenge under non-cooperative circumstances. Currently, deep learning based models treat HRRP as sequences, which may lead to ignorance of the internal relationship of range cells. This letter introduces HRRPGraphNet, whose pivotal innovation is the transformation of HRRP data into a novel graph structure, utilizing a range cell amplitude(hyphen)based node vector and a range(hyphen)relative adjacency matrix. This graph(hyphen)based approach facilitates both local feature extraction via one(hyphen)dimensional convolution layers, global feature extraction through a graph convolution layer and a attention module. Experiments on the aircraft electromagnetic simulation dataset confirmed HRRPGraphNet superior accuracy and robustness, particularly in limited training sample environments, underscoring the potential of graph(hyphen)driven innovations in HRRP(hyphen)based RATR.
- [76] arXiv:2407.19235 (replaced) [pdf, html, other]
-
Title: B-ISAC: Backscatter Integrated Sensing and Communication for IoE ApplicationsComments: 15 pages, 12 figures, submitted to IEEE Journal, This paper is the Journal version of the following paper: arXiv:2409.02797Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
The integration of backscatter communication (BackCom) technology with integrated sensing and communication (ISAC) technology not only enhances the system sensing performance, but also enables low-power information transmission. This is expected to provide a new paradigm for communication and sensing in internet of everything (IoE) applications. In this paper, we propose a novel cognitive wireless system called backscatter-ISAC (B-ISAC) and develop a joint beamforming framework for different stages (task modes). This system can achieve cognitive spectrum sharing between legacy communication, backscatter communication and sensing functions. We derive communication performance metrics of the system in terms of the signal-to-interference-plus-noise ratio (SINR) and communication rate, and derive sensing performance metrics of the system in terms of probability of detection, error of linear least squares (LS) estimation, and the error of linear minimum mean square error (LMMSE) estimation. The proposed joint beamforming framework consists of three stages: tag detection, tag estimation, and communication enhancement. We develop corresponding joint beamforming schemes aimed at enhancing the performance objectives of their respective stages by solving complex non-convex optimization problems. Extensive simulation results demonstrate the effectiveness of the proposed joint beamforming schemes. The proposed B-ISAC system has broad application prospect in next generation IoE scenarios.
- [77] arXiv:2408.08112 (replaced) [pdf, html, other]
-
Title: On the Spectral Efficiency of Movable and Rotary Antenna Arrays under Rician FadingComments: 11 pages, 11 figures. Manuscript submitted to IEEE Open Journal of the Communications Society. arXiv admin note: text overlap with arXiv:2406.19078Subjects: Signal Processing (eess.SP)
Most works evaluating the performance of Multi-User Multiple-Input Multiple-Output (MU-MIMO) systems consider Access Points (APs) with fixed antennas, that is, without any movement capability. Recently, the idea of APs with antenna arrays that are able to move have gained traction among the research community. Many works evaluate the communications performance of Movable Antenna Arrays (MAAs) that can move on the horizontal plane. However, they require a very bulky, complex and expensive movement system. In this work, we propose a simpler and cheaper alternative: the utilization of Rotary Antenna Arrays (RAA)s, i.e. antenna arrays that can rotate. We also analyze the performance of a system in which the array is able to both move and rotate. The movements and/or rotations of the array are computed in order to maximize the mean per-user achievable spectral efficiency, based on estimates of the locations of the active devices and using particle swarm optimization. We adopt a spatially correlated Rician fading channel model, and evaluate the resulting optimized performance of the different setups in terms of mean per-user achievable spectral efficiencies. Our numerical results show that both the optimal rotations and movements of the arrays can provide substantial performance gains when the line-of-sight components of the channel vectors are strong. Moreover, the simpler RAAs can outperform the MAAs when their movement area is constrained.
- [78] arXiv:2409.02424 (replaced) [pdf, html, other]
-
Title: Enhancing Information Freshness: An AoI Optimized Markov Decision Process Dedicated In the Underwater TaskJournal-ref: AAAI (Student) 2025Subjects: Systems and Control (eess.SY)
Ocean exploration utilizing autonomous underwater vehicles (AUVs) via reinforcement learning (RL) has emerged as a significant research focus. However, underwater tasks have mostly failed due to the observation delay caused by acoustic communication in the Internet of underwater things. In this study, we present an AoI optimized Markov decision process (AoI-MDP) to improve the performance of underwater tasks. Specifically, AoI-MDP models observation delay as signal delay through statistical signal processing, and includes this delay as a new component in the state space. Additionally, we introduce wait time in the action space, and integrate AoI with reward functions to achieve joint optimization of information freshness and decision-making for AUVs leveraging RL for training. Finally, we apply this approach to the multi-AUV data collection task scenario as an example. Simulation results highlight the feasibility of AoI-MDP, which effectively minimizes AoI while showcasing superior performance in the task. To accelerate relevant research in this field, we have made the simulation codes available as open-source.
- [79] arXiv:2409.16016 (replaced) [pdf, html, other]
-
Title: VascX Models: Model Ensembles for Retinal Vascular Analysis from Color Fundus ImagesJose Vargas Quiros, Bart Liefers, Karin van Garderen, Jeroen Vermeulen, Eyened Reading Center, Sinergia Consortium, Caroline KlaverSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
We introduce VascX models, a comprehensive set of model ensembles for analyzing retinal vasculature from color fundus images (CFIs). Annotated CFIs were aggregated from public datasets . Additional CFIs, mainly from the population-based Rotterdam Study were annotated by graders for arteries and veins at pixel level, resulting in a dataset diverse in patient demographics and imaging conditions. VascX models demonstrated superior segmentation performance across datasets, image quality levels, and anatomic regions when compared to existing, publicly available models, likely due to the increased size and variety of our training set. Important improvements were observed in artery-vein and disc segmentation performance, particularly in segmentations of these structures on CFIs of intermediate quality, common in large cohorts and clinical datasets. Importantly, these improvements translated into significantly more accurate vascular features when we compared features extracted from VascX segmentation masks with features extracted from segmentation masks generated by previous models. With VascX models we provide a robust, ready-to-use set of model ensembles and inference code aimed at simplifying the implementation and enhancing the quality of automated retinal vasculature analyses. The precise vessel parameters generated by the model can serve as starting points for the identification of disease patterns in and outside of the eye.
- [80] arXiv:2410.12602 (replaced) [pdf, other]
-
Title: Design of Fiber-Longitudinal Optical Power MonitorComments: 11 pages, 13 figures, accepted version for Journal of Lightwave TechnologyJournal-ref: Journal of Lightwave Technology, 2024Subjects: Signal Processing (eess.SP); Optics (physics.optics)
This paper presents analytical results on the accuracy of fiber-longitudinal optical power monitoring (LPM) at arbitrary positions. To quantify the accuracy, the position-wise variance and power-profile SNR of LPM are defined and analyzed, yielding formulas for these metrics. Using these metrics, we show that various designs and performance predictions of LPM for a given link and estimation conditions are possible in a unified manner. Specifically, the required SNR to detect a given loss event is first presented. Based on this relation, the design parameters of LPM, such as the sample size and optical power required to detect the loss, are explicitly determined. The performance such as the detectable limit of loss events at individual positions and maximum dynamic range are also specified. These results can be used as a basis for establishing a design principle of LPM.
- [81] arXiv:2410.17539 (replaced) [pdf, html, other]
-
Title: Urban Outdoor Propagation Measurements and Channel Models at 6.75 GHz FR1(C) and 16.95 GHz FR3 Upper Mid-Band Spectrum for 5G and 6GDipankar Shakya, Mingjun Ying, Theodore S. Rappaport, Peijie Ma, Idris Al-Wazani, Yanze Wu, Yanbo Wang, Doru Calin, Hitesh Poddar, Ahmad Bazzi, Marwa Chafii, Yunchou Xing, Amitava GhoshComments: 6 pages, 4 figures, 6 tablesSubjects: Signal Processing (eess.SP)
Global allocations in the upper mid-band spectrum (4-24 GHz) necessitate a comprehensive exploration of the propagation behavior to meet the promise of coverage and capacity. This paper presents an extensive Urban Microcell (UMi) outdoor propagation measurement campaign at 6.75 GHz and 16.95 GHz conducted in Downtown Brooklyn, USA, using a 1 GHz bandwidth sliding correlation channel sounder over 40-880 m propagation distance, encompassing 6 Line of Sight (LOS) and 14 Non-Line of Sight (NLOS) locations. Analysis of the path loss (PL) reveals lower directional and omnidirectional PL exponents compared to mmWave and sub-THz frequencies in the UMi environment, using the close-in PL model with a 1 m reference distance. Additionally, a decreasing trend in root mean square (RMS) delay spread (DS) and angular spread (AS) with increasing frequency was observed. The NLOS RMS DS and RMS AS mean values are obtained consistently lower compared to 3GPP model predictions. Point data for all measured statistics at each TX-RX location are provided to supplement the models and results. The spatio-temporal statistics evaluated here offer valuable insights for the design of next-generation wireless systems and networks.
- [82] arXiv:2410.20196 (replaced) [pdf, html, other]
-
Title: Age of Information-Oriented Probabilistic Link Scheduling for Device-to-Device NetworksComments: 8 pages, 7 figures, accepted by IEEE WiOpt24Subjects: Signal Processing (eess.SP)
This paper focuses on optimizing the long-term average age of information (AoI) in device-to-device (D2D) networks through age-aware link scheduling. The problem is naturally formulated as a Markov decision process (MDP). However, finding the optimal policy for the formulated MDP in its original form is challenging due to the intertwined AoI dynamics of all D2D links. To address this, we propose an age-aware stationary randomized policy that determines the probability of scheduling each link in each time slot based on the AoI of all links and the statistical channel state information among all transceivers. By employing the Lyapunov optimization framework, our policy aims to minimize the Lyapunov drift in every time slot. Nonetheless, this per-slot minimization problem is nonconvex due to cross-link interference in D2D networks, posing significant challenges for real-time decision-making. After analyzing the permutation equivariance property of the optimal solutions to the per-slot problem, we apply a message passing neural network (MPNN), a type of graph neural network that also exhibits permutation equivariance, to optimize the per-slot problem in an unsupervised learning manner. Simulation results demonstrate the superior performance of the proposed age-aware stationary randomized policy over baselines and validate the scalability of our method.
- [83] arXiv:2410.22283 (replaced) [pdf, html, other]
-
Title: Leveraging Recurrent Neural Networks for Predicting Motor Movements from Primate Motor Cortex Neural RecordingsSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
This paper presents an efficient deep learning solution for decoding motor movements from neural recordings in non-human primates. An Autoencoder Gated Recurrent Unit (AEGRU) model was adopted as the model architecture for this task. The autoencoder is only used during the training stage to achieve better generalization. Together with the preprocessing techniques, our model achieved 0.71 $R^2$ score, surpassing the baseline models in Neurobench and is ranked first for $R^2$ in the IEEE BioCAS 2024 Grand Challenge on Neural Decoding. Model pruning is also applied leading to a reduction of 41.4% of the multiply-accumulate (MAC) operations with little change in the $R^2$ score compared to the unpruned model.
- [84] arXiv:2410.22530 (replaced) [pdf, html, other]
-
Title: Adaptive Aggregation Weights for Federated Segmentation of Pancreas MRIHongyi Pan, Gorkem Durak, Zheyuan Zhang, Yavuz Taktak, Elif Keles, Halil Ertugrul Aktas, Alpay Medetalibeyoglu, Yury Velichko, Concetto Spampinato, Ivo Schoots, Marco J. Bruno, Rajesh N. Keswani, Pallavi Tiwari, Candice Bolan, Tamas Gonda, Michael G. Goggins, Michael B. Wallace, Ziyue Xu, Ulas BagciSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning (FL) enables collaborative model training across institutions without sharing sensitive data, making it an attractive solution for medical imaging tasks. However, traditional FL methods, such as Federated Averaging (FedAvg), face difficulties in generalizing across domains due to variations in imaging protocols and patient demographics across institutions. This challenge is particularly evident in pancreas MRI segmentation, where anatomical variability and imaging artifacts significantly impact performance. In this paper, we conduct a comprehensive evaluation of FL algorithms for pancreas MRI segmentation and introduce a novel approach that incorporates adaptive aggregation weights. By dynamically adjusting the contribution of each client during model aggregation, our method accounts for domain-specific differences and improves generalization across heterogeneous datasets. Experimental results demonstrate that our approach enhances segmentation accuracy and reduces the impact of domain shift compared to conventional FL methods while maintaining privacy-preserving capabilities. Significant performance improvements are observed across multiple hospitals (centers).
- [85] arXiv:2410.23207 (replaced) [pdf, html, other]
-
Title: Enhancing Autonomous Driving Safety Analysis with Generative AI: A Comparative Study on Automated Hazard and Risk AssessmentSubjects: Systems and Control (eess.SY)
The advent of autonomous driving technology has accentuated the need for comprehensive hazard analysis and risk assessment (HARA) to ensure the safety and reliability of vehicular systems. Traditional HARA processes, while meticulous, are inherently time-consuming and subject to human error, necessitating a transformative approach to fortify safety engineering. This paper presents an integrative application of generative artificial intelligence (AI) as a means to enhance HARA in autonomous driving safety analysis. Generative AI, renowned for its predictive modeling and data generation capabilities, is leveraged to automate the labor-intensive elements of HARA, thus expediting the process and augmenting the thoroughness of the safety analyses. Through empirical research, the study contrasts conventional HARA practices conducted by safety experts with those supplemented by generative AI tools. The benchmark comparisons focus on critical metrics such as analysis time, error rates, and scope of risk identification. By employing generative AI, the research demonstrates a significant upturn in efficiency, evidenced by reduced timeframes and expanded analytical coverage. The AI-augmented processes also deliver enhanced brainstorming support, stimulating creative problem-solving and identifying previously unrecognized risk factors.
- [86] arXiv:2410.23378 (replaced) [pdf, html, other]
-
Title: Power Modeling in mm-Wave and Terahertz CMOS Transmitters for Wireless Network-on-ChipSubjects: Signal Processing (eess.SP)
Wireless Network-on-Chip (WNoC) systems, which interconnect chips using wireless links, face significant challenges in area and power consumption. To tackle these constraints, behavioral models (BMs) are crucial for assessing system performance under various conditions and optimizing parameters like data throughput and power consumption. Building transceivers (TRXs) physically is costly and time-consuming, making modeling a more practical approach. This paper develops a power consumption model for the sub-blocks of a WNoC transmitter (TX) at the chip level. By integrating these BMs with MATLAB, we aim to create a power model for TXs in WNoC architectures, optimized for CMOS technology operating at millimeter-wave and terahertz frequencies.
- [87] arXiv:2206.03861 (replaced) [pdf, html, other]
-
Title: Decentralized Online Regularized Learning Over Random Time-Varying GraphsSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
We study the decentralized online regularized linear regression algorithm over random time-varying graphs. At each time step, every node runs an online estimation algorithm consisting of an innovation term processing its own new measurement, a consensus term taking a weighted sum of estimations of its own and its neighbors with additive and multiplicative communication noises and a regularization term preventing over-fitting. It is not required that the regression matrices and graphs satisfy special statistical assumptions such as mutual independence, spatio-temporal independence or stationarity. We develop the nonnegative supermartingale inequality of the estimation error, and prove that the estimations of all nodes converge to the unknown true parameter vector almost surely if the algorithm gains, graphs and regression matrices jointly satisfy the sample path spatio-temporal persistence of excitation condition. Especially, this condition holds by choosing appropriate algorithm gains if the graphs are uniformly conditionally jointly connected and conditionally balanced, and the regression models of all nodes are uniformly conditionally spatio-temporally jointly observable, under which the algorithm converges in mean square and almost surely. In addition, we prove that the regret upper bound is $O(T^{1-\tau}\ln T)$, where $\tau\in (0.5,1)$ is a constant depending on the algorithm gains.
- [88] arXiv:2308.02324 (replaced) [pdf, html, other]
-
Title: Robust mmWave/sub-THz multi-connectivity using minimal coordination and coarse synchronizationComments: Accepted versionSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This study investigates simpler alternatives to coherent joint transmission for supporting robust connectivity against signal blockage in mmWave/sub-THz access networks. By taking an information-theoretic viewpoint, we demonstrate analytically that with a careful design, full macrodiversity gains and significant SNR gains can be achieved through canonical receivers and minimal coordination and synchronization requirements at the infrastructure side. Our proposed scheme extends non-coherent joint transmission by employing a special form of diversity to counteract artificially induced deep fades that would otherwise make this technique often compare unfavorably against standard transmitter selection schemes. Additionally, the inclusion of an Alamouti-like space-time coding layer is shown to recover a significant fraction of the optimal performance. Our conclusions are based on a statistical single-user multi-point intermittent block fading channel model that, although simplified, enables rigorous ergodic and outage rate analysis, while also considering timing offsets due to imperfect delay compensation. In addition, we validate our theoretical approach by means of deterministic ray-tracing simulations that capture the essential features of next generation mmWave/sub-THz communications.
- [89] arXiv:2311.12056 (replaced) [pdf, html, other]
-
Title: Kuro Siwo: 33 billion $m^2$ under the water. A global multi-temporal satellite dataset for rapid flood mappingNikolaos Ioannis Bountos, Maria Sdraka, Angelos Zavras, Ilektra Karasante, Andreas Karavias, Themistocles Herekakis, Angeliki Thanasou, Dimitrios Michail, Ioannis PapoutsisComments: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and BenchmarksSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Global floods, exacerbated by climate change, pose severe threats to human life, infrastructure, and the environment. Recent catastrophic events in Pakistan and New Zealand underscore the urgent need for precise flood mapping to guide restoration efforts, understand vulnerabilities, and prepare for future occurrences. While Synthetic Aperture Radar (SAR) remote sensing offers day-and-night, all-weather imaging capabilities, its application in deep learning for flood segmentation is limited by the lack of large annotated datasets. To address this, we introduce Kuro Siwo, a manually annotated multi-temporal dataset, spanning 43 flood events globally. Our dataset maps more than 338 billion $m^2$ of land, with 33 billion designated as either flooded areas or permanent water bodies. Kuro Siwo includes a highly processed product optimized for flood mapping based on SAR Ground Range Detected, and a primal SAR Single Look Complex product with minimal preprocessing, designed to promote research on the exploitation of both the phase and amplitude information and to offer maximum flexibility for downstream task preprocessing. To leverage advances in large scale self-supervised pretraining methods for remote sensing data, we augment Kuro Siwo with a large unlabeled set of SAR samples. Finally, we provide an extensive benchmark, namely BlackBench, offering strong baselines for a diverse set of flood events from Europe, America, Africa, Asia and Australia.
- [90] arXiv:2402.02827 (replaced) [pdf, html, other]
-
Title: PowerGraph: A power grid benchmark dataset for graph neural networksComments: 21 pages, 8 figures, conference paperSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Power grids are critical infrastructures of paramount importance to modern society and, therefore, engineered to operate under diverse conditions and failures. The ongoing energy transition poses new challenges for the decision-makers and system operators. Therefore, developing grid analysis algorithms is important for supporting reliable operations. These key tools include power flow analysis and system security analysis, both needed for effective operational and strategic planning. The literature review shows a growing trend of machine learning (ML) models that perform these analyses effectively. In particular, Graph Neural Networks (GNNs) stand out in such applications because of the graph-based structure of power grids. However, there is a lack of publicly available graph datasets for training and benchmarking ML models in electrical power grid applications. First, we present PowerGraph, which comprises GNN-tailored datasets for i) power flows, ii) optimal power flows, and iii) cascading failure analyses of power grids. Second, we provide ground-truth explanations for the cascading failure analysis. Finally, we perform a complete benchmarking of GNN methods for node-level and graph-level tasks and explainability. Overall, PowerGraph is a multifaceted GNN dataset for diverse tasks that includes power flow and fault scenarios with real-world explanations, providing a valuable resource for developing improved GNN models for node-level, graph-level tasks and explainability methods in power system modeling. The dataset is available at this https URL and the code at this https URL.
- [91] arXiv:2406.00976 (replaced) [pdf, html, other]
-
Title: Generative Pre-trained Speech Language Model with Efficient Hierarchical TransformerComments: Accept in ACL2024-mainSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. The code is available at \url{this https URL}.
- [92] arXiv:2409.02428 (replaced) [pdf, html, other]
-
Title: Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement LearningJournal-ref: AAAI (Student) 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to an underwater data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities
- [93] arXiv:2409.12306 (replaced) [pdf, other]
-
Title: Measuring Sound Symbolism in Audio-visual ModelsComments: Errors in the introduction part that might potentially affect the integrity of the paper. Withdraw at the point. Will replace with an updated version in the futureSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
- [94] arXiv:2409.12470 (replaced) [pdf, html, other]
-
Title: HSIGene: A Foundation Model For Hyperspectral Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affecting the reliability and diversity of the generated images. Some studies propose to incorporate multi-modal data to enhance spatial diversity, but the spectral fidelity cannot be ensured. In addition, existing HSI synthesis models are typically uncontrollable or only support single-condition control, limiting their ability to generate accurate and reliable HSIs. To alleviate these issues, we propose HSIGene, a novel HSI generation foundation model which is based on latent diffusion and supports multi-condition control, allowing for more precise and reliable HSI generation. To enhance the spatial diversity of the training data while preserving spectral fidelity, we propose a new data augmentation method based on spatial super-resolution, in which HSIs are upscaled first, and thus abundant training patches could be obtained by cropping the high-resolution HSIs. In addition, to improve the perceptual quality of the augmented data, we introduce a novel two-stage HSI super-resolution framework, which first applies RGB bands super-resolution and then utilizes our proposed Rectangular Guided Attention Network (RGAN) for guided HSI super-resolution. Experiments demonstrate that the proposed model is capable of generating a vast quantity of realistic HSIs for downstream tasks such as denoising and super-resolution. The code and models are available at this https URL.
- [95] arXiv:2409.13698 (replaced) [pdf, html, other]
-
Title: Lightweight Transducer Based on Frame-Level CriterionComments: Accepted by Interspeech 2024, code repository: this https URLJournal-ref: Proc. Interspeech 2024, 247-251 (2024)Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
- [96] arXiv:2409.15711 (replaced) [pdf, html, other]
-
Title: Adversarial Federated Consensus Learning for Surface Defect Classification Under Data Heterogeneity in IIoTSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
The challenge of data scarcity hinders the application of deep learning in industrial surface defect classification (SDC), as it's difficult to collect and centralize sufficient training data from various entities in Industrial Internet of Things (IIoT) due to privacy concerns. Federated learning (FL) provides a solution by enabling collaborative global model training across clients while maintaining privacy. However, performance may suffer due to data heterogeneity-discrepancies in data distributions among clients. In this paper, we propose a novel personalized FL (PFL) approach, named Adversarial Federated Consensus Learning (AFedCL), for the challenge of data heterogeneity across different clients in SDC. First, we develop a dynamic consensus construction strategy to mitigate the performance degradation caused by data heterogeneity. Through adversarial training, local models from different clients utilize the global model as a bridge to achieve distribution alignment, alleviating the problem of global knowledge forgetting. Complementing this strategy, we propose a consensus-aware aggregation mechanism. It assigns aggregation weights to different clients based on their efficacy in global knowledge learning, thereby enhancing the global model's generalization capabilities. Finally, we design an adaptive feature fusion module to further enhance global knowledge utilization efficiency. Personalized fusion weights are gradually adjusted for each client to optimally balance global and local features. Compared with state-of-the-art FL methods like FedALA, the proposed AFedCL method achieves an accuracy increase of up to 5.67% on three SDC datasets.
- [97] arXiv:2410.07376 (replaced) [pdf, html, other]
-
Title: Optimal Attitude Control of Large Flexible Space Structures with Distributed Momentum ActuatorsComments: 10 pages, 9 figuresSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Systems and Control (eess.SY)
Recent spacecraft mission concepts propose larger payloads that have lighter, less rigid structures. For large lightweight structures, the natural frequencies of their vibration modes may fall within the attitude controller bandwidth, threatening the stability and settling time of the controller and compromising performance. This work tackles this issue by proposing an attitude control design paradigm of distributing momentum actuators throughout the structure to have more control authority over vibration modes. The issue of jitter disturbances introduced by these actuators is addressed by expanding the bandwidth of the attitude controller to suppress excess vibrations. Numerical simulation results show that, at the expense of more control action, a distributed configuration can achieve lower settling times and reduce structural deformation compared to a more standard centralized configuration.
- [98] arXiv:2410.07801 (replaced) [pdf, html, other]
-
Title: LucidGrasp: Robotic Framework for Autonomous Manipulation of Laboratory Equipment with Different Degrees of Transparency via 6D Pose EstimationComments: Accepted to the 2024 IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO 2024), 6 pages, 8 figuresSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE); Systems and Control (eess.SY)
Many modern robotic systems operate autonomously, however they often lack the ability to accurately analyze the environment and adapt to changing external conditions, while teleoperation systems often require special operator skills. In the field of laboratory automation, the number of automated processes is growing, however such systems are usually developed to perform specific tasks. In addition, many of the objects used in this field are transparent, making it difficult to analyze them using visual channels. The contributions of this work include the development of a robotic framework with autonomous mode for manipulating liquid-filled objects with different degrees of transparency in complex pose combinations. The conducted experiments demonstrated the robustness of the designed visual perception system to accurately estimate object poses for autonomous manipulation, and confirmed the performance of the algorithms in dexterous operations such as liquid dispensing. The proposed robotic framework can be applied for laboratory automation, since it allows solving the problem of performing non-trivial manipulation tasks with the analysis of object poses of varying degrees of transparency and liquid levels, requiring high accuracy and repeatability.
- [99] arXiv:2410.19207 (replaced) [pdf, html, other]
-
Title: Equitable Federated Learning with Activation ClusteringComments: 28 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Federated learning is a prominent distributed learning paradigm that incorporates collaboration among diverse clients, promotes data locality, and thus ensures privacy. These clients have their own technological, cultural, and other biases in the process of data generation. However, the present standard often ignores this bias/heterogeneity, perpetuating bias against certain groups rather than mitigating it. In response to this concern, we propose an equitable clustering-based framework where the clients are categorized/clustered based on how similar they are to each other. We propose a unique way to construct the similarity matrix that uses activation vectors. Furthermore, we propose a client weighing mechanism to ensure that each cluster receives equal importance and establish $O(1/\sqrt{K})$ rate of convergence to reach an $\epsilon-$stationary solution. We assess the effectiveness of our proposed strategy against common baselines, demonstrating its efficacy in terms of reducing the bias existing amongst various client clusters and consequently ameliorating algorithmic bias against specific groups.
- [100] arXiv:2410.20359 (replaced) [pdf, html, other]
-
Title: Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from AudiosComments: Accepted by WACV 2025 (Round 1)Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation.
- [101] arXiv:2410.20595 (replaced) [pdf, html, other]
-
Title: A Framework for Real-Time Volcano-Seismic Event Recognition Based on Multi-Station Seismograms and Semantic Segmentation ModelsComments: 10 pages, 9 figures. This is a pre-print, it is currently under review for publicationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
In volcano monitoring, effective recognition of seismic events is essential for understanding volcanic activity and raising timely warning alerts. Traditional methods rely on manual analysis, which can be subjective and labor-intensive. Furthermore, current automatic approaches often tackle detection and classification separately, mostly rely on single station information and generally require tailored preprocessing and representations to perform predictions. These limitations often hinder their application to real-time monitoring and utilization across different volcano conditions. This study introduces a novel approach that utilizes Semantic Segmentation models to automate seismic event recognition by applying a straight forward transformation of multi-channel 1D signals into 2D representations, enabling their use as images. Our framework employs a data-driven, end-to-end design that integrates multi-station seismic data with minimal preprocessing, performing both detection and classification simultaneously for five seismic event classes. We evaluated four state-of-the-art segmentation models (UNet, UNet++, DeepLabV3+ and SwinUNet) on approximately 25.000 seismic events recorded at four different Chilean volcanoes: Nevados del Chillán Volcanic Complex, Laguna del Maule, Villarrica and Puyehue-Cordón Caulle. Among these models, the UNet architecture was identified as the most effective model, achieving mean F1 and Intersection over Union (IoU) scores of up to 0.91 and 0.88, respectively, and demonstrating superior noise robustness and model flexibility to unseen volcano datasets.
- [102] arXiv:2410.23634 (replaced) [pdf, html, other]
-
Title: Tiny Learning-Based MPC for Multirotors: Solver-Aware Learning for Efficient Embedded Predictive ControlSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Tiny aerial robots show promise for applications like environmental monitoring and search-and-rescue but face challenges in control due to their limited computing power and complex dynamics. Model Predictive Control (MPC) can achieve agile trajectory tracking and handle constraints. Although current learning-based MPC methods, such as Gaussian Process (GP) MPC, improve control performance by learning residual dynamics, they are computationally demanding, limiting their onboard application on tiny robots. This paper introduces Tiny Learning-Based Model Predictive Control (LB MPC), a novel framework for resource-constrained micro multirotor platforms. By exploiting multirotor dynamics' structure and developing an efficient solver, our approach enables high-rate control at 100 Hz on a Crazyflie 2.1 with a Teensy 4.0 microcontroller. We demonstrate a 23% average improvement in tracking performance over existing embedded MPC methods, achieving the first onboard implementation of learning-based MPC on a tiny multirotor (53 g).