EventGeM
Our EventGeM pipeline provides initial place predicitions using global features from a vision transformer backbone with generalized mean pooling (GeM) with local feature re-ranking and RANSAC from keypoint descriptors.
Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount.
In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries.
Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera.
Our EventGeM pipeline provides initial place predicitions using global features from a vision transformer backbone with generalized mean pooling (GeM) with local feature re-ranking and RANSAC from keypoint descriptors.
We can also perform additional re-ranking refinement of matches using depth estimation. Calculating a structural similarity index metric allows from depth maps improves localization performance above keypoint RANSAC.
We evaluate EventGeM on multiple benchmark datasets showing strong performance across diverse lighting conditions and environments. Use the slider to inspect each query, its location in the distance matrix, and the corresponding retrievals.
Input Query
Initial Match
Re-ranked Match
EventGeM runs for real-time VPR whilst performing substantially better than current baseline methods. By combining our initial global place prediction with keypoint re-ranking, we optimize performance and runtime efficiency.
Integrating a Jetson Orin AGX and a DAVIS346 event camera, we demonstrate EventGeM's real-time localization capabilities in a real-world robotic deployment.
@misc{hines2026eventgem,
title={EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition},
author={Adam D. Hines and Gokul B. Nair and Nicolás Marticorena and Michael Milford and Tobias Fischer},
year={2026},
eprint={2603.05807},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.05807},
}