ICCV2023论文集

ResQ Residual Quantization for Video Perception
CDUL CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
SATR Zero-Shot Semantic Segmentation of 3D Shapes
GePSAn Generative Procedure Step Anticipation in Cooking Videos
CLIPTER Looking at the Bigger Picture in Scene Text Recognition
A-STAR Test-time Attention Segregation and Retention for Text-to-image Synthesis
Ordered Atomic Activity for Fine-grained Interactive Traffic Scenario Understanding
Person Re-Identification without Identification via Event anonymization
SSDA Secure Source-Free Domain Adaptation
Sample-wise Label Confidence Incorporation for Learning with Noisy Labels
Story Visualization by Online Text Augmentation with Context Memory
Continual Learning for Personalized Co-speech Gesture Generation
Data-Free Class-Incremental Hand Gesture Recognition
Efficient Controllable Multi-Task Architectures
Yes we CANN Constrained Approximate Nearest Neighbors for Local Feature-Bas
Self-Supervised Object Detection from Egocentric Videos
Iterative Superquadric Recomposition of 3D Objects from Multiple Views
StyleDomain Efficient and Lightweight Parameterizations of StyleGAN for One-shot an
HMD-NeMo Online 3D Avatar Motion Generation From Sparse Observations
Zenseact Open Dataset A Large-Scale and Diverse Multimodal Dataset fo
Task Agnostic Restoration of Natural Video Dynamics
VidStyleODE Disentangled Video Editing via StyleGAN and NeuralODEs
Learning Human-Human Interactions in Images from Weak Textual Supervision
Kader Hammoud Rapid Adaptation in Online Continual Learning Are We Evaluating It
Khatib 3D Instance Segmentation via Enhanced Spatial and Semantic Supervision
XiNet Efficient Neural Networks for tinyML
BEVBert Multimodal Map Pre-training for Language-guided Navigation
MiniROAD Minimal RNN Framework for Online Action Detection
Towards Content-based Pixel Retrieval in Revisited Oxford and Paris
Long-range Multimodal Pretraining for Movie Understanding
Viewing Graph Solvability in Practic
LIST Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction
MixBag Bag-Level Data Augmentation for Learning from Label Proportions
uSplit Image Decomposition for Fluorescence Microscopy
SINC Spatial Composition of 3D Human Motions for Simultaneous Action
DarSwin Distortion Aware Radial Swin Transform
Unified Out-Of-Distribution Detection A Model-Specific Perspectiv
ADAPT Efficient Multi-Agent Trajectory Prediction with Adaptation
Make-An-Animation Large-Scale Text-conditional 3D Human Motion Generation
Markov Game Video Augmentation for Action Segmentation
Adaptive Spiral Layers for Efficient 3D Representation Learning on Meshes
Luminance-aware Color Transform for Multiple Exposure Correction
EigenTrajectory Low-Rank Descriptors for Multi-Modal Trajectory Forecasting
PNI Industrial Anomaly Detection using Position and Neighborhood Information
CC3D Layout-Conditioned Generation of Compositional 3D Scenes
How Much Temporal Long-Term Context is Needed for Action Segmentation
Cross-Domain Product Representation Learning for Rich-Content E-Commerc
Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF
Unified Data-Free Compression Pruning and Quantization without Fine-Tuning
HRS-Bench Holistic Reliable and Scalable Benchmark for Text-to-Image Models
Towards Improved Input Masking for Convolutional Neural Networks
Multimodal Garment Designer Human-Centric Latent Diffusion Models for Fashion Imag
Zero-Shot Composed Image Retrieval with Textual Inversion
CleanCLIP Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting
DEDRIFT Robust Similarity Search under Content Drift
Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity A
Visual Explanations via Iterated Integrated Attributions
BeLFusion Latent Diffusion for Behavior-Driven Human Motion Prediction
With a Little Help from Your Own Past Prototypical Memory
Localizing Moments in Long Video Via Multimodal Guidanc
Zip-NeRF Anti-Aliased Grid-Based Neural Radiance Fields
Active Stereo Without Pattern Projecto
SatlasPretrain A Large-Scale Dataset for Remote Sensing Image Understanding
Inspecting the Geographical Representativeness of Images from Text-to-Image Models
XMem Production-level Video Segmentation From Few Annotated Frames
A Game of Bundle Adjustment - Learning Efficient Convergenc
MapFormer Boosting Change Detection by Using Pre-change Information
EigenPlaces Training Viewpoint Robust Models for Visual Place Recognition
Vision Transformer Adapters for Generalizable Multitask Learning
Self-Supervised Burst Super-Resolution
Detecting Objects with Context-Likelihood Graphs and Graph Refinement
Breaking Common Sense WHOOPS A Vision-and-Language Benchmark of Synthetic an
VL-Match Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching
VADER Video Alignment Differencing and Retrieval
Mesh2Tex Generating Mesh Textures from Image Queries
Beyond the Pixel a Photometrically Calibrated HDR Dataset for Luminanc
Distilling from Similar Tasks for Transfer Learning on a Budget
HyperReenact One-Shot Reenactment via Jointly Learning to Refine and Retarget
IDiff-Face Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Model
Plausible Uncertainties for Human Pose Regression
Compatibility of Fundamental Matrices for Complete Viewing Graphs
A Multidimensional Analysis of Social Biases in Vision Transformers
Contrastive Model Adaptation for Cross-Condition Robustness in Semantic Segmentation
Preface A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Fac
FS-DETR Few-Shot DEtection TRansformer with Prompting and without Re-Training
ReGen A good Generative Zero-Shot Video Classifier Should be Rew
V-FUSE Volumetric Depth Map Fusion with Long-Range Constraints
UniverSeg Universal Medical Image Segmentation
Towards Building More Robust Models with Frequency Bias
Building a Winning Team Selecting Source Model Ensembles using
Active Self-Supervised Learning A Few Low-Cost Relationships Are All You
CLNeRF Continual Learning Meets NeRF
Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Cam
DiffDreamer Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion
Doppelgangers Learning to Disambiguate Images of Similar Structures
EfficientViT Lightweight Multi-Scale Attention for High-Resolution Dense Prediction
IIEU Rethinking Neural Feature Activation from Decision-Making
MixReorg Cross-Modal Mixed Patch Reorganization is a Good Mask Learn
ObjectFusion Multi-modal 3D Object Detection with Object-Centric Fusion
Rehearsal-Free Domain Continual Face Anti-Spoofing Generalize More and Forget Less
Retinexformer One-stage Retinex-based Transformer for Low-light Image Enhancement
Robust Object Modeling for Visual Tracking
Exploiting Proximity-Aware Tasks for Embodied Social Navigation
Improving Online Lane Graph Extraction by Object-Lane Clustering
Anomaly Detection Under Distribution Shift
Attention Where It Matters Rethinking Visual Document Understanding with Selectiv
E2E-LOAD End-to-End Long-form Online Action Detection
Efficient-VQGAN Towards High-Resolution Image Generation with Efficient Vision Transformers
Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints
Knowledge-Aware Federated Active Learning with Non-IID Dat
MasaCtrl Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis an
Multi-Modal Continual Test-Time Adaptation for 3D Semantic Segmentation
Multi-Modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion
OmniZoomer Learning to Move and Zoom in on Sphere at
Re-mine Learn and Reason Exploring the Cross-modal Semantic Correlations fo
SceneRF Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields
Strip-MLP Efficient Token Interaction for Vision MLP
TexFusion Synthesizing 3D Textures with Text-Guided Image Diffusion Models
Going Beyond Nouns With Vision Language Models Using Synthetic
A Simple Recipe to Meta-Learn Forward and Backward Trans
Pix2Video Video Editing using Image Diffusion
Global Adaptation Meets Local Generalization Unsupervised Domain Adaptation for 3D
HiFace High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic
StableVideo Text-driven Consistency-aware Diffusion Video Editing
DETRDistill A Universal Knowledge Distillation Framework for DETR-families
HairNeRF Geometry-Aware Image Synthesis for Hairstyle Trans
Neural Radiance Field with LiDAR maps
Revisiting Vision Transformer from the View of Path Ensembl
Generative Novel View Synthesis with 3D-Aware Diffusion Models
Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Tim
ReLeaPS Reinforcement Learning-based Illumination Planning for Generalized Photometric Stereo
SpinCam High-Speed Imaging via a Rotating Point-Spread Function
Shape Analysis of Euclidean Curves under Frenet-Serret Framework
PASTA Proportional Amplitude Spectrum Training Augmentation for Syn-to-Real Domain Generalization
Towards Realistic Evaluation of Industrial Continual Learning Scenarios with an
Quality Diversity for Visual Pre-Training
3DMiner Discovering Shapes from Large-Scale Unannotated Image Datasets
Adversarial Bayesian Augmentation for Single-Source Domain Generalization
ChartReader A Unified Framework for Chart Derendering and Comprehension without
Contrastive Continuity on Augmentation Stability Rehearsal for Continual Self-Supervised Learning
DNA-Rendering A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering
DVGaze Dual-View Gaze Estimation
Forecast-MAE Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders
Frequency Guidance Matters in Few-Shot Learning
General Image-to-Image Translation with One-Shot Image Guidanc
HandR2N2 Iterative 3D Hand Pose Estimation Using a Residual Recurrent
LISTER Neighbor Decoding for Length-Insensitive Scene Text Recognition
LU-NeRF Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
MixSpeech Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech
Multi-Scale Bidirectional Recurrent Network with Hybrid Correlation for Point Clou
PRIOR Prototype Representation Joint Learning from Medical Images and Reports
ReST A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking
Score Priors Guided Deep Variational Inference for Unsupervised Real-World Singl
Tracking Anything with Decoupled Video Segmentation
Activate and Reject Towards Safe Domain Generalization under Category Shift
AdaMV-MoE Adaptive Multi-Task Vision Mixture-of-Experts
AdvDiffuser Natural Adversarial Example Synthesis with Diffusion Models
AGG-Net Attention Guided Gated-Convolutional Network for Depth Image Completion
An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability
AREA Adaptive Reweighting via Effective Area for Long-Tailed Classification
Atmospheric Transmission and Thermal Inertia Induced Blind Road Segmentation with
A Generalist Framework for Panoptic Segmentation of Images and Videos
A Retrospect to Multi-prompt Learning across Vision and Languag
Be Everywhere - Hear Everything BEE Audio Scene Reconstruction by
BoMD Bag of Multi-label Descriptors for Noisy Chest X-ray Classification
Building Vision Transformers with Hierarchy Aware Feature Aggregation
CancerUniT Towards a Single Unified Model for Effective Detection Segmentation
Category-aware Allocation Transformer for Weakly Supervised Object Localization
CuNeRF Cube-Based Neural Radiance Field for Zero-Shot Medical Image Arbitrary-Scal
Deep Multiview Clustering by Contrasting Cluster Assignments
DiffRate Differentiable Compression Rate for Efficient Vision Transformers
DiffusionDet Diffusion Model for Object Detection
Domain Generalization via Rationale Invarianc
DReg-NeRF Deep Registration for Neural Radiance Fields
Dual Aggregation Transformer for Image Super-Resolution
Dynamic Residual Classifier for Class Incremental Learning
Editable Image Geometric Abstraction via Neural Primitive Assembly
Efficient Deep Space Filling Curv
Efficient Video Action Detection with Token Dropout and Context Refinement
Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only
Fan-Beam Binarization Difference Projection FB-BDP A Novel Local Object Descripto
Fantasia3D Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation
FBLNet FeedBack Loop Network for Driver Attention Prediction
FocalFormer3D Focusing on Hard Instance for 3D Object Detection
FPR False Positive Rectification for Weakly Supervised Semantic Segmentation
FRAug Tackling Federated Learning with Non-IID Features via Representation Augmentation
Generating Dynamic Kernels via Transformers for Lane Detection
GridPull Towards Scalability in Learning Implicit Representations from 3D Point
Group DETR Fast DETR Training with Group-Wise One-to-Many Assignment
HumanMAC Masked Motion Completion for Human Motion Prediction
Joint Implicit Neural Representation for High-fidelity and Compact Vector Fonts
Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction
Learning from Noisy Data for Semi-Supervised 3D Object Detection
MHEntropy Entropy Meets Multiple Hypotheses for Pose and Shape Recovery
Mimic3D Thriving 3D-Aware GANs via 3D-to-2D Imitation
MoTIF Learning Motion Trajectories with Local Implicit Neural Functions fo
Multi-view Self-supervised Disentanglement for General Image Denoising
NeuRBF A Neural Fields Representation with Adaptive Radial Basis Functions
Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation
Open-vocabulary Panoptic Segmentation with Embedding Modulation
Overcoming Forgetting Catastrophe in Quantization-Aware Training
PointDC Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-Modal
Ray Conditioning Trading Photo-consistency for Photo-realism in Multi-view Image Generation
Rethinking Point Cloud Registration as Masking and Reconstruction
Revisiting Domain-Adaptive 3D Object Detection by Reliable Diverse and Class-balanc
SHIFT3D Synthesizing Hard Inputs For Tricking 3D Detectors
SINC Self-Supervised In-Context Learning for Vision-Language Tasks
Single-Stage Diffusion NeRF A Unified Approach to 3D Generation an
SIRA-PCR Sim-to-Real Adaptation for 3D Point Cloud Registration
Size Does Matter Size-aware Virtual Try-on via Clothing-oriented Transformation Try-on
SMMix Self-Motivated Image Mixing for Vision Transformers
Snow Removal in Video A New Dataset and A Novel
Sound Localization from Motion Jointly Learning Sound Direction and Cam
Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal o
SVQNet Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic
Tem-Adapter Adapting Image-Text Pretraining for Video Question Answ
Text2Tex Text-driven Texture Synthesis via Diffusion Models
The Devil is in the Crack Orientation A New Perspectiv
Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts
Traj-MAE Masked Autoencoders for Trajectory Prediction
TrajectoryFormer 3D Object Tracking Transformer with Predictive Trajectory Hypotheses
TransIFF An Instance-Level Feature Fusion Framework for Vehicle-Infrastructure Cooperative 3D
TransTIC Transferring Transformer-based Image Compression from Human Perception to Machin
UniT3D A Unified Transformer for 3D Dense Captioning and Visual
VeRi3D Generative Vertex-based Radiance Fields for 3D Controllable Human Imag
Video Action Recognition with Attentive Semantic Units
VQA Therapy Exploring Answer Differences by Visually Grounding Answers
WDiscOOD Out-of-Distribution Detection via Whitened Linear Discriminant Analysis
Weakly-supervised 3D Pose Transfer with Keypoints
Workie-Talkie Accelerating Federated Learning by Overlapping Computing and Communications vi
Parametric Information Maximization for Generalized Category Discovery
Muscles in Action
Better May Not Be Fairer A Study on Subgroup Discrepancy
Spacetime Surface Regularization for Neural Dynamic Scene Reconstruction
DiffV2S Diffusion-Based Video-to-Speech Synthesis with Vision-Guided Speaker Embedding
Environment Agnostic Representation for Visual Reinforcement Learning
Exploring Positional Characteristics of Dual-Pixel Data for Camera Autofocus
ORC Network Group-based Knowledge Distillation using Online Role Chang
R-Pred Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement
TEMPO Efficient Multi-View Pose Estimation Tracking and Forecasting
Diffusion-SDF Conditional Generative Modeling of Signed Distance Functions
AdVerb Visually Guided Audio Dereverberation
Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting
Complementary Domain Adaptation and Generalization for Unsupervised Continual Domain Shift
DALL-Eval Probing the Reasoning Skills and Social Biases of Text-to-Imag
Distribution-Aware Prompt Tuning for Vision-Language Models
Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction
Local or Global Selective Knowledge Assimilation for Federated Learning with
Non-Coaxial Event-Guided Motion Deblurring with Spatial Alignment
PromptStyler Prompt-driven Style Generation for Source-free Domain Generalization
Image-Free Classifier Injection for Zero-Shot Classification
LAN-HDR Luminance-based Alignment Network for High Dynamic Range Video Reconstruction
Shortcut-V2V Compression Framework for Video-to-Video Translation Based on Temporal Redundancy
MixPath A Unified Approach for One-shot Neural Architecture Search
Rethinking Fast Fourier Convolution in Image Inpainting
A2Q Accumulator-Aware Quantization with Guaranteed Overflow Avoidanc
To Adapt or Not to Adapt Real-Time Adaptation for Semantic
Enhancing NeRF akin to Enhancing LLMs Generalizable NeRF Transformer with
Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion
Learning Depth Estimation for Transparent and Mirror Surfaces
Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models
Moment Detection in Long Tutorial Videos
Focal Network for Image Restoration
Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation
Learning Hierarchical Features with Joint Latent Space Energy-Based Prio
P2C Self-Supervised Point Cloud Completion from Single Partial Clouds
SportsMOT A Large Multi-Object Tracking Dataset in Multiple Sports Scenes
Test-time Personalizable Forecasting of 3D Human Poses
Cloth2Body Generating 3D Human Body Mesh from 2D Clothing
Indoor Depth Recovery Based on Deep Unfolding with Non-Local Prio
X-VoE Measuring eXplanatory Violation of Expectation in Physical Events
Cin Multi-body Depth and Camera Pose Estimation from Multiple Views
AutoSynth Learning to Generate 3D Training Data for Object Point
Search for or Navigate to Dual Adaptive Thinking for Object
TransFace Calibrating Transformer Training for Face Recognition from a Data-Centric
EverLight Indoor-Outdoor Editable HDR Lighting Estimation
Efficient Video Prediction via Sparsely Conditioned Flow Matching
LIMITR Leveraging Local Information for Medical Image-Text Representation
Vision Grid Transformer for Document Layout Analysis
Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes
PoseFix Correcting 3D Human Poses with Natural Languag
A Large-scale Study of Spatiotemporal Representation Learning with a New
Explicit Motion Disentangling for Efficient Optical Flow Estimation
GrowCLIP Data-Aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-Training
Identity-Consistent Aggregation for Video Object Detection
Learning Neural Eigenfunctions for Unsupervised Semantic Segmentation
NeRF-LOAM Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry an
PIRNet Privacy-Preserving Image Restoration Network via Wavelet Lifting
Prompt Switch Efficient CLIP Adaptation for Text-Video Retrieval
Towards Inadequately Pre-trained Models in Transfer Learning
Bayesian Prompt Learning for Image-Language Model Generalization
Sample4Geo Hard Negative Sampling For Cross-View Geo-Localisation
Guevara Cross-modal Latent Space Alignment for Image to Avatar Translation
Strata-NeRF Neural Radiance Fields for Stratified Scenes
General Planar Motion from a Pair of 3D Correspondences
3DMOTFormer Graph Transformer for Online 3D Multi-Object Tracking
MeViS A Large-scale Benchmark for Video Segmentation with Motion Expressions
Minimal Solutions to Generalized Three-View Relative Pose Problem
MOSE A New Dataset for Video Object Segmentation in Complex
PivotNet Vectorized Pivot Learning for End-to-end HD Map Construction
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Unsupervised Manifold Linearizing and Clustering
VertexSerum Poisoning Graph Neural Networks for Link Inferenc
SFHarmony Source Free Domain Adaptation for Distributed Neuroimaging Analysis
U-RED Unsupervised 3D Shape Retrieval and Deformation for Partial Point
Lip2Vec Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual
TAPIR Tracking Any Point with Per-Frame Initialization and Temporal Refinement
Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models
AG3D Learning to Generate 3D Avatars from 2D Image Collections
Boosting Long-tailed Object Detection via Step-wise Learning on Smooth-tail Dat
Collaborative Propagation on Multiple Instance Graphs for 3D Instance Segmentation
Cross-view Topology Based Consistent and Complementary Information for Deep Multi-view
CVSformer Cross-View Synthesis Transformer for Semantic Scene Completion
Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video
EMQ Evolving Training-free Proxies for Automated Mixed Precision Quantization
Heterogeneous Forgetting Compensation for Class-Incremental Learning
iVS-Net Learning Human View Synthesis from Internet Videos
Knowledge Restore and Transfer for Multi-Label Class-Incremental Learning
Large-Scale Land Cover Mapping with Fine-Grained Classes via Class-Aware Semi-Supervis
Multi-Scale Residual Low-Pass Filter Network for Image Deblurring
One-bit Flip is All You Need When Bit-flip Attack Meets
Preserving Tumor Volumes for Unsupervised Medical Image Registration
Prompt Tuning Inversion for Text-driven Image Editing Using Diffusion Models
Shape Anchor Guided Holistic Indoor Scene Understanding
Sparse Instance Conditioned Multimodal Trajectory Prediction
Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-Identification
TORE Token Reduction for Efficient Human Mesh Recovery with Transform
Reducing Training Time in Cross-Silo Federated Learning Using Multigraph Topology
Rosetta Neurons Mining the Common Units in a Model Zoo
One-Shot Recognition of Any Material Anywhere Using Contrastive Learning with
SkeleTR Towards Skeleton-based Action Recognition in the Wil
Towards Saner Deep Image Registration
Towards Semi-supervised Learning with Non-random Missing Labels
A Low-Shot Object Counting Network With Iterative Prototype Adaptation
SAFE Machine Unlearning With Shard Graphs
Eventful Transformers Leveraging Temporal Redundancy in Vision Transformers
Multi-View Active Fine-Grained Visual Recognition
s-Adaptive Decoupled Prototype for Few-Shot Object Detection
Semi-Supervised Learning via Weight-Aware Distillation under Class Distribution Mismatch
HyperDiffusion Generating Implicit Neural Fields with Weight-Space Diffusion
Physically-Plausible Illumination Distribution Estimation
Structure and Content-Guided Video Synthesis with Diffusion Models
All4One Symbiotic Neighbour Contrastive Learning via Self-Attention and Redundancy Reduction
Diffusion in Styl
Reinforce Data Multiply Impact Improved Model Accuracy and Robustness with
PODA Prompt-driven Zero-shot Domain Adaptation
FastRecon Few-shot Industrial Anomaly Detection via Fast Feature Reconstruction
GIFD A Generative Gradient Inversion Method with Feature Domain Optimization
Locating Noise is Halfway Denoising for Semi-Supervised Segmentation
Robust Heterogeneous Federated Learning under Data Corruption
SQAD Automatic Smartphone Camera Quality Assessment and Benchmarking
Tracing the Origin of Adversarial Attack for Forensic Investigation an
UATVR Uncertainty-Adaptive Text-Video Retrieval
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object
Visible-Infrared Person Re-Identification via Semantic Alignment and Affinity Inferenc
Flexible Visual Recognition by Evidential Modeling of Confusion and Ignoranc
Motion-Guided Masking for Spatiotemporal Representation Learning
Occ2Net Robust Image Matching Based on 3D Occupancy Estimation fo
Once Detected Never Lost Surpassing Human Performance in Offline LiDAR
RCA-NOC Relative Contrastive Alignment for Novel Object Captioning
Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric
Simulating Fluids in Real-World Still Images
SSB Simple but Strong Baseline for Boosting Performance of Open-Set
Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory
Unpaired Multi-domain Attribute Translation of 3D Facial Shapes with
Unsupervised Open-Vocabulary Object Localization in Videos
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
3D Motion Magnification Visualizing Subtle Motions from Time-Varying Radiance Fields
Clustering based Point Cloud Representation Learning for 3D Analysis
CVRecon Rethinking 3D Geometric Feature Learning For Neural Reconstruction
DiffPose SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning
Generalizing Neural Human Fitting to Unseen Poses With Articulated SE
Hierarchical Contrastive Learning for Pattern-Generalizable Image Corruption Detection
Score-Based Diffusion Models as Principled Priors for Inverse Imaging
Semantically Structured Image Compression via Irregular Group-Based Decoupling
SimFIR A Simple Framework for Fisheye Image Rectification with Self-supervis
Towards Instance-adaptive Inference for Federated Learning
ViM Vision Middleware for Unified Downstream Transferring
The Stable Signature Rooting Watermarks in Latent Diffusion Models
TeD-SPAD Temporal Distinctiveness for Self-Supervised Privacy-Preservation for Video Anomaly Detection
Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection
Distribution-Aligned Diffusion for Human Mesh Recovery
Jumping through Local Minima Quantization in the Loss Landscape o
NLOS-NeuS Non-line-of-sight Neural Implicit Surfac
ASAG Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Ancho
Dancing in the Dark A Benchmark towards General Low-light Video
Deformer Dynamic Fusion Transformer for Robust Hand Pose Estimation
GPGait Generalized Pose-based Gait Recognition
Towards High-Quality Specular Highlight Removal by Leveraging Large-Scale Synthetic Dat
TripLe Revisiting Pretrained Model Reuse and Progressive Learning for Efficient
UnitedHuman Harnessing Multi-Source Data for High-Resolution Human Generation
VAPCNet Viewpoint-Aware 3D Point Cloud Completion
Erasing Concepts from Diffusion Models
Improving Unsupervised Visual Program Inference with Code Rewriting Families
Towards Models that Can See and R
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Towards Robust Model Watermark via Reducing Parametric Vulnerability
Adaptive Positional Encoding for Bundle-Adjusting Neural Radiance Fields
Adaptive Testing of Computer Vision Models
A 5-Point Minimal Solver for Event Camera Relative Motion Estimation
A Unified Continual Learning Framework with General Parameter-Efficient Tuning
Coarse-to-Fine Amodal Segmentation with Shape Prio
Controllable Visual-Tactile Synthesis
CSDA Learning Category-Scale Joint Feature for Domain Adaptive Object Detection
DIFFGUARD Semantic Mismatch-Guided Out-of-Distribution Detection Using Pre-Trained Diffusion Models
DQS3D Densely-matched Quantization-aware Semi-supervised 3D Detection
Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation
Masked Diffusion Transformer is a Strong Image Synthesiz
MeMOTR Long-Term Memory-Augmented Transformer for Multi-Object Tracking
SIGMA Scale-Invariant Global Sparse Shape Matching
Strivec Sparse Tri-Vector Radiance Fields
Structural Alignment for Network Pruning through Partial Regularization
Towards Better Robustness against Common Corruptions for Unsupervised Domain Adaptation
Tuning Pre-trained Model via Moment Probing
Robust Monocular Depth Estimation under Challenging Conditions
Segmenting Known Objects and Unseen Unknowns without Prior Knowledg
Tree-Structured Shading Decomposition
Audiovisual Masked Autoencoders
Advancing Example Exploitation Can Alleviate Critical Challenges in Adversarial Training
CLR Channel-wise Lightweight Reprogramming for Continual Learning
Expressive Text-to-Image Generation with Rich Text
MetaBEV Solving Sensor Failures for 3D Detection and Map Segmentation
Preserve Your Own Correlation A Noise Prior for Video Diffusion
Ref-NeuS Ambiguity-Reduced Neural Implicit Surface Learning for Multi-View Reconstruction with
Weakly-Supervised Action Segmentation and Unseen Error Detection in Anomalous Instructional
zPROBE Zero Peek Robustness Checks for Federated Learning
Handwritten and Printed Text Segmentation A Signature Case Study
ETran Energy-Based Transferability Estimation
SHACIRA Scalable HAsh-grid Compression for Implicit Neural Representations
SiLK Simple Learned Keypoints
Humans in 4D Reconstructing and Tracking Humans with Transformers
Who Are You Referring To Coreference Resolution In Image Narrations
Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks
ARNOLD A Benchmark for Language-Grounded Task Learning with Continuous States
TM2D Bimodality Driven 3D Dance Generation via Music-Text Integration
ToonTalker Cross-Domain Face Reenactment
SYENet A Simple Yet Effective Network for Multiple Low-Level Vision
Semantify Simplifying the Control of 3D Morphable Models Using CLIP
CrossLoc3D Aerial-Ground Cross-Source 3D Place Recognition
PIDRo Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
Revisit PCA-based Technique for Out-of-Distribution Detection
Self-Supervised Character-to-Character Distillation for Text Recognition
DeLiRa Self-Supervised Depth Light and Radiance Fields
Towards Zero-Shot Scale-Aware Monocular Depth Estimation
Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning
Audio-Visual Deception Detection DOLOS Dataset and Parameter-Efficient Crossmodal Learning
Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information
Boundary-Aware Divide and Conquer A Diffusion-Based Solution for Unsupervised Shadow
Controllable Guide-Space for Generalizable Face Forgery Detection
DomainDrop Suppressing Domain-Sensitive Channels for Domain Generalization
EGC Image Generation and Classification via a Diffusion Energy-Based Model
Forward Flow for Novel View Synthesis of Dynamic Scenes
From Sky to the Ground A Large-scale Benchmark and Simpl
FSAR Federated Skeleton-based Action Recognition with Adaptive Topology Structure an
Membrane Potential Batch Normalization for Spiking Neural Networks
Physics-Augmented Autoencoder for 3D Skeleton-Based Gait Recognition
PolicyCleanse Backdoor Detection and Mitigation for Competitive Reinforcement Learning
RMP-Loss Regularizing Membrane Potential Distribution for Spiking Neural Networks
Robustifying Token Attention for Vision Transformers
Task-aware Adaptive Learning for Cross-domain Few-shot Learning
Template-guided Hierarchical Feature Restoration for Anomaly Detection
ViewRefer Grasp the Multi-view Knowledge for 3D Visual Grounding
Visual Traffic Knowledge Graph Generation from Scene Images
ASIC Aligning Sparse in-the-wild Image Collections
CLIPTrans Transferring Visual Knowledge with Pre-trained Models for Multimodal Machin
Eulerian Single-Photon Vision
Generalized Sum Pooling for Metric Learning
SPACE Speech-driven Portrait Animation with Controllable Expression
FACET Fairness in Computer Vision Evaluation Benchmark
Learned Compressive Representations for Single-Photon 3D Imaging
Class-relation Knowledge Distillation for Novel Class Discovery
Few-shot Continual Infomax Learning
Generalizable Neural Fields as Partially Observed Neural Processes
I Cant Believe Theres No Images Learning Visual Tasks Using
Remembering Normality Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection
Two Birds One Stone A Unified Framework for Joint Learning
Deep Geometry-Aware Camera Self-Calibration from Video
Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation
Fast Globally Optimal Surface Normal Estimation from an Affine Correspondenc
ClusT3 Information Invariant Test-Time Training
Efficient Diffusion Training via Min-SNR Weighting Strategy
AutoAD II The Sequel - Who When and What in
CHAMPAGNE Learning Real-world Conversation from Large-Scale Web Videos
CHORUS Learning Canonicalized 3D Human-Object Spatial Relations from Unboun
Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion
Dynamic Perceiver for Efficient Visual Recognition
E2VPT An Effective and Efficient Approach for Visual Prompt Tuning
FLatten Transformer Vision Transformer using Focused Linear Attention
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
HTML Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object
Neglected Free Lunch - Learning Image Classifiers Using Annotation Byproducts
Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network
Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network
STEERER Resolving Scale Variations for Counting and Localization via Selectiv
SVDiff Compact Parameter Space for Diffusion Fine-Tuning
Towards Attack-tolerant Federated Learning via Critical Parameter Analysis
Vision HGNN An Image is More than a Graph o
Class-Aware Patch Embedding Adaptation for Few-Shot Image Classification
Instruct-NeRF2NeRF Editing 3D Scenes with Instructions
BaRe-ESA A Riemannian Framework for Unregistered Human Body Shapes
FeatEnHancer Enhancing Hierarchical Features for Object Detection and Beyond Un
Will Large-scale Generative Models Corrupt Future Datasets
Point-TTA Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary
EgoTV Egocentric Task Verification from Natural Language Task Descriptions
Video OWL-ViT Temporally-consistent Open-world Localization in Video
Chasing Clouds Differentiable Volumetric Rasterisation of Point Clouds as
A Fast Unified System for 3D Object Detection and Tracking
Understanding Hessian Alignment for Domain Generalization
Energy-based Self-Training and Normalization for Unsupervised Domain Adaptation
Delta Denoising Sco
FunnyBirds A Synthetic Vision Dataset for a Part-Based Analysis o
Bidirectional Alignment for Domain Adaptive Detection with Transformers
BiViT Extremely Compressed Binary Vision Transformers
Candidate-aware Selective Disambiguation Based On Normalized Entropy for Instance-dependent Partial-label
Degradation-Resistant Unfolding Network for Heterogeneous Image Fusion
GlobalMapper Arbitrary-Shaped Urban Layout Generation
ICL-D3IE In-Context Learning with Diverse Demonstrations Updating for Document Information
OrthoPlanes A Novel Representation for Better 3D-Awareness of GANs
Pyramid Dual Domain Injection Network for Pan-sharpening
Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning
Shift from Texture-bias to Shape-bias Edge Deformation-based Augmentation for Robust
Speech4Mesh Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial
Thinking Image Color Aesthetics Assessment Models Datasets and Benchmarks
TopoSeg Topology-Aware Nuclear Instance Segmentation
Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning
Unsupervised Prompt Tuning for Text-Driven Object Detection
REAP A Large-Scale Realistic Adversarial Patch Benchmark
Normalizing Flows for Human Pose Anomaly Detection
Text2Room Extracting Textured 3D Meshes from 2D Text-to-Image Models
DiffPose Multi-hypothesis Human Pose Estimation using Diffusion Models
AesPA-Net Aesthetic Pattern-Aware Style Transfer Networks
Attention Discriminant Sampling for Point Clouds
Hyperbolic Audio-visual Zero-shot Learning
Implicit Identity Representation Conditioned Memory Compensation Network for Talking H
Improving Sample Quality of Diffusion Models Using Self-Attention Guidanc
Learning Navigational Visual Representations with Semantic Map Supervision
LVOS A Benchmark for Long-term Video Object Segmentation
On the Robustness of Normalizing Flows for Inverse Problems in
Out-of-Distribution Detection for Monocular Depth Estimation
Subclass-balancing Contrastive Learning for Long-tailed Recognition
When to Learn What Model-Adaptive Data Augmentation Curriculum
Class-incremental Continual Learning for Instance Segmentation with Image-level Weak Supervision
360VOT A New Benchmark Dataset for Omnidirectional Visual Object Tracking
Adaptive Frequency Filters As Efficient Global Token Mixers
Adaptive Nonlinear Latent Transformation for Conditional Face Editing
Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection
A Sentence Speaks a Thousand Images Domain Generalization through Distilling
CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training
ConSlide Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual
Counting Crowds in Bad Weath
Delving into Motion-Aware Matching for Monocular 3D Object Tracking
DiffDis Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
ESTextSpotter Towards Better Scene Text Spotting with Explicit Synergy in
Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks
FULLER Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration
GameFormer Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction an
iDAG Invariant DAG Searching for Domain Generalization
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting
Interactive Class-Agnostic Object Counting
InterFormer Real-time Interactive Image Segmentation
Learning Shape Primitives via Implicit Convexity Regularization
MGMAE Motion Guided Masking for Video Masked Autoencoding
Multi-Metrics Adaptively Identifies Backdoors in Federated Learning
Neural LiDAR Fields for Novel View Synthesis
One-shot Implicit Animatable Avatars with Model-based Priors
PADDLES Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy
PHRIT Parametric Hand Representation with Implicit Templat
Pixel-Wise Contrastive Distillation
Ponder Point Cloud Pre-training via Neural Rendering
Prototypical Kernel Learning and Open-set Foreground Perception for Generalized Few-shot
Reconstructing Groups of People with Hypergraph Relational Reasoning
SAFARI Versatile and Efficient Evaluations for Robustness of Interpretability
Simoun Synergizing Interactive Motion-appearance Understanding for Vision-based Reinforcement Learning
Skill Transformer A Monolithic Policy for Mobile Manipulation
Understanding Self-attention Mechanism via Dynamical System Perspectiv
Video Task Decathlon Unifying Image and Video Tasks in Autonomous
Weakly Supervised Learning of Semantic Correspondence through Cascaded Online Correspondenc
What can Discriminator do Towards Box-free Ownership Verification of Generativ
Efficient LiDAR Point Cloud Oversegmentation Network
Focus on Your Target A Dual Teacher-Student Framework for Domain-Adaptiv
Beyond One-to-One Rethinking the Referring Image Segmentation
DandelionNet Domain Composition with Instance Adaptive Classification for Domain Generalization
DRAW Defending Camera-shooted RAW Against Image Manipulation
Explore and Tell Embodied Visual Captioning in 3D Environments
Federated Learning Over Images Vertical Decompositions and Pre-Trained Backbones A
Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering
Open-domain Visual Entity Recognition Towards Recognizing Millions of Wikipedia Entities
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency fo
PlankAssembly Robust 3D Reconstruction from Three Orthographic Views with Learnt
PromptCap Prompt-Guided Image Captioning for VQA with GPT-
Pseudo-label Alignment for Semi-supervised Instance Segmentation
SHERF Generalizable Human NeRF from a Single Imag
Single Image Reflection Separation via Component Synergy
TIFA Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Tri-MipRF Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields
Unsupervised Feature Representation Learning for Domain-generalized Cross-domain Image Retrieval
VL-PET Vision-and-Language Parameter-Efficient Tuning via Granularity Control
FaceCLIPNeRF Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields
UpCycling Semi-supervised 3D Object Detection without Sharing Raw-level Unlabeled Scenes
Scratching Visual Transformers Back with Uniform Attention
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
RANA Relightable Articulated Neural Avatars
NeO 360 Neural Fields for Sparse View Synthesis of Outdoo
RED-PSM Regularization by Denoising of Partially Separable Models for Dynamic
Hidden Biases of End-to-End Driving Models
Efficiently Robustify Pre-Trained Models
UMFuse Unified Multi View Fusion for Human Editing Applications
Physics-Driven Turbulence Image Restoration with Stochastic Refinement
Dynamic Mesh Recovery from Partial Point Cloud Sequenc
Knowing Where to Focus Event-aware Transformer for Video Grounding
Self-supervised Image Denoising with Downsampled Invariance Loss and Conditional Blind-Spot
BlindHarmony Blind Harmonization for MR Images via Flow Model
The Power of Sound TPoS Audio Reactive Video Generation with
A Unified Framework for Robustness on Diverse Sampling Errors
Beyond Single Path Integrated Gradients for Reliable Input Attribution vi
Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations
Anatomical Invariance Modeling and Semantic Alignment for Self-supervised Learning in
AvatarCraft Transforming Text into Neural Human Avatars with Parameterized Sh
BUS Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization
Center-Based Decoupled Point-cloud Registration for 6D Object Pose Estimation
Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction
Diffuse3D Wide-Angle 3D Photography via Bilateral Diffusion
Domain Generalization via Balancing Training Difficulty and Model Capability
Efficient Decision-based Black-box Patch Attacks on Video Recognition
EMR-MSF Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity
Full-Body Articulated Human-Object Interaction
MEFLUT Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion
Optimizing the Placement of Roadside LiDARs for Autonomous Driving
Personalized Image Generation for Color Vision Deficiency Population
Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation
Revisiting Scene Text Recognition A Data Perspectiv
Scenimefy Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Structure-Aware Surface Reconstruction via Primitive Assembly
Supervised Homography Learning with Realistic Dataset Generation
Text2Performer Text-Driven Human Video Generation
VAD Vectorized Scene Representation for Efficient Autonomous Driving
Video Action Segmentation via Contextually Refined Temporal Keypoints
AffordPose A Large-Scale Dataset of Hand-Object Interactions with Affordance-Driven Han
Unsupervised Domain Adaptation for Training Event-Based Networks Using Contrastive Learning
CoSign Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition
Semi-supervised Semantics-guided Adversarial Training for Robust Trajectory Prediction
DriveAdapter Breaking the Coupling Barrier of Perception and Planning in
Revisiting the Parameter Efficiency of Adapters from the Perspective o
Order-preserving Consistency Regularization for Domain Adaptation and Generalization
Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching
DiffusionRet Generative Text-Video Retrieval with Diffusion Model
Explaining Adversarial Robustness of Neural Networks from Clustering Effect Perspectiv
Growing a Brain with Sparsity-Inducing Generation for Continual Learning
Lighting Every Darkness in Two Pairs A Calibration-Free Pipeline fo
Recursive Video Lane Detection
Anchor Structure Regularization Induced Multi-view Subspace Clustering via Enhanced Tenso
Benchmarking and Analyzing Robust Point Cloud Recognition Bag of Tricks
Continual Segment Towards a Single Unified and Non-forgetting Continual Segmentation
DDP Diffusion Model for Dense Visual Prediction
Rethinking Video Frame Interpolation from Shutter Mode Induced Degradation
Single Image Deblurring with Row-dependent Blur Magnitu
Uncertainty-guided Learning for Improving Image Manipulation Detection
3D-Aware Generative Model for Improved Side-View Image Synthesis
MARS Model-agnostic Biased Object Removal without Additional Supervision for Weakly-Supervis
Panoramas from Photons
CAFA Class-Aware Feature Alignment for Test-Time Adaptation
Generating Instance-level Prompts for Rehearsal-free Continual Learning
DG-Recon Depth-Guided Neural 3D Scene Reconstruction
HumanSD A Native Skeleton-Guided Diffusion Model for Human Image Generation
MIMO-NeRF Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields
Alleviating Catastrophic Forgetting of Incremental Object Detection via Within-Class an
A Soft Nearest-Neighbor Framework for Continual Semi-Supervised Learning
DDColor Towards Photo-Realistic Image Colorization via Dual Decoders
Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking
Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
Essential Matrix Estimation using Convex Relaxations in Orthogonal Spac
HoloFusion Towards Photo-realistic 3D Generative Modeling
DreamPose Fashion Video Synthesis with Stable Diffusion
Guided Motion Diffusion for Controllable Human Motion Synthesis
EMDB The Electromagnetic Database of Global 3D Human Pose an
LERF Language Embedded Radiance Fields
Text2Video-Zero Text-to-Image Diffusion Models are Zero-Shot Video Generators
FishNet A Large-scale Dataset and Benchmark for Fish Recognition Detection
Introducing Language Guidance in Prompt-based Continual Learning
Tiled Multiplane Images for Practical 3D Photography
Self-regulating Prompts Foundational Model Adaptation without Forgetting
Ego-Humans An Ego-Centric 3D Multi-Human Benchmark
Sentence Attention Blocks for Answer Grounding
Unsupervised Facial Performance Editing via Vector-Quantized StyleGAN Representations
PreSTU Pre-Training for Scene-Text Understanding
3D-aware Blending with Generative NeRFs
Adaptive Superpixel for Active Learning in Semantic Segmentation
Breaking Temporal Consistency Generating Video Universal Adversarial Perturbations Using Imag
Calibrating Panoramic Depth Estimation for Practical Localization and Mapping
Chupa Carving 3D Clothed Humans from Skinned Shape Priors using
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents
Contrastive Feature Masking Open-Vocabulary Vision Transform
CRN Camera Radar Net for Accurate Robust Efficient 3D Perception
Cross-Modal Learning with 3D Deformable Attention for Action Recognition
Dense Text-to-Image Generation with Attention Modulation
EP2P-Loc End-to-End 3D Point to 2D Pixel Localization for Large-Scal
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning
Joint Demosaicing and Deghosting of Time-Varying Exposures for Single-Shot HDR
LDL Line Distance Functions for Panoramic Localization
Learning Point Cloud Completion without Complete Point Clouds A Pose-Aw
Lip Reading for Low-resource Languages by Learning and Combining General
Misalign Contrast then Distill Rethinking Misalignments in Language-Image Pre-training
NCHO Unsupervised Learning for Neural 3D Composition of Humans an
PODIA-3D Domain Adaptation of 3D Generative Model Across Large Domain
Predict to Detect Prediction-guided 3D Object Detection using Sequential Images
ProtoFL Unsupervised Federated Learning via Prototypical Distillation
Proxy Anchor-based Unsupervised Learning for Continuous Generalized Category Discovery
SCOB Universal Text Understanding via Character-wise Supervised Contrastive Learning with
Self-Feedback DETR for Temporal Action Detection
Semantic-Aware Implicit Template Learning via Part Deformation Consistency
Shatter and Gather Learning Referring Image Segmentation with Text Supervision
Texture Learning Domain Randomization for Domain Generalized Segmentation
Convolutional Networks with Oriented 1D Kernels
Segment Anything
StyleLipSync Style-based Personalized Lip-sync Video Generation
DISeR Designing Imaging Systems with Reinforcement Learning
Towards Viewpoint Robustness in Birds Eye View Segmentation
LoCUS Learning Multiscale 3D-consistent Features from Posed Images
Computational 3D Imaging with Position Sensors
Disposable Transfer Learning for Selective Source Task Unlearning
Priority-Centric Human Motion Generation in Discrete Latent Spac
Rethinking Range View Representation for LiDAR Segmentation
Robo3D Towards Robust and Reliable 3D Perception against Corruptions
Enhancing Modality-Agnostic Representations via Meta-Learning for Brain Tumor Segmentation
PG-RCNN Semantic Surface Point Generation for 3D Object Detection
SALAD Part-Level Latent Diffusion for 3D Shape Generation and Manipulation
Guiding Image Captioning Models Toward More Specific Captions
ENTL Embodied Navigation Trajectory Learn
Continuously Masked Transformer for Image Inpainting
Open-vocabulary Video Question Answering A New Benchmark for Evaluating th
Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models A Pilot
Navigating to Objects Specified by Images
Tetra-NeRF Representing Neural Radiance Fields Using Tetrah
Ablating Concepts in Text-to-Image Diffusion Models
Generative Multiplane Neural Radiance for 3D-Aware Image Generation
RefEgo Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
TiDAL Learning Training Dynamics for Active Learning
COOL-CHIC Coordinate-based Low Complexity Hierarchical Image Codec
Hybrid Spectral Denoising Transformer with Guided Attention
Mask-Attention-Free Transformer for 3D Instance Segmentation
PADCLIP Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain
XVO Generalized Visual Odometry via Cross-Modal Self-Training
The Making and Breaking of Camouflag
Efficient Converted Spiking Neural Network for 3D and 2D Classification
Masked Autoencoders Are Stronger Knowledge Distillers
UniKD Universal Knowledge Distillation for Mimicking Homogeneous or Heterogeneous Object
SeeABLE Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes
Adaptive Similarity Bootstrapping for Self-Distillation Based Representation Learning
Bayesian Optimization Meets Self-Distillation
Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification
DetermiNet A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using
Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors
ExBluRF Efficient Radiance Fields for Extreme Motion Blurred Images
Few-Shot Common Action Localization via Cross-Attentional Fusion of Context an
Generating Realistic Images from In-the-wild Sounds
Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition
Human Part-wise 3D Motion Context Learning for Sign Language Recognition
ICE-NeRF Interactive Color Editing of NeRFs via Decomposition-Aware Weight Optimization
Improving 3D Imaging with Pre-Trained Perpendicular 2D Diffusion Models
INSTA-BNN Binary Neural Network with INSTAnce-aware Threshol
Latent-OFER Detect Mask and Reconstruct with Latent Vectors for Occlu
Lecture Presentations Multimodal Dataset Towards Understanding Multimodality in Educational Videos
Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition
Locomotion-Action-Manipulation Synthesizing Human-Scene Interactions in Complex 3D Environments
Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Doubl
Neural Collage Transfer Artistic Reconstruction via Material Manipulation
Online Continual Learning on Hierarchical Label Expansion
Read-only Prompt Optimization for Vision-Language Few-shot Learning
Robust Evaluation of Diffusion-Based Adversarial Purification
Semantic-Aware Dynamic Parameter for Video Inpainting Transform
SlaBins Fisheye Depth Estimation using Slanted Bins on Road Environments
Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models
Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in
Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial
Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency
Decomposition-Based Variational Network for Multi-Contrast MRI Super-Resolution and Reconstruction
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory
Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction
DLT Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transform
EPiC Ensemble of Partial Point Clouds for Robust Classification
Moing WALDO Future Video Synthesis Using Object Layer Decomposition and Parametric
Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning
Benchmarking Algorithmic Bias in Face Recognition An Experimental Approach Using
Coherent Event Guided Low-Light Video Enhancement
ENVIDR Implicit Differentiable Renderer with Neural Environment Lighting
Inducing Neural Collapse to a Fixed Hierarchy-Aware Frame for Reducing
Iterative Prompt Learning for Unsupervised Backlit Image Enhancement
Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution
Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation
MAAL Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects
MPI-Flow Learning Realistic Optical Flow with Multiplane Images
Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition
Simple Baselines for Interactive Video Retrieval with Questions and Answers
CheckerPose Progressive Dense Keypoint Localization for Object Pose Estimation with
WaterMask Instance Segmentation for Underwater Imagery
DocTr Document Transformer for Structured Information Extraction in Documents
RecRecNet Rectangling Rectified Wide-Angle Images by Thin-Plate Spline Model an
Segmentation of Tubular Structures Using Iterative Training with Tailored Samples
LightGlue Local Feature Matching at Light S
Algebraically Rigorous Quaternion Framework for the Neural Network Pose Estimation
A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions
DETR Does Not Need Multi-Scale or Locality Design
Exploring Group Video Captioning with Efficient Relational Approximation
Graph Matching with Bi-level Noisy Correspondenc
Hyperbolic Chamfer Distance for Point Cloud Completion
InfiniCity Infinite-Scale City Synthesis
Learning Vision-and-Language Navigation from YouTube Videos
Leveraging Intrinsic Properties for Non-Rigid Garment Alignment
MAtch eXpand and Improve Unsupervised Finetuning for Zero-Shot Action Recognition
MHCN A Hyperbolic Neural Network Model for Multi-view Hierarchical Clustering
MMST-ViT Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision
OmnimatteRF Robust Omnimatte with 3D Background Modeling
PourIt Weakly-Supervised Liquid Perception from a Single Image for Visual
Preparing the Future for Continual Semantic Segmentation
RealGraph A Multiview Dataset for 4D Real-world Context Graph Generation
Scale-Aware Modulation Meet Transform
Self-supervised Pre-training for Mirror Detection
SMAUG Sparse Masked Autoencoder for Efficient Video-Language Pre-Training
UniVTG Towards Unified Video-Language Temporal Grounding
Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generativ
VI-Net Boosting Category-level 6D Object Pose Estimation via Learning Decoupl
AerialVLN Vision-and-Language Navigation for UAVs
Augmented Box Replay Overcoming Foreground Shift for Incremental Object Detection
Beating Backdoor Attack at Its Own Gam
Beyond Image Borders Learning Feature Extrapolation for Unbounded Image Composition
Birds-Eye-View Scene Graph for Vision-Language Navigation
Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings
CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection
Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding
ContactGen Generative Contact Modeling for Grasp Generation
CPCM Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic
DeFormer Integrating Transformers with Deformable Models for 3D Shape Abstraction
Density-invariant Features for Distant Point Cloud Registration
Detection Transformer with Stable Matching
Diffusion Action Segmentation
DOLCE A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction
DREAM Efficient Dataset Distillation by Representative Matching
Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation
Few-Shot Dataset Distillation via Translative Pre-Training
Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation
FSI Frequency and Spatial Interactive Learning for Image Restoration in
Geometrized Transformer for Self-Supervised Homography Estimation
GeoMIM Towards Better 3D Knowledge Transfer via Masked Image Modeling
Group Pose A Simple Baseline for End-to-End Multi-Person Pose Estimation
HOSNeRF Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement
Improving Pixel-based MIM by Reducing Wasted Modeling Capability
Instance Neural Radiance Fiel
Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection
IST-Net Prior-Free Category-Level Pose Estimation with Implicit Space Transformation
Landscape Learning for Neural Network Inversion
LeaF Learning Frames for 4D Point Cloud Sequence Understanding
Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term
Learning Cross-Representation Affinity Consistency for Sparsely Supervised Biomedical Instance Segmentation
Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration
Learning to Identify Critical States for Reinforcement Learning from Videos
Learning to Upsample by Learning to Sampl
Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation
LoTE-Animal A Long Time-span Dataset for Endangered Animal Behavior Understanding
Low-Light Image Enhancement with Multi-Stage Residue Quantization and Brightness-Aware Attention
MODA Mapping-Once Audio-driven Portrait Animation with Dual Attentions
Model Calibration in Dense Classification with Adaptive Label Perturbation
Monocular 3D Object Detection with Bounding Box Denoising in 3D
Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation
Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Imag
Multi-Modal Neural Radiance Field for Monocular Dense SLAM with
MUter Machine Unlearning on Adversarially Trained Models
MV-DeepSDF Implicit Modeling with Multi-Sweep Point Clouds for 3D Vehicl
Objects Do Not Disappear Video Object Detection by Single-Frame Object
Parallel Attention Interaction Network for Few-Shot Skeleton-Based Action Recognition
PARIS Part-level Reconstruction and Motion Analysis for Articulated Objects
Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increas
Periodically Exchange Teacher-Student for Source-Free Object Detection
PETRv2 A Unified Framework for 3D Perception from Multi-Camera Images
PlanarTrack A Large-scale Challenging Benchmark for Planar Object Tracking
Point-Query Quadtree for Crowd Counting Localization and Mo
Real-Time Neural Rasterization for Large Scenes
Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution
Referring Image Segmentation Using Text Supervision
RegFormer An Efficient Projection-Aware Transformer Network for Large-Scale Point Clou
Residual Pattern Learning for Pixel-Wise Out-of-Distribution Detection in Semantic Segmentation
Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization
Seeing Beyond the Patch Scale-Adaptive Semantic Segmentation of High-resolution Remot
SimpleClick Interactive Image Segmentation with Simple Vision Transformers
SKiT a Fast Key Information Video Transformer for Online Surgical
SparseBEV High-Performance Sparse 3D Object Detection from Multi-Camera Videos
Tangent Model Composition for Ensembling and Continual Fine-tuning
Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization
The Devil is in the Upsampling Architectural Decisions Made Simpl
TMA Temporal Motion Aggregation for Event-based Optical Flow
Towards Unsupervised Domain Generalization for Face Anti-Spoofing
TRM-UAP Enhancing the Transferability of Data-Free Universal Adversarial Perturbation vi
Uncertainty-aware Unsupervised Multi-Object Tracking
UniSeg A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models
When Epipolar Constraint Meets Non-Local Operators in Multi-View Stereo
Zero-1-to-3 Zero-shot One Image to 3D Object
2D3D-MATR 2D-3D Matching Transformer for Detection-Free Registration Between Images an
Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking
AlignDet Aligning Pre-training and Fine-tuning in Object Detection
Among Us Adversarially Robust Collaborative Perception by Consensus
An Embarrassingly Simple Backdoor Attack on Self-supervised Learning
AutoDiffusion Training-Free Optimization of Time Steps and Architectures for Automat
Automated Knowledge Distillation via Monte Carlo Tree Search
BEV-DG Cross-Modal Learning under Birds-Eye View for Domain Generalization o
Beyond Object Recognition A New Benchmark towards Object Concept Learning
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation
Calibrating Uncertainty for Semi-Supervised Crowd Counting
CFCG Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision
CHORD Category-level Hand-held Object Reconstruction via Shape Deformation
CiteTracker Correlating Image and Text for Visual Tracking
ClimateNeRF Extreme Weather Synthesis in Neural Radiance Fiel
Collecting The Puzzle Pieces Disentangled Self-Driven Human Pose Transfer by
Compositional Feature Augmentation for Unbiased Scene Graph Generation
Contactless Pulse Estimation Leveraging Pseudo Labels and Self-Supervision
Coordinate Transformer Achieving Single-stage Multi-person Mesh Recovery from Videos
CORE Co-planarity Regularized Monocular Geometry Estimation with Weak Supervision
Cross Contrasting Feature Perturbation for Domain Generalization
D3G Exploring Gaussian Prior for Temporal Sentence Grounding with Glanc
DDIT Semantic Scene Completion via Deformable Deep Implicit Templates
DenseShift Towards Accurate and Efficient Low-Bit Power-of-Two Quantization
DFA3D 3D Deformable Attention For 2D-to-3D Feature Lifting
Differentiable Transportation Pruning
Discovering Spatio-Temporal Rationales for Video Question Answering
Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning
Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Diverse Cotraining Makes Strong Semi-Supervised Segmento
DLGSANet Lightweight Dynamic Local and Global Self-Attention Networks for Imag
Do DALL-E and Flamingo Understand Each Oth
DPM-OT A New Diffusion Probabilistic Model Based on Optimal Transport
DreamTeacher Pretraining Image Backbones with Deep Generative Models
E3Sym Leveraging E3 Invariance for Unsupervised 3D Planar Reflective Symmetry
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
End-to-end 3D Tracking with Decoupled Queries
Exploring Model Transferability through the Lens of Potential Energy
Exploring the Benefits of Visual Prompting in Differential Privacy
Extensible and Efficient Proxy for Neural Architecture Search
Fast Neural Scene Flow
FB-BEV BEV Representation from Forward-Backward View Transformations
Feature Modulation Transformer Cross-Refinement of Global Representation via High-Frequency Prio
FineDance A Fine-grained Choreography Dataset for 3D Full Body Danc
Foreground and Text-lines Aware Document Image Rectification
G2L Semantically Aligned and Uniform Video Grounding via Geodesic an
GPA-3D Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object
Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
Heterogeneous Diversity Driven Active Learning for Multi-Object Tracking
Hierarchical Visual Categories Modeling A Joint Representation Learning and Density
High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset an
I-ViT Integer-only Quantization for Efficient Vision Transformer Inferenc
IntentQA Context-aware Video Intent Reasoning
Inverse Compositional Learning for Weakly-supervised Relation Grounding
IOMatch Simplifying Open-Set Semi-Supervised Learning with Joint Inliers and Outliers
JOTR 3D Joint Contrastive Learning with Transformers for Occluded Human
Knowledge-Spreader Learning Semi-Supervised Facial Action Dynamics by Consistifying Knowledge Granularity
Knowledge Proxy Intervention for Deconfounded Video Question Answering
Large Selective Kernel Network for Remote Sensing Object Detection
Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limit
Learning Fine-Grained Features for Pixel-Wise Video Correspondences
Learning Robust Representations with Information Bottleneck and Memory Network fo
Learning to Distill Global Representation for Sparse-View CT
Leveraging Inpainting for Single-Image Shadow Removal
LogicSeg Parsing Visual Semantics with Neural Logic Learning and Reasoning
MatrixCity A Large-scale City Dataset for City-scale Neural Rendering an
MemorySeg Online LiDAR Semantic Segmentation with a Latent Memory
Mitigating and Evaluating Static Bias of Action Representations in th
Monte Carlo Linear Clustering with Single-Point Supervision is Enough fo
Multi-Frequency Representation Enhancement with Privilege Information for Video Super-Resolution
Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation
MUVA A New Large-Scale Benchmark for Multi-View Amodal Instance Segmentation
NeRF-MS Neural Radiance Fields with Multi-Sequenc
NerfAcc Efficient Sampling Accelerates NeRFs
NeTONeural Reconstruction of Transparent Objects with Self-Occlusion Aware Refraction-Tracing
Neural Characteristic Function Learning for Conditional Image Generation
Novel Scenes Classes Towards Adaptive Open-set Object Detection
No Fear of Classifier Biases Neural Collapse Inspired Federated Learning
On the Robustness of Open-World Test-Time Training Self-Training with Dynamic
Open-vocabulary Object Segmentation with Diffusion Models
OxfordTVG-HIC Can Machine Make Humorous Captions from Images
Partition-And-Debias Agnostic Biases Mitigation via a Mixture of Biases-Specific Experts
PatchCT Aligning Patch Set and Label Set with Conditional Transport
Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction
Pluralistic Aging Diffusion Autoenco
Point2Mask Point-supervised Panoptic Segmentation via Optimal Transport
Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval
PVT A Simple End-to-End Latency-Aware Visual Tracking Framework
Q-Diffusion Quantizing Diffusion Models
ReactioNet Learning High-Order Facial Behavior from Universal Stimulus-Reaction by Dyadic
RenderIH A Large-Scale Synthetic Dataset for 3D Interacting Hand Pos
RepQ-ViT Scale Reparameterization for Post-Training Quantization of Vision Transformers
Representation Disparity-aware Distillation for 3D Object Detection
Rethinking Multi-Contrast MRI Super-Resolution Rectangle-Window Cross-Attention Transformer and Arbitrary-Scale Upsampling
Rethinking Vision Transformers for MobileNet Size and S
RFD-ECNet Extreme Underwater Image Compression with Reference to Feature Dictionary
RICO Regularizing the Unobservable for Indoor Compositional Reconstruction
Robust Referring Video Object Segmentation with Cyclic Structural Consensus
Semi-Supervised Semantic Segmentation under Label Noise via Diverse Learning Groups
Sequential Texts Driven Cohesive Motions Synthesis with Natural Transitions
Skip-Plan Procedure Planning in Instructional Videos via Condensed Action Spac
StegaNeRF Embedding Invisible Information within Neural Radiance Fields
STPrivacy Spatio-Temporal Privacy-Preserving Action Recognition
TCOVIS Temporally Consistent Online Video Instance Segmentation
The Euclidean Space is Evil Hyperbolic Attribute Editing for Few-shot
Tube-Link A Flexible Cross Tube Framework for Universal Video Segmentation
UHDNeRF Ultra-High-Definition Neural Radiance Fields
UniFormerV2 Unlocking the Potential of Image ViTs for Video Understanding
Unify Align and Refine Multi-Level Semantic Alignment for Radiology Report
Unleashing the Potential of Spiking Neural Networks with Dynamic Confidenc
Unmasked Teacher Towards Training-Efficient Video Foundation Models
Variational Degeneration to Structural Refinement A Unified Framework for Superimpos
Virtual Try-On with Pose-Garment Keypoints Guided Inpainting
Your Diffusion Model is Secretly a Zero-Shot Classifi
Cross-modal Scalable Hierarchical Clustering in Hyperbolic spac
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
ATT3D Amortized Text-to-3D Object Synthesis
ELFNet Evidential Local-global Fusion for Stereo Matching
Robust e-NeRF NeRF from Sparse Noisy Events under Non-Uniform
3D VR Sketch Guided 3D Shape Prototyping and Exploration
BEVPlace Learning LiDAR-based Place Recognition using Birds Eye View Images
CopyRNeRF Protecting the CopyRight of Neural Radiance Fields
GAFlow Incorporating Gaussian Attention into Optical Flow
Harvard Glaucoma Detection and Progression A Multimodal Multitask Dataset an
KECOR Kernel Coding Rate Maximization for Active 3D Object Detection
LATR 3D Lane Detection from Monocular Images with Transform
Learning Optical Flow from Event Camera with Rendered Dataset
Learning Versatile 3D Shape Generation with Improved Auto-regressive Models
LexLIP Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval
On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement
Perpetual Humanoid Control for Real-time Simulated Avatars
PGFed Personalize Each Clients Global Objective for Federated Learning
Similarity Min-Max Zero-Shot Day-Night Domain Adaptation
A Large-Scale Outdoor Multi-Modal Dataset and Benchmark for Novel View
Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with
Holistic Geometric Feature Learning for Structured Reconstruction
Label-Noise Learning with Intrinsically Long-Tailed Dat
Query Refinement Transformer for 3D Instance Segmentation
Removing Anomalies as Noises for Industrial Defect Localization
Scene-Aware Feature Matching
See More and Know More Zero-shot Point Cloud Segmentation vi
Set-level Guidance Attack Boosting Adversarial Transferability of Vision-Language Pre-training Models
TF-ICON Diffusion-Based Training-Free Cross-Domain Image Composition
Translating Images to Road Network A Non-Autoregressive Sequence-to-Sequence Approach
Urban Radiance Field Representation with Deformable Neural Mesh Primitives
Anchor-Intermediate Detector Decoupling and Coupling Bounding Boxes for Accurate Object
Aperture Diffraction for Compact Snapshot Spectral Imaging
Learning a Room with the Occ-SDF Hybrid Signed Distance Function
Measuring Asymmetric Gradient Discrepancy in Parallel Continual Learning
Fast Inference and Update of Probabilistic Density Estimation on Trajectory
How to Choose your Best Allies for a Transferable Attack
EgoLoc Revisiting 3D Object Localization from Egocentric Videos with Visual
Neural Microfacet Fields for Inverse Rendering
NAPA-VQ Neighborhood-Aware Prototype Augmentation with Vector Quantization for Continual Learning
TrackFlow Multi-Object tracking with Normalizing Flows
CAD-Estate Large-scale CAD Model Annotation in RGB Videos
Towards Zero Domain Gap A Comprehensive Study of Realistic LiDAR
SurfsUP Learning Fluid Simulation for Novel Surfaces
Chordal Averaging on Flag Manifolds and Its Applications
COCO-O A Benchmark for Object Detectors under Natural Distribution Shifts
Masked Motion Predictors are Strong 3D Action Representation Learners
Multimodal Variational Auto-encoder based Audio-Visual Segmentation
VoroMesh Learning Watertight Surface Meshes with Voronoi Diagrams
Multi-Object Navigation with Dynamically Learned Neural Implicit Representations
Learning to Ground Instructional Articles in Videos through Narrations
A Benchmark for Chinese-English Scene Text Image Super-Resolution
Borrowing Knowledge From Pre-trained Language Model A New Data-efficient Visual
Deformable Neural Radiance Fields using RGB and Event Cameras
DetZero Rethinking Offboard 3D Object Detection with Long-term Sequential Point
Enhanced Soft Label for Semi-Supervised Semantic Segmentation
Fine-grained Unsupervised Domain Adaptation for Gait Recognition
GaFET Learning Geometry-aware Facial Expression Translation from In-The-Wild Images
Invariant Feature Regularization for Fair Face Recognition
Order-Prompted Tag Sequence Generation for Video Tagging
Rethinking Safe Semi-supervised Learning Transferring the Open-set Problem to A
Synchronize Feature Extracting and Matching A Single Branch Framework fo
Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection
Tracking by Natural Language Specification with Long Short-term Context Decoupling
Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks
WaveIPT Joint Attention and Flow Alignment in the Wavelet domain
X-Mesh Towards Fast and Accurate Text-driven 3D Stylization via Dynamic
Inter-Realization Channels Unsupervised Anomaly Detection Beyond One-Class Classification
A Theory of Topological Derivatives for Inverse Rendering of Geometry
Gender Artifacts in Visual Datasets
Towards Geospatial Foundation Models via Continual Pretraining
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks
Tracking without Label Unsupervised Multiple Object Tracking via Contrastive Similarity
Encyclopedic VQA Visual Questions About Detailed Properties of Fine-Grained Categories
A Skeletonization Algorithm for Gradient-Based Optimization
M2T Masking Transformers Twice for Faster Decoding
Efficient Neural Supersampling on a Novel Gaming Dataset
Identification of Systematic Errors of Image Classifiers on Rare Subgroups
CauSSL Causality-inspired Semi-supervised Learning for Medical Image Segmentation
DDS2M Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration
Spectrum-guided Multi-granularity Referring Video Object Segmentation
Domain Generalization Guided by Gradient Signal to Noise Ratio o
SKED Sketch-guided Text-based 3D Editing
Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation
Geometric Viewpoint Learning with Hyper-Rays and Harmonics Encoding
Reference-guided Controllable Inpainting of Neural Radiance Fields
MATE Masked Autoencoders are Online 3D Test-Time Learners
Privacy-Preserving Face Recognition Using Random Frequency Components
Dark Side Augmentation Generating Diverse Night Examples for Metric Learning
Verbs in Action Improving Verb Understanding in Video-Language Models
MSI Maximize Support-Set Information for Few-Shot Segmentation
Online Class Incremental Learning on Stochastic Blurry Task Boundary vi
CROSSFIRE Camera Relocalization On Self-Supervised Features from an Implicit Representation
MolGrapher Graph-based Visual Recognition of Chemical Structures
PATMAT Person Aware Tuning of Mask-Aware Transformer for Face Inpainting
Class-Incremental Grouping Network for Continual Audio-Visual Learning
SIDGAN High-Resolution Dubbed Video Generation via Shift-Invariant Learning
LiveHand Real-time and Photorealistic Neural Hand Rendering
Multi-label Affordance Mapping from Egocentric Vision
ActorsNeRF Animatable Few-shot Human Rendering with Generalizable NeRFs
DiffTAD Temporal Action Detection with Proposal Denoising Diffusion
Mining bias-target Alignment from Voronoi Cells
Steered Diffusion A Generalized Framework for Plug-and-Play Conditional Image Synthesis
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
DeePoint Visual Pointing Recognition and Direction Estimation
Pre-training Vision Transformers with Very Limited Synthesized Images
Representation Uncertainty in Self-Supervised Learning as Variational Inferenc
Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles
Interaction-aware Joint Attention Estimation Using People Attributes
DiffFacto Controllable Part-Based 3D Point Cloud Generation with Cross Diffusion
CO-PILOT Dynamic Top-Down Point Cloud with Conditional Neighborhood Aggregation fo
Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh
Unmasking Anomalies in Road-Scene Segmentation
Multi-Directional Subspace Editing in Style-Spac
RbA Segmenting Unknown Regions Rejected by All
Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features
GaPro Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes
Improved Knowledge Transfer for Semi-Supervised Domain Adaptation via Trico Training
Can Language Models Learn to Listen
Parallax-Tolerant Unsupervised Deep Image Stitching
PARTNER Level up the Polar Representation for LiDAR 3D Object
RLSAC Reinforcement Learning Enhanced Sample Consensus for End-to-End Robust Estimation
All in Tokens Unifying Output Space of Visual Tasks vi
Deep Image Harmonization with Globally Guided Feature Transformation and Relation
Deep Image Harmonization with Learnable Augmentation
Fine-grained Visible Watermark Removal
NIR-assisted Video Enhancement via Unpaired 24-hour Dat
On the Audio-visual Synchronization for Lip-to-Speech Synthesis
Deep Incubation Training Large Models by Divide-and-Conquering
Part-Aware Transformer for Generalizable Person Re-identification
RankMixup Ranking-Based Mixup Training for Network Calibration
Simple and Effective Out-of-Distribution Detection via Cosine-based Softmax Loss
PRANC Pseudo RAndom Networks for Compacting Deep Models
Neural Implicit Surface Evolution
Audio-Visual Glance Network for Efficient Video Recognition
Time-to-Contact Map by Joint Estimation of Up-to-Scale Inverse Depth an
Chaotic World A Large and Challenging Benchmark for Human Behavio
Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces o
Editing Implicit Assumptions in Text-to-Image Diffusion Models
Black Box Few-Shot Adaptation for Vision-Language Models
RSFNet A White-Box Image Retouching Approach using Region-Specific Color Filters
Troubleshooting Ethnic Quality Bias with Curriculum Domain Adaptation for Fac
Conceptual and Hierarchical Latent Space Decomposition for Face Editing
Teaching CLIP to Count to Ten
NeSS-ST Detecting Good and Stable Keypoints with a Neural Stability
Domain Adaptive Few-Shot Open-Set Learning
FashionNTM Multi-turn Fashion Image Retrieval via Cascaded Memory
A Complete Recipe for Diffusion Generative Models
Locally Stylized Neural Radiance Fields
First Session Adaptation A Strong Replay-Free Baseline for Class-Incremental Learning
Adaptive Template Transformer for Mitochondria Segmentation in Electron Microscopy Images
Aria Digital Twin A New Benchmark Dataset for Egocentric 3D
COPILOT Human-Environment Collision Prediction and Localization from Egocentric Videos
Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
Few Shot Font Generation Via Transferring Similarity Guided Global Styl
Privacy Preserving Localization via Coordinate Permutations
Random Sub-Samples Generation for Self-Supervised Real Image Denoising
Scanning Only Once An End-to-end Framework for Fast Temporal Grounding
TransHuman A Transformer-based Human Representation for Generalizable Neural Human Rendering
Relightify Relightable 3D Faces from a Single Image via Diffusion
Taming Contrast Maximization for Learning Sequential Low-latency Event-based Optical Flow
MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Cam
ACLS Adaptive and Conditional Label Smoothing for Network Calibration
COMPASS High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability
Content-Aware Local GAN for Photo-Realistic Super-Resolution
Label Shift Adapter for Test-Time Adaptation under Covariate and Label
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in
Nearest Neighbor Guidance for Out-of-Distribution Detection
PC-Adapter Topology-Aware Adapter for Efficient Domain Adaption on Point Clouds
Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models
SeiT Storage-Efficient Vision Training with Tokens Using 1 of Pixel
Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocul
Understanding the Feature Norm for Out-of-Distribution Detection
Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models
Pretrained Language Models as Visual Planners for Human Assistanc
Multi-weather Image Restoration via Domain Translation
GlueStick Robust Image Matching by Sticking Points and Lines Togeth
Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction
Scalable Diffusion Models with Transformers
Clusterformer Cluster-based Transformer for 3D Object Detection in Point Clouds
Space-time Prompting for Video Class-incremental Learning
AutoReP Automatic ReLU Replacement for Fast Private Network Inferenc
CAME Contrastive Automated Model Evaluation
DELFlow Dense Efficient Learning of Scene Flow for Large-Scale Point
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic
EmoTalk Speech-Driven Emotional Disentanglement for 3D Face Animation
GET Group Event Transformer for Event-Based Vision
Source-free Domain Adaptive Human Pose Estimation
USAGE A Unified Seed Area Generation Paradigm for Weakly Supervis
TMR Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
Audio-Visual Class-Incremental Learning
Lens Parameter Estimation for Realistic Depth of Field Modeling
BANSAC A Dynamic BAyesian Network for Adaptive SAmple Consensus
A step towards understanding why classification helps regression
LDP-Feat Image Features with Local Differential Privacy
Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models
What Can a Cook in Italy Teach a Mechanic in
LD-ZNet A Latent Diffusion Approach for Text-Based Image Segmentation
Event-based Temporally Dense Optical Flow Estimation with Sequential Learning
DiFaReli Diffusion Face Relighting
Surface Normal Clustering for Implicit Representation of Manhattan Scenes
Learn TAROT with MENTOR A Meta-Learned Self-Supervised Approach for Trajectory
EgoVLPv2 Egocentric Video-Language Pre-training with Fusion in the Backbon
What Does a Platypus Look Like Generating Customized Prompts fo
Dynamic Point Fields
Inverse Problem Regularization with Hierarchical Variational Autoencoders
Keep It SimPool Who Said Supervised Transformers Suffer from Attention
Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation
Adaptive Rotated Convolution for Rotated Object Detection
Breaking The Limits of Text-conditioned 3D Motion Synthesis with Elaborativ
Decouple Before Interact Multi-Modal Prompt Learning for Continual Visual Question
LEA2 A Lightweight Ensemble Adversarial Attack via Non-overlapping Vulnerable Frequency
Sat2Density Faithful Density Learning from Satellite-Ground Image Pairs
Semantics Meets Temporal Correspondence Self-supervised Object-centric Learning in Videos
Stable Cluster Discrimination for Deep Clustering
Understanding 3D Object Interaction from a Single Imag
Dynamic Mesh-Aware Radiance Fields
March in Chat Interactive Prompting for Remote Embodied Referring Expression
Multi-view Spectral Polarization Propagation for Video Glass Segmentation
VLN-PETL Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
GlueGen Plug and Play Multi-modal Encoders for X-to-image Generation
SupFusion Supervised LiDAR-Camera Fusion for 3D Object Detection
UniFusion Unified Multi-View Fusion Transformer for Spatial-Temporal Representation in Birds-Eye-View
Gram-based Attentive Neural Ordinary Differential Equations Network for Video Nystagmography
MB-TaylorFormer Multi-Branch Efficient Transformer Expanded by Taylor Formula for Imag
Scratch Each Others Back Incomplete Multi-Modal Brain Tumor Segmentation vi
Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubul
E2NeRF Event Enhanced Neural Radiance Fields from Blurry Images
FateZero Fusing Attentions for Zero-shot Text-based Video Editing
High Quality Entity Segmentation
Deep Video Demoireing via Compact Invertible Dyadic Decomposition
Fingerprinting Deep Image Restoration Models
Semantic Information in Contrastive Learning
Single Image Defocus Deblurring via Implicit Neural Inverse Kernels
Boosting Whole Slide Image Classification from the Perspectives of Distribution
Novel-View Synthesis and Pose Estimation for Hand-Object Interaction from Spars
Towards Nonlinear-Motion-Aware and Occlusion-Robust Rolling Shutter Correction
Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
Multimodal Distillation for Egocentric Action Recognition
DreamBooth3D Subject-Driven Text-to-3D Generation
ScatterNeRF Seeing Through Fog with Physically-Based Inverse Neural Rendering
MOST Multiple Object Localization with Self-Supervised Transformers for Object Discovery
Perceptual Grouping in Contrastive Vision-Language Models
DynaMITe Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transform
Studying How to Efficiently and Effectively Guide Models with Explanations
SEMPART Self-supervised Multi-resolution Partitioning of Image Semantics
Prior-guided Source-free Domain Adaptation for Human Pose Estimation
Scale-MAE A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning
L-DAWA Layer-wise Divergence Aware Weight Aggregation in Federated Self-Supervised Visual
Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from
GeoUDF Surface Reconstruction from 3D Point Clouds via Geometry-guided Distanc
Hierarchical Prior Mining for Non-local Multi-View Stereo
Multiscale Structure Guided Diffusion for Image Deblurring
Reinforced Disentanglement for Face Swapping without Skip Connection
SG-Former Self-guided Transformer with Evolving Token Reallocation
UGC Unified GAN Compression for Efficient Image-to-Image Translation
Zero-guidance Segmentation Using Zero Segment Labels
CGBA Curvature-aware Geometric Black-box Attack
Efficient 3D Semantic Segmentation with Superpoint Transform
LightDepth Single-View Depth Self-Supervision from Illumination Declin
End2End Multi-View Feature Matching with Differentiable Pose Optimization
Re-ReND Real-Time Rendering of NeRFs across Devices
Waffling Around for Performance Visual Classification with Random Words an
Exemplar-Free Continual Transformer with Convolutions
Test Time Adaptation for Blind Image Quality Assessment
Tracking by 3D Model Estimation of Unknown Objects in Videos
Towards Viewpoint-Invariant Visual Recognition via Adversarial Training
Theoretical and Numerical Analysis of 3D Reconstruction Using Point an
ICICLE Interpretable Class Incremental Continual Learning
Gramian Attention Heads are Strong yet Efficient Vision Learners
MEGA Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
Multi-Object Discovery by Low-Dimensional Object Motion
EDAPS Enhanced Domain-Adaptive Panoptic Segmentation
Learning Adaptive Neighborhoods for Graph Neural Networks
Chop Learn Recognizing and Generating Object-State Compositions
DataDAM Efficient Dataset Distillation with Attention Matching
Time Does Tell Self-Supervised Time-Tuning of Dense Image Representations
Walking Your LiDOG A Journey Through Multiple Domains for LiDAR
CDFSL-V Cross-Domain Few-Shot Learning for Videos
Spatio-Temporal Crop Aggregation for Video Representation Learning
You Never Get a Second Chance To Make a Goo
Domain Generalization of 3D Semantic Segmentation in Autonomous Driving
Point-SLAM Dense Neural Point Cloud-based SLAM
S-TREK Sequential Translation and Rotation Equivariant Keypoints for Local Featu
Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation
Curvature-Aware Training for Coordinate Networks
VQ3D Learning a 3D-Aware Generative Model on ImageNet
MI-GAN A Simple Baseline for Image Inpainting on Mobile Devices
SGAligner 3D Scene Alignment with Scene Graphs
Self-supervised Monocular Depth Estimation Lets Talk About The Weath
GACE Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors
Distracting Downpour Adversarial Weather Attacks for Motion Estimation
Probabilistic Modeling of Inter- and Intra-observer Variability in Medical Imag
R3D3 Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras
OmniLabel A Challenging Benchmark for Language-Based Object Detection
Discriminative Class Tokens for Text-to-Image Diffusion Models
MotionLM Multi-Agent Motion Forecasting as Language Modeling
DARTH Holistic Test-time Adaptation for Multiple Object Tracking
Vox-E Text-Guided Voxel Editing of 3D Objects
Sound Source Localization is All about Cross-Modal Alignment
FlipNeRF Flipped Reflection Rays for Few-shot Novel View Synthesis
Graphics2RAW Mapping Computer Graphics Images to Sensor RAW Images
LFS-GAN Lifelong Few-Shot Image Generation
How to Boost Face Recognition with StyleGAN
LiDAR-UDA Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation
Template Inversion Attack against Face Recognition Systems using 3D Fac
STEPs Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural
TiDy-PSFs Computational Imaging with Time-Averaged Dynamic Point-Spread-Functions
SwiftFormer Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Neural Fields for Structured Lighting
Causal-DFQ Causality Guided Data-Free Network Quantization
Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Aliv
Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
Action Sensitivity Learning for Temporal Action Localization
Building Bridge Across the Time Disruption and Restoration of Murals
Data-free Knowledge Distillation for Fine-grained Visual Categorization
Global Features are All You Need for Image Retrieval an
HiVLP Hierarchical Interactive Video-Language Pre-Training
LNPL-MIL Learning from Noisy Pseudo Labels for Promoting Multiple Instanc
NDDepth Normal-Distance Assisted Monocular Depth Estimation
Towards Multi-Layered 3D Garments Animation
Transparent Shape from a Single View Polarization Imag
Unified Pre-Training with Pseudo Texts for Text-To-Image Person Re-Identification
Replay Multi-modal Multi-view Acted Videos for Casual Holography
The Perils of Learning From Unlabeled Data Backdoor Attacks on
AdaptGuard Defending Against Universal Attacks for Model Adaptation
Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on
Accurate and Fast Compressed Video Captioning
CLIP-Cluster CLIP-Guided Attribute Hallucination for Face Clustering
Dec-Adapter Exploring Efficient Decoder-Side Adapter for Bridging Screen Content an
FerKD Surgical Label Adaptation for Efficient Distillation
Learning Global-aware Kernel for Image Harmonization
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Clou
RPG-Palm Realistic Pseudo-data Generation for Palmprint Recognition
SegRCDB Semantic Segmentation via Formula-Driven Supervised Learning
Anomaly Detection using Score-based Perturbation Resilienc
BallGAN 3D-aware Image Synthesis with a Spherical Backgroun
BlendFace Re-designing Identity Encoders for Face-Swapping
3D Distillation Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces
Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transform
Deep Multitask Learning with Progressive Parameter Sharing
Dual Pseudo-Labels Interactive Self-Training for Semi-Supervised Visible-Infrared Person Re-Identification
EdaDet Open-Vocabulary Object Detection Using Early Dense Alignment
FreeCOS Self-Supervised Learning from Fractals and Unlabeled Images for Curvilin
LoGoPrompt Synthetic Text Images Can Be Good Visual Prompts fo
Lossy and Lossless L2 Post-training Model Size Compression
PhaseMP Robust 3D Pose Estimation via Phase-conditioned Human Motion Prio
PlaneRecTR Unified Query Learning for 3D Plane Recovery from
Prototype Reminiscence and Augmented Asymmetric Knowledge Aggregation for Non-Exemplar Class-Incremental
Trajectory Unified Transformer for Pedestrian Trajectory Prediction
VideoFlow Exploiting Temporal Cues for Multi-frame Optical Flow Estimation
Video Anomaly Detection via Sequentially Learning Multiple Pretext Tasks
Efficient Computation Sharing for Multi-Task Visual Scene Understanding
What does CLIP know about a red circle Visual prompt
DPF-Net Combining Explicit Shape Priors in Deformable Primitive Field fo
eP-ALM Efficient Perceptual Augmentation of Language Models
Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration
3DPPE 3D Point Positional Encoding for Transformer-based Multi-Camera 3D Object
Adaptive Image Anonymization in the Context of Image Classification with
In-Style Bridging Text and Uncurated Videos with Style Transfer fo
Learning by Sorting Self-supervised Learning with Group Ordering Constraints
MosaiQ Quantum Generative Adversarial Networks for Image Generation on NISQ
SUMMIT Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets
Learning to Transform for Generalizable Instance-wise Invarianc
Benchmarking Low-Shot Robustness to Natural Distribution Shifts
Learning to Learn How to Continuously Teach Humans and Machines
Scene Graph Contrastive Learning for Embodied Navigation
The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining
Deep Geometrized Cartoon Line Inbetweening
Neural Haircut Prior-Guided Strand-Based Hair Reconstruction
VLSlice Interactive Vision-and-Language Slice Discovery
Blending-NeRF Text-Driven Localized Editing in Neural Radiance Fields
Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in
Emotional Listener Portrait Neural Listener Head Generation with Emotion
Feature Proliferation -- the Cancer in StyleGAN and its Treatments
GraphAlign Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal
Householder Projector for Unsupervised Latent Semantics Discovery
LLM-Planner Few-Shot Grounded Planning for Embodied Agents with Large Languag
ModelGiF Gradient Fields for Model Functional Distanc
Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning
Total-Recon Deformable Scene Reconstruction for Embodied View Synthesis
Under-Display Camera Image Restoration with Scattering Effect
Unsupervised Object Localization with Representer Point Selection
Mastering Spatial Graph Prediction of Road Networks
Kick Back Relax Learning to Reconstruct the World by
Corrupting Neuron Explanations of Deep Visual Features
FLIP Cross-domain Face Anti-spoofing with Language Guidanc
Leaping Into Memories Space-Time Deep Feature Synthesis
FineRecon Depth-aware Feed-forward Network for Detailed 3D Reconstruction
LivePose Online 3D Reconstruction from Monocular Video with Dynamic Cam
SAGA Spectral Adversarial Geometric Attack on 3D Meshes
Agile Modeling From Concept to Classifier in Minutes
Rickrolling the Artist Injecting Backdoors into Text Encoders for Text-to-Imag
Vision Relation Transformer for Unbiased Scene Graph Generation
Exploring the Sim2Real Gap Using Digital Twins
SoDaCam Software-defined Cameras via Single-Photon Imaging
Adaptive Illumination Mapping for Shadow Detection in Raw Images
Alignment Before Aggregation Trajectory Memory Retrieval Network for Video Object
Communication-Efficient Vertical Federated Learning with Limited Overlapping Samples
Contrastive Pseudo Learning for Open-World DeepFake Attribution
DIME-FM DIstilling Multimodal and Efficient Foundation Models
Dual Meta-Learning with Longitudinally Consistent Regularization for One-Shot Brain Tissu
FedPerfix Towards Partial Model Personalization of Vision Transformers in Federat
Going Denser with Open-Vocabulary Part Segmentation
Local Context-Aware Active Domain Adaptation
MAPConNet Self-supervised 3D Pose Transfer with Mesh and Point Contrastiv
MixSynthFormer A Transformer Encoder-like Structure with Mixed Synthetic Self-attention fo
Neural-PBIR Reconstruction of Shape Material and Illumination
Neural Reconstruction of Relightable Human Model from Monocular Video
SAFL-Net Semantic-Agnostic Feature Learning Network with Auxiliary Plugins for Imag
Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution
Spatially and Spectrally Consistent Deep Functional Maps
Spatio-temporal Prompting Network for Robust Video Feature Extraction
Unleashing the Power of Gradient Signal-to-Noise Ratio for Zero-Shot NAS
ViperGPT Visual Inference via Python Execution for Reasoning
SparseDet Improving Sparsely Annotated Object Detection with Pseudo-positive Mining
ACTIVE Towards Highly Transferable 3D Physical Camouflage for Universal an
TIJO Trigger Inversion with Joint Optimization for Defending Multimodal Backdoo
Smoothness Similarity Regularization for Few-Shot GAN Adaptation
Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeo
CaPhy Capturing Physical Properties for Animatable Human Avatars
Deep Directly-Trained Spiking Neural Networks for Object Detection
Hiding Visual Information via Obfuscating Adversarial Perturbations
Name Your Colour For the Task Artificially Discover Colour Naming
NPC Neural Point Characters from Video
Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning
DINAR Diffusion Inpainting of Neural Textures for One-Shot Human Avatars
Preserving Modality Structure Improves Multi-Modal Learning
Viewset Diffusion 0-Image-Conditioned 3D Generative Models from 2D Dat
ChildPlay A New Benchmark for Understanding Childrens Gaze Behaviou
Global Perception Based Autoregressive Neural Processes
3D Segmentation of Humans in Point Clouds with Synthetic Dat
Role-Aware Interaction Generation from Textual Description
CoTDet Affordance Knowledge Prompting for Task Driven Object Detection
DDG-Net Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization
Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement
Distribution Shift Matters for Knowledge Distillation with Webly Collected Images
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
ElasticViT Conflict-aware Supernet Training for Deploying Fast Vision Transformer on
Make-It-3D High-fidelity 3D Creation from A Single Image with Diffusion
Multiple Instance Learning Framework with Masked Hard Instance Mining fo
ProtoTransfer Cross-Modal Prototype Transfer for Point Cloud Segmentation
Scene Matters Model-based Deep Video Compression
SwinLSTM Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM
Temporal Collection and Distribution for Referring Video Object Segmentation
When Prompt-based Incremental Learning Does Not Meet Strong Pretraining
Social Diffusion Long-term Multiple Human Motion Anticipation
DS-Fusion Artistic Typography via Discriminated and Stylized Diffusion
EMMN Emotional Motion Memory Network for Audio-driven Emotional Talking Fac
3DHacker Spectrum-based Decision Boundary Generation for Hard-label 3D Point Clou
AdaNIC Towards Practical Neural Image Compression via Dynamic Transform Routing
Local and Global Logit Adjustments for Long-Tailed Learning
Enhanced Meta Label Correction for Coping with Label Corruption
Examining Autoexposure for Challenging Scenes
Alignment-free HDR Deghosting with Semantics Consistent Transform
StageInteractor Query-based Object Detector with Cross-stage Interaction
Tangent Sampson Error Fast Approximate Two-view Reprojection Error for Central
Imitator Personalized Speech-driven 3D Facial Animation
Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
Beyond Skin Tone A Multidimensional Measure of Apparent Skin Colo
DPS-Net Deep Polarimetric Stereo Depth Estimation
Instance and Category Supervision are Alternate Learners for Continual Learning
MonoNeRF Learning a Generalizable Dynamic Radiance Field from Monocular Videos
Non-Semantics Suppressed Mask Learning for Unsupervised Video Semantic Compression
Prototypes-oriented Transductive Few-shot Learning with Conditional Transport
ShapeScaffolder Structure-Aware 3D Shape Generation from Text
Scene as Occupancy
Object-aware Gaze Target Detection
Linear Spaces of Meanings Compositional Structures in Vision-Language Models
Persistent-Transient Duality A Multi-Mechanism Approach for Modeling Human-Object Interaction
DECO Dense Estimation of 3D Human-Scene Contact In The Wil
DivideClassify Fine-Grained Classification for City-Wide Visual Geo-Localization
Spectral Graphormer Spectral Graph-Based Transformer for Egocentric Two-Hand Reconstruction using
Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution
Agglomerative Transformer for Human-Object Interaction Detection
FemtoDet An Object Detection Baseline for Energy Versus Performance Tradeoffs
ImGeoNet Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
MULLER Multilayer Laplacian Resizer for Vision
Self-supervised Cross-view Representation Reconstruction for Change Captioning
GECCO Geometrically-Conditioned Point Diffusion Models
SuS-X Training-Free Name-Only Transfer of Vision-Language Models
ProbVLM Probabilistic Adapter for Frozen Vison-Language Models
When Do Curricula Work in Federated Learning
der Klis PDiscoNet Semantically consistent part discovery for fine-grained recognition
Landeghem Document Understanding Dataset and Evaluation DUDE
Le Anti-DreamBooth Protecting Users from Personalized Text-to-image Synthesis
Noord Protoype-based Dataset Comparison
Spengler Poincare ResNet
Self-supervised Monocular Underwater Depth Recovery Image Restoration and a Real-s
ViLLA Fine-Grained Vision-Language Representation Learning from Real-World Dat
FastViT A Fast Hybrid Vision Transformer Using Structural Reparameterization
Convex Decomposition of Indoor Scenes
P1AC Revisiting Absolute Pose From a Single Affine Correspondenc
CLIPascene Scene Sketching with Different Types and Levels of Abstraction
MST-compression Compressing and Accelerating Binary Neural Networks with Minimum Spanning
End-to-End Diffusion Latent Optimization Improves Classifier Guidanc
3D Human Mesh Recovery with Sequentially Global Rotation Estimation
3D Semantic Subspace Traverser Empowering 3D Generative Model with Sh
ALWOD Active Learning for Weakly-Supervised Object Detection
Batch-based Model Registration for Fast 3D Sherd Reconstruction
Building3D A Urban-Scale Dataset and Benchmarks for Learning Roof Structures
CBA Improving Online Continual Learning via Continual Bias Adapto
CDAC Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic
CLIPN for Zero-Shot OOD Detection Teaching CLIP to Say No
CORE Cooperative Reconstruction for Multi-Agent Perception
Counterfactual-based Saliency Map Towards Visual Contrastive Explanations for Neural Networks
Creative Birds Self-Supervised Single-View 3D Style Trans
Deep Active Contours for Real-time 6-DoF Object Tracking
Deep Equilibrium Object Detection
Deep Optics for Video Snapshot Compressive Imaging
DiLiGenT-Pi Photometric Stereo for Planar Surfaces with Rich Details -
DIRE for Diffusion-Generated Image Detection
DistillBEV Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual
Distribution-Consistent Modal Recovering for Incomplete Multimodal Learning
Does Physical Adversarial Example Really Matter to Autonomous Driving Towards
Domain Specified Optimization for Deployment Authorization
DREAMWALKER Mental Planning for Continuous Vision-Language Navigation
DyGait Exploiting Dynamic Representations for High-performance Gait Recognition
EfficientTrain Exploring Generalized Curriculum Learning for Training Visual Backbones
Ego-Only Egocentric Action Detection without Exocentric Transferring
Equivariant Similarity for Vision-Language Foundation Models
Evaluating Data Attribution for Text-to-Image Models
Event-Guided Procedure Planning from Instructional Videos with Text Supervision
Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection
ExposureDiffusion Learning to Expose for Low-light Image Enhancement
Fg-T2M Fine-Grained Text-Driven Human Motion Generation via Diffusion Model
Generalizable Decision Boundaries Dualistic Meta-Learning for Open Set Domain Generalization
Get the Best of Both Worlds Improving Accuracy and Transferability
GlowGAN Unsupervised Learning of HDR Images from LDR Images in
GridMM Grid Memory Map for Vision-and-Language Navigation
Guiding Local Feature Matching with Surface Curvatu
Hierarchical Spatio-Temporal Representation Learning for Gait Recognition
HoloAssist an Egocentric Human Interaction Dataset for Interactive AI Assistants
Homography Guided Temporal Fusion for Road Line and Marking Segmentation
How Far Pre-trained Models Are from Neural Collapse on th
IHNet Iterative Hierarchical Network Guided by High-Resolution Estimated Information fo
Improved Visual Fine-tuning with Natural Language Supervision
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts
Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation
Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video
Learning Human Dynamics in Autonomous Driving Scenarios
Learning Long-Range Information with Dual-Scale Transformers for Indoor Scene Completion
Learning Support and Trivial Prototypes for Interpretable Image Classification
Learning Unified Decompositional and Compositional NeRF for Editable Novel View
Lighting up NeRF via Unsupervised Decomposition and Enhancement
LoLep Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion
Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Imag
LRRU Long-short Range Recurrent Updating Networks for Depth Completion
Manipulate by Seeing Creating Manipulation Controllers from Pre-Trained Representations
Masked Spiking Transform
Memory-and-Anticipation Transformer for Online Action Understanding
Mixed Neural Voxels for Fast Multi-view Video Synthesis
NEMTO Neural Environment Matting for Novel View and Relighting Synthesis
Neural Video Depth Stabiliz
NeuS2 Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction
Noise2Info Noisy Image to Information of Noise for Self-Supervised Imag
Not All Steps are Created Equal Selective Diffusion Distillation fo
Not Every Side Is Equal Localization Uncertainty Estimation for Semi-Supervis
Object as Query Lifting Any 2D Object Detector to 3D
Open-Vocabulary Object Detection With an Open Corpus
OpenOccupancy A Large Scale Benchmark for Surrounding Semantic Occupancy Perception
OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions
Ord2Seq Regarding Ordinal Regression as Label Sequence Prediction
Overwriting Pretrained Bias with Finetuning Dat
PoseDiffusion Solving Pose Estimation via Diffusion-aided Bundle Adjustment
Query6DoF Learning Sparse Queries as Implicit Shape Prior for Category-Level
Random Boxes Are Open-world Object Detectors
ReFit Recurrent Fitting Network for 3D Human Recovery
Regularized Primitive Graph Learning for Unified Vector Mapping
RFLA A Stealthy Reflected Light Adversarial Attack in the Physical
ROME Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation
Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocul
Saliency Regularization for Self-Training with Partial Annotations
Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions
Scaling Data Generation in Vision-and-Language Navigation
Seal-3D Interactive Pixel-Level Editing for Neural Radiance Fields
SegGPT Towards Segmenting Everything in Context
Self-similarity Driven Scale-invariant Learning for Weakly Supervised Person Search
SpaceEvo Hardware-Friendly Search Space Design for Efficient INT8 Inferenc
Space Engage Collaborative Space Supervision for Contrastive-Based Semi-Supervised Semantic Segmentation
SparseNeRF Distilling Depth Ranking for Few-shot Novel View Synthesis
SSF Accelerating Training of Spiking Neural Networks with Stabilized Spiking
Structure Invariant Transformation for better Adversarial Transferability
StyleDiffusion Controllable Disentangled Style Transfer via Diffusion Models
StyleInV A Temporal Style Modulated Inversion Network for Unconditional Video
Take-A-Photo 3D-to-2D Generative Pre-training of Point Cloud Models
Too Large Data Reduction for Vision-Language Pre-Training
Towards Open-Vocabulary Video Instance Segmentation
Tracking Everything Everywhere All at Onc
Treating Pseudo-labels Generation as Image Matting for Weakly Supervised Semantic
UMC A Unified Bandwidth-efficient and Multi-resolution based Collaborative Perception Framework
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection
UniTR A Unified and Efficient Multi-Modal Transformer for Birds-Eye-View Representation
Unsupervised Video Deraining with An Event Cam
V3Det Vast Vocabulary Visual Detection Dataset
View Consistent Purification for Accurate Cross-View Localization
ViLTA Enhancing Vision-Language Pre-training through Textual Augmentation
VQA-GNN Reasoning with Multimodal Knowledge via Graph Neural Networks fo
Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling
What do neural networks learn in image classification A frequency
Why do networks have inhibitorynegative connections
Zolly Zoom Focal Length Correctly for Perspective-Distorted Human Mesh Reconstruction
Enhancing Privacy Preservation in Federated Learning via Learning Rate Perturbation
RPEFlow Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow an
SOCS Semantically-Aware Object Coordinate Space for Category-Level 6D Object Pos
UniDexGrasp Improving Dexterous Grasping Policy Learning via Geometry-Aware Curriculum an
Nerfbusters Removing Ghostly Artifacts from Casually Captured NeRFs
Video-FocalNets Spatio-Temporal Focal Modulation for Video Action Recognition
CroCo v2 Improved Cross-view Completion Pre-training for Stereo Matching an
Adaptive Reordering Sampler with Neurally Guided MAGSA
Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting
DCPB Deformable Convolution Based on the Poincare Ball for Top-view
Diffusion Models as Masked Autoencoders
Disentangle then Parse Night-time Semantic Segmentation with Illumination Disentanglement
ELITE Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Imag
Generalized Differentiable RANSA
HairCLIPv2 Unifying Hair Editing via Proxy Feature Blending
Improving CLIP Fine-tuning Performanc
Improving Continuous Sign Language Recognition with Cross-Lingual Signs
Is Imitation All You Need Generalized Decision-Making with Dual-Phase Training
Multimodal High-order Relation Transformer for Scene Boundary Detection
Online Prototype Learning for Online Continual Learning
Passive Ultra-Wideband Single-Photon Imaging
SurroundOcc Multi-camera 3D Occupancy Prediction for Autonomous Driving
Temporal-Coded Spiking Neural Networks with Dynamic Firing Threshold Learning with
Towards Real-World Burst Image Super-Resolution Benchmark and Metho
Unified Adversarial Patch for Cross-Modal Attacks in the Physical Worl
Affective Image Filter Reflecting Emotions from Text to Images
Joint Metrics Matter A Better Standard for Trajectory Forecasting
Divide and Conquer a Two-Step Method for High Quality Fac
Ordinal Label Distribution Learning
Pairwise Similarity Learning is SimPLE
Parametric Classification for Generalized Category Discovery A Baseline Study
SimNP Learning Self-Similarity Priors Between Neural Points
SAFE Sensitivity-Aware Features for Out-of-Distribution Object Detection
Unsupervised Learning of Object-Centric Embeddings for Cell Instance Segmentation in
AccFlow Backward Accumulation for Long-Range Optical Flow
Advancing Referring Expression Segmentation Beyond Single Imag
A Latent Space of Stochastic Diffusion Models for Zero-Shot Imag
Betrayed by Captions Joint Caption Grounding and Generation for Open
Bold but Cautious Unlocking the Potential of Personalized Federated Learning
Computation and Data Efficient Backdoor Attacks
Deep Feature Deblurring Diffusion for Detecting Out-of-Distribution Objects
DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using
Efficient View Synthesis with Neural Radiance Distribution Fiel
Estimator Meets Equilibrium Perspective A Rectified Straight Through Estimator fo
Exploring Transformers for Open-world Instance Segmentation
Exploring Video Quality Assessment on User Generated Contents from Aesthetic
Face Clustering via Graph Convolutional Networks with Confidence Edges
Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation
Grounded Image Text Matching with Mismatched Relation Reasoning
Hallucination Improves the Performance of Unsupervised Visual Representation Learning
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Imag
HSR-Diff Hyperspectral Image Super-Resolution via Conditional Diffusion Models
Human Preference Score Better Aligning Text-to-Image Models with Human Preferenc
Improving Representation Learning for Histopathologic Images with Cluster Constraints
LA-Net Landmark-Aware Learning for Reliable Facial Expression Recognition under Label
Label-Efficient Online Continual Object Detection in Streaming Video
Learning Concordant Attention via Target-aware Alignment for Visible-Infrared Person Re-identification
Learning Foresightful Dense Visual Affordance for Deformable Object Manipulation
Leveraging SE3 Equivariance for Learning 3D Geometric Shape Assembly
LPFF A Portrait Dataset for Face Generators Across Large Poses
MedKLIP Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis
MetaGCD Learning to Continually Learn in Generalized Category Discovery
Meta OOD Learning For Continuously Adaptive OOD Detection
MixCycle Mixup Assisted Semi-Supervised 3D Single Object Tracking with Cycl
ObjectSDF Improved Object-Compositional Neural Implicit Surfaces
OnlineRefer A Simple Online Baseline for Referring Video Object Segmentation
Randomized Quantization A Generic Augmentation for Data Agnostic Self-supervised Learning
S-VolSDF Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces
Scalable Video Object Segmentation with Simplified Framework
Segment Every Reference Object in Spatial and Temporal Spaces
Sketch and Text Guided Diffusion Model for Colored Point Clou
Source-free Depth for Object Pop-out
Spatial-Aware Token for Weakly Supervised Object Localization
Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes
Speech2Lip High-fidelity Speech to Lip Generation by Learning from
TinyCLIP CLIP Distillation via Affinity Mimicking and Weight Inheritanc
Towards Universal LiDAR-Based 3D Object Detection by Multi-Domain Knowledge Trans
Tune-A-Video One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
What Can Simple Arithmetic Operations Do for Temporal Modeling
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy
AssetField Assets Mining and Reconfiguration in Ground Feature Plane Representation
3D-aware Image Generation using 2D Diffusion Models
Denoising Diffusion Autoencoders are Unified Self-supervised Learners
Generative Action Description Prompts for Skeleton-based Action Recognition
GRAM-HD 3D-Consistent Image Generation at High Resolution with Generative Radianc
HM-ViT Hetero-Modal Vehicle-to-Vehicle Cooperative Perception with Vision Transform
Rendering Humans from Object-Occluded Monocular Videos
Retro-FPN Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation
ADNet Lane Shape Prediction via Anchor Decomposition
Automatic Animation of Hair Blowing in Still Portrait Photos
Token-Label Alignment for Vision Transformers
CASSPR Cross Attention Single Scan Place Recognition
CMDA Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation
CoIn Contrastive Instance Feature Mining for Outdoor 3D Object Detection
Combating Noisy Labels with Sample Selection by Mining High-Discrepancy Examples
DiffIR Efficient Diffusion Model for Image Restoration
Few-Shot Video Classification via Representation Fusion and Promotion Learning
Holistic Label Correction for Noisy Multi-Label Classification
Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization
Personalized Semantics Excitation for Federated Image Classification
Window-Based Early-Exit Cascades for Uncertainty Estimation When Deep Ensembles
BoxDiff Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
CO-Net Learning Multiple Point Cloud Tasks at Once with A
DiffFit Unlocking Transferability of Large Diffusion Models via Simple Parameter-efficient
GAIT Generating Aesthetic Indoor Tours with Deep Reinforcement Learning
HollowNeRF Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
Most Important Person-Guided Dual-Branch Cross-Patch Attention for Group Affect Recognition
MV-Map Offboard HD-Map Generation with Multi-view Consistency
NaviNeRF NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation
Nonrigid Object Contact Estimation With Regional Unwrapping Transform
OFVL-MS Once for Visual Localization across Multiple Indoor Scenes
Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection
S3IM Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural
SparseFusion Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection
Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching
HDG-ODE A Hierarchical Continuous-Time Model for Human Pose Forecasting
CL-MVSNet Unsupervised Multi-View Stereo with Dual-Level Contrastive Learning
Confidence-based Visual Dispersal for Few-shot Unsupervised Domain Adaptation
Get3DHuman Lifting StyleGAN-Human into a 3D Generative Model Using Pixel-Align
Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting
Narrator Towards Natural Control of Human-Scene Interaction Generation via Relationshi
NSF Neural Surface Fields for Human Modeling from Monocular Depth
Variational Causal Inference Network for Explanatory Visual Question Answering
ActFormer A GAN-based Transformer towards General Action-Conditioned 3D Human Motion
Animal3D A Comprehensive Dataset of 3D Animal Pose and Sh
Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation
Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction
Backpropagation Path Search On Adversarial Transferability
Bridging Vision and Language Encoders Parameter-Efficient Tuning for Referring Imag
C2F2NeUS Cascade Cost Frustum Fusion for High Fidelity and Generalizabl
CiT Curation in Training for Effective Vision-Language Dat
ClothPose A Real-world Benchmark for Visual Analysis of Garment Pos
DeepChange A Long-Term Person Re-Identification Benchmark with Clothes Chang
Deformable Model-Driven Neural Rendering for High-Fidelity 3D Reconstruction of Human
Downscaled Representation Matters Improving Image Rescaling with Collaborative Downscaled Images
Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural
EgoPCA A New Framework for Egocentric Hand-Object Interaction Understanding
EQ-Net Elastic Quantization Neural Networks
Fast and Accurate Transferability Measurement by Evaluating Intra-class Feature Varianc
FDViT Improve the Hierarchical Architecture of Vision Transform
FrozenRecon Pose-free 3D Scene Reconstruction with Frozen Depth Models
Generalized Few-Shot Point Cloud Segmentation via Geometric Words
Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation
Human-centric Scene Understanding for 3D Large-scale Scenarios
Integrating Boxes and Masks A Multi-Object Framework for Unified Visual
InterDiff Generating 3D Human-Object Interactions with Physics-Informed Diffusion
Joint-Relation Transformer for Multi-Person Motion Prediction
Learning Image Harmonization in the Linear Color Spac
MasQCLIP for Open-Vocabulary Universal Image Segmentation
MBPTrack Improving 3D Point Cloud Tracking with Memory Networks an
MonoNeRD NeRF-like Representations for Monocular 3D Object Detection
Multi-Task Learning with Knowledge Distillation for Dense Prediction
Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency fo
NeRF-Det Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
ParCNetV2 Oversized Kernel with Enhanced Attention
ReNeRF Relightable Neural Radiance Fields with Nearfield Lighting
RIGID Recurrent GAN Inversion and Editing of Real Face Videos
Self-Calibrated Cross Attention Network for Few-Shot Segmentation
StylerDALLE Language-Guided Style Transfer Using a Vector-Quantized Tokenizer o
TALL Thumbnail Layout for Deepfake Video Detection
Versatile Diffusion Text Images and Variations All in One Diffusion
WaveNeRF Wavelet-based Generalizable Neural Radiance Fields
FCCNs Fully Complex-valued Convolutional Networks using Complex-valued Color Model an
2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision
3DHumanGAN 3D-Aware Human Image Generation with 3D Pose Mapping
AIDE A Vision-Driven Multi-View Multi-Modal Multi-Tasking Dataset for Assistive Driving
ALIP Adaptive Language-Image Pre-Training with Synthetic Caption
ASM Adaptive Skinning Model for High-Quality 3D Face Modeling
Attentive Mask CLIP
Beyond the Limitation of Monocular 3D Detector via Knowledge Distillation
BoxSnake Polygonal Instance Segmentation with Box Supervision
Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection
Computationally-Efficient Neural Image Compression with Shallow Decoders
Concept-wise Fine-tuning Matters in Preventing Negative Trans
Cross-Ray Neural Radiance Fields for Novel-View Synthesis from Unconstrained Imag
Cross-view Semantic Alignment for Livestreaming Product Recognition
D-IF Uncertainty-aware Human Digitization via Implicit Distribution Fiel
Data Augmented Flatness-aware Gradient Projection for Continual Learning
Designing Phase Masks for Under-Display Cameras
Diffusion Model as Representation Learn
Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation
EmoSet A Large-scale Visual Emotion Dataset with Rich Attributes
Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization
Event Camera Data Pre-training
FedPD Federated Open Set Recognition with Parameter Disentanglement
Foreground-Background Distribution Modeling Transformer for Visual Object Tracking
From Knowledge Distillation to Self-Knowledge Distillation A Unified Approach with
GEDepth Ground Embedding for Monocular Depth Estimation
Generating Visual Scenes from Touch
GraphEcho Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation
Grounding 3D Object Affordance from 2D Interactions in Images
HSE Hybrid Species Embedding for Deep Metric Learning
Implicit Neural Representation for Cooperative Low-light Image Enhancement
Innovating Real Fisheye Image Correction with Dual Diffusion Architectu
Label-Guided Knowledge Distillation for Continual Semantic Segmentation on 2D Images
LAC - Latent Action Composition for Skeleton-based Action Segmentation
Large-Scale Person Detection and Localization Using Overhead Fisheye Cameras
LAW-Diffusion Complex Scene Generation by Diffusion with Layouts
Learning Trajectory-Word Alignments for Video-Language Tasks
Long-Range Grouping Transformer for Multi-View 3D Reconstruction
MRM Masked Relation Modeling for Medical Image Pre-Training with Genetics
Multi-Label Knowledge Distillation
Neural Interactive Keypoint Detection
One-Shot Generative Domain Adaptation
Out-of-Domain GAN Inversion via Invertibility Decomposition for Photo-Realistic Human Fac
PanFlowNet A Flow-Based Deep Network for Pan-Sharpening
Parametric Depth Based Feature Representation Learning for Object Detection an
PPR Physically Plausible Reconstruction from Monocular Videos
Prototypical Mixing and Retrieval-Based Refinement for Label Noise-Resistant Image Retrieval
SEFD Learning to Distill Complex Pose and Occlusion
Self-Ordering Point Clouds
Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning
SILT Shadow-Aware Iterative Label Tuning for Learning to Detect Shadows
Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception
Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations
StyleGANEX StyleGAN-Based Manipulation Beyond Cropped Aligned Faces
SynBody Synthetic Dataset with Layered Human Models for 3D Human
Towards Grand Unified Representation Learning for Unsupervised Visible-Infrared Person Re-Identification
UrbanGIRAFFE Representing Urban Scenes as Compositional Generative Neural Feature Fields
Video Adverse-Weather-Component Suppression Network via Weather Messenger and Adversarial Backpropagation
Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Trans
Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis
Active Neural Mapping
Cross Modal Transformer Towards Fast and Robust 3D Object Detection
Deep Homography Mixture for Single Image Rolling Shutter Correction
Feature Prediction Diffusion Model for Video Anomaly Detection
Implicit Autoencoder for Point-Cloud Self-Supervised Representation Learning
INT2 Interactive Trajectory Prediction at Intersections
Learning Concise and Descriptive Attributes for Visual Recognition
Learning with Diversity Self-Expanded Equalization for Better Generalized Deep Metric
SkeletonMAE Graph-based Masked Autoencoder for Skeleton Sequence Pre-training
UCF Uncovering Common Features for Generalizable Deepfake Detection
UnLoc A Unified Framework for Video Localization Tasks
Focus the Discrepancy Intra- and Inter-Correlation Learning for Image Anomaly
Generalized Lightness Adaptation with Channel Selective Normalization
Inherent Redundancy in Spiking Neural Networks
NDC-Scene Boost Monocular 3D Semantic Scene Completion in Normalized Devic
Sign Language Translation with Iterative Prototy
Sparse Point Guided 3D Lane Detection
Towards Understanding the Generalization of Deepfake Detectors from a Game-Theoretical
MAMo Leveraging Memory and Attention for Monocular Video Depth Estimation
TextManiA Enriching Visual Feature by Text-driven Manifold Augmentation
FACTS First Amplify Correlations and Then Slice to Discover Bias
Rapid Network Adaptation Learning to Adapt Neural Networks Using Test-Tim
ScanNet A High-Fidelity Dataset of 3D Indoor Scenes
Adverse Weather Removal with Codebook Priors
Bootstrap Motion Forecasting With Self-Consistent Constraints
Cascade-DETR Delving into High-Quality Universal Object Detection
Constraining Depth Map Geometry for Multi-View Stereo A Dual-Depth Approach
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
Efficient Transformer-based 3D Object Detection with Dynamic Token Halting
FeatureNeRF Learning Generalizable NeRFs by Distilling Foundation Models
HiTeA Hierarchical Temporal-Aware Video-Language Pre-training
IntrinsicNeRF Learning Intrinsic Neural Radiance Fields for Editable Novel View
Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction an
Recovering a Molecules 3D Dynamics from Liquid-phase Electron Microscopy Movies
Self-Evolved Dynamic Expansion Model for Task-Free Continual Learning
TaskExpert Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts
Wasserstein Expansible Variational Autoencoder for Discriminative and Generative Continual Learning
Diverse Inpainting and Editing with GAN Inversion
CTVIS Consistent Training for Online Video Instance Segmentation
PARF Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis
CrossMatch Source-Free Domain Adaptive Semantic Segmentation via Cross-Modal Consistency Training
Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection
Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction
MetaF2N Blind Image Super-Resolution by Learning Efficient Model Adaptation from
Metric3D Towards Zero-shot Metric 3D Prediction from A Single Imag
Canonical Factors for Hybrid Neural Fields
Diff-Retinex Rethinking Low-light Image Enhancement with A Generative Diffusion Model
Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Clou
SCANet Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval
Video Object Segmentation-aware Video Frame Interpolation
DynamicISP Dynamically Controlled Image Signal Processor for Image Recognition
Co-Evolution of Pose and Mesh for 3D Human Body Estimation
Towards Universal Image Embeddings A Large-Scale Dataset and Challenge fo
4D Myocardium Reconstruction with Decoupled Motion and Shape Model
Isomer Isomerous Transformer for Zero-shot Video Object Segmentation
Late Stopping Avoiding Confidently Learning from Mislabeled Examples
Make Encoder Great Again in 3D GAN Inversion through Geometry
PhysDiff Physics-Guided Human Motion Diffusion Model
PointMBF A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point
RLIPv2 Fast Scaling of Relational Language-Image Pre-Training
SemARFlow Injecting Semantics into Unsupervised Optical Flow Estimation for Autonomous
Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning
HybridAugment Unified Frequency Spectra Perturbations for Model Robustness
Achievement-Based Training Progress Balancing for Multi-Task Learning
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
EGformer Equirectangular Geometry-biased Transformer for 360 Depth Estimation
SPANet Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation
Aggregating Feature Point Cloud for Depth Completion
Bidirectionally Deformable Motion Modulation For Video-based Human Pose Trans
Both Diverse and Realism Matter Physical Attribute and Style Alignment
Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS
Enhancing Non-line-of-sight Imaging via Learnable Inverse Kernel and Attention Mechanisms
FreeDoM Training-Free Energy-Guided Conditional Diffusion Model
GLA-GCN Global-local Adaptive Graph Convolutional Network for 3D Human Pos
HAL3D Hierarchical Active Learning for Fine-Grained 3D Part Labeling
ICD-Face Intra-class Compactness Distillation for Face Recognition
LaPE Layer-adaptive Position Embedding for Vision Transformers with Independent Lay
Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models
Modality Unifying Network for Visible-Infrared Person Re-Identification
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
Texture Generation on 3D Meshes with Point-UV Diffusion
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only
Video State-Changing Object Segmentation
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an
Boosting Novel Category Discovery Over Domains with Soft Contrastive Learning
The Unreasonable Effectiveness of Large Language-Vision Models for Source-Free Video
Stochastic Segmentation with Conditional Categorical Diffusion Models
Global Balanced Experts for Federated Long-Tailed Learning
MPCViT Searching for Accurate and Efficient MPC-Friendly Vision Transformer with
Parameterized Cost Volume for Stereo Matching
HopFIR Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human
Masked Autoencoders are Efficient Class Incremental Learners
PEANUT Predicting and Navigating to Unseen Targets
Sigmoid Loss for Language Image Pre-Training
SLAN Self-Locator Aided Network for Vision-Language Understanding
SOAR Scene-debiasing Open-set Action Recognition
Stabilizing Visual Reinforcement Learning via Asymmetric Interactive Cooperation
Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning
3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pos
Accurate 3D Face Reconstruction with Facial Component Tokens
Adding Conditional Control to Text-to-Image Diffusion Models
A Dynamic Dual-Processing Object Detection Framework Inspired by the Brains
A Simple Framework for Open-Vocabulary Segmentation and Detection
A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection
Black-Box Unsupervised Domain Adaptation with Bi-Directional Atkinson-Shiffrin Memory
Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body
Boosting Single Image Super-Resolution via Partial Channel Shifting
C2ST Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition
CoinSeg Contrast Inter- and Intra- Class Representations for Incremental Segmentation
Continual Zero-Shot Learning through Semantically Guided Generative Random Walks
Decoupled DETR Spatially Disentangling Localization and Classification for Improved End-to-En
DeformToon3D Deformable Neural Radiance Fields for 3D Toonification
DETA Denoised Task Adaptation for Few-Shot Learning
DiffCloth Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal
DMNet Delaunay Meshing Network for 3D Shape Representation
DomainAdaptor A Novel Approach to Test-time Adaptation
DVIS Decoupled Video Instance Segmentation Framework
ESSAformer Efficient Transformer for Hyperspectral Image Super-resolution
Exploring Predicate Visual Context in Detecting of Human-Object Interactions
Exploring Temporal Concurrency for Video-Language Representation Learning
Fcaformer Forward Cross Attention in Hybrid Vision Transform
Flatness-Aware Minimization for Domain Generalization
Foreground Object Search by Distilling Composite Image Featu
Generalizing Event-Based Motion Deblurring in Real-World Scenarios
Generative Gradient Inversion via Over-Parameterized Networks in Federated Learning
GETAvatar Generative Textured Meshes for Animatable Human Avatars
GeT Generative Target Structure Debiasing for Domain Adaptation
GO-SLAM Global Optimization for Consistent 3D Instant Reconstruction
GPFL Simultaneously Learning Global and Personalized Feature Information for Personaliz
Helping Hands An Object-Aware Ego-Centric Video Recognition Model
ITI-GEN Inclusive Text-to-Image Generation
LayoutDiffusion Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models
Learning in Imperfect Environment Multi-Label Classification with Long-Tailed Distribution an
Learning Neural Implicit Surfaces with Object-Aware Radiance Fields
Learning Rain Location Prior for Nighttime Deraining
Learning Spatial-context-aware Global Visual Feature Representation for Instance Image Retrieval
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment
Lightweight Image Super-Resolution with Superpixel Token Interaction
LMR A Large-Scale Multi-Reference Dataset for Reference-Based Super-Resolution
MAGI Multi-Annotated Explanation-Guided Learning
MAP Towards Balanced Generalization of IID and OOD through Model-Agnostic
Meta-ZSDETR Zero-shot DETR with Meta-learning
Minimum Latency Deep Online Video Stabilization
MonoDETR Depth-guided Transformer for Monocular 3D Object Detection
MoreauGrad Sparse and Robust Interpretation of Neural Networks via Moreau
Multi-Event Video-Text Retrieval
Multi3DRefer Grounding Text Description to Multiple 3D Objects
Multiple Planar Object Tracking
NeILF Inter-Reflectable Light Fields for Geometry and Material Estimation
NeMF Inverse Volume Rendering with Neural Microflake Fiel
OccFormer Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
OCHID-Fi Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision
Perceptual Artifacts Localization for Image Synthesis Tasks
Pose-Free Neural Radiance Fields via Implicit Pose Regularization
Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views
QD-BEV Quantization-aware View-guided Distillation for Multi-view 3D Object Detection
RankMatch Fostering Confidence and Consistency in Learning with Noisy Labels
Reconciling Object-Level and Global-Level Objectives for Long-Tail Detection
ReMoDiffuse Retrieval-Augmented Motion Diffusion Model
Rethinking Mobile Block for Efficient Attention-based Models
Rethinking the Role of Pre-Trained Networks in Source-Free Domain Adaptation
Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering
Robust Mixture-of-Expert Training for Convolutional Neural Networks
SA-BEV Generating Semantic-Aware Birds-Eye-View Feature for Multi-view 3D Object Detection
SAL-ViT Towards Latency Efficient Private Inference on ViT using Selectiv
Self-supervised Learning of Implicit Shape Representation with Dense Correspondence fo
ShiftNAS Improving One-shot NAS via Probability Shift
Single Depth-image 3D Reflection Symmetry and Shape Prediction
SLCA Slow Learner with Classifier Alignment for Continual Learning on
Surface Extraction from Neural Unsigned Distance Fields
TARGET Federated Class-Continual Learning via Exemplar-Free Distillation
Tiny Updater Towards Efficient Neural Network-Driven Software Updating
Towards Effective Instance Discrimination Contrastive Loss for Unsupervised Domain Adaptation
Towards Fairness-aware Adversarial Network Pruning
Towards General Low-Light Raw Noise Synthesis and Modeling
Toward Multi-Granularity Decision-Making Explicit Visual Reasoning with Hierarchical Knowledg
Toward Unsupervised Realistic Visual Question Answering
TrajPAC Towards Robustness Verification of Pedestrian Trajectory Prediction Models
Uni-3D A Universal Model for Panoptic 3D Scene Reconstruction
Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model
Weakly-Supervised Text-Driven Contrastive Learning for Facial Behavior Understanding
When Noisy Labels Meet Long Tail Dilemmas A Representation Calibration
NeRFrac Neural Radiance Fields through Refractive Surfac
Ada3D Exploiting the Spatial Redundancy with Adaptive Inference fo
Bring Clipart to Li
Class Prior-Free Positive-Unlabeled Learning with Taylor Variational Loss for Hyperspectral
Cumulative Spatial Knowledge Distillation for Vision Transformers
DDFM Denoising Diffusion Model for Multi-Modality Image Fusion
Divide and Conquer 3D Point Cloud Instance Segmentation With Point-Wis
DOT A Distillation-Oriented Train
Fast Adversarial Training with Smooth Convergenc
Fast Full-frame Video Stabilization with Iterative Optimization
Fully Attentional Networks with Self-emerging Token Labeling
GasMono Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes
Generative Prompt Model for Weakly Supervised Object Localization
Human from Blur Human Pose Tracking from Blurry Images
Incremental Generalized Category Discovery
Learning Pseudo-Relations for Cross-domain Semantic Segmentation
Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery
Learning Symmetry-Aware Geometry Correspondences for 6D Object Pose Estimation
MagicFusion Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
Masked Retraining Teacher-Student Framework for Domain Adaptive Object Detection
MDCS More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition
Movement Enhancement toward Multi-Scale Video Feature Representation for Temporal Action
MVPSNet Fast Generalizable Multi-view Photometric Stereo
Object-Centric Multiple Object Tracking
RecursiveDet End-to-End Region-Based Recursive Object Detection
Spherical Space Feature Decomposition for Guided Depth Map Super-Resolution
Synthesizing Diverse Human Motions in 3D Indoor Scenes
TextPSG Panoptic Scene Graph Generation from Textual Descriptions
Towards Authentic Face Restoration with Iterative Diffusion Models and Beyon
Unified Visual Relationship Detection with Vision and Language Models
Unleashing Text-to-Image Diffusion Models for Visual Perception
Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models
CIRI Curricular Inactivation for Residue-aware One-shot Video Inpainting
COOP Decoupling and Coupling of Whole-Body Grasping Pose Generation
Distributed Bundle Adjustment with Block-Based Sparse Matrix Compression for Su
Empowering Low-Light Image Enhancer through Customized Learnable Priors
HaMuCo Hand Pose Estimation via Multiview Collaborative Self-Supervised Learning
Less is More Focus Attention for Efficient DETR
Look at the Neighbor Distortion-aware Unsupervised Domain Adaptation for Panoramic
MRN Multiplexed Routing Network for Incremental Multilingual Text Recognition
Multi-task View Synthesis with Neural Radiance Fields
Online Clustered Codebook
PointOdyssey A Large-Scale Synthetic Dataset for Long-Term Point Tracking
Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling
Regularized Mask Tuning Uncovering Hidden Knowledge in Pre-Trained Vision-Language Models
Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic
SimMatchV2 Semi-Supervised Learning with Graph Consistency
Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation
LivelySpeaker Towards Semantic-Aware Co-Speech Gesture Generation
3D Implicit Transporter for Temporally Consistent Keypoint Discovery
AttT2M Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism
Contrastive Learning Relies More on Spatial Inductive Bias Than Supervis
Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors
MMVP Motion-Matrix-Based Video Prediction
3D Neural Embedding Likelihood Probabilistic Inverse Graphics for Robust 6D
BT2 Backward-compatible Training with Basis Transformation
ClothesNet An Information-Rich 3D Garment Model Repository with Simulated Clothes
Communication-efficient Federated Learning with Single-Step Synthetic Features Compressor for Fast
Cross-Modal Translation and Alignment for Survival Analysis
Dataset Quantization
Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting fo
Downstream-agnostic Adversarial Examples
DR-Tune Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization
FF Attack Adversarial Attack against Multiple Object Trackers by Inducing
Gloss-Free Sign Language Translation Improving from Visual-Language Pretraining
HiLo Exploiting High Low Frequency Relations for Unbiased Panoptic Scen
Homeomorphism Alignment for Unsupervised Domain Adaptation
ImbSAM A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition
Improving Lens Flare Removal with General-Purpose Pipeline and Multiple Light
Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic
Learning a More Continuous Zero Level Set in Unsigned Distanc
Learning Correction Filter via Degradation-Adaptive Regression for Blind Single Imag
MatrixVT Efficient Multi-Camera to BEV Transformation for 3D Perception
MSRA-SR Image Super-resolution Transformer with Multi-scale Shared Representation Acquisition
Pre-Training-Free Image Manipulation Localization through Non-Mutually Exclusive Contrastive Learning
ProPainter Improving Propagation and Transformer for Video Inpainting
Rethinking Pose Estimation in Crowds Overcoming the Detection Information Bottleneck
SAMPLING Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis
SparseMAE Sparse Training Meets Masked Autoencoders
SRFormer Permuted Self-Attention for Single Image Super-Resolution
Two-in-One Depth Bridging the Gap Between Monocular and Binocular Self-Supervis
UniFace Unified Cross-Entropy Loss for Deep Face Recognition
Unsupervised Domain Adaptive Detection with Network Stability Analysis
XNet Wavelet-Based Low and High Frequency Fusion Networks for Fully-
MAS Towards Resource-Efficient Federated Multiple-Task Learning
Video Background Music Generation Dataset Method and Evaluation
3D-VisTA Pre-trained Transformer for 3D Vision and Text Alignment
4D Panoptic Segmentation as Invariant and Equivariant Field Prediction
All-to-Key Attention for Arbitrary Style Trans
A Good Student is Cooperative and Reliable CNN-Transformer Collaborative Learning
BiFF Bi-level Future Fusion with Polyline-based Coordinate for Interactive Trajectory
Boosting Adversarial Transferability via Gradient Relevance Attack
Coarse-to-Fine Learning Compact Discriminative Representation for Single-Stage Image Retrieval
Cross-Modal Orthogonal High-Rank Augmentation for RGB-Event Transformer-Trackers
CTPTowards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology
EgoObjects A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding
Enhancing Fine-Tuning Based Backdoor Defense with Sharpness-Aware Minimization
Exploring Temporal Frequency Spectrum in Deep Video Deblurring
Frequency-aware GAN for Adversarial Manipulation Generation
H3WB Human3.6M 3D WholeBody Dataset and Benchmark
Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning
Learning Gabor Texture Features for Fine-Grained Recognition
LinkGAN Linking GAN Latents to Pixels for Controllable Image Synthesis
MapPrior Birds-Eye View Map Layout Estimation with Generative Models
Modeling the Relative Visual Tempo for Self-supervised Skeleton-based Action Recognition
MotionBERT A Unified Perspective on Learning Human Motion Representations
Multi-Label Self-Supervised Learning with Scene Images
Not All Features Matter Enhancing Few-shot CLIP with Adaptive Prio
PointCLIP V2 Prompting CLIP and GPT for Powerful 3D Open-worl
Prompt-aligned Gradient for Prompt Tuning
Rethinking Data Distillation Do Not Overlook Calibration
Scene-Aware Label Graph Learning for Multi-Label Image Classification
SegPrompt Boosting Open-World Segmentation via Category-Level Prompt Learning
Self-Organizing Pathway Expansion for Non-Exemplar Class-Incremental Learning
SVDFormer Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generato
The Victim and The Beneficiary Exploiting a Poisoned Model to
UMIFormer Mining the Correlations between Similar Tokens for Multi-View 3D
Universal Domain Adaptation via Compressive Attention Matching
Unsupervised Self-Driving Attention Prediction via Uncertainty Mining and Knowledge Embedding
SC3K Self-supervised and Coherent 3D Keypoints Estimation from Rotated Noisy
DETRs with Collaborative Hybrid Assignments Training
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical
RePolyWorld - A Graph Neural Network for Polygonal Scene Parsing
Adaptive Calibrator Ensemble Navigating Test Set Difficulty in Out-of-Distribution Scenarios
Discrepant and Multi-Instance Proxies for Unsupervised Person Re-Identification
Iterative Denoiser and Noise Estimator for Self-Supervised Image Denoising
RawHDR High Dynamic Range Image Reconstruction from a Single Raw
From Chaos Comes Order Ordering Event Representations for Object Recognition
DG3D Generating High Quality 3D Textured Shapes by Learning to
Reconstructing Interacting Hands with Interaction Prior from Monocular Images
LaRS A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

评论区 0