Urban Vision AI Enters a New Era with K2K®: City-Scale Video Curation Complete for VLM Fine-Tuning Using NVIDIA DGX Cloud
A breakthrough in visual AI at city scale has just been realized.
K2K® is one of the early adopters of the newly announced NVIDIA Omniverse Blueprint for Smart City AI. to optimize operations AI agents and digital twins for cities. The blueprint provides the complete software stack needed to efficiently develop AI agents and digital twins at scale.
K2K® has also successfully launched the curation of a vast urban video dataset to fine-tune next-generation Vision Language Models (VLMs) optimized for the NVIDIA AI Blueprint for video search and summarization (VSS) workloads. This milestone combines cutting-edge data curation infrastructure with powerful AI training services on NVIDIA DGX Cloud.
The curated data covers city life across a full set of diverse conditions: traffic density, lighting variations, weather events, and pedestrian flow, reflecting the rich complexity of real-world urban dynamics. The initiative lays the groundwork for intelligent systems capable of perceiving, reasoning, and interacting with the physical environment in real time.
From Raw Video to Real Intelligence
The core challenge in building high-performance vision AI agents lies not only in computing power, but in the quality and scale of AI training data available. This project addresses that by curating city-scale video footage using advanced pipelines designed for regulatory compliance, privacy-preserving transformations, and multimodal annotation.
Once curated, the video data is processed in DGX Cloud using fine-tuning microservices optimized for NVIDIA AI Blueprints. This stack accelerates the development of highly interactive, perceptive visual AI agents capable of powering a new class of smart city services, from real-time traffic summarization to multimodal urban analytics powered by K2K®’s Neural Networks Architectures.
Training Vision AI Agents for the Real World
This work enables a new generation of VLMs fine-tuned specifically for dense, live video environments. These models are built for low-latency video understanding and retrieval, key to real-time analysis such as live camera search, scene summarization, and generative events reporting.
Aligned with the NVIDIA AI Blueprint for Video Search and Summarization (VSS), the project demonstrates how purpose-trained VLMs can serve as foundational components for digital agents operating across complex dynamic physical spaces. Whether deployed at intersections, transportation hubs, or public venues, these agents are capable of answering questions, generating summaries, and surfacing critical information directly from video feeds.
DGX Cloud: A Platform for Scalable Urban AI
The initiative showcases the synergy of high-performance AI infrastructure and modular data engineering. DGX Cloud streamlined the model, fine-tuning with scalability and speed, eliminating the traditional bottlenecks of compute-bound training cycles. NVIDIA provides the backbone for compliant, high-throughput video ingestion, enrichment, and labeling—accelerating the path from raw footage to high-value AI-ready data.
A Foundation for the Next Phase of Urban Intelligence
This effort is the first of several strategic steps aimed at operationalizing sovereign AI for cities, architectures of Neural Networks that are not only powerful and accurate, but also auditable, explainable, and aligned with public values.
The NVIDIA Omniverse Blueprint for Smart City AI provides a framework that focuses on training AI to perceive and respond in complex scenarios, testing the AI in a digital twin, and deploying the AI through AI agents. Next, K2K® plans to explore the use of digital twins for simulating and testing AI agents before deployment.
The fine-tuned VLM emerging from this pipeline is designed to support a broad range of use cases, from traffic flow optimization to public safety insights, infrastructure resilience, and emergency response.
By abstracting visual reasoning into adaptable services, this initiative empowers both public and private stakeholders to unlock the full potential of their video assets, with full compliance, responsibly, and at scale.
