H100 is a topic tracked in our intelligence system with 5 linked articles.
Kog claims real-time LLM inference on standard datacenter GPUs can reach about 3,000 tokens/s per request on a 2B model by co-designing a monokernel runtime, GPU code, and a Laneformer architecture, with scalability toward frontier MoEs as memory bandwidth grows.
A multi-article TechCrunch digest highlights mega AI funding, open-social/content interoperability, emerging AI-token futures markets, startup milestone metrics, and notable cybersecurity/regulatory risk signals.
GPU matmuls are more driven by power constraints and input data patterns than theoretical compute; zeros can yield higher sustained FLOPS due to reduced transistor switching, with CUTLASS showing gains over CuBLAS in profiler benchmarks but real-world results depend on framework, leading to power-limited performance far below marketed peaks.
A technical blog post shows a 16% throughput and ~11% end-to-end latency improvement in multimodal inference by caching CUDA IPC pool handles in a Python dict, reducing host-side overhead in SGLang.
Anthropic inks a compute deal with SpaceXAI to access Colossus 1’s ~220,000 Nvidia GPUs and ~300 MW capacity in Memphis, as SpaceXAI eyes an IPO and orbital compute ambitions, all amid regulatory and environmental scrutiny and large cloud-spend implications.
Mistral launches Devstral 2 (123B) and Devstral Small 2 (24B) with open-source licenses, strong SWE-bench benchmarks, cost-efficiency, and a new Vibe CLI, plus detailed deployment and pricing information.
Subscribe for real-time topic updates and unlimited access to our intelligence platform.