MakoraNeocloudgenerated Jun 27, 2026 · 1h

Makora analysis

Thesis

Makora is a performance-engineering organization focused on automated GPU kernel generation and inference optimization. Its public surface spans four categories: (1) an AI-driven kernel generation system (MakoraGenerate) that produces optimized GPU kernels targeting NVIDIA H100/B200, AMD MI300X, and Tenstorrent hardware P1W1; (2) a lightweight multi-vendor GPU querying utility (gpuq) supporting CUDA and HIP runtimes P2; (3) a Mixture-of-Experts kernel project (flash-moe) aimed at overlapping expert computation with inter-GPU communication on AMD MI300X P4; and (4) an inference serving business whose endpoints achieved five first-place positions on Artificial Analysis third-party benchmarks across DeepSeek V4, Qwen3.6, and Llama 3.3 model families W3. The organization frames its work through an AI-Driven Research for Systems (ADRS) lens, emphasizing agentic search for optimization algorithms W5, and has publicly discussed novel inference algorithms including sequential Monte Carlo speculative decoding W2. The evidence signals a dual-track strategy: building agent-based tooling for kernel generation while operating competitive inference infrastructure that validates those tools in production.

Signal desks

Hiring

No open job listings or formal hiring signals are present in this evidence pack. The only named personnel appear in an inference benchmark announcement crediting the performance engineering team: Noushin Azami, Tripp Lyons, Yahya Emara, Paweł Kopeć, Wojciech Paluch, Kajetan Kruczkowski, Essam Wisam, and Cătălin M. W3. Code contributors across releases include @vaenyr (gpuq, makora) and @1y33 (makora) P6P7P9P10P11P12. No role descriptions, locations, or team structures can be inferred beyond the existence of a "performance engineering team" W3.

Forks

No cited evidence in this pack. All six Makora repositories are original (not forks): kernels P1, gpuq P2, aiagent_playground P3, flash-moe P4, mako-generate-agent-playground P5, and makora P13. The flash-moe implementation is conceptually derived from arxiv paper 2506.04667 but is not a fork P4.

Releases

  • gpuq has a sustained release cadence: v1.3.0 E16, v1.4.2 E14, v1.5.0 E13, v1.5.1 (mock device naming) P7, v1.5.2 (AMD ROCm 7 nameless device fix) P6, v1.5.3 (typo fix) P10, v1.5.4 (empty VISIBLE_DEVICES fix) P9, and v1.5.5 (CUDA 13 support) P12. The rapid Feb 2026 burst of four patch releases (v1.5.2–v1.5.5) within the same day indicates an active compatibility sprint P6P9P10P12.
  • makora CLI shipped v1.0.3 (initial release, Feb 2026) P8E11 and v1.0.4 (colored typing, package update, Mar 2026) P11E1. The CLI provides subcommands for generate, jobs, kernels, check, profile, evaluate, and expert-generate P13.
  • kernels repo was last pushed May 2026 P1 and flash-moe was last pushed Jan 2026 with status "last day of active development (28.01.2026)" P4.

Talking

  • Kernel generation performance: A GTC talk (May 2026) framed MakoraGenerate as producing CUDA kernels that beat hand-tuned code, with discussion of fine-tuning and specializing models as a lower-cost alternative to large foundation models W1.
  • Inference benchmarks: A LinkedIn post (Jun 2026) announced five first-place positions on Artificial Analysis benchmarks, with 14 total submissions, naming the performance engineering team W3.
  • Speculative decoding: A SemiAnalysis feature (Jun 2026) detailed Makora's sequential Monte Carlo speculative decoding algorithm, which keeps multiple draft tokens alive in parallel instead of rewinding on mismatches W2.
  • Agent memory management: An ADRS blog post (Jun 2026) described Makora's approach to GPU kernel generation agents, arguing that agent memory must act like a strict cache rather than an unbounded notebook to avoid context noise W5.
  • Research coverage: Hugging Face Daily Papers included Makora's GPU kernel generation work, noting results outperforming Torch with speedups of 4.8× and 21.8× W4.

Shipping

Makora ships through three primary artifact channels:

1. Python packages: makora CLI distributed via PyPI (pip install makora) with login-gated access to the MakoraGenerate API P13. The CLI exposes kernel generation, benchmarking, profiling, and evaluation workflows as subcommands P13. 2. Open-source GPU utilities: gpuq is MIT-licensed and installable as a lightweight Python library with zero build-time dependencies, supporting CUDA and HIP runtimes simultaneously P2. The kernels repository is Apache-2.0 licensed with auto-generated kernels for H100, B200, MI300X, and Tenstorrent targets P1. 3. Inference endpoints: A hosted inference service at app.makora.com W3 serving models including DeepSeek V4 Pro/Flash, Qwen3.6 (35B, 27B), and Llama 3.3 70B, validated through third-party benchmarks W3.

The flash-moe project reached proof-of-concept status for Qwen3 MoE on MI300X with vLLM integration by January 2026 but was marked as concluded P4. The aiagent_playground and mako-generate-agent-playground repos appear to be internal tooling or demonstration projects with minimal public traction (0–1 stars) P3P5.

Research themes

Evidence points to three active research directions:

  • Automated kernel generation via agents: MakoraGenerate uses LLM-based agents to produce optimized GPU kernels, with published results showing 4.8–21.8× speedups over Torch baselines W4. The system targets multiple hardware backends (NVIDIA H100, B200, AMD MI300X, Tenstorrent) P1. Research attention is focused on memory management for the generation agent itself, treating context as a cache to avoid noise in iterative optimization W5.
  • Speculative decoding algorithms: Sequential Monte Carlo speculative decoding maintains N parallel draft hypotheses instead of rewinding on mismatches, targeting inference latency reduction W2. This is positioned as a novel inference algorithm distinct from standard speculative decoding W2.
  • Mixture-of-Experts kernel optimization: The flash-moe project explored overlapping gate computation, expert computation, and inter-GPU communication in a single async kernel for the decode phase, using ROCSHMEM for device-to-device communication on AMD MI300X P4. The project was scoped as a proof-of-concept for Qwen3 MoE and concluded in January 2026 P4.

Hiring & scaling

No formal hiring evidence exists in this pack. The organization's public scaling signals are instead product-driven:

  • The Feb 2026 burst of gpuq releases (v1.5.2–v1.5.5, all on the same day) addressing AMD ROCm 7 bugs and adding CUDA 13 support suggests active compatibility engineering to maintain multi-vendor coverage P6P9P10P12.
  • The makora CLI launch (v1.0.3, Feb 2026) and follow-up (v1.0.4, Mar 2026) indicate a productization push for the kernel generation service, moving from playground scripts P5 to a packaged CLI with authentication, job management, and hardware profiling P13.
  • The inference benchmark campaign with 14 submissions across multiple model families W3 suggests dedicated performance engineering capacity, though team size and hiring plans cannot be estimated from available evidence.
  • The appearance of only two contributors (@vaenyr, @1y33) across all release activity P6P7P9P10P11P12 and a named team of 9 individuals W3 provides a lower-bound signal on team composition but no growth trajectory.

Category implications

Infrastructure: Makora's multi-vendor GPU strategy — spanning NVIDIA (CUDA 13 P12, H100, B200 P1), AMD (ROCm 7, HIP, MI300X P2P4P6), and Tenstorrent P1 — implies investment in hardware-portable optimization tooling rather than single-ecosystem lock-in. The gpuq library's zero-dependency design and soft runtime requirements P2 suggest infrastructure meant to run broadly across heterogeneous clusters and CI environments. This multi-vendor posture has strategic implications for organizations managing mixed GPU fleets or evaluating hardware alternatives.

Product: The makora CLI represents a commercialization path for the kernel generation research: an API-gated service with token-based authentication and remote hardware profiling/evaluation capabilities P13. The progression from shell-script playground P5 to packaged CLI P13 to publicly benchmarked inference endpoints W3 suggests a product funnel from developer tooling to managed inference services, with kernel generation quality serving as the shared technical moat.

Research: Makora's ADRS framing W5 positions agent-driven systems optimization as a research paradigm, not just a product feature. The sequential Monte Carlo speculative decoding work W2 extends this beyond kernel generation into inference algorithms. The combination of automated kernel generation research W1W4 with production inference benchmarking W3 suggests a research-to-production pipeline where algorithmic advances can be validated in competitive third-party benchmarks.

GTM: The inference benchmark results — claiming #1 positions against GPU providers, with specific comparison to Groq and Sambanova for the Llama workload W3 — serve as GTM validation for the inference product. The free trial offer at app.makora.com W3 indicates a self-serve adoption model. The GTC talk W1 and SemiAnalysis feature W2 target technical credibility with the developer and researcher audience rather than broad enterprise marketing.

Hiring: Without formal job listings, the hiring implication is inferred from capability signals. The need for contributors spanning CUDA, ROCm/HIP, and Tenstorrent backends P1P2, combined with kernel-level C++ development P4 and LLM-based agent systems W5, suggests a team requiring deep compiler/kernel expertise alongside ML systems engineering. The single-contributor pattern on most releases P6P7P9P10P12 may indicate either a lean team or underinvestment in open-source tooling relative to the proprietary inference service.

Traction highlights

  • Benchmarks: Five #1 positions on Artificial Analysis across DeepSeek V4 Pro, DeepSeek V4 Flash, Qwen3.6 35B, Qwen3.6 27B, and Llama 3.3 70B, with 14 total submissions W3.
  • Kernel performance: 4.8× and 21.8× speedups over Torch baselines reported in Hugging Face Daily Papers coverage W4.
  • Research attention: Featured in SemiAnalysis for sequential Monte Carlo speculative decoding W2 and in the ADRS blog series for agentic memory management W5. GTC talk on kernel generation that beats hand-tuned code W1.
  • Repository metrics: Modest open-source traction — kernels (13 stars, 2 forks) P1, gpuq (13 stars) P2, makora CLI (8 stars) P13, flash-moe (1 star) P4, aiagent_playground (1 star) P3, mako-generate-agent-playground (0 stars) P5.
  • Release velocity: gpuq has seen releases from at least v1.3.0 (Jun 2025) through v1.5.5 (Feb 2026) E7E16, with active compatibility maintenance for CUDA 13 and ROCm 7 P6P12.