sllm.cloud's GPU cohorts: cheap tokens, noisy neighbors

A new GPU-sharing platform called sllm.cloud is trying to solve the problem every developer faces: running big models costs a fortune. Their pitch is simple. Join a cohort, split the cost of a GPU node with other developers, and get unlimited tokens. The service offers access to hefty models like Llama 4 Scout (109B), Qwen 3.5 (122B), GLM 5 (754B), Kimi K2.5 (1T), and DeepSeek variants. Pricing ranges from $10 for one month to $40 for three months, with throughput options between 15 and 35 tokens per second.

The billing model is unusual. You save a card via Stripe, but you're only charged when your cohort fills up and becomes active. That sounds great until you realize the catch: if nobody else joins your cohort, you're waiting indefinitely. The bigger question is what happens when the cohort actually runs. Hacker News commenters quickly flagged the classic "noisy neighbor" problem. If someone in your cohort decides to run a heavy 24/7 job, your Time to First Token and overall throughput could tank. vLLM's continuous batching helps, but there are physical limits to shared VRAM and compute that no scheduler can magically fix.

There's also the fairness question. Users who need occasional large queries compete for the same resources as those running steady smaller requests. Which usage pattern should the system prioritize? And how do you even measure fairness in a shared GPU environment? These aren't theoretical concerns. They're the difference between a responsive API and one that times out. Runfra has already moved toward a credit-per-task model on idle GPUs specifically to avoid these contention issues. AWS, meanwhile, offers predictable access at higher prices. sllm is betting developers will accept uncertainty in exchange for lower costs. Maybe they will. But sharing infrastructure with strangers always sounds better on paper than it feels when someone else's workload slows yours to a crawl.