A Diabettech study tested whether AI can count carbohydrates. Short answer: not reliably enough to trust with insulin dosing. Researchers sent 13 food photos to four models (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) across 26,904 queries. Same photos, same prompts, lowest randomness settings. Claude showed the least variation at 2.4% median. Gemini 2.5 Pro was worst at 11%. The paella photo is where it gets scary. Gemini 2.5 Pro's estimates for that single image ranged from 55g to 484g of carbohydrates. A 42.9-unit insulin swing. Potentially fatal. You take a photo, you get one number. No way to know if it's an outlier. Then there's the "precisely wrong" problem. Claude estimated a 40g cheese sandwich at 28g across all 510 queries. Consistent, yes. Also consistently wrong, underdosing by about 1.2 units of insulin every time. Models misidentified foods too. Claude called a Bakewell tart a "Linzer torte" in every single query. GPT-5.4 called it a "jam tart" or "cake bar." Wrong names, wrong estimates. Confidence scores made things worse. Claude's confidence had zero correlation with accuracy. Higher confidence actually meant lower accuracy. These tools sit in a regulatory gray zone. FDA-approved diabetes devices require "locked" algorithms. Robot companions like PARO require rigorous testing instead. People use them for dosing decisions anyway. One number, no second opinion.