A Diabettech study tested whether AI can count carbohydrates. Short answer: not reliably enough to trust with insulin dosing. Researchers sent 13 food photos to four models (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) across 26,904 queries. Same photos, same prompts, lowest randomness settings. Claude showed the least variation at 2.4% median. Gemini 2.5 Pro was worst at 11%. The paella photo is where it gets scary. Gemini 2.5 Pro's estimates for that single image ranged from 55g to 484g of carbohydrates. A 42.9-unit insulin swing. Potentially fatal. You take a photo, you get one number. No way to know if it's an outlier. Then there's the "precisely wrong" problem. Claude estimated a 40g cheese sandwich at 28g across all 510 queries. Consistent, yes. Also consistently wrong, underdosing by about 1.2 units of insulin every time. Models misidentified foods too. Claude called a Bakewell tart a "Linzer torte" in every single query. GPT-5.4 called it a "jam tart" or "cake bar." Wrong names, wrong estimates. Confidence scores made things worse. Claude's confidence had zero correlation with accuracy. Higher confidence actually meant lower accuracy. These tools sit in a regulatory gray zone. FDA-approved diabetes devices require "locked" algorithms. Robot companions like PARO require rigorous testing instead. People use them for dosing decisions anyway. One number, no second opinion.
27,000 Food Photos Later, AI Still Can't Count Carbs Reliably
A study submitted 13 food photographs over 26,904 times to four AI models (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) for carbohydrate estimation. Results showed significant inconsistency: Claude had the lowest variation (2.4% median) while Gemini 2.5 Pro had the highest (11% median). The variations were large enough to cause dangerous insulin dosing errors, with Gemini 2.5 Pro's worst-case scenario showing a 42.9 unit insulin swing. Model confidence scores showed zero or negative correlation with actual accuracy.