NeuralForge Reverse-Engineers Apple's Private APIs to Run LLM Training on the Neural Engine

Apple's CoreML framework gives developers access to the Neural Engine for one thing: inference. Training is officially off-limits, funneled instead toward cloud services or GPU-based tooling. NeuralForge, released on GitHub by developer Khaeldur, takes a different view. It builds on maderix/ANE — a separate research project that reverse-engineers Apple's private `_ANEClient` and `_ANECompiler` APIs — to unlock backpropagation directly on ANE hardware, a capability Apple has never exposed through official channels. The result is a SwiftUI-wrapped fine-tuning tool where model training stays entirely on the device.

The architectural foundation is where NeuralForge gets interesting. Training doesn't run purely on ANE — the forward and backward passes are split between ANE and CPU via Apple's Accelerate framework, with element-wise operations largely falling back to CPU because the reverse-engineered ANE access can't yet handle them natively. LoRA (Low-Rank Adaptation) is what makes this hybrid approach viable for actual LLMs: by training only a small set of adapter weights rather than full model parameters, NeuralForge keeps memory pressure manageable on hardware that was never intended for this workload. INT8 quantization (W8A8) delivers a 1.88x throughput improvement on M4, and an MLX Metal GPU backend offers an alternative compute path for models the ANE pipeline can't handle. Export targets — GGUF, CoreML, llama2c — plug directly into downstream tools like llama.cpp.

The benchmarks are functional rather than fast. The 110M-parameter Stories110M model runs at 91ms per step; Qwen3-0.6B at 412ms. ANE utilization sits at roughly 5–9% of the M4's theoretical 15.8 TFLOPS FP16 peak — the maderix/ANE author states this plainly rather than paper over it. For production-scale training of large models, this isn't the tool. For fine-tuning a small model locally without touching a cloud API, the numbers are at least usable.

The utilization ceiling is primarily a function of how much of the ANE's instruction set the reverse-engineering effort has successfully mapped. The more element-wise operations that can migrate off CPU and onto ANE, the further that 5–9% figure moves. maderix/ANE is the upstream dependency that matters: broader ANE instruction coverage there translates directly into better performance in NeuralForge. With 356 unit tests and end-to-end XCUITest coverage already in place, the app is built to absorb those improvements as they arrive — the question is how far the reverse-engineering can get before Apple's next silicon generation resets the work.