Google Inside: Apple's AI strategy for iPhone is a brave move

What’s strategy?

It isn’t just about saying NO (as a lot of pundits will tell you), but it’s largely about knowing one’s strength and sticking to it – and saying NO to everything else (no matter the opportunity cost).

Apple is in talks with Google to use Google’s Gemini AI models, to power some new features coming to the iPhone software this year (via).

While this news isn’t confirmed yet, but if it is true, I think this is a brave decision and Appel deserves a shout out for being true to its mission.

Search hasn’t been core to Apple and we all have seen how badly companies get trolled when things go wrong with their GenAI implementation.

LLMs, beyond a point are a biatch. They are good to have from an infrastructure point of view, but if you are a big brand opening up your implementation of LLM – your brand, your reputation is at stake.

Google is going through this

Apple doesn’t want to trade its brand and get trolled for something it isn’t good at.

Apple will instead focus on running LLM inside your phone, utilizing flash memory as a viable solution for storing LLMs*. Apple will use your data to hyper-personalize your experience and in all probability, it will never go wrong, won’t hallucinate. And Apple will leave everything else to vendors like Google who are going to take the reputation hit if things go wrong.

*Notes from Apple’s paper on running LLMs in flash storage

The GPU in your pocket will have a local GPT3.5 by next year that can be shared among other applications.

Efficient LLM Inference on Limited DRAM
The paper details innovative approaches for executing large language models on devices constrained by DRAM capacity. It introduces a novel method where model parameters are stored in flash memory and selectively loaded into DRAM during inference, optimizing memory usage and computational efficiency.

Flash Memory for LLM Storage
The study delves into utilizing flash memory as a viable solution for storing LLMs. It addresses the inherent limitations of DRAM in terms of capacity and proposes hardware-aware strategies to enhance inference performance while dealing with the slower access speeds of flash memory.

Windowing and Row-Column Bundling Techniques
The paper proposes ‘windowing’ and ‘row-column bundling’ as key techniques. ‘Windowing’ aims to reduce the amount of data transferred by reusing previously activated neurons, while ‘row-column bundling’ seeks to optimize the size of data chunks read from flash memory, enhancing efficiency [Read the paper].