
This initiative validated that generative AI can run efficiently on edge devices, delivering cloud-level performance while improving speed, cost, and privacy. Our custom SDK serves as a foundation for embedding LLM-powered features into mobile and embedded apps, unlocking new product possibilities.
Meet our client
Client:
Industry:
Market:
Technology:
A Deep Dive
1. Overview
This project was a pioneering internal R&D initiative focused on enabling fast, offline, and privacy-preserving LLM inference directly on handheld devices for one of the global technology leaders in mobile computing.
The core objective was to develop a lightweight SDK that would allow application developers to easily integrate GenAI-powered features—such as natural language querying of product manuals—without relying on cloud infrastructure.
The initiative had four key goals:
- Prove the feasibility of running meaningful GenAI use cases directly on edge devices, without compromising performance or usability.
- Achieve fast and efficient on-device inference, including support for Retrieval-Augmented Generation (RAG) to ensure relevant and accurate responses from local data sources.
- Design a developer-friendly API that abstracts away the complexity of working with LLMs, allowing non-ML specialists to build intelligent applications quickly and effectively.
- Maintain a high quality of output that is competitive with cloud-based LLMs, while using significantly less compute and memory by optimizing for on-device resource constraints.
Key Outcomes
- Built a highly optimized SDK for edge LLM applications, leveraging Qualcomm’s GenAI stack.
- Achieved up to 30x faster inference compared to baseline edge implementations.
- Developed an offline RAG pipeline that matches the quality of cloud-based alternatives.
- Enabled real-time, internet-free GenAI features on consumer-grade devices.
2. Client
A global mobile computing company exploring how to push the boundaries of GenAI deployment on their devices.
- Industry: Manufacturing and Technology
- Market Position: A company with 10,000 employees.
- Achievements & Context:
- Long-standing expertise in ML/AI at scale
- Recognized for applying frontier AI tech in production settings
- Internal experimentation aimed at driving client innovation forward
3. Challenge
Business Challenge
Edge-based LLM solutions were unproven and immature, yet critical for enabling cost-efficient, low-latency, and offline AI capabilities in consumer devices. The team needed to validate whether a usable, high-quality GenAI solution could be built without relying on cloud infrastructure.
Technology Challenge
- Early-stage and fragmented ecosystem for on-device LLMs.
- Small models performed poorly; larger models exceeded typical device constraints.
- Lack of reliable retrieval mechanisms using edge-compatible embedding models.
- Manual-heavy workflow due to missing annotated datasets and evaluation tools.
- Existing solutions had unacceptable latency or limited customization.
4. Solution
We approached the challenge as a full-stack GenAI engineering problem, optimizing both model performance and the developer experience.
Model & Inference Optimization
- Benchmarked open-source frameworks (MLC, llama.cpp)
- Transitioned to Qualcomm’s libGenie + QNN, enabling hardware-accelerated inference
- Built an optimized C++ inference engine with Java/Kotlin bindings
- Reduced prefill stage latency by up to 30x on mid-range Android devices
Retrieval Pipeline Innovation
- Evaluated and selected on-device embedding models
- Developed custom document indexing (by chunks, summaries, and LLM-generated Q&A pairs)
- Achieved higher query–document alignment, boosting retrieval precision
Evaluation & Quality Monitoring
- Created evaluation sets using OpenAI and Gemini APIs
- Applied LLM-as-a-judge scoring to simulate end-user relevance
- Manually verified outputs for critical use cases
Data Processing & SDK Abstraction
- PDF digitization via Docling to convert manuals to markdown
- Designed a lightweight, application-friendly SDK interface
- Ensured internal and external teams could adopt it without LLM expertise
Tech Stack
- Languages: C++, Python, Java, Kotlin
- Frameworks & Tools: libGenie, QNN, Faiss, Docling
- LLMs: Meta LLaMA (1B–8B variants)
- Cloud APIs for Eval: OpenAI / Gemini
5. Process
Key Development Stages
- Feasibility & Benchmarking:
Compared inference speed and quality across open-source and commercial edge AI runtimes. - SDK Prototyping & Optimization:
Engineered a modular SDK to abstract model handling, indexing, and RAG orchestration. - Data & Retrieval Tuning:
Introduced advanced chunking and summary-indexing techniques to improve answer relevance. - Evaluation Pipeline:
Automated dataset generation, model response scoring, and quality monitoring with human-in-the-loop reviews.
Team Involved
- GenAI Engineer (Optimization & SDK architecture)
- AI Engineer (Retrieval & Evaluation)
- Full-Stack Developer (Android integration)
- Data Engineer (Data pipelines & format conversion)
6. Outcome
Quantitative Results
- 30x speedup in LLM prefill latency on edge devices
- Enabled real-time on-device inference for GenAI assistants
- Significant retrieval quality improvement through hybrid chunking + summary indexing
- Built an SDK enabling integration in hours, not weeks
Qualitative Results
- Full offline mode improves UX in low-connectivity environments
- On-device processing enhances user data privacy
- Enables new GenAI features in embedded apps without cloud dependence
Lessons Learned
- On-device AI is viable – with careful model selection and hardware-specific optimization
- Evaluation is essential – automated + manual loops helped maintain quality
- SDK abstraction is key to democratizing GenAI use by non-ML developers
7. Summary
Final Thoughts
This initiative validated that generative AI can run efficiently on edge devices, delivering cloud-level performance while improving speed, cost, and privacy. Our custom SDK serves as a foundation for embedding LLM-powered features into mobile and embedded apps, unlocking new product possibilities.