Most “AI apps” are thin wrappers around cloud APIs. They stop working without internet. They send every conversation to a remote server. They cost money per query. And they are at the mercy of an API provider’s uptime, pricing changes, and content policies.
We are going to build something different: a fully offline AI assistant app that runs entirely on the user’s phone. No internet connection needed. No API keys. No server costs. No data leaving the device. It works on an airplane, in a basement, in a country with restricted internet access — anywhere.
The stack: Llamafu for on-device LLM inference and Flutter for cross-platform UI. By the end of this tutorial, you will have a working Android and iOS app with a chat interface powered by a language model running directly on the phone’s processor.
What Is Llamafu?
Llamafu is a lightweight, cross-platform inference library built on llama.cpp. It provides native bindings for mobile platforms (Android, iOS) and desktop (Windows, macOS, Linux), with a simple C API that can be called from any language with FFI support — including Dart via Flutter’s FFI.
Key features for mobile development:
- Optimized for ARM processors (NEON SIMD on Android, Metal acceleration on iOS)
- Memory-efficient inference suitable for phones with 4-8GB RAM
- GGUF model format support
- Streaming token generation
- Small binary footprint (~5MB)
Prerequisites
- Flutter 3.22+ installed and configured
- Android Studio (for Android builds) or Xcode (for iOS builds)
- A physical device for testing (emulators work but are very slow for AI inference)
- Basic Flutter/Dart knowledge
- About 2 hours
Target devices: Any phone from 2022 or later should work. Models run fastest on:
- Android: Snapdragon 8 Gen 2+, Dimensity 9000+, Tensor G3+
- iOS: A16 Bionic+ (iPhone 14 Pro and later)
Step 1: Project Setup
flutter create --org net.localllm offline_assistant
cd offline_assistant
Add dependencies to pubspec.yaml:
dependencies:
flutter:
sdk: flutter
ffi: ^2.1.0
path_provider: ^2.1.2
path: ^1.9.0
permission_handler: ^11.3.0
file_picker: ^8.0.0
share_plus: ^9.0.0
dev_dependencies:
flutter_test:
sdk: flutter
ffigen: ^11.0.0
Step 2: Integrate Llamafu Native Libraries
Download Pre-built Libraries
Llamafu provides pre-built native libraries. Download them from the Llamafu releases page:
# Create native library directories
mkdir -p android/app/src/main/jniLibs/arm64-v8a
mkdir -p android/app/src/main/jniLibs/armeabi-v7a
mkdir -p ios/Frameworks
# Download and extract (replace with actual release URL)
curl -L https://github.com/local-llm-net/llamafu/releases/latest/download/llamafu-android-arm64.so \
-o android/app/src/main/jniLibs/arm64-v8a/libllamafu.so
curl -L https://github.com/local-llm-net/llamafu/releases/latest/download/llamafu-ios.xcframework.zip \
-o ios/llamafu.xcframework.zip
cd ios && unzip llamafu.xcframework.zip && rm llamafu.xcframework.zip && cd ..
Dart FFI Bindings
Create the Dart bindings to call the native Llamafu functions:
// lib/llamafu_bindings.dart
import 'dart:ffi';
import 'dart:io';
import 'package:ffi/ffi.dart';
/// Native function signatures
typedef LlamafuInitNative = Pointer<Void> Function(Pointer<Utf8> modelPath, Int32 contextSize);
typedef LlamafuInit = Pointer<Void> Function(Pointer<Utf8> modelPath, int contextSize);
typedef LlamafuGenerateNative = Pointer<Utf8> Function(
Pointer<Void> ctx,
Pointer<Utf8> prompt,
Int32 maxTokens,
Float temperature,
);
typedef LlamafuGenerate = Pointer<Utf8> Function(
Pointer<Void> ctx,
Pointer<Utf8> prompt,
int maxTokens,
double temperature,
);
typedef LlamafuFreeNative = Void Function(Pointer<Void> ctx);
typedef LlamafuFreeDart = void Function(Pointer<Void> ctx);
class LlamafuBindings {
late final DynamicLibrary _lib;
late final LlamafuInit init;
late final LlamafuGenerate generate;
late final LlamafuFreeDart free;
LlamafuBindings() {
if (Platform.isAndroid) {
_lib = DynamicLibrary.open('libllamafu.so');
} else if (Platform.isIOS) {
_lib = DynamicLibrary.process();
} else {
throw UnsupportedError('Platform not supported');
}
init = _lib
.lookupFunction<LlamafuInitNative, LlamafuInit>('llamafu_init');
generate = _lib
.lookupFunction<LlamafuGenerateNative, LlamafuGenerate>('llamafu_generate');
free = _lib
.lookupFunction<LlamafuFreeNative, LlamafuFreeDart>('llamafu_free');
}
}
Step 3: The Inference Service
Wrap the FFI bindings in a clean Dart service:
// lib/services/inference_service.dart
import 'dart:ffi';
import 'dart:isolate';
import 'package:ffi/ffi.dart';
import '../llamafu_bindings.dart';
class InferenceService {
final LlamafuBindings _bindings = LlamafuBindings();
Pointer<Void>? _context;
bool _isLoaded = false;
bool get isLoaded => _isLoaded;
/// Load a GGUF model from the given file path.
Future<void> loadModel(String modelPath, {int contextSize = 2048}) async {
// Run model loading in an isolate to avoid blocking the UI
await Isolate.run(() {
final bindings = LlamafuBindings();
final pathPtr = modelPath.toNativeUtf8();
final ctx = bindings.init(pathPtr, contextSize);
calloc.free(pathPtr);
return ctx.address;
}).then((address) {
_context = Pointer<Void>.fromAddress(address);
_isLoaded = true;
});
}
/// Generate a response to the given prompt.
/// Returns the generated text.
Future<String> generate(
String prompt, {
int maxTokens = 512,
double temperature = 0.7,
}) async {
if (!_isLoaded || _context == null) {
throw StateError('Model not loaded. Call loadModel() first.');
}
// Run generation in an isolate so the UI stays responsive
final ctxAddress = _context!.address;
final result = await Isolate.run(() {
final bindings = LlamafuBindings();
final ctx = Pointer<Void>.fromAddress(ctxAddress);
final promptPtr = prompt.toNativeUtf8();
final resultPtr = bindings.generate(
ctx,
promptPtr,
maxTokens,
temperature,
);
calloc.free(promptPtr);
return resultPtr.toDartString();
});
return result;
}
/// Format a conversation into a prompt string.
String formatPrompt(List<ChatMessage> messages) {
final buffer = StringBuffer();
buffer.writeln('<|im_start|>system');
buffer.writeln(
'You are a helpful offline assistant. You are running directly on '
'the user\'s phone with no internet connection. Be concise and '
'helpful. If you do not know something, say so honestly.'
);
buffer.writeln('<|im_end|>');
for (final msg in messages) {
final role = msg.isUser ? 'user' : 'assistant';
buffer.writeln('<|im_start|>$role');
buffer.writeln(msg.content);
buffer.writeln('<|im_end|>');
}
buffer.writeln('<|im_start|>assistant');
return buffer.toString();
}
void dispose() {
if (_context != null) {
_bindings.free(_context!);
_context = null;
_isLoaded = false;
}
}
}
class ChatMessage {
final String content;
final bool isUser;
final DateTime timestamp;
ChatMessage({
required this.content,
required this.isUser,
DateTime? timestamp,
}) : timestamp = timestamp ?? DateTime.now();
}
Step 4: The Chat UI
// lib/screens/chat_screen.dart
import 'package:flutter/material.dart';
import '../services/inference_service.dart';
class ChatScreen extends StatefulWidget {
final InferenceService inferenceService;
const ChatScreen({super.key, required this.inferenceService});
@override
State<ChatScreen> createState() => _ChatScreenState();
}
class _ChatScreenState extends State<ChatScreen> {
final List<ChatMessage> _messages = [];
final TextEditingController _inputController = TextEditingController();
final ScrollController _scrollController = ScrollController();
bool _isGenerating = false;
Future<void> _sendMessage() async {
final text = _inputController.text.trim();
if (text.isEmpty || _isGenerating) return;
_inputController.clear();
setState(() {
_messages.add(ChatMessage(content: text, isUser: true));
_isGenerating = true;
});
_scrollToBottom();
try {
final prompt = widget.inferenceService.formatPrompt(_messages);
final response = await widget.inferenceService.generate(
prompt,
maxTokens: 512,
temperature: 0.7,
);
setState(() {
_messages.add(ChatMessage(content: response.trim(), isUser: false));
_isGenerating = false;
});
} catch (e) {
setState(() {
_messages.add(ChatMessage(
content: 'Error generating response: $e',
isUser: false,
));
_isGenerating = false;
});
}
_scrollToBottom();
}
void _scrollToBottom() {
WidgetsBinding.instance.addPostFrameCallback((_) {
if (_scrollController.hasClients) {
_scrollController.animateTo(
_scrollController.position.maxScrollExtent,
duration: const Duration(milliseconds: 300),
curve: Curves.easeOut,
);
}
});
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(
title: const Text('Offline Assistant'),
actions: [
// Indicator showing the model is running locally
Padding(
padding: const EdgeInsets.only(right: 16),
child: Row(
children: [
Icon(
Icons.circle,
size: 10,
color: widget.inferenceService.isLoaded
? Colors.green
: Colors.red,
),
const SizedBox(width: 6),
Text(
widget.inferenceService.isLoaded ? 'Model Loaded' : 'No Model',
style: const TextStyle(fontSize: 12),
),
],
),
),
],
),
body: Column(
children: [
// Chat messages
Expanded(
child: _messages.isEmpty
? const Center(
child: Column(
mainAxisAlignment: MainAxisAlignment.center,
children: [
Icon(Icons.offline_bolt, size: 64, color: Colors.grey),
SizedBox(height: 16),
Text(
'Fully Offline AI Assistant',
style: TextStyle(
fontSize: 20,
fontWeight: FontWeight.bold,
),
),
SizedBox(height: 8),
Padding(
padding: EdgeInsets.symmetric(horizontal: 32),
child: Text(
'Everything runs on your device.\n'
'No internet. No data collection.\n'
'Your conversations stay private.',
textAlign: TextAlign.center,
style: TextStyle(color: Colors.grey),
),
),
],
),
)
: ListView.builder(
controller: _scrollController,
padding: const EdgeInsets.all(16),
itemCount: _messages.length,
itemBuilder: (context, index) {
return _MessageBubble(message: _messages[index]);
},
),
),
// Generating indicator
if (_isGenerating)
const Padding(
padding: EdgeInsets.all(8),
child: Row(
children: [
SizedBox(width: 16),
SizedBox(
width: 16,
height: 16,
child: CircularProgressIndicator(strokeWidth: 2),
),
SizedBox(width: 8),
Text('Thinking...', style: TextStyle(color: Colors.grey)),
],
),
),
// Input area
Container(
padding: const EdgeInsets.all(8),
decoration: BoxDecoration(
color: Theme.of(context).cardColor,
boxShadow: [
BoxShadow(
color: Colors.black.withValues(alpha: 0.1),
blurRadius: 4,
offset: const Offset(0, -2),
),
],
),
child: SafeArea(
child: Row(
children: [
Expanded(
child: TextField(
controller: _inputController,
decoration: const InputDecoration(
hintText: 'Ask anything (offline)...',
border: OutlineInputBorder(),
contentPadding: EdgeInsets.symmetric(
horizontal: 16,
vertical: 12,
),
),
maxLines: 3,
minLines: 1,
textInputAction: TextInputAction.send,
onSubmitted: (_) => _sendMessage(),
),
),
const SizedBox(width: 8),
IconButton(
onPressed: _isGenerating ? null : _sendMessage,
icon: const Icon(Icons.send),
style: IconButton.styleFrom(
backgroundColor: Theme.of(context).primaryColor,
foregroundColor: Colors.white,
),
),
],
),
),
),
],
),
);
}
}
class _MessageBubble extends StatelessWidget {
final ChatMessage message;
const _MessageBubble({required this.message});
@override
Widget build(BuildContext context) {
return Align(
alignment: message.isUser ? Alignment.centerRight : Alignment.centerLeft,
child: Container(
margin: const EdgeInsets.only(bottom: 12),
padding: const EdgeInsets.symmetric(horizontal: 16, vertical: 12),
constraints: BoxConstraints(
maxWidth: MediaQuery.of(context).size.width * 0.78,
),
decoration: BoxDecoration(
color: message.isUser
? Theme.of(context).primaryColor
: Theme.of(context).cardColor,
borderRadius: BorderRadius.circular(16),
border: message.isUser
? null
: Border.all(color: Colors.grey.shade300),
),
child: Text(
message.content,
style: TextStyle(
color: message.isUser ? Colors.white : null,
),
),
),
);
}
}
Step 5: Model Management
Users need a way to load models onto their device. Here is a model management screen:
// lib/screens/model_manager_screen.dart
import 'dart:io';
import 'package:flutter/material.dart';
import 'package:path_provider/path_provider.dart';
import 'package:file_picker/file_picker.dart';
import 'package:path/path.dart' as p;
import '../services/inference_service.dart';
class ModelManagerScreen extends StatefulWidget {
final InferenceService inferenceService;
final VoidCallback onModelLoaded;
const ModelManagerScreen({
super.key,
required this.inferenceService,
required this.onModelLoaded,
});
@override
State<ModelManagerScreen> createState() => _ModelManagerScreenState();
}
class _ModelManagerScreenState extends State<ModelManagerScreen> {
List<File> _availableModels = [];
bool _isLoading = false;
String? _loadingStatus;
@override
void initState() {
super.initState();
_scanForModels();
}
Future<void> _scanForModels() async {
final appDir = await getApplicationDocumentsDirectory();
final modelsDir = Directory(p.join(appDir.path, 'models'));
if (!await modelsDir.exists()) {
await modelsDir.create(recursive: true);
}
final files = await modelsDir.list().toList();
setState(() {
_availableModels = files
.whereType<File>()
.where((f) => f.path.endsWith('.gguf'))
.toList();
});
}
Future<void> _importModel() async {
final result = await FilePicker.platform.pickFiles(
type: FileType.any,
allowMultiple: false,
);
if (result != null && result.files.isNotEmpty) {
final sourcePath = result.files.single.path!;
if (!sourcePath.endsWith('.gguf')) {
_showError('Please select a GGUF model file.');
return;
}
setState(() {
_isLoading = true;
_loadingStatus = 'Copying model file...';
});
final appDir = await getApplicationDocumentsDirectory();
final destPath = p.join(
appDir.path,
'models',
p.basename(sourcePath),
);
await File(sourcePath).copy(destPath);
setState(() {
_isLoading = false;
_loadingStatus = null;
});
await _scanForModels();
}
}
Future<void> _loadModel(File modelFile) async {
setState(() {
_isLoading = true;
_loadingStatus = 'Loading model (this may take 30-60 seconds)...';
});
try {
await widget.inferenceService.loadModel(
modelFile.path,
contextSize: 2048,
);
setState(() {
_isLoading = false;
_loadingStatus = null;
});
widget.onModelLoaded();
if (mounted) {
ScaffoldMessenger.of(context).showSnackBar(
const SnackBar(content: Text('Model loaded successfully!')),
);
}
} catch (e) {
setState(() {
_isLoading = false;
_loadingStatus = null;
});
_showError('Failed to load model: $e');
}
}
void _showError(String message) {
ScaffoldMessenger.of(context).showSnackBar(
SnackBar(content: Text(message), backgroundColor: Colors.red),
);
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(title: const Text('Model Manager')),
body: _isLoading
? Center(
child: Column(
mainAxisAlignment: MainAxisAlignment.center,
children: [
const CircularProgressIndicator(),
const SizedBox(height: 16),
Text(_loadingStatus ?? 'Loading...'),
],
),
)
: Column(
children: [
// Instructions
const Padding(
padding: EdgeInsets.all(16),
child: Text(
'Import a GGUF model file to use with the offline '
'assistant. Recommended: Phi-4 Mini 3.8B Q4_K_M '
'(~2 GB) for most phones.',
style: TextStyle(color: Colors.grey),
),
),
// Import button
Padding(
padding: const EdgeInsets.symmetric(horizontal: 16),
child: ElevatedButton.icon(
onPressed: _importModel,
icon: const Icon(Icons.add),
label: const Text('Import GGUF Model'),
style: ElevatedButton.styleFrom(
minimumSize: const Size(double.infinity, 48),
),
),
),
const Divider(height: 32),
// Available models
Expanded(
child: _availableModels.isEmpty
? const Center(
child: Text('No models imported yet.'),
)
: ListView.builder(
itemCount: _availableModels.length,
itemBuilder: (context, index) {
final model = _availableModels[index];
final name = p.basename(model.path);
final sizeGB =
model.lengthSync() / (1024 * 1024 * 1024);
return ListTile(
leading: const Icon(Icons.psychology),
title: Text(name),
subtitle: Text('${sizeGB.toStringAsFixed(1)} GB'),
trailing: ElevatedButton(
onPressed: () => _loadModel(model),
child: const Text('Load'),
),
);
},
),
),
],
),
);
}
}
Step 6: Main App Entry Point
// lib/main.dart
import 'package:flutter/material.dart';
import 'services/inference_service.dart';
import 'screens/chat_screen.dart';
import 'screens/model_manager_screen.dart';
void main() {
runApp(const OfflineAssistantApp());
}
class OfflineAssistantApp extends StatelessWidget {
const OfflineAssistantApp({super.key});
@override
Widget build(BuildContext context) {
return MaterialApp(
title: 'Offline Assistant',
theme: ThemeData(
colorScheme: ColorScheme.fromSeed(seedColor: Colors.indigo),
useMaterial3: true,
),
darkTheme: ThemeData(
colorScheme: ColorScheme.fromSeed(
seedColor: Colors.indigo,
brightness: Brightness.dark,
),
useMaterial3: true,
),
home: const HomeScreen(),
);
}
}
class HomeScreen extends StatefulWidget {
const HomeScreen({super.key});
@override
State<HomeScreen> createState() => _HomeScreenState();
}
class _HomeScreenState extends State<HomeScreen> {
final InferenceService _inferenceService = InferenceService();
int _currentIndex = 0;
@override
void dispose() {
_inferenceService.dispose();
super.dispose();
}
@override
Widget build(BuildContext context) {
final screens = [
ChatScreen(inferenceService: _inferenceService),
ModelManagerScreen(
inferenceService: _inferenceService,
onModelLoaded: () => setState(() => _currentIndex = 0),
),
];
return Scaffold(
body: screens[_currentIndex],
bottomNavigationBar: NavigationBar(
selectedIndex: _currentIndex,
onDestinationSelected: (index) =>
setState(() => _currentIndex = index),
destinations: const [
NavigationDestination(
icon: Icon(Icons.chat_bubble_outline),
selectedIcon: Icon(Icons.chat_bubble),
label: 'Chat',
),
NavigationDestination(
icon: Icon(Icons.settings_outlined),
selectedIcon: Icon(Icons.settings),
label: 'Models',
),
],
),
);
}
}
Step 7: Choosing the Right Model
For mobile, model selection is critical. You need to balance quality with the phone’s limited resources.
| Model | Size | RAM Needed | Speed (Snapdragon 8 Gen 3) | Quality |
|---|---|---|---|---|
| Phi-4 Mini 3.8B Q4_K_M | 2.0 GB | 3 GB | ~20 tok/s | Good for simple tasks |
| Qwen 3 1.7B Q5_K_M | 1.3 GB | 2 GB | ~28 tok/s | Best for low-end phones |
| Llama 3.2 3B Q4_K_M | 1.8 GB | 2.5 GB | ~22 tok/s | Solid all-rounder |
| Phi-4 14B Q3_K_M | 6.5 GB | 8 GB | ~6 tok/s | High quality, flagship only |
| Gemma 3 4B Q4_K_M | 2.5 GB | 3.5 GB | ~18 tok/s | Strong reasoning |
Recommendation: Start with Phi-4 Mini 3.8B Q4_K_M. It is small enough to run on mid-range phones, fast enough for interactive chat, and smart enough for most casual tasks.
Step 8: Build and Deploy
# Android
flutter build apk --release
# iOS
flutter build ios --release
Android-Specific Notes
Add to android/app/build.gradle:
android {
// Ensure native libraries are included
sourceSets {
main {
jniLibs.srcDirs = ['src/main/jniLibs']
}
}
// Recommended: increase app heap size for model loading
defaultConfig {
ndk {
abiFilters 'arm64-v8a' // Most modern phones are arm64
}
}
}
iOS-Specific Notes
In ios/Runner/Info.plist, add the document picker permission:
<key>UIFileSharingEnabled</key>
<true/>
<key>LSSupportsOpeningDocumentsInPlace</key>
<true/>
Performance Optimization Tips
-
Use a small context window. 2048 tokens is plenty for casual conversation and uses much less memory than 4096 or 8192.
-
Limit max generation tokens. Mobile users do not want to wait 60 seconds for a 1,000-token response. Set
maxTokensto 256-512. -
Warm up the model. The first generation after loading is slower. Send a short “hello” prompt immediately after loading to warm up the model.
-
Monitor memory. On Android, use
ActivityManager.getMemoryInfo()to check available RAM before loading larger models. On iOS, useos_proc_available_memory(). -
Offer model quality tiers. Let users choose between “Fast” (smaller model, lower quality) and “Quality” (larger model, slower) modes based on their phone.
What You Can Build With This
The offline AI app is a foundation. Here are practical applications:
- Travel translator — Works without roaming data
- Private journaling assistant — Helps structure thoughts, never uploads anything
- Field work assistant — Answer technical questions in areas with no connectivity
- Study companion — Quiz generation and explanation without internet
- Accessibility tool — Text simplification and summarization for people with reading difficulties
- Emergency information — Medical and safety Q&A when networks are down
Limitations to Be Honest About
- Model quality is limited by phone hardware. A 3B model on a phone is not going to match a 70B model on a server. Set user expectations accordingly.
- First load is slow. Loading a model from storage takes 10-30 seconds. Consider loading at app startup.
- Battery impact. AI inference is computationally intensive. A long conversation will drain the battery noticeably faster.
- Storage requirements. Even small models are 1-2 GB. Users need free storage space.
Despite these limitations, having an AI assistant that works with zero internet is genuinely useful in ways that cloud-dependent apps cannot replicate.
The complete source code for this tutorial is available on GitHub. Questions? Join our community.