Building an Offline AI Phone App with Llamafu and Flutter

A step-by-step tutorial for building a fully offline AI assistant app for Android and iOS using Llamafu for on-device inference and Flutter for the UI. No internet required.

Most “AI apps” are thin wrappers around cloud APIs. They stop working without internet. They send every conversation to a remote server. They cost money per query. And they are at the mercy of an API provider’s uptime, pricing changes, and content policies.

We are going to build something different: a fully offline AI assistant app that runs entirely on the user’s phone. No internet connection needed. No API keys. No server costs. No data leaving the device. It works on an airplane, in a basement, in a country with restricted internet access — anywhere.

The stack: Llamafu for on-device LLM inference and Flutter for cross-platform UI. By the end of this tutorial, you will have a working Android and iOS app with a chat interface powered by a language model running directly on the phone’s processor.

What Is Llamafu?

Llamafu is a lightweight, cross-platform inference library built on llama.cpp. It provides native bindings for mobile platforms (Android, iOS) and desktop (Windows, macOS, Linux), with a simple C API that can be called from any language with FFI support — including Dart via Flutter’s FFI.

Key features for mobile development:

  • Optimized for ARM processors (NEON SIMD on Android, Metal acceleration on iOS)
  • Memory-efficient inference suitable for phones with 4-8GB RAM
  • GGUF model format support
  • Streaming token generation
  • Small binary footprint (~5MB)

Prerequisites

  • Flutter 3.22+ installed and configured
  • Android Studio (for Android builds) or Xcode (for iOS builds)
  • A physical device for testing (emulators work but are very slow for AI inference)
  • Basic Flutter/Dart knowledge
  • About 2 hours

Target devices: Any phone from 2022 or later should work. Models run fastest on:

  • Android: Snapdragon 8 Gen 2+, Dimensity 9000+, Tensor G3+
  • iOS: A16 Bionic+ (iPhone 14 Pro and later)

Step 1: Project Setup

flutter create --org net.localllm offline_assistant
cd offline_assistant

Add dependencies to pubspec.yaml:

dependencies:
  flutter:
    sdk: flutter
  ffi: ^2.1.0
  path_provider: ^2.1.2
  path: ^1.9.0
  permission_handler: ^11.3.0
  file_picker: ^8.0.0
  share_plus: ^9.0.0

dev_dependencies:
  flutter_test:
    sdk: flutter
  ffigen: ^11.0.0

Step 2: Integrate Llamafu Native Libraries

Download Pre-built Libraries

Llamafu provides pre-built native libraries. Download them from the Llamafu releases page:

# Create native library directories
mkdir -p android/app/src/main/jniLibs/arm64-v8a
mkdir -p android/app/src/main/jniLibs/armeabi-v7a
mkdir -p ios/Frameworks

# Download and extract (replace with actual release URL)
curl -L https://github.com/local-llm-net/llamafu/releases/latest/download/llamafu-android-arm64.so \
  -o android/app/src/main/jniLibs/arm64-v8a/libllamafu.so

curl -L https://github.com/local-llm-net/llamafu/releases/latest/download/llamafu-ios.xcframework.zip \
  -o ios/llamafu.xcframework.zip
cd ios && unzip llamafu.xcframework.zip && rm llamafu.xcframework.zip && cd ..

Dart FFI Bindings

Create the Dart bindings to call the native Llamafu functions:

// lib/llamafu_bindings.dart
import 'dart:ffi';
import 'dart:io';
import 'package:ffi/ffi.dart';

/// Native function signatures
typedef LlamafuInitNative = Pointer<Void> Function(Pointer<Utf8> modelPath, Int32 contextSize);
typedef LlamafuInit = Pointer<Void> Function(Pointer<Utf8> modelPath, int contextSize);

typedef LlamafuGenerateNative = Pointer<Utf8> Function(
  Pointer<Void> ctx,
  Pointer<Utf8> prompt,
  Int32 maxTokens,
  Float temperature,
);
typedef LlamafuGenerate = Pointer<Utf8> Function(
  Pointer<Void> ctx,
  Pointer<Utf8> prompt,
  int maxTokens,
  double temperature,
);

typedef LlamafuFreeNative = Void Function(Pointer<Void> ctx);
typedef LlamafuFreeDart = void Function(Pointer<Void> ctx);

class LlamafuBindings {
  late final DynamicLibrary _lib;
  late final LlamafuInit init;
  late final LlamafuGenerate generate;
  late final LlamafuFreeDart free;

  LlamafuBindings() {
    if (Platform.isAndroid) {
      _lib = DynamicLibrary.open('libllamafu.so');
    } else if (Platform.isIOS) {
      _lib = DynamicLibrary.process();
    } else {
      throw UnsupportedError('Platform not supported');
    }

    init = _lib
        .lookupFunction<LlamafuInitNative, LlamafuInit>('llamafu_init');
    generate = _lib
        .lookupFunction<LlamafuGenerateNative, LlamafuGenerate>('llamafu_generate');
    free = _lib
        .lookupFunction<LlamafuFreeNative, LlamafuFreeDart>('llamafu_free');
  }
}

Step 3: The Inference Service

Wrap the FFI bindings in a clean Dart service:

// lib/services/inference_service.dart
import 'dart:ffi';
import 'dart:isolate';
import 'package:ffi/ffi.dart';
import '../llamafu_bindings.dart';

class InferenceService {
  final LlamafuBindings _bindings = LlamafuBindings();
  Pointer<Void>? _context;
  bool _isLoaded = false;

  bool get isLoaded => _isLoaded;

  /// Load a GGUF model from the given file path.
  Future<void> loadModel(String modelPath, {int contextSize = 2048}) async {
    // Run model loading in an isolate to avoid blocking the UI
    await Isolate.run(() {
      final bindings = LlamafuBindings();
      final pathPtr = modelPath.toNativeUtf8();
      final ctx = bindings.init(pathPtr, contextSize);
      calloc.free(pathPtr);
      return ctx.address;
    }).then((address) {
      _context = Pointer<Void>.fromAddress(address);
      _isLoaded = true;
    });
  }

  /// Generate a response to the given prompt.
  /// Returns the generated text.
  Future<String> generate(
    String prompt, {
    int maxTokens = 512,
    double temperature = 0.7,
  }) async {
    if (!_isLoaded || _context == null) {
      throw StateError('Model not loaded. Call loadModel() first.');
    }

    // Run generation in an isolate so the UI stays responsive
    final ctxAddress = _context!.address;
    final result = await Isolate.run(() {
      final bindings = LlamafuBindings();
      final ctx = Pointer<Void>.fromAddress(ctxAddress);
      final promptPtr = prompt.toNativeUtf8();

      final resultPtr = bindings.generate(
        ctx,
        promptPtr,
        maxTokens,
        temperature,
      );

      calloc.free(promptPtr);
      return resultPtr.toDartString();
    });

    return result;
  }

  /// Format a conversation into a prompt string.
  String formatPrompt(List<ChatMessage> messages) {
    final buffer = StringBuffer();

    buffer.writeln('<|im_start|>system');
    buffer.writeln(
      'You are a helpful offline assistant. You are running directly on '
      'the user\'s phone with no internet connection. Be concise and '
      'helpful. If you do not know something, say so honestly.'
    );
    buffer.writeln('<|im_end|>');

    for (final msg in messages) {
      final role = msg.isUser ? 'user' : 'assistant';
      buffer.writeln('<|im_start|>$role');
      buffer.writeln(msg.content);
      buffer.writeln('<|im_end|>');
    }

    buffer.writeln('<|im_start|>assistant');
    return buffer.toString();
  }

  void dispose() {
    if (_context != null) {
      _bindings.free(_context!);
      _context = null;
      _isLoaded = false;
    }
  }
}

class ChatMessage {
  final String content;
  final bool isUser;
  final DateTime timestamp;

  ChatMessage({
    required this.content,
    required this.isUser,
    DateTime? timestamp,
  }) : timestamp = timestamp ?? DateTime.now();
}

Step 4: The Chat UI

// lib/screens/chat_screen.dart
import 'package:flutter/material.dart';
import '../services/inference_service.dart';

class ChatScreen extends StatefulWidget {
  final InferenceService inferenceService;

  const ChatScreen({super.key, required this.inferenceService});

  @override
  State<ChatScreen> createState() => _ChatScreenState();
}

class _ChatScreenState extends State<ChatScreen> {
  final List<ChatMessage> _messages = [];
  final TextEditingController _inputController = TextEditingController();
  final ScrollController _scrollController = ScrollController();
  bool _isGenerating = false;

  Future<void> _sendMessage() async {
    final text = _inputController.text.trim();
    if (text.isEmpty || _isGenerating) return;

    _inputController.clear();

    setState(() {
      _messages.add(ChatMessage(content: text, isUser: true));
      _isGenerating = true;
    });

    _scrollToBottom();

    try {
      final prompt = widget.inferenceService.formatPrompt(_messages);
      final response = await widget.inferenceService.generate(
        prompt,
        maxTokens: 512,
        temperature: 0.7,
      );

      setState(() {
        _messages.add(ChatMessage(content: response.trim(), isUser: false));
        _isGenerating = false;
      });
    } catch (e) {
      setState(() {
        _messages.add(ChatMessage(
          content: 'Error generating response: $e',
          isUser: false,
        ));
        _isGenerating = false;
      });
    }

    _scrollToBottom();
  }

  void _scrollToBottom() {
    WidgetsBinding.instance.addPostFrameCallback((_) {
      if (_scrollController.hasClients) {
        _scrollController.animateTo(
          _scrollController.position.maxScrollExtent,
          duration: const Duration(milliseconds: 300),
          curve: Curves.easeOut,
        );
      }
    });
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(
        title: const Text('Offline Assistant'),
        actions: [
          // Indicator showing the model is running locally
          Padding(
            padding: const EdgeInsets.only(right: 16),
            child: Row(
              children: [
                Icon(
                  Icons.circle,
                  size: 10,
                  color: widget.inferenceService.isLoaded
                      ? Colors.green
                      : Colors.red,
                ),
                const SizedBox(width: 6),
                Text(
                  widget.inferenceService.isLoaded ? 'Model Loaded' : 'No Model',
                  style: const TextStyle(fontSize: 12),
                ),
              ],
            ),
          ),
        ],
      ),
      body: Column(
        children: [
          // Chat messages
          Expanded(
            child: _messages.isEmpty
                ? const Center(
                    child: Column(
                      mainAxisAlignment: MainAxisAlignment.center,
                      children: [
                        Icon(Icons.offline_bolt, size: 64, color: Colors.grey),
                        SizedBox(height: 16),
                        Text(
                          'Fully Offline AI Assistant',
                          style: TextStyle(
                            fontSize: 20,
                            fontWeight: FontWeight.bold,
                          ),
                        ),
                        SizedBox(height: 8),
                        Padding(
                          padding: EdgeInsets.symmetric(horizontal: 32),
                          child: Text(
                            'Everything runs on your device.\n'
                            'No internet. No data collection.\n'
                            'Your conversations stay private.',
                            textAlign: TextAlign.center,
                            style: TextStyle(color: Colors.grey),
                          ),
                        ),
                      ],
                    ),
                  )
                : ListView.builder(
                    controller: _scrollController,
                    padding: const EdgeInsets.all(16),
                    itemCount: _messages.length,
                    itemBuilder: (context, index) {
                      return _MessageBubble(message: _messages[index]);
                    },
                  ),
          ),

          // Generating indicator
          if (_isGenerating)
            const Padding(
              padding: EdgeInsets.all(8),
              child: Row(
                children: [
                  SizedBox(width: 16),
                  SizedBox(
                    width: 16,
                    height: 16,
                    child: CircularProgressIndicator(strokeWidth: 2),
                  ),
                  SizedBox(width: 8),
                  Text('Thinking...', style: TextStyle(color: Colors.grey)),
                ],
              ),
            ),

          // Input area
          Container(
            padding: const EdgeInsets.all(8),
            decoration: BoxDecoration(
              color: Theme.of(context).cardColor,
              boxShadow: [
                BoxShadow(
                  color: Colors.black.withValues(alpha: 0.1),
                  blurRadius: 4,
                  offset: const Offset(0, -2),
                ),
              ],
            ),
            child: SafeArea(
              child: Row(
                children: [
                  Expanded(
                    child: TextField(
                      controller: _inputController,
                      decoration: const InputDecoration(
                        hintText: 'Ask anything (offline)...',
                        border: OutlineInputBorder(),
                        contentPadding: EdgeInsets.symmetric(
                          horizontal: 16,
                          vertical: 12,
                        ),
                      ),
                      maxLines: 3,
                      minLines: 1,
                      textInputAction: TextInputAction.send,
                      onSubmitted: (_) => _sendMessage(),
                    ),
                  ),
                  const SizedBox(width: 8),
                  IconButton(
                    onPressed: _isGenerating ? null : _sendMessage,
                    icon: const Icon(Icons.send),
                    style: IconButton.styleFrom(
                      backgroundColor: Theme.of(context).primaryColor,
                      foregroundColor: Colors.white,
                    ),
                  ),
                ],
              ),
            ),
          ),
        ],
      ),
    );
  }
}

class _MessageBubble extends StatelessWidget {
  final ChatMessage message;

  const _MessageBubble({required this.message});

  @override
  Widget build(BuildContext context) {
    return Align(
      alignment: message.isUser ? Alignment.centerRight : Alignment.centerLeft,
      child: Container(
        margin: const EdgeInsets.only(bottom: 12),
        padding: const EdgeInsets.symmetric(horizontal: 16, vertical: 12),
        constraints: BoxConstraints(
          maxWidth: MediaQuery.of(context).size.width * 0.78,
        ),
        decoration: BoxDecoration(
          color: message.isUser
              ? Theme.of(context).primaryColor
              : Theme.of(context).cardColor,
          borderRadius: BorderRadius.circular(16),
          border: message.isUser
              ? null
              : Border.all(color: Colors.grey.shade300),
        ),
        child: Text(
          message.content,
          style: TextStyle(
            color: message.isUser ? Colors.white : null,
          ),
        ),
      ),
    );
  }
}

Step 5: Model Management

Users need a way to load models onto their device. Here is a model management screen:

// lib/screens/model_manager_screen.dart
import 'dart:io';
import 'package:flutter/material.dart';
import 'package:path_provider/path_provider.dart';
import 'package:file_picker/file_picker.dart';
import 'package:path/path.dart' as p;
import '../services/inference_service.dart';

class ModelManagerScreen extends StatefulWidget {
  final InferenceService inferenceService;
  final VoidCallback onModelLoaded;

  const ModelManagerScreen({
    super.key,
    required this.inferenceService,
    required this.onModelLoaded,
  });

  @override
  State<ModelManagerScreen> createState() => _ModelManagerScreenState();
}

class _ModelManagerScreenState extends State<ModelManagerScreen> {
  List<File> _availableModels = [];
  bool _isLoading = false;
  String? _loadingStatus;

  @override
  void initState() {
    super.initState();
    _scanForModels();
  }

  Future<void> _scanForModels() async {
    final appDir = await getApplicationDocumentsDirectory();
    final modelsDir = Directory(p.join(appDir.path, 'models'));
    if (!await modelsDir.exists()) {
      await modelsDir.create(recursive: true);
    }

    final files = await modelsDir.list().toList();
    setState(() {
      _availableModels = files
          .whereType<File>()
          .where((f) => f.path.endsWith('.gguf'))
          .toList();
    });
  }

  Future<void> _importModel() async {
    final result = await FilePicker.platform.pickFiles(
      type: FileType.any,
      allowMultiple: false,
    );

    if (result != null && result.files.isNotEmpty) {
      final sourcePath = result.files.single.path!;
      if (!sourcePath.endsWith('.gguf')) {
        _showError('Please select a GGUF model file.');
        return;
      }

      setState(() {
        _isLoading = true;
        _loadingStatus = 'Copying model file...';
      });

      final appDir = await getApplicationDocumentsDirectory();
      final destPath = p.join(
        appDir.path,
        'models',
        p.basename(sourcePath),
      );

      await File(sourcePath).copy(destPath);

      setState(() {
        _isLoading = false;
        _loadingStatus = null;
      });

      await _scanForModels();
    }
  }

  Future<void> _loadModel(File modelFile) async {
    setState(() {
      _isLoading = true;
      _loadingStatus = 'Loading model (this may take 30-60 seconds)...';
    });

    try {
      await widget.inferenceService.loadModel(
        modelFile.path,
        contextSize: 2048,
      );

      setState(() {
        _isLoading = false;
        _loadingStatus = null;
      });

      widget.onModelLoaded();

      if (mounted) {
        ScaffoldMessenger.of(context).showSnackBar(
          const SnackBar(content: Text('Model loaded successfully!')),
        );
      }
    } catch (e) {
      setState(() {
        _isLoading = false;
        _loadingStatus = null;
      });
      _showError('Failed to load model: $e');
    }
  }

  void _showError(String message) {
    ScaffoldMessenger.of(context).showSnackBar(
      SnackBar(content: Text(message), backgroundColor: Colors.red),
    );
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('Model Manager')),
      body: _isLoading
          ? Center(
              child: Column(
                mainAxisAlignment: MainAxisAlignment.center,
                children: [
                  const CircularProgressIndicator(),
                  const SizedBox(height: 16),
                  Text(_loadingStatus ?? 'Loading...'),
                ],
              ),
            )
          : Column(
              children: [
                // Instructions
                const Padding(
                  padding: EdgeInsets.all(16),
                  child: Text(
                    'Import a GGUF model file to use with the offline '
                    'assistant. Recommended: Phi-4 Mini 3.8B Q4_K_M '
                    '(~2 GB) for most phones.',
                    style: TextStyle(color: Colors.grey),
                  ),
                ),

                // Import button
                Padding(
                  padding: const EdgeInsets.symmetric(horizontal: 16),
                  child: ElevatedButton.icon(
                    onPressed: _importModel,
                    icon: const Icon(Icons.add),
                    label: const Text('Import GGUF Model'),
                    style: ElevatedButton.styleFrom(
                      minimumSize: const Size(double.infinity, 48),
                    ),
                  ),
                ),

                const Divider(height: 32),

                // Available models
                Expanded(
                  child: _availableModels.isEmpty
                      ? const Center(
                          child: Text('No models imported yet.'),
                        )
                      : ListView.builder(
                          itemCount: _availableModels.length,
                          itemBuilder: (context, index) {
                            final model = _availableModels[index];
                            final name = p.basename(model.path);
                            final sizeGB =
                                model.lengthSync() / (1024 * 1024 * 1024);

                            return ListTile(
                              leading: const Icon(Icons.psychology),
                              title: Text(name),
                              subtitle: Text('${sizeGB.toStringAsFixed(1)} GB'),
                              trailing: ElevatedButton(
                                onPressed: () => _loadModel(model),
                                child: const Text('Load'),
                              ),
                            );
                          },
                        ),
                ),
              ],
            ),
    );
  }
}

Step 6: Main App Entry Point

// lib/main.dart
import 'package:flutter/material.dart';
import 'services/inference_service.dart';
import 'screens/chat_screen.dart';
import 'screens/model_manager_screen.dart';

void main() {
  runApp(const OfflineAssistantApp());
}

class OfflineAssistantApp extends StatelessWidget {
  const OfflineAssistantApp({super.key});

  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      title: 'Offline Assistant',
      theme: ThemeData(
        colorScheme: ColorScheme.fromSeed(seedColor: Colors.indigo),
        useMaterial3: true,
      ),
      darkTheme: ThemeData(
        colorScheme: ColorScheme.fromSeed(
          seedColor: Colors.indigo,
          brightness: Brightness.dark,
        ),
        useMaterial3: true,
      ),
      home: const HomeScreen(),
    );
  }
}

class HomeScreen extends StatefulWidget {
  const HomeScreen({super.key});

  @override
  State<HomeScreen> createState() => _HomeScreenState();
}

class _HomeScreenState extends State<HomeScreen> {
  final InferenceService _inferenceService = InferenceService();
  int _currentIndex = 0;

  @override
  void dispose() {
    _inferenceService.dispose();
    super.dispose();
  }

  @override
  Widget build(BuildContext context) {
    final screens = [
      ChatScreen(inferenceService: _inferenceService),
      ModelManagerScreen(
        inferenceService: _inferenceService,
        onModelLoaded: () => setState(() => _currentIndex = 0),
      ),
    ];

    return Scaffold(
      body: screens[_currentIndex],
      bottomNavigationBar: NavigationBar(
        selectedIndex: _currentIndex,
        onDestinationSelected: (index) =>
            setState(() => _currentIndex = index),
        destinations: const [
          NavigationDestination(
            icon: Icon(Icons.chat_bubble_outline),
            selectedIcon: Icon(Icons.chat_bubble),
            label: 'Chat',
          ),
          NavigationDestination(
            icon: Icon(Icons.settings_outlined),
            selectedIcon: Icon(Icons.settings),
            label: 'Models',
          ),
        ],
      ),
    );
  }
}

Step 7: Choosing the Right Model

For mobile, model selection is critical. You need to balance quality with the phone’s limited resources.

ModelSizeRAM NeededSpeed (Snapdragon 8 Gen 3)Quality
Phi-4 Mini 3.8B Q4_K_M2.0 GB3 GB~20 tok/sGood for simple tasks
Qwen 3 1.7B Q5_K_M1.3 GB2 GB~28 tok/sBest for low-end phones
Llama 3.2 3B Q4_K_M1.8 GB2.5 GB~22 tok/sSolid all-rounder
Phi-4 14B Q3_K_M6.5 GB8 GB~6 tok/sHigh quality, flagship only
Gemma 3 4B Q4_K_M2.5 GB3.5 GB~18 tok/sStrong reasoning

Recommendation: Start with Phi-4 Mini 3.8B Q4_K_M. It is small enough to run on mid-range phones, fast enough for interactive chat, and smart enough for most casual tasks.

Step 8: Build and Deploy

# Android
flutter build apk --release

# iOS
flutter build ios --release

Android-Specific Notes

Add to android/app/build.gradle:

android {
    // Ensure native libraries are included
    sourceSets {
        main {
            jniLibs.srcDirs = ['src/main/jniLibs']
        }
    }

    // Recommended: increase app heap size for model loading
    defaultConfig {
        ndk {
            abiFilters 'arm64-v8a'  // Most modern phones are arm64
        }
    }
}

iOS-Specific Notes

In ios/Runner/Info.plist, add the document picker permission:

<key>UIFileSharingEnabled</key>
<true/>
<key>LSSupportsOpeningDocumentsInPlace</key>
<true/>

Performance Optimization Tips

  1. Use a small context window. 2048 tokens is plenty for casual conversation and uses much less memory than 4096 or 8192.

  2. Limit max generation tokens. Mobile users do not want to wait 60 seconds for a 1,000-token response. Set maxTokens to 256-512.

  3. Warm up the model. The first generation after loading is slower. Send a short “hello” prompt immediately after loading to warm up the model.

  4. Monitor memory. On Android, use ActivityManager.getMemoryInfo() to check available RAM before loading larger models. On iOS, use os_proc_available_memory().

  5. Offer model quality tiers. Let users choose between “Fast” (smaller model, lower quality) and “Quality” (larger model, slower) modes based on their phone.

What You Can Build With This

The offline AI app is a foundation. Here are practical applications:

  • Travel translator — Works without roaming data
  • Private journaling assistant — Helps structure thoughts, never uploads anything
  • Field work assistant — Answer technical questions in areas with no connectivity
  • Study companion — Quiz generation and explanation without internet
  • Accessibility tool — Text simplification and summarization for people with reading difficulties
  • Emergency information — Medical and safety Q&A when networks are down

Limitations to Be Honest About

  • Model quality is limited by phone hardware. A 3B model on a phone is not going to match a 70B model on a server. Set user expectations accordingly.
  • First load is slow. Loading a model from storage takes 10-30 seconds. Consider loading at app startup.
  • Battery impact. AI inference is computationally intensive. A long conversation will drain the battery noticeably faster.
  • Storage requirements. Even small models are 1-2 GB. Users need free storage space.

Despite these limitations, having an AI assistant that works with zero internet is genuinely useful in ways that cloud-dependent apps cannot replicate.

The complete source code for this tutorial is available on GitHub. Questions? Join our community.