Local Image Generation: Stable Diffusion, FLUX, and ComfyUI Guide

Generate images locally with Stable Diffusion, FLUX, and ComfyUI. Covers setup, ControlNet, LoRAs, VRAM management, prompt engineering, and workflow optimization.

Local image generation lets you create AI art, product mockups, design concepts, and visual content entirely on your own hardware. No cloud services, no content filters, no per-image fees. This guide covers the two major model families (Stable Diffusion and FLUX), the most powerful interface (ComfyUI), and essential techniques like ControlNet, LoRAs, and prompt engineering that transform basic generation into a professional workflow.

Understanding the Models

Stable Diffusion Family

SD 1.5 (2022): The original. Lightweight, massive ecosystem of fine-tunes and LoRAs.

  • Resolution: 512x512 native
  • VRAM: 4+ GB
  • Ecosystem: Thousands of community models and LoRAs

SDXL (2023): Significant quality upgrade. Better composition, lighting, and detail.

  • Resolution: 1024x1024 native
  • VRAM: 6-8+ GB
  • Ecosystem: Growing rapidly, many SDXL-specific models

SD 3.5 (2024): Latest from Stability AI. Improved text rendering and coherence.

  • Resolution: 1024x1024 native
  • VRAM: 8-12+ GB
  • Ecosystem: Still developing

FLUX Family

FLUX.1 Schnell (Fast): Optimized for speed. 4 steps instead of 20+.

  • Resolution: 1024x1024
  • VRAM: 8-12 GB
  • Speed: 3-8 seconds on good hardware

FLUX.1 Dev: Higher quality, more steps.

  • Resolution: Up to 2048x2048
  • VRAM: 12-24 GB
  • Speed: 15-30 seconds
  • License: Non-commercial

FLUX.1 Pro: API-only commercial version.

Quick Comparison

FeatureSD 1.5SDXLFLUX.1 Dev
QualityGoodVery goodExcellent
VRAM4 GB8 GB12-24 GB
SpeedFastMediumSlower
Text in imagesPoorFairGood
Prompt adherenceFairGoodExcellent
LoRA ecosystemHugeLargeGrowing
ControlNetYesYesYes
LicenseOpenOpenNon-commercial

Setting Up ComfyUI

ComfyUI is a node-based interface for image generation. It’s more powerful than automatic1111’s WebUI, offering full control over the generation pipeline through visual workflows.

Installation

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install PyTorch (NVIDIA GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install PyTorch (Apple Silicon)
pip install torch torchvision torchaudio

# Install PyTorch (CPU only)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Install ComfyUI dependencies
pip install -r requirements.txt

# Start ComfyUI
python main.py
# Open http://127.0.0.1:8188 in your browser

Docker Installation

# NVIDIA GPU
docker run -d \
  --gpus all \
  -p 8188:8188 \
  -v comfyui_data:/app/output \
  -v comfyui_models:/app/models \
  --name comfyui \
  ghcr.io/ai-dock/comfyui:latest

ComfyUI Manager (Essential)

ComfyUI Manager adds a GUI for installing custom nodes, models, and extensions:

cd ComfyUI/custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager.git

# Restart ComfyUI
# Click "Manager" in the menu bar to access

Downloading Models

Stable Diffusion Models

Place checkpoint files in ComfyUI/models/checkpoints/:

cd ComfyUI/models/checkpoints

# SDXL Base (6.9 GB)
wget https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors

# Popular community models (from CivitAI or Hugging Face):
# - Juggernaut XL (photorealistic)
# - DreamShaper XL (versatile)
# - RealVisXL (photorealistic)
# - Pony Diffusion XL (stylized)

FLUX Models

cd ComfyUI/models

# FLUX.1 Dev (23.8 GB for full model)
# Use the fp8 quantized version to save VRAM:
cd unet
wget https://huggingface.co/Comfy-Org/flux1-dev/resolve/main/flux1-dev-fp8.safetensors

# FLUX.1 Schnell (fast version)
wget https://huggingface.co/Comfy-Org/flux1-schnell/resolve/main/flux1-schnell-fp8.safetensors

# FLUX text encoder (T5 XXL, required)
cd ../clip
wget https://huggingface.co/Comfy-Org/flux1-dev/resolve/main/t5xxl_fp8_e4m3fn.safetensors
wget https://huggingface.co/Comfy-Org/flux1-dev/resolve/main/clip_l.safetensors

# FLUX VAE
cd ../vae
wget https://huggingface.co/Comfy-Org/flux1-dev/resolve/main/ae.safetensors

VAE Models

cd ComfyUI/models/vae

# SDXL VAE (for better color accuracy)
wget https://huggingface.co/stabilityai/sdxl-vae/resolve/main/sdxl_vae.safetensors

Basic Image Generation Workflows

SDXL Workflow in ComfyUI

The default ComfyUI workflow generates images with SDXL. Here’s what each node does:

[Load Checkpoint] → loads the SDXL model

[CLIP Text Encode (Positive)] → your prompt

[CLIP Text Encode (Negative)] → what to avoid

[KSampler] → the denoising/generation process

[VAE Decode] → converts latent to image

[Save Image] → saves to output folder

Key parameters in KSampler:

  • Steps: 20-30 for SDXL (higher = more detail, slower)
  • CFG (Classifier-Free Guidance): 7-8 for SDXL (higher = more prompt adherence)
  • Sampler: euler or dpmpp_2m for speed, dpmpp_2m_sde for quality
  • Scheduler: karras (recommended)
  • Denoise: 1.0 for text-to-image, 0.3-0.7 for image-to-image

FLUX Workflow

FLUX uses a different pipeline than Stable Diffusion:

[Load Diffusion Model] → FLUX unet
[Load CLIP] → T5 encoder + CLIP-L
[Load VAE] → FLUX VAE

[CLIP Text Encode] → prompt (no negative prompt needed for FLUX)

[KSampler] → steps: 20-28, CFG: 1.0 (FLUX ignores CFG)

[VAE Decode] → [Save Image]

FLUX-specific settings:

  • Steps: 4 for Schnell, 20-28 for Dev
  • CFG: 1.0 (FLUX uses guidance scale differently)
  • Sampler: euler for Schnell, euler or dpmpp_2m for Dev
  • Scheduler: simple for Schnell, normal for Dev

Prompt Engineering for Image Generation

SDXL Prompt Structure

# Positive prompt
A professional photograph of a mountain landscape at golden hour,
dramatic clouds, snow-capped peaks, alpine meadow with wildflowers,
crystal clear lake reflection, 8K resolution, photorealistic,
shot on Canon EOS R5, 24mm wide angle lens, f/11

# Negative prompt
blurry, low quality, distorted, deformed, ugly, bad anatomy,
watermark, text, signature, jpeg artifacts, low resolution

FLUX Prompt Style

FLUX responds better to natural language descriptions:

A cozy coffee shop on a rainy afternoon. Through the window,
you can see people with umbrellas on a wet cobblestone street.
Inside, warm lighting illuminates wooden tables, stacked books,
and a steaming cup of latte with leaf art. The atmosphere is
warm and inviting, with a slight film grain quality.

FLUX typically does not need negative prompts.

Prompt Tips

TechniqueExampleEffect
Quality modifiers”masterpiece, 8K, detailed”Higher overall quality
Camera terms”shot on Canon, 85mm, f/1.8, bokeh”Photographic style
Lighting”golden hour, dramatic lighting, rim light”Mood and atmosphere
Style reference”in the style of Studio Ghibli”Artistic direction
Composition”rule of thirds, leading lines, symmetrical”Better framing
Specificity”red 1967 Ford Mustang” vs “a car”More accurate results

ControlNet

ControlNet lets you guide image generation using reference images for pose, edges, depth, or other structural information.

Installing ControlNet

# Install ControlNet nodes for ComfyUI
cd ComfyUI/custom_nodes
git clone https://github.com/Fannovel16/comfyui_controlnet_aux.git
pip install -r comfyui_controlnet_aux/requirements.txt

# Download ControlNet models
cd ComfyUI/models/controlnet

# For SDXL
wget https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0/resolve/main/diffusion_pytorch_model.safetensors \
  -O sdxl-controlnet-canny.safetensors

# For FLUX
wget https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny/resolve/main/diffusion_pytorch_model.safetensors \
  -O flux-controlnet-canny.safetensors

ControlNet Types

TypeInputUse Case
CannyEdge detectionPreserve structure/outlines
DepthDepth mapMaintain 3D composition
OpenPosePose skeletonControl human poses
ScribbleHand-drawn linesQuick sketches to images
TileImage tilesUpscaling with detail
IP-AdapterReference imageStyle/subject transfer
InpaintingMasked regionEdit parts of images

ControlNet Workflow

[Load Image] → [Canny Edge Detector] → canny_image
[Load ControlNet Model] → controlnet

[Apply ControlNet] ← (conditioning from CLIP, controlnet, canny_image)

[KSampler] → (uses controlled conditioning)

[VAE Decode] → [Save Image]

LoRAs (Low-Rank Adaptations)

LoRAs are small add-on models that modify the base model’s output for specific styles, characters, or concepts.

Using LoRAs

# Place LoRA files in:
ComfyUI/models/loras/

# Popular LoRA sources:
# - CivitAI (civitai.com)
# - Hugging Face

In ComfyUI, add a “Load LoRA” node between the checkpoint and the CLIP encoders:

[Load Checkpoint] → [Load LoRA] → [CLIP Text Encode]

                  lora_file: my_style.safetensors
                  strength_model: 0.7
                  strength_clip: 0.7

LoRA strength: 0.5-0.8 for subtle effect, 0.8-1.0 for strong effect. Too high can distort the image.

Stacking Multiple LoRAs

[Load Checkpoint] → [Load LoRA 1] → [Load LoRA 2] → [CLIP Encode]
                     style.safetensors   character.safetensors
                     strength: 0.6       strength: 0.8

VRAM Management

Reducing VRAM Usage

# Start ComfyUI with optimizations

# Low VRAM mode (offloads to CPU)
python main.py --lowvram

# Very low VRAM mode (aggressive offloading)
python main.py --novram

# Use fp16 (half precision)
python main.py --force-fp16

# Apple Silicon
python main.py --force-fp16

# Specify GPU
python main.py --cuda-device 0

VRAM Requirements by Configuration

TaskVRAM NeededNotes
SD 1.5, 512x5124 GBBasic generation
SD 1.5 + ControlNet6 GBWith guidance
SDXL, 1024x10248 GBBase generation
SDXL + ControlNet + LoRA10-12 GBFull pipeline
FLUX Schnell (fp8)10 GBFast mode
FLUX Dev (fp8)12 GBQuality mode
FLUX Dev (fp16)24 GBMaximum quality
FLUX + ControlNet16-24 GBWith guidance

Tiled Generation for High Resolution

Generate images larger than your VRAM allows using tiling:

# In ComfyUI, use the "Tiled KSampler" node
# This generates the image in overlapping tiles
# Allows 2048x2048+ on 8 GB VRAM

Image-to-Image and Inpainting

Image-to-Image

Transform an existing image using a prompt:

[Load Image] → [VAE Encode] → latent

[KSampler] ← latent + conditioning
    denoise: 0.5  (lower = closer to original)

[VAE Decode] → [Save Image]

Denoise strength:

  • 0.2-0.3: Subtle changes (color correction, minor style)
  • 0.4-0.6: Moderate changes (style transfer, detail additions)
  • 0.7-0.9: Major changes (significant transformation)
  • 1.0: Full generation using image as rough guide only

Inpainting

Edit specific regions of an image:

[Load Image] → image
[Load Mask] → mask (white = generate, black = keep)

[Set Latent Noise Mask] ← (latent from VAE Encode, mask)

[KSampler] ← conditioned latent
    denoise: 0.8

[VAE Decode] → [Save Image]

Upscaling

AI Upscaling Models

# Download upscale models to ComfyUI/models/upscale_models/
cd ComfyUI/models/upscale_models

# RealESRGAN x4 (general purpose)
wget https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth

# 4x-UltraSharp (popular, high quality)
# Download from CivitAI or Hugging Face

Upscale Workflow

[Load Image] → [Upscale Image (using Model)] → 4x larger image

            [Load Upscale Model]

# For even better results, combine with img2img:
[Upscaled Image] → [VAE Encode] → [KSampler (denoise: 0.3)] → [VAE Decode]

Batch Generation and Automation

ComfyUI API

ComfyUI has a REST API for automation:

import json
import urllib.request

def queue_prompt(workflow):
    """Send a workflow to ComfyUI for processing."""
    data = json.dumps({"prompt": workflow}).encode("utf-8")
    req = urllib.request.Request(
        "http://127.0.0.1:8188/prompt",
        data=data,
        headers={"Content-Type": "application/json"},
    )
    return json.loads(urllib.request.urlopen(req).read())

# Load a saved workflow JSON from ComfyUI
with open("workflow_api.json") as f:
    workflow = json.load(f)

# Modify the prompt
workflow["6"]["inputs"]["text"] = "A beautiful sunset over the ocean"

# Queue it
result = queue_prompt(workflow)
print(f"Queued: {result}")

Batch Processing

prompts = [
    "A serene mountain lake at dawn",
    "A bustling city street at night, neon lights",
    "A quiet library with warm lighting and old books",
    "An alien landscape with two moons",
]

for i, prompt in enumerate(prompts):
    workflow["6"]["inputs"]["text"] = prompt
    workflow["3"]["inputs"]["seed"] = 42 + i  # Different seed each
    queue_prompt(workflow)
    print(f"Queued image {i+1}: {prompt[:50]}...")

Alternative Interfaces

Automatic1111 (Stable Diffusion WebUI)

The older but still popular interface:

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui
cd stable-diffusion-webui
./webui.sh  # Linux/macOS
# or webui-user.bat on Windows

Fooocus (Simplest)

Midjourney-like simplicity for local generation:

git clone https://github.com/lllyasviel/Fooocus
cd Fooocus
python -m venv venv
source venv/bin/activate
pip install -r requirements_versions.txt
python entry_with_update.py

DiffusionBee (macOS Native)

Download from diffusionbee.com for a native macOS experience.

Next Steps

  • Build AI workflows: Combine image generation with LLMs for automated content creation
  • Set up RAG: Local RAG Chatbot for text-based applications
  • Deploy for teams: Docker guide for multi-user access
  • Fine-tune models: Fine-Tuning guide to create custom models

Frequently Asked Questions

What is the difference between Stable Diffusion and FLUX?

Stable Diffusion (by Stability AI) is the established standard with a massive ecosystem of models, LoRAs, and tools. FLUX (by Black Forest Labs, from former Stability AI founders) represents the next generation with better prompt adherence, more coherent compositions, and improved text rendering in images. FLUX models are typically larger and need more VRAM. SD is better for users with limited hardware or who need the extensive LoRA/ControlNet ecosystem. FLUX is better for users with 12+ GB VRAM who want higher quality output.

How much VRAM do I need for image generation?

SDXL requires 6-8 GB VRAM for basic generation and 10-12 GB with ControlNet. FLUX.1 Dev needs 10-12 GB minimum (24 GB recommended). SD 1.5 runs on 4 GB VRAM. You can use CPU offloading to run larger models on less VRAM, but generation will be much slower. Apple Silicon Macs can use unified memory, allowing SDXL on 16 GB and FLUX on 32 GB configurations.

How long does it take to generate an image locally?

On an RTX 4090 (24 GB): SD 1.5 takes 2-5 seconds, SDXL takes 10-20 seconds, FLUX Dev takes 15-30 seconds per image at 1024x1024. On an RTX 3060 (12 GB): SD 1.5 takes 5-10 seconds, SDXL takes 20-40 seconds. On Apple M3 Max: SDXL takes 30-60 seconds, FLUX takes 45-90 seconds. CPU-only generation takes minutes per image and is not practical for regular use.