Self-Hosting AI Image Generation: Why I Stopped Paying for Midjourney

Not bad art either. We are talking cinematic landscapes, fantasy characters, surrealist compositions, photorealistic portraits — all produced locally on a nine-year-old GPU. This is what self-hosting AI image generation looks like in 2026, and if you are already comfortable running your own infrastructure, it is a natural next step.

Why not just use Midjourney or Adobe Firefly?

The same reason I do not host my email at Gmail.

Every cloud AI image service is a dependency. The pricing changes. The terms of service change. The content policies tighten without notice. Images you generate today are, depending on the platform, used to train future models. Some services watermark outputs. Others restrict commercial use unless you are on the expensive tier.

Midjourney runs $10 to $60 USD per month. Adobe Firefly is bundled into a Creative Cloud subscription that costs significantly more. DALL-E charges per image. The costs add up quickly if you are generating regularly for a website, a business, or just for the enjoyment of it.

Running Stable Diffusion locally costs nothing after the initial setup. The models are free. The generations are free. The outputs are yours.

The setup

I use Stability Matrix as the frontend launcher. It handles Python environments, model management, and launching cleanly — think of it as a control panel that sits on top of the actual generation engine. Download it from GitHub and it handles nearly everything, including pointing multiple packages at a shared models folder so you don't store duplicates on disk.

The actual workhorse is Automatic1111 Stable Diffusion WebUI — a browser-based interface that has been the community standard for years. Thousands of models, extensions, and ready-made configurations exist for it. You launch it through Stability Matrix, open localhost:7860 in a browser, and you have a full image generation studio running locally.

Hardware-wise, you want a dedicated NVIDIA GPU with at least 4GB VRAM. My GTX 1060 6GB handles SD 1.5 models comfortably with the --medvram flag, which tiles operations to stay within the VRAM budget. If you are running SDXL or FLUX models you will want 8–12GB.

Generation settings

Getting good results is largely a settings problem before it's a prompt problem. These are the defaults I settled on after testing, tuned for SD 1.5 models on 6GB VRAM:

Setting	Value
Sampler	DPM++ 2M — Karras schedule
Steps	28
CFG Scale	5.5
Resolution	512 × 768
VAE	vae-ft-mse-840000-ema-pruned
CLIP Skip	2
Hires Fix	Enabled — 2× upscale
Upscaler	4x-UltraSharp
Hires Steps	15
Denoising Strength	0.45

Why these numbers? DPM++ 2M Karras is the most reliable sampler for photorealistic and detailed work — it converges cleanly in 25–30 steps. CFG 5.5 is lower than the A1111 default of 7, which produces more natural-looking results; higher CFG values push the model harder toward the prompt but introduce artefacts. The VAE is critical — without a proper SD 1.5 VAE, colours are washed out and skin tones look grey. CLIP skip 2 is standard for most SD 1.5 community models.

The hires fix workflow generates at 512×768 first (fast, low memory), then upscales to 1024×1536 using the 4x-UltraSharp model. At denoising 0.45 it adds detail without reinterpreting the composition. On a GTX 1060 a single image with hires fix takes around 90 seconds.

Negative prompts and embeddings

The negative prompt is just as important as the positive one. You are telling the model what to actively avoid. My standard negative:

BadDream, UnrealisticDream, (worst quality, low quality:1.4), (malformed hands:1.4), (poorly drawn hands:1.2), blurry, extra limbs, cloned face, disfigured, ugly, watermark, text, signature, cartoon, anime, painting, CGI

BadDream and UnrealisticDream are textual inversion embeddings — small files that encode entire patterns of unwanted output into a single token. They are far more effective than typing out individual negative terms because they capture complex multi-dimensional failure modes the model learned during training. Both are free downloads from CivitAI and sit in the embeddings/ folder.

The numbers in brackets are attention weights. (malformed hands:1.4) means "pay 40% more attention to avoiding this". Hands are notoriously difficult for diffusion models and worth weighting heavily.

ADetailer — automatic face fixing

At 512×768, faces often come out slightly soft or inconsistent, especially when the figure is small in the frame. ADetailer is an A1111 extension that runs a second pass automatically: it detects faces using a YOLO object detection model, crops and re-generates just that region at higher detail, then composites it back. The result is sharper, more consistent faces without any manual inpainting.

Configuration is minimal — install the extension, enable it in the ADetailer accordion, set the model to face_yolov8n.pt, and leave the defaults. Denoising strength of 0.4 works well — enough to add detail without changing the face entirely.

The models

Stable Diffusion 1.5 is the foundation — fast, modest VRAM requirements, and by far the largest ecosystem of community-trained variants. Think of it like the base Linux kernel that everyone builds their own distribution on top of. All of the models below are free to download from CivitAI or HuggingFace and sit alongside each other in Stability Matrix's shared models folder.

Realistic Vision v6 — photorealistic portraits and people, film photography aesthetic. The HyperVAE version has a baked-in VAE so no external VAE file is needed.
Dreamshaper 8 — versatile all-rounder, handles illustration, concept art, and photorealism. The best single model if you only want one.
AbsoluteReality v1.8 — photorealism with better ethnic diversity than most models; good for marketing imagery where you want varied faces.
Deliberate v6 — detailed and painterly, excellent for fantasy scenes and editorial work. Handles complex compositions well.
Dreamlike Photoreal 2.0 — very clean, modern photography look. My first choice for website hero images and lifestyle shots.
ICBINP Mid 2024 — one of the most convincingly photographic SD 1.5 models available. Stands for "I Can't Believe It's Not Photography".
toonyou Beta 6 — friendly cartoon style, consistent character anatomy. Best choice for children's content, illustrations, and anything that needs a safe, approachable look.

Writing good prompts

A few principles that consistently improve results:

Be specific about style. "A portrait" gets you something generic. "A portrait, 35mm film photo, soft natural lighting, bokeh background, Vogue editorial style" gives the model clear direction.

Use quality anchors. Terms like photorealistic, RAW photo, 8k, (masterpiece:1.2) signal high quality output. The numbers in brackets are weights — (masterpiece:1.2) means "pay 20% more attention to this".

Match the model to the task. A photorealism model will fight you if you ask for anime. A cartoon model will struggle with documentary-style photography. Use the right tool for the subject.

Keep ethnicity descriptors in portrait prompts. Most SD 1.5 photorealism models have a bias toward East Asian faces as a default. Adding caucasian, european, dark skin, or whatever you actually want is necessary for consistent results.

The images

Below are thirty prompts I generated locally using the settings above. Every image came off the GTX 1060 6GB — no cloud, no API, no subscription. Generation time per image including hires fix was around 90 seconds.

Epic & Cinematic

A lone warrior standing on a cliff at sunrise, vast mountains in the background, cinematic lighting, ultra-realistic, 8K

A futuristic cyberpunk city at night, neon lights reflecting on wet streets, flying cars, Blade Runner style

A dragon flying over a burning medieval city, dramatic clouds, epic fantasy art

A space explorer discovering an ancient alien temple on Mars, glowing symbols, cinematic atmosphere

A massive wave frozen mid-crash with a surfer inside, hyper-detailed, slow-motion effect

Nature & Dreamscapes

A forest where trees glow with bioluminescent light, magical fog, fantasy realism

A giant floating island with waterfalls pouring into the sky, surreal dreamscape

A deer made of constellations standing under a starry sky, cosmic art style

A desert with glass sand dunes reflecting the sunset, hyper-realistic

A waterfall flowing upward into the clouds, surreal, vibrant colors

Surrealism & Symbolism

A human head dissolving into a swarm of butterflies, symbolism of transformation

Time melting like liquid clocks in a modern city, Salvador Dalí-inspired

A person walking through doors floating in empty space, each door a different universe

A heart made of fire and ice colliding, representing emotional contrast

A brain growing like a tree with glowing knowledge fruits

Technology & Sci-Fi

A humanoid robot sitting alone in the rain, emotional expression, cinematic lighting

A futuristic astronaut floating through a neon galaxy, vaporwave aesthetic

A cybernetic samurai with glowing armor in a dystopian Tokyo

AI controlling a vast digital universe, glowing data streams everywhere

A futuristic city floating above Earth, ultra-detailed sci-fi landscape

Artistic Styles

A portrait of a woman in the style of Van Gogh, swirling brush strokes, vivid colors

A bustling street market painted in watercolor, soft tones, dreamy atmosphere

A lion illustrated in geometric low-poly style, vibrant colors

A surreal portrait with Picasso-inspired abstract face distortion

A fantasy castle in Studio Ghibli style, warm lighting, cozy mood

Fantasy & Mythology

A phoenix rising from ashes in slow motion, fire particles, ultra-detailed

A mermaid sitting on a rock during a storm, dramatic ocean waves

A wizard casting a glowing spell in a dark enchanted forest

A giant ancient tree with a hidden glowing city inside its trunk

A knight facing a shadow monster made of smoke and darkness

Getting started

Stability Matrix is the easiest entry point. Download it, install Automatic1111 through its package manager, and download a model or two from CivitAI. You will be generating images within an hour of starting.

You will need:

A dedicated NVIDIA GPU — 4GB VRAM minimum, 6GB+ recommended for SD 1.5 with hires fix
Around 20GB of free disk space for a useful model collection
GNU/Linux, Windows, or macOS — Stability Matrix runs on all three
CUDA drivers installed and up to date

The first generation feels genuinely surprising. You type a sentence and get an image back — one that exists nowhere else, made on your own hardware, owned entirely by you.

That is the point of self-hosting, applied to AI.