Troubleshooting Guide

Use this guide to diagnose and fix common deployment issues. Start with the symptom you're seeing, then follow the steps.

What's happening?

⏳ Deployment stuck on "Starting"— model downloading or image pulling ❌ Deployment failed— GPU error, model not found, or crash 🔒 "GPU unavailable" error— RunPod capacity issue 🐌 Model running but slow— wrong GPU or resource limits 🔌 API returning errors— 503, timeout, or authentication

How to Check Logs

Most issues can be diagnosed from the pod logs:

Go to your Dashboard
Click the deployment card or the ⋮ menu
Click View Logs
Look for lines with ERROR or WARNING

If logs show "Container starting up..." with a spinner, the pod hasn't finished pulling the Docker image yet. This can take 5–15 minutes for large images.

⏳ Deployment Stuck on "Starting"

Deployments go through several phases. Here's how long each should take:

Model Type	Normal	Extended	Possibly Stuck	Likely Failed
Image / Text	0–15 min	15–30 min	30–60 min	60+ min
Video	0–30 min	30–60 min	60–120 min	120+ min
Custom Workflow	0–30 min	30–50 min	50–90 min	90+ min

Why it's slow: First-time deployments download models from HuggingFace (5–20 GB). Video models take longest. Subsequent deploys of the same model are faster because models are cached in the Docker image.

💡 Tip: If your deployment reaches the "Likely Failed" threshold, stop it to save costs and try again. A fresh pod may land on a node with better network connectivity.

❌ Deployment Failed

Check your logs for these common errors:

GPU Memory Exhausted (CUDA OOM)

Log message: CUDA out of memory or OutOfMemoryError

The model needs more VRAM than your GPU has.

Upgrade to a GPU with more VRAM (L4 24GB → A6000 48GB → A100 80GB)
Use a smaller model variant (e.g., 8B instead of 32B)
Reduce image resolution or batch size in your workflow

Model Not Found

Log message: Model not found or No such file .safetensors

Check the model name is spelled correctly
For workflows: verify model filenames match what ComfyUI expects
For gated models: add your HuggingFace token in settings

Missing Python Package

Log message: ModuleNotFoundError or No module named

A custom node requires a Python package not in the Docker image
Open ComfyUI Manager → Install Missing Packages → Restart
Or redeploy to trigger a fresh custom node installation

GPU Driver Error

Log message: CUDA error or cuDNN error

Restart the pod to reset GPU state
Try a different GPU type
This is usually a transient hardware issue on the RunPod node

Process Killed (OOMKilled)

Log message: Killed or OOMKilled

The system ran out of RAM (not VRAM — system memory)
Use a GPU with more system RAM
Reduce the complexity of your workflow

Disk Full

Log message: No space left on device

The container disk (200 GB default) is full — usually from too many models
Delete unused generated files from the ComfyUI output folder
Note: CPU instances have only 20 GB disk

Gated Model (Access Denied)

Log message: gated model or access token required

Some HuggingFace models require you to accept a license first
Go to the model page on huggingface.co and accept the terms
Add your HuggingFace token to your ModelPilot account settings

🔒 GPU Unavailable

Error: "The selected instance type is temporarily unavailable"

This means RunPod's data centers have no machines with your selected GPU available right now. This is not a ModelPilot issue — it's GPU supply and demand.

Try a different GPU: L4 ($0.51/hr), RTX 4090 ($0.77/hr), A6000 ($0.64/hr), and A100 ($1.81/hr) have different availability
Wait and retry: GPU capacity changes every few minutes as other users finish
Try off-peak hours: Availability is typically better outside US business hours

🐌 Model Running But Slow

Wrong GPU for the model: Large models (32B+ parameters) on a 24 GB GPU will swap to system memory. Upgrade to A6000 (48 GB) or A100 (80 GB).
First request is slow: The model loads into GPU memory on the first request. Subsequent requests are much faster.
High resolution / large batch: Reduce image dimensions or batch size in your ComfyUI workflow.

🔌 API Errors

Status	Meaning	Fix
401	Invalid or missing API key	Check your API key in Dashboard → API Keys
402	Insufficient credits	Add credits at Billing
404	Deployment not found	Check the pod ID — the deployment may have been deleted
429	Rate limited	Wait 60 seconds and retry. API limit is 100 requests/min
503	Pod not ready	The model is still loading. Wait for status to show "Running"

Still Need Help?

If your issue isn't covered here:

Email support@modelpilot.ai with your deployment ID and the error from your logs
Check the FAQ for more answers

Documentation