Dirt-Simple Remote Inference for LLMs
tl;dr:
I’d like an easy way to do batch jobs on remote GPUs. I don’t want to wait around manually to terminate instances when the jobs are done.
What? Why?
Like everyone lately, I could sure use some more VRAM. I don’t have such a volume of ML/AI workloads as warrants investment in local hardware yet. A100s in the cloud are under 3¢/minute – an order of magnitude cheaper than a long-distance phone call in my birthyear – and RTX 3090s are actually free, if you regard the prices with %0.2f, so let’s use ‘em.
Why RunPod?
There’s a lot of providers offering GPUs in the cloud at the moment, but RunPod is one of a small number1 of them that offers pre-paid bill-capped service. According to their FAQ, they’ll just stop service if you run out of pre-paid credit.
That’s perfect for doing quick hacky stuff like this, since you can’t accidentally rack up hundreds or thousands of dollars of bills like many people have done on AWS and GCP. 2
Why NOT RunPod?
I’d prefer a full VPS, but RunPod is a container-based service. This might be preferable if that’s your workflow anyway. It can however be mildly annoying if you just wanted to leverage a prebuilt image, and naively expect full-featured SSH.3
Their containers do not, by default, get a public IP address. You can request a public IP, but if you do, it looks to me like you’re pulling from a smaller pool of machines, so you might be more likely to encounter availability issues.
You might think you don’t need a public IP, but unless your traffic is exclusively HTTP, you probably do. They have proxying for HTTP, but they won’t forward TCP ports for you unless the host supports a public IP.
I also encountered at least one issue that feels like a bug. When calling runpod.create_pod() with a template, the template’s storage configuration is not respected. All attempted launches will fail somewhat cryptically unless you directly supply a storage configuration to create_pod(). I was able to find the workaround rather quickly because I happened to think to check their Discord server, but that’s not necessarily the first place one might think to look.
The vibe I get is that they’re new at this, but improving. I might hesitate to use RunPod in production, but on balance I like the service.
You said this would be easy?
Yep! Following the happy-path and just running a simple prebuilt Ollama container with a GPU is pretty easy.
If you want to use my code below (also available as a Gist), you can just get an API key and use it like:
with runpod_ollama_client("tinyllama-test", "NVIDIA GeForce RTX 3080",
load_model="tinyllama") as client:
prompt = "Tell your favorite joke?"
response = client.generate(model='tinyllama', prompt=prompt)
print(response.response)Performance
When using a small model that loads quickly, the service goes live within a couple minutes. For bigger models, factor in the time to download. Most of the pods I’ve been given have purported to have pretty decent bandwidth, but then it’s a matter of how much bandwidth the server hosting the models wants to give you.4 Anyway, for me, for batch use, this is perfectly acceptable.
Security
There is none. That HTTP endpoint is publicly accessible with no authentication.
Ollama doesn’t have anything built in. Even if it did, it’s a fast-moving project that hasn’t hit v1.0 yet – so, would you trust it?
Sure, you’re getting a secure tunnel to the proxy, I guess. Between there and your pod, I think you’re trusting RunPod (and, possibly, a third-party host!) not to snoop.
Practically speaking, I’m not very concerned. My workloads are not sensitive. For me, I think the most likely (but still unlikely) worst case scenario is that someone somehow guesses one of my pod IDs and piggybacks some free inference.
Thoughts for the Future
RunPod can also do serverless inference. They have some semi-pre-configured stuff using vLLM, but it looked slightly more involved than just spinning up an ollama pod. The pods also have the advantage that I can SSH in and troubleshoot any issues which may come up while I’m getting started. For now, my workloads are batch-processing things anyway, so a serverless approach doesn’t do me many favors.
If I have workloads that are less batchy, I may invest some energy in figuring out a serverless solution. That said, unless you’re paying to keep an instance hot, the latency seems likely to be pretty ugly, and if you do keep an instance hot, it doesn’t sound so serverless after all.
The Code
import os
import sys
import time
import subprocess
import contextlib
import functools
import tqdm
import requests
import runpod
import ollama
runpod.api_key = os.environ.get('RUNPOD_API_KEY')
PORT = 11434
class LaunchFailureException(Exception):
"""Raised when a pod or instance fails to launch properly"""
def __init__(self, message):
self.message = message
super().__init__(self.message)
def with_retry(f, should_retry=lambda _: True,
max_attempts=5, delay=2, backoff=False):
"""
should_retry is a function that accepts the exception raised by f and
returns a bool
"""
for attempt in range(max_attempts):
try:
return f()
except Exception as e:
if attempt == max_attempts - 1:
raise e
if not should_retry(e):
raise e
print(
f"Attempt {attempt + 1} failed. "
f"Retrying in {delay} seconds..."
)
time.sleep(delay)
if backoff:
delay *= 2
def wait_for_pod_reported_ready(pod, ready_timeout_seconds=450):
print("Waiting for pod: ", end="")
start_t = time.time()
while (time.time() - start_t) < ready_timeout_seconds:
pod_info = runpod.get_pod(pod['id'])
if not pod_info:
print("-", end="")
continue
if pod_info['runtime']:
print("\n")
assert 'ports' in pod_info['runtime']
return pod_info
print(".", end="")
sys.stdout.flush()
time.sleep(1)
raise LaunchFailureException(
message=f"Pod failed to become ready. \n Pod ID: [{pod['id']}]"
)
def pull_with_progress_bar(client, model):
current_digest = ''
progress_bars = {}
# Note that streaming mode is necessary, otherwise the request gets timed
# out, I think because of the proxying through Cloudflare (100 seconds).
# See doc here:
# https://docs.runpod.io/pods/configuration/expose-ports
for response in client.pull(model, stream=True):
digest = response.digest or ''
if not digest:
print(response.status)
continue
if digest != current_digest and current_digest in progress_bars:
progress_bars[current_digest].close()
if digest not in progress_bars and response.total:
progress_bars[digest] = tqdm.tqdm(
total=response.total,
desc=f"Pulling {digest[7:19]}",
unit='B',
unit_scale=True
)
if response.completed and digest in progress_bars:
bar = progress_bars[digest]
bar.update(response.completed - bar.n)
current_digest = digest
def load_model_on_server(client, model):
def load_model_on_server_inner():
print(f"Loading model...")
pull_with_progress_bar(client, model)
with_retry(
load_model_on_server_inner,
should_retry=lambda e: (
isinstance(e, ollama._types.ResponseError)
or
isinstance(e, httpx.RemoteProtocolError)
),
backoff=True
)
def wait_for_ollama_generate_ready(client, model):
"""
Sometimes /api/generate seems to not be ready immediately after the pull.
"""
def wait_for_ollama_generate_ready_inner():
print("Waiting for /api/generate...")
response = client.generate(model=model, prompt="check check 123")
assert response.response
with_retry(
wait_for_ollama_generate_ready_inner,
should_retry=lambda e: (
isinstance(e, ollama._types.ResponseError)
or
isinstance(e, httpx.RemoteProtocolError)
)
)
@contextlib.contextmanager
def runpod_ollama_client(name, podspec="NVIDIA GeForce RTX 3090",
load_model=None, storage_gb=128):
def create_pod():
return \
runpod.create_pod(
name,
"ollama/ollama",
podspec, # use the ids from `runpod.get_gpus()`
volume_mount_path="/root/.ollama",
ports=f"{PORT}/http",
container_disk_in_gb=storage_gb, # required even with template
volume_in_gb=storage_gb # required even with template
)
pod = \
with_retry(
create_pod,
should_retry=lambda e: (
isinstance(e, runpod.error.QueryError)
and
(
"does not have the resources to deploy your pod" in str(e)
or
"no longer any instances available" in str(e)
)
)
)
try:
pod_info = wait_for_pod_reported_ready(pod)
ollama_host_url = f"https://{pod_info['id']}-{PORT}.proxy.runpod.net"
client = ollama.Client(host=ollama_host_url)
# You need to load a model before you can get completions.
if load_model is None:
print('Proceeding without loading a model.')
print('Probably this is a mistake.')
else:
load_model_on_server(client, load_model)
wait_for_ollama_generate_ready(client, load_model)
print("Verified /api/generate is ready!")
yield client
except LaunchFailureException as lfe:
print("Looks like a bad machine.")
raise lfe
finally:
runpod.stop_pod(pod['id'])
runpod.terminate_pod(pod['id'])
def kill_all_pods():
pods = runpod.get_pods()
print(f"Found [{len(pods)}] pods to kill.")
killed = 0
for pod in pods:
pod_id = pod["id"]
try:
print(f"Stopping pod {pod_id}...")
runpod.stop_pod(pod_id)
print(f"Terminating pod {pod_id}...")
runpod.terminate_pod(pod_id)
killed += 1
print(f"Successfully terminated pod {pod_id}")
except Exception as e:
print(f"Error terminating pod {pod_id}: {str(e)}")
print(f"Terminated [{killed}] pods.")
def tell_me_a_joke():
with runpod_ollama_client("tinyllama-test", "NVIDIA GeForce RTX 3080",
load_model="tinyllama") as client:
prompt = "Tell your favorite joke?"
response = client.generate(model='tinyllama', prompt=prompt)
print(response.response)
def tell_me_a_more_expensive_joke__llama3_3_70b_q8():
model = "llama3.3:70b-instruct-q8_0"
with runpod_ollama_client("llama3_3__70b__q8_test", "NVIDIA A100 80GB PCIe",
load_model=model, storage_gb=256) as client:
prompt = """
Tell me your favorite joke, but talk like a pirate.
To be clear, the joke should not be even remotely nautical.
No pirate jokes whatsoever.
Tell me a normal, non-pirate joke.
Just, you know, tell it like as if you *also* happened to be a pirate.
"""
response = client.generate(model=model, prompt=prompt)
print(response.response)
if __name__ == "__main__":
tell_me_a_joke()Vast.ai is the only other one I found, but my search was not exhaustive.
Some of these examples are more sympathetic than others, but you get the picture.
Of course, AWS becomes less risky the more time you spend being cautious, but I don't enjoy caution for caution's sake. Anyway, who wants to give money to Jeff when we could give it to literally anyone else?
Some of their prebuilt templates do use an image that runs SSHd, but not all. Of the rest, whether or not you can use the trick they suggest here depends on whether the Docker image involved uses 'entrypoint' or 'cmd'. You can override a 'cmd', but they don't support overriding an 'entrypoint'.
What's provided by default in all cases is something which looks like SSH access, but isn't. They are actually letting you connect to the underlying Docker host, and have set your login-shell to
docker exec -it $YOUR_CONTAINER /bin/bash. Considering the full context, that's great, since you can get into any container without having to run your own SSHd. It does mean however that you are getting only an interactive console. There's no secure tunneling, no SFTP, no SCP, and, in fact, you can't even do something likessh $host apt install $package.If you want to force the issue, you could do
echo "apt install $package" | ssh -tt $host, but if you start trying to pipe any non-trivial scripting through the interactive console, it will eventually make you sad. SSH can't even forward you an exit code in this case, so you're getting into 'pexpect' territory...Considering my fairly infrequent use, I don't feel bad about hitting the servers hosting the models. If you were going to do something similar but more frequently, it might be more responsible to prebuild an image with the model already downloaded, so you're hitting your container registry rather than the public model servers.
This likely won't change runtime performance significantly since the Docker host is pulling roughly the same amount of bytes down the same connection, wherever they're coming from.