Emilian Bold's blog: Scaling AI Workloads: Using Linux FS-Cache to Serve Giant Models from Network Storage

Working with multi-gigabyte LLM and diffusion model files presents a practical challenge: local storage is fast but limited, while network storage is capacious but slow. This is especially true for compact workstations like a Mac Mini or a laptop with a small SSD, where fitting several large models is impossible.

What if you could get the best of both worlds—the speed of local storage for active models and the limitless capacity of a network share? Instead of copying files back and forth, we can use a transparent caching layer built right into the Linux kernel.

The Bottleneck: Network Latency vs. Model Size

The standard approach of mounting a network drive (NFS or Samba/CIFS) containing your model repository solves the storage problem but introduces a performance penalty. Loading a 10GB model over a network, even a fast one, can cause significant delays. This slows down experimentation, hinders rapid model switching, and creates a frustrating development cycle.

The solution isn't to fight the network but to smartly use local storage as a massive read-cache for the remote filesystem. Enter FS-Cache.

FS-Cache: The Hidden Gem for Accelerating Network Filesystems

Linux has long included a powerful, filesystem-agnostic caching layer called FS-Cache. Originally designed for environments like NFS, its utility for AI workloads is profound. The concept is simple:

The first time a model file is read from the network, it is silently stored in a designated cache on a local disk (ideally an SSD).
Every subsequent read for that file is served directly from the local cache at drive speeds, bypassing the network entirely.

This means the second time you load "llama2-7b.Q4.gguf," it feels instantaneous.

Implementation: A Two-Step Setup

The beauty of this system is that it requires no changes to your applications. PyTorch, TensorFlow, or any other tool that reads files will transparently benefit.

Step 1: Configure the cachefilesd Daemon

The kernel provides the caching engine, but you need a userspace manager for the cache directory. This is handled by cachefilesd.

Install it: sudo apt install cachefilesd (on Debian/Ubuntu).
Edit /etc/cachefilesd.conf to point dir to a directory on your fast local drive (e.g., dir /var/cache/fscache). Ensure it has enough space for your active set of models.
Start and enable the daemon: sudo systemctl enable --now cachefilesd.

Step 2: Mount the Network Share with the fsc Flag

Now, mount your network share containing the models with the special fsc (filesystem cache) option.

For a CIFS/Samba share:

sudo mount -t cifs //ai-server/models /mnt/models -o username=user,password=pass,fsc

For an NFS share:

sudo mount -t nfs ai-server:/models /mnt/models -o fsc

That's it. Any file read from /mnt/models is now eligible for caching.

Pro Tip: Pre-Warming the Cache for Instant Results

While the cache populates naturally during use, you can pre-load specific models to eliminate the first-load penalty entirely. This is perfect for preparing a model before a demo or a critical training run.

Simply read the file through the mount point:

cat /mnt/models/llama2-7b.Q4.gguf > /dev/null

This command will pull the entire model file through the kernel's FS-Cache layer, populating the local disk cache. The next time your Python script opens that file, it will be read from the local SSD at full speed.

Conclusion

Using FS-Cache transforms your workflow. It allows a small, fast local disk to act as a high-speed front-end to a vast, centralized model repository on a network server. This setup is not just for media files; it's a pragmatic and powerful solution for managing the growing size of AI artifacts, making it easier to scale your development environment without upgrading every machine's storage.

Emilian Bold's blog

Friday, September 26, 2025

Scaling AI Workloads: Using Linux FS-Cache to Serve Giant Models from Network Storage