Running Vicuna-13B in Google Cloud
Vicuna-13B is an LLM chatbot based off of the LLaMa model. It’s authors claim it achieves 90% of the quality of ChatGPT in a “fun and non-scientific” evaluation.
You can rent some cloud hardware and experiment with Vicuna 13B yourself! Using CPU-only is slow (couple tokens per second), but fast enough for yout to get an idea of what to expect.
Set up your Cloud Instance
Create a cloud VM with
- 150 GB of disk space
- 64 GB of CPU memory
I used a Google Compute Engine
e2-standard-16, which costs around $0.70/hour, so it may not be something you want to leave running. You can stop the instance when you’re not using it.
When everything was done, my VM had 132GB of disk space used.
Ordinarily I wouldn’t recommend setting up python like this, but since we’re just experimenting:
apt-get install python3-pip
Acquire the LLaMa-13B model
For licensing reasons, Vicuna-13B is distributed as a delta of the LLaMa model, so the first step is to acquire the LLaMa model. The official way is to request the weights from Meta, by filling this Google Docs form.
You can also use leaked weights from a torrent with the following magnet link:
<HASH>above with this:
Or, someone has made the leaked weights available on IPFS, which you can access through a helpful mirror:
I couldn’t figure out how to get a torrent client working on Google’s VMs (perhaps a firewall issue), so I ended up using aria2c to download the LLaMa weights from the IPFS mirror above.
apt-get install aria2 mkdir -p $HOME/llama/13B cd $HOME/llama/13B aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/consolidated.00.pth aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/consolidated.01.pth aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/checklist.chk aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/params.json aria2c https://ipfs.io/ipfs/Qmb9y5GCkTG7ZzbBWMu2BXwMkzyCKcUjtEKPpgdZ7GEFKm/tokenizer.model
consolidated files are the weights.
checklist.chk has the md5 sums for the files, which you should check after they’re downloaded.
params.json seems to have some metadata.
tokenizer.model is needed to convert the weights to HuggingFace format.
Convert weights to HuggingFace Format
pip install torch transformers accelerate sentencepiece protobuf==3.20 python3 convert_llama_weights_to_hf.py --input_dir ~ --output_dir ~/llama-hf --model_size 13B
I used rev d2ffc3fc4 of the script.
apt-get install wget wget https://github.com/huggingface/transformers/blob/d2ffc3fc48430f629c38c36fa8f308b045d1f715/src/transformers/models/llama/convert_llama_weights_to_hf.py pip install torch transformers accelerate sentencepiece protobuf==3.20 python3 convert_llama_weights_to_hf.py --input_dir $HOME/llama --output_dir $HOME/llama-hf --model_size 13B
These are the package versions that worked for me (note
protobuf=3.20 in the pip install command).
I got an error about regenerating protobuf functions if I used protobuf > 3.20.
Apply the vicuna deltas
FastChat has done the work of getting a little chat interface set up. We’ll use their package to download the deltas and apply them as well.
pip install fschat python3 -m fastchat.model.apply_delta \ --base-model-path $HOME/llama_hf \ --target-model-path $HOME/vicuna-13b \ --delta-path lmsys/vicuna-13b-delta-v1.1
This will open up a little Chat-GPT-style interface in your terminal.
python3 -m fastchat.serve.cli --device cpu --model-path vicuna-13b/