This tutorial is a work in progress. Right now it's probably better to say that this is the notes that I made while writing a bash script to do this, so cavaet lector, there probably are mistakes. This is also really slow and I'm already planning to rewrite it in Rust or Go.
change log
v 1.2
added a second json path to the metadata
wrote the json to a temporary file
added debug mode
The specification for the .safetensors format can be found on huggingface. The first eight bytes of the file are an unsigned 64 bit integer that indicates how many bytes of json-formatted metadata there is.
That JSON document* may contain a string at some path that contains another JSON document, in the wild I've found two paths that contain tags metadata, so i check both of them before giving up. In order to do that, it's best that I write the json to a temporary file, which I create using mktemp.
#!/bin/bash
debug_mode=0
while getopts ":d" opt; do
case $opt in
d)
debug_mode=1
;;
\?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
esac
done
if [ $debug_mode -eq 1 ]; then
echo "DEBUG MODE ON"
fi
# Remove options from positional parameters
shift $((OPTIND -1))
# validate parameters
if [ -z "$1" ]; then
echo "Usage: $0 [-d] <path_to_safetensors_file>"
exit 1
fi
input_file="$1"
temp_file="$(mktemp)"
# Read the first 8 bytes to get the length of the JSON header (little-endian u64, so we have to flip it)
json_length_hex=$(dd if="$input_file" bs=1 count=8 2>/dev/null | xxd -p | tac -rs .. | tr -d '\n')
# Convert the hexidecimal representation into decimal
json_length=$(( 16#$json_length_hex ))
# Extract the JSON metadata using the length obtained. feed through jq twice because the json we want is in a string
# in the top-level JSON document
dd if="$input_file" bs=1 skip=8 count="$json_length" 2>/dev/null > $temp_file
if grep -q 'ss_tag_frequency' $temp_file; then
jq -r '.__metadata__.ss_tag_frequency' "$temp_file" | jq .
elif grep -q 'tag_frequency' $temp_file; then
jq -r '.__metadata__.ss_datasets' "$temp_file" | jq '.[0].tag_frequency'
else
if [ $debug_mode -eq 1 ]; then
cat "$temp_file"
echo ""
fi
echo "No pattern found."
fi*In the files I've checked while writing this tutorial. This is a work in progress, but I'll update it as I learn more. When I'm talking about the JSON, I'm only talking about what I've seen in the wild, I'm not working from any spec, and I suspect that files will actually vary greatly once I start looking at them.
**Maybe when I check a Lora that used a baseline dataset, I'll see two?
