How to extract safetensor training captions on the command line using bash (linux, macos)

This tutorial is a work in progress. Right now it's probably better to say that this is the notes that I made while writing a bash script to do this, so cavaet lector, there probably are mistakes. This is also really slow and I'm already planning to rewrite it in Rust or Go.

change log

v 1.2

added a second json path to the metadata

wrote the json to a temporary file

added debug mode

The specification for the .safetensors format can be found on huggingface. The first eight bytes of the file are an unsigned 64 bit integer that indicates how many bytes of json-formatted metadata there is.

That JSON document* may contain a string at some path that contains another JSON document, in the wild I've found two paths that contain tags metadata, so i check both of them before giving up. In order to do that, it's best that I write the json to a temporary file, which I create using mktemp.

#!/bin/bash

debug_mode=0

while getopts ":d" opt; do
  case $opt in
    d)
      debug_mode=1
      ;;
    \?)
      echo "Invalid option: -$OPTARG" >&2
      exit 1
      ;;
  esac
done

if [ $debug_mode -eq 1 ]; then
   echo "DEBUG MODE ON"
fi

# Remove options from positional parameters
shift $((OPTIND -1))

# validate parameters 
if [ -z "$1" ]; then
  echo "Usage: $0 [-d] <path_to_safetensors_file>"
  exit 1
fi

input_file="$1"
temp_file="$(mktemp)"

# Read the first 8 bytes to get the length of the JSON header (little-endian u64, so we have to flip it)
json_length_hex=$(dd if="$input_file" bs=1 count=8 2>/dev/null | xxd -p | tac -rs .. | tr -d '\n')
# Convert the hexidecimal representation into decimal
json_length=$(( 16#$json_length_hex ))

# Extract the JSON metadata using the length obtained.  feed through jq twice because the json we want is in a string
# in the top-level JSON document
dd if="$input_file" bs=1 skip=8 count="$json_length" 2>/dev/null > $temp_file
if grep -q 'ss_tag_frequency' $temp_file; then
    jq -r '.__metadata__.ss_tag_frequency' "$temp_file" | jq .
elif grep -q 'tag_frequency' $temp_file; then
    jq -r '.__metadata__.ss_datasets' "$temp_file" | jq '.[0].tag_frequency'
else
    if [ $debug_mode -eq 1 ]; then
        cat "$temp_file"
        echo ""
    fi
    echo "No pattern found."
fi

*In the files I've checked while writing this tutorial. This is a work in progress, but I'll update it as I learn more. When I'm talking about the JSON, I'm only talking about what I've seen in the wild, I'm not working from any spec, and I suspect that files will actually vary greatly once I start looking at them.

**Maybe when I check a Lora that used a baseline dataset, I'll see two?