Metadata Backfill

Backfill missing labels, years, and albums across a Rekordbox collection using cached enrichment data and file metadata.

Agent prompt

Paste into your agent to start:

Find tracks missing labels and years in my collection and fill them from enrichment data.

Constraints

backfill_labels is enrichment-dependent. The tool pulls labels from Discogs, MusicBrainz, Bandcamp, and Beatport enrichment caches — it does not check file tags or the web. Tracks without cached enrichment are skipped (unless auto_enrich=true). For tracks still unlabeled after the tool runs, the agent researches labels manually (step 1c).
Years use a priority cascade. backfill_years tries multiple sources in order: file tags → folder path (YYYY) → Discogs enrichment → Beatport enrichment → MusicBrainz enrichment → Bandcamp enrichment. For tracks still missing years after the tool runs, the agent researches release dates (step 2c).
No auto-tagging without review (labels). The backfill tools stage non-conflicting values automatically. Label conflicts require human approval before staging. Year conflicts are resolved automatically using the lowest year (see “Lowest year wins”).
XML export only. All changes flow through update_tracks → preview_changes → write_xml.
Self-released tracks are skipped (labels only). Discogs “Not On Label” entries are filtered out — they carry no useful signal.
Year 0 = unset. Rekordbox uses 0 for tracks with no year. Rekordbox often fails to import year tags from WAV files, so year=0 does not mean the file lacks a year tag — check before assuming it’s missing.
Lowest year wins. Year is used as a signal for genre classification — it represents the track’s production era, not its release or reissue date. When a conflict arises between current year and enrichment year, always use the lowest (earliest) year unless it is obviously wrong (e.g., a nonsensical date like 1900 for a modern electronic track). Do not present year conflicts for interactive review — resolve them automatically. Web research for year gaps should search for release dates (not production dates specifically), since release date is the best publicly available proxy.
Year gaps are always researched. Tracks left with year=0 after all automated sources are exhausted must be researched by the agent via web search, store lookups, and model knowledge. The tools handle the bulk; the agent fills every remaining gap it can.
Never ask whether to research — just do it. After a backfill tool reports remaining gaps, immediately begin researching them. Do not summarize and wait for instructions, do not ask “should I also research?”, and do not present research as a separate decision. The automated backfill is the easy warm-up; researching the remaining gaps is the core work. Asking whether to proceed is asking whether to finish the job — the answer is always yes.
Pause only where marked. This workflow has exactly two pause points: conflict-resolution prompts (where user input is needed) and the final export confirmation. Everything else — including the transition from backfill to research — is automatic. Do not stop, summarize, or ask for direction at unmarked transitions.
Albums are best-effort. backfill_albums fills empty albums from file tags, folder paths, and enrichment caches. Unlike labels and years, album gaps do not require exhaustive research — fill what the tools and a light web search can find. Empty album is valid for singles, loose tracks, and DJ edits. Album names where the value equals the track title or artist name are noise and are skipped automatically.

Prerequisites

Check enrichment cache coverage across all providers for unlabeled tracks:

cache_coverage(has_label=false)

Review the coverage section for each provider’s searched_percent and has_result counts. Hydrate providers with low coverage before backfilling — the more cached enrichment data available, the more labels and years the backfill tools can fill.

Hydration priority

Discogs (best for label data, genre/style):

enrich_tracks(has_label=false, skip_cached=true, providers=["discogs"], max_tracks=50)

Bandcamp (strongest for underground/independent electronic music — high hit rate for labels and years):

enrich_tracks(has_label=false, skip_cached=true, providers=["bandcamp"], max_tracks=50)

Beatport (label data extracted from search results since v0.15; older cache entries lack labels):

enrich_tracks(has_label=false, skip_cached=true, providers=["beatport"], max_tracks=50)

If Beatport searched_percent is 100% but labels are still missing, existing cache entries predate label extraction. Re-enrich with force_refresh=true to populate them.

MusicBrainz is hydrated automatically by lookup_musicbrainz during year research (step 2c) or via auto_enrich.

Repeat batches until providers reach 100% searched. Report progress between batches.

Alternative: skip manual hydration and use auto_enrich=true on the backfill tools (see steps 1 and 2). This automatically fetches Bandcamp data for uncached tracks before backfilling. Slower for large collections but requires no manual batching.

Steps

1. Fill labels

1a. Backfill from enrichment

backfill_labels()

Or, to automatically fetch Bandcamp data for uncached tracks:

backfill_labels(auto_enrich=true)

The tool scans the entire collection (excluding samples), looks up each track’s enrichment across all four providers (Discogs, MusicBrainz, Bandcamp, Beatport), and:

Fills empty labels from enrichment (auto-staged)
Skips tracks where current label matches enrichment
Reports conflicts where current label differs from enrichment (not staged)

The summary includes no_enrichment_by_provider showing which providers are missing data for the remaining unlabeled tracks. If no_bandcamp is high, hydrate Bandcamp and re-run.

Report the summary: “Backfill complete: N labels staged, N already correct, N conflicts, N without enrichment.” Then immediately proceed to conflict resolution.

1b. Resolve conflicts

Deduplicate conflicts before presenting — a track must appear in exactly one group.

Group conflicts by pattern and present each group as a table with columns: #, Artist — Title, Current, Enrichment, Rec (recommendation). Follow each table with a clear bulk-action prompt.

Group A: Artist-name-as-label → actual label

When the current label is the artist’s name and enrichment has an actual label, recommend the enrichment label — it provides a stronger signal than the artist name.

Group A: Artist-name-as-label → actual label (N tracks) — recommend: use enrichment

| #  | Artist — Title              | Current       | Enrichment | Rec            |
|----|-----------------------------|---------------|------------|----------------|
| 1  | Fantastic Man — Antiboudi   | Fantastic Man | Mule Musiq | use enrichment |
| 2  | Fantastic Man — Diaspora    | Fantastic Man | Mule Musiq | use enrichment |

→ “Approve all Group A, reject all, or specify numbers to change (e.g. ‘approve all except 2’).”

Group B: Label name variations

When both labels refer to the same entity with different formatting (e.g. “Palette” vs “Palette Recordings”, “трип” vs “trip recordings”), recommend keeping the current label — it was set intentionally.

Group B: Label name variations (N tracks) — recommend: keep current

| #  | Artist — Title             | Current | Enrichment         | Rec          |
|----|----------------------------|---------|--------------------|--------------|
| 3  | Dauwd — Kindlinn           | Palette | Palette Recordings | keep current |
| 4  | Lone — Abraxas             | трип    | trip recordings    | keep current |

→ “Skip all Group B (keep current labels), or specify numbers to change.”

Group C: Wrong enrichment

When enrichment returned nonsense — artist names as labels, gibberish, or clearly incorrect data — recommend keeping the current label.

Group C: Wrong enrichment (N tracks) — recommend: keep current

| #  | Artist — Title                    | Current           | Enrichment    | Rec          |
|----|-----------------------------------|--------------------|--------------|--------------|
| 5  | Beat Movement — X                 | WarinD Records     | BEAT MOVEMENT | keep current |
| 6  | Chaka Demus and Pliers — Murder   | Soul Jazz Records  | Alex Di Ciò   | keep current |

→ “Skip all Group C, or specify numbers to change.”

Group D: Genuine disagreements

When the enrichment label is a different entity entirely (e.g. different pressing, compilation, or wrong match), present for individual review with no default recommendation.

Group D: Genuine disagreements (N tracks) — review individually

| #  | Artist — Title                 | Current          | Enrichment         | Rec    |
|----|--------------------------------|------------------|--------------------|--------|
| 7  | Rick Wade — Authentideep       | Harmonie Park    | Unknown Season     | review |
| 8  | Vril — Haus (Rework)           | Giegling         | Delsin             | review |

→ “For each, reply ‘keep’ or ‘use enrichment’, or skip. You can also bulk-skip: ‘skip all Group D’.”

Stage approved changes via update_tracks. Then immediately proceed to research remaining gaps — do not ask.

1c. Research remaining gaps

After steps 1a–1b, tracks may still lack labels. The backfill_labels output includes a research_queue section listing the count and top artists by frequency. write_xml will refuse to export until this step is done — it checks whether backfill_labels reported unlabeled tracks and blocks export unless skip_label_gate=true is passed.

Start by fetching the first batch of unlabeled tracks:

search_tracks(has_label=false, limit=50)

Then for each batch:

Group by artist for efficient lookup — tracks from the same artist often share a label.
Prioritize by artist frequency — artists with many unlabeled tracks first, for maximum coverage per lookup.
Research labels using web search, store lookups (lookup_beatport, lookup_discogs, lookup_bandcamp), label catalogs, and model knowledge.
Present findings grouped by confidence:
- High confidence (exact release found on store/catalog): present for bulk approval.
- Uncertain (multiple candidates or ambiguous match): present individually with source context.
Stage approved labels via update_tracks.

Skip tracks where no label can be determined — self-released tracks, private edits, or truly obscure releases may have no public label.

2. Fill years

2a. Backfill from sources

backfill_years()

Or, to automatically fetch Bandcamp and MusicBrainz data for uncached year-zero tracks:

backfill_years(auto_enrich=true)

The tool scans the entire collection (excluding samples) and tries six sources in priority order for each track with year=0:

File tags — reads year from the audio file’s metadata (ID3v2, Vorbis Comment, RIFF INFO). Most reliable since the user tagged it.
Folder path — extracts year from a (YYYY) suffix in the parent directory name (e.g. Album (2019)/).
Discogs enrichment — falls back to the cached Discogs release year.
Beatport enrichment — falls back to the cached Beatport publish_date/release_date.
MusicBrainz enrichment — falls back to the cached MusicBrainz first-release-date.
Bandcamp enrichment — falls back to the cached Bandcamp release_date. Particularly effective for underground/independent electronic music.

The first source to produce a valid year (1900–2099) wins. Non-conflicting fills are auto-staged. For tracks that already have a non-zero year, the tool checks Discogs enrichment only — if the Discogs year differs from the current year, the track is reported as a conflict. Other providers (Beatport, MusicBrainz, Bandcamp) are not compared against existing years.

The summary includes remaining_uncached_providers showing which providers are missing data for remaining year-zero tracks. If providers show gaps, hydrate them and re-run (or use auto_enrich=true).

Note: Beatport date extraction requires enrichment cache entries created after this feature was added. For tracks with older cache entries, re-enrich with enrich_tracks(providers=["beatport"], force_refresh=true) to populate the release_date field.

Report the summary: “Year backfill complete: N years staged (N file tags, N folder paths, N Discogs, N Beatport, N MusicBrainz, N Bandcamp), N already correct, N conflicts, N without any source.” Then immediately proceed to resolve conflicts.

2b. Resolve conflicts

Year conflicts are resolved automatically — do not present them for interactive review.

For each conflict, stage the lowest (earliest) year from either the current value or the Discogs enrichment value via update_tracks. The earliest year best approximates the track’s production era, which is the signal used for genre classification.

Exception — obviously wrong years: If the lowest year is clearly nonsensical (e.g., 1900 for a modern electronic track, or a year that predates the genre by decades), flag it to the user instead of auto-staging. This should be rare.

Report the summary: “Year conflicts resolved: N auto-staged (used earliest year), N already had the earliest year, N flagged for review.” Then immediately proceed to research remaining gaps — do not ask.

2c. Research remaining gaps

After backfill_years, some tracks will still have year=0 with no source data. The tool reports these as remaining_year_zero. The agent must research every one of them — this is not optional. Do not skip this step or proceed to export until all year-zero tracks have been researched.

If auto_enrich=true was used, Bandcamp and MusicBrainz have already been fetched for these tracks. Skip to the web research sub-step below.

Batch enrich remaining year-zero tracks

Use enrich_tracks with the year_zero filter to batch-enrich remaining tracks via Bandcamp:

enrich_tracks(year_zero=true, skip_cached=true, providers=["bandcamp"], max_tracks=50)

Repeat until all year-zero tracks are covered. Then re-run backfill_years() to pick up the new data.

Re-run backfill

After enrichment, re-run the backfill to incorporate newly cached data:

backfill_years()

This picks up years from the fresh Bandcamp/MusicBrainz cache entries.

Web research for remaining gaps

For tracks still without years after all enrichment sources are exhausted:

Group by artist for efficient lookup.
Search for release dates — use web search, Discogs links, store lookups (lookup_beatport, lookup_discogs, lookup_bandcamp), label catalogs, and model knowledge to find the original release year.
Present findings grouped by confidence:
- High confidence (exact release found): present for bulk approval.
- Uncertain (multiple candidates or approximate): present individually with context.
Stage approved years via update_tracks.

Only mark a track as unresolvable after genuine research effort — anonymous private edits or truly untraceable tracks exist, but they are the exception. Exhaust web search, store lookups, and artist discographies before giving up on any track.

3. Backfill albums

backfill_albums()

Or, to automatically fetch Bandcamp data for uncached tracks:

backfill_albums(auto_enrich=true)

The tool scans tracks with empty album and tries four sources in order:

File tags — reads album from the audio file’s metadata.
Folder path — extracts album name from the parent directory if it has a (YYYY) suffix (e.g. Album Name (2019)/). Edition qualifiers like (Deluxe Edition) and (Original Motion Picture Soundtrack) are stripped automatically.
Bandcamp enrichment — uses the album field from cached Bandcamp data.
Discogs enrichment — uses the title field (release name) from cached Discogs data.

Noise is filtered automatically: albums that match the track title or artist name are skipped.

No conflict resolution is needed — the tool only fills empty albums.

Report the summary: “Album backfill complete: N albums staged (N file tags, N folder paths, N Bandcamp, N Discogs), N already set, N without any source.”

The agent may optionally web-research remaining gaps, but this is not required. Empty album is valid for singles, loose tracks, and DJ edits.

4. Export

Pre-export gate (enforced by tool). write_xml checks whether backfill_labels reported unlabeled tracks. If so, it returns an error unless skip_label_gate=true is passed.

Before calling write_xml(skip_label_gate=true), confirm:

Step 1c (label research): all unlabeled tracks have been researched. Remaining gaps are genuinely unresolvable (private edits, anonymous tracks), not just unattempted.
Step 2c (year research): all year-zero tracks have been researched. Same standard.

If either step was skipped, go back and complete it before proceeding.

preview_changes()

Ask user: “Export these changes to XML?”

write_xml(skip_label_gate=true)

Report output path, then walk the user through the Rekordbox import:

Add XML to Rekordbox — Open Preferences → Advanced → rekordbox xml → Imported Library → Browse → select the exported XML file.
Open the XML view — In the sidebar, click the “Display rekordbox xml” icon. The imported tracks appear under “All Tracks”.
Import into collection — Select all tracks (Cmd+A), right-click → Import To Collection. When prompted “Do you want to load information in the tag of the library being imported?”, click Yes (tick “Don’t ask me again” for bulk imports).