Why two catalogues?
ESA's Gaia mission (launched 2013, Data Release 3 in 2022) surveyed roughly 1.8 billion sources with sub-milliarcsecond astrometry — an unprecedented map of the Milky Way. But Gaia's CCDs saturate on the brightest stars. For sources brighter than about G ~ 3–6, the astrometric solutions are often incomplete, degraded, or missing entirely. These are exactly the stars that people recognise when they look up: Sirius, Vega, Betelgeuse, the Southern Cross.
Hipparcos (ESA, 1989–1993; revised by van Leeuwen 2007) was purpose-built for bright-star astrometry. Its catalogue of around 118,000 stars covers the naked-eye sky reliably, including many objects where Gaia struggles. The trade-off is that Hipparcos is shallow — it has nothing useful to say about the vast majority of the galaxy.
For a complete 3D star map you need both: Gaia for depth and completeness, Hipparcos for the bright-star anchors. But roughly 100,000 stars appear in both catalogues, and the pipeline must produce exactly one canonical row per physical star.
Why deduplication is hard
No shared identifier
Gaia and Hipparcos use independent numbering systems. There is no column in either
catalogue that says "this is the same star." The only link is ESA's
hipparcos2_best_neighbour crossmatch table — a positional match published
as part of Gaia DR3. The pipeline downloads it via a TAP query against the Gaia Archive
and converts it into a local lookup that maps Gaia source IDs to Hipparcos numbers,
along with an angular_distance and a number_of_neighbours
count.
Scale
The full Gaia catalogue has around 1.5 billion rows. You cannot load two copies to compare them — the merger must stream Gaia batch-by-batch while holding only the small Hipparcos table (~118K rows) and the crossmatch table in memory. This constraint shapes the entire architecture.
Binaries and multiples
Hipparcos often saw a binary system as a single point of light, while Gaia resolves the individual components. The crossmatch may link one Hipparcos entry to multiple Gaia sources. Meanwhile, Hipparcos solution type codes distinguish standard five-parameter single-star fits from more complex acceleration solutions, orbital solutions, and component solutions — the latter being unreliable for position and parallax. These problems concentrate exactly on the brightest, most recognisable stars.
Quality scoring and winner selection
When a Gaia row and a Hipparcos row are linked by the crossmatch and no manual override applies, the merger runs a decision tree to pick a winner.
number_of_neighbours > 1 in
the crossmatch entry, Hipparcos is ambiguous — Gaia has resolved what Hipparcos
saw as one source into multiple candidates. Gaia wins automatically. | Apparent magnitude | Margin required for Hipparcos to win |
|---|---|
| G < 3.5 (very bright) | Hipparcos must be strictly better (1.0×) |
| 3.5 ≤ G < 6 (bright) | Hipparcos score must be below 60% of Gaia's |
| G ≥ 6 (normal) | Hipparcos score must be below 50% of Gaia's |
The quality metric is fractional parallax error — σ/π — computed identically
for both catalogues: parallax_error / parallax for Hipparcos,
and parallax_error / max(parallax, ε) for Gaia DR3 (or the equivalent
Bailer-Jones interval width). Lower is better, and the values are directly comparable
across catalogues.
For Gaia rows that fall back from the primary parallax measurement through successive distance tiers, sentinel quality values (10, 20, 50) are used instead of real fractional errors. These sentinels sort correctly — a photometric distance estimate always loses to a reliable parallax — without requiring special case logic in the winner selection.
Manual overrides
Some stars cannot be handled by automated scoring, and automated scoring can be wrong. The pipeline includes a YAML-driven override system that takes precedence over everything else. Each override targets a star by its catalogue identity and can do one of three things: add a star that appears in neither catalogue, replace the payload for a star that does exist, or drop a star from the output entirely.
When an override targets one member of a matched Gaia–Hipparcos pair, the pipeline automatically resolves the whole pair via the crossmatch table — the partner is suppressed without the override author needing to supply its ID.
The Sun
add
The Sun does not appear in either catalogue. It is inserted at the ICRS origin with IAU 2015 nominal solar parameters: Teff = 5772 K, MV = 4.83.
Alpha Centauri A & B
replace
The two components have independent catalogue parallaxes, but they are a gravitationally bound binary. Both are replaced with the system-level parallax from Kervella et al. (2016/2017), keeping the components at a consistent and more accurate shared distance.
Proxima Centauri
replace
Proxima is absent from the hipparcos2_best_neighbour crossmatch table,
almost certainly because its extreme proper motion (~3.8 arcsec/yr) caused the
positional match to fail. The override provides Gaia DR3 astrometry sourced via SIMBAD
and flags it as a variable flare star.
Sirius B
replace
The white dwarf companion of Sirius has weak catalogue photometry due to the glare from Sirius A. The override substitutes HST-resolved values from Barstow et al. (2005).
Procyon B
add
This white dwarf companion is absent from Gaia DR3 entirely. Resolved photometry and temperature come from Provencal et al. (2002, HST/STIS).
Canonical identity
Every output row is identified by a compound key: (source, source_id).
The rules are fixed and designed to make human-facing names work correctly:
- Matched pairs (Gaia + Hipparcos linked by the crossmatch) always
use
source = "hip", regardless of which catalogue's astrometry won. Hipparcos numbers are preferred because named-star designations — HD numbers, Bayer/Flamsteed letters, proper names — are indexed by Hipparcos number in the identifiers sidecar. - Gaia-only (no crossmatch partner):
source = "gaia". - Hipparcos-only (no partner in the loaded data):
source = "hip". - Manual additions:
source = "manual", with a string ID (e.g."sun") that cannot collide with numeric catalogue IDs.
Cross-catalog identifiers (for example, the Gaia DR3 source ID of a Hipparcos star) are stored in a separate identifiers sidecar keyed by the same compound key — they are not duplicated on every row of the dense output table.
Streaming merge architecture
The merger processes roughly 1.5 billion Gaia rows without loading the full catalogue into memory. The Hipparcos table (~118K rows), the crossmatch table, and the override list are loaded once and held in memory. Gaia is streamed one batch file at a time.
For each batch, rows that are either crossmatch-linked or override-targeted are
separated out and processed individually. Unmatched Gaia rows are written directly to
their HEALPix output directories without further inspection. After all Gaia batches
are processed, any remaining unmatched Hipparcos rows are flushed, and manual
add overrides are emitted.
The output is partitioned by HEALPix pixel — one directory per sky tile, each containing one or more Parquet files. This layout means downstream tools can process one sky region at a time without reading the whole dataset.
What the output contains
Each row in the merged Parquet table carries:
| Column | Description |
|---|---|
source, source_id | Canonical identity |
x_icrs_pc, y_icrs_pc, z_icrs_pc | Sun-centred Cartesian coordinates (parsecs, ICRS J2016.0) |
ra_deg, dec_deg, r_pc | Spherical coordinates and distance |
mag_abs | Absolute magnitude |
teff | Effective temperature (K) |
quality_flags | Packed uint16: distance source, Teff source, photometry source, status bits |
astrometry_quality | Fractional parallax error or tier sentinel value |
photometry_quality | GSP-Phot MG confidence interval width, or 0 if unavailable |
A separate merge decisions sidecar records the winner, quality scores, and diagnostic columns for every matched pair — bounded in size by the crossmatch table, not the full catalogue.