Gaia and Hipparcos — merging two star catalogues | Build

Why two catalogues?

ESA's Gaia mission (launched 2013, Data Release 3 in 2022) surveyed roughly 1.8 billion sources with sub-milliarcsecond astrometry — an unprecedented map of the Milky Way. But the very brightest objects can require special handling in Gaia: their astrometric solutions may be incomplete, degraded, or missing entirely. These are often exactly the stars that people recognise when they look up: Sirius, Vega, Betelgeuse, the Southern Cross.

Hipparcos (ESA, 1989–1993; revised by van Leeuwen 2007) was purpose-built for bright-star astrometry. Its catalogue of around 118,000 stars gives the pipeline a compact, high-quality bright-star backbone. The trade-off is that Hipparcos is shallow — it has nothing useful to say about the vast majority of the galaxy.

For a usable 3D star map you need both: Gaia for depth and scale, Hipparcos for the bright-star anchors. But roughly 100,000 stars appear in both catalogues, and the pipeline must produce exactly one canonical row per physical star.

Why deduplication is hard

No shared identifier

Gaia and Hipparcos use independent numbering systems. There is no column in either catalogue that says "this is the same star." The only link is ESA's hipparcos2_best_neighbour crossmatch table — a positional match published as part of Gaia DR3. The pipeline downloads it via a TAP query against the Gaia Archive and converts it into a local lookup that maps Gaia source IDs to Hipparcos numbers, along with an angular_distance and a number_of_neighbours count.

Scale

The full Gaia catalogue has around 1.5 billion rows. You cannot load two copies to compare them — the merger must stream Gaia batch-by-batch while holding only the small Hipparcos table (~118K rows) and the crossmatch table in memory. This constraint shapes the entire architecture.

Binaries and multiples

Hipparcos often saw a binary system as a single point of light, while Gaia resolves the individual components. The crossmatch may link one Hipparcos entry to multiple Gaia sources. Meanwhile, Hipparcos solution type codes distinguish standard five-parameter single-star fits from more complex acceleration solutions, orbital solutions, and component solutions — the latter being unreliable for position and parallax. These problems concentrate exactly on the brightest, most recognisable stars.

Quality scoring and winner selection

When a Gaia row and a Hipparcos row are linked by the crossmatch and no manual override applies, the merger runs a decision tree to pick a winner.

Neighbour veto. If number_of_neighbours > 1 in the crossmatch entry, Hipparcos is ambiguous — Gaia has resolved what Hipparcos saw as one source into multiple candidates. Gaia wins automatically.

Multiplicity veto. If the Hipparcos solution type is anything other than a standard five-parameter single-star model, the Hipparcos astrometry is suspect. Gaia wins automatically.

Bright-star gate. For the remaining pairs, Hipparcos must beat Gaia by a margin that depends on apparent brightness. The margins exist because Gaia is increasingly unreliable at the very bright end — a small Hipparcos advantage there is more meaningful than it would be for a normal star.

Tie-break. If neither side wins on quality, Gaia wins.

Apparent magnitude	Margin required for Hipparcos to win
G < 3.5 (very bright)	Hipparcos must be strictly better (1.0×)
3.5 ≤ G < 6 (bright)	Hipparcos score must be below 60% of Gaia's
G ≥ 6 (normal)	Hipparcos score must be below 50% of Gaia's

The quality metric is fractional parallax error — σ/π — computed identically for both catalogues: parallax_error / parallax for Hipparcos, and parallax_error / max(parallax, ε) for Gaia DR3 (or the equivalent Bailer-Jones interval width). Lower is better, and the values are directly comparable across catalogues.

For Gaia rows that fall back from the primary parallax measurement through successive distance tiers, sentinel quality values (10, 20, 50) are used instead of real fractional errors. These sentinels sort correctly — a photometric distance estimate always loses to a reliable parallax — without requiring special case logic in the winner selection.

Manual overrides

Some stars cannot be handled by automated scoring, and automated scoring can be wrong. The pipeline includes a YAML-driven override system that takes precedence over everything else. Each override targets a star by its catalogue identity and can do one of three things: add a star that appears in neither catalogue, replace the payload for a star that does exist, or drop a star from the output entirely.

When an override targets one member of a matched Gaia–Hipparcos pair, the pipeline automatically resolves the whole pair via the crossmatch table — the partner is suppressed without the override author needing to supply its ID.

The Sun

add

The Sun does not appear in either catalogue. It is inserted at the ICRS origin with IAU 2015 nominal solar parameters: T_eff = 5772 K, M_V = 4.83.

Alpha Centauri A & B

replace

The two components have independent catalogue parallaxes, but they are a gravitationally bound binary. Both are replaced with the system-level parallax from Kervella et al. (2016/2017), keeping the components at a consistent and more accurate shared distance.

Proxima Centauri

replace

Proxima is absent from the hipparcos2_best_neighbour crossmatch table, almost certainly because its extreme proper motion (~3.8 arcsec/yr) caused the positional match to fail. The override provides Gaia DR3 astrometry sourced via SIMBAD and flags it as a variable flare star.

Sirius B

replace

The white dwarf companion of Sirius has weak catalogue photometry due to the glare from Sirius A. The override substitutes HST-resolved values from Barstow et al. (2005).

Procyon B

add

This white dwarf companion is absent from Gaia DR3 entirely. Resolved photometry and temperature come from Provencal et al. (2002, HST/STIS).

Canonical identity

Every output row is identified by a compound key: (source, source_id). The rules are fixed and designed to make human-facing names work correctly:

Matched pairs (Gaia + Hipparcos linked by the crossmatch) always use source = "hip", regardless of which catalogue's astrometry won. Hipparcos numbers are preferred because named-star designations — HD numbers, Bayer/Flamsteed letters, proper names — are indexed by Hipparcos number in the identifiers sidecar.
Gaia-only (no crossmatch partner): source = "gaia".
Hipparcos-only (no partner in the loaded data): source = "hip".
Manual additions: source = "manual", with a string ID (e.g. "sun") that cannot collide with numeric catalogue IDs.

Cross-catalogue identifiers (for example, the Gaia DR3 source ID of a Hipparcos star) are stored in a separate identifiers sidecar keyed by the same compound key — they are not duplicated on every row of the dense output table.

Streaming merge architecture

The merger processes roughly 1.5 billion Gaia rows without loading the full catalogue into memory. The Hipparcos table (~118K rows), the crossmatch table, and the override list are loaded once and held in memory. Gaia is streamed one batch file at a time.

For each batch, rows that are either crossmatch-linked or override-targeted are separated out and processed individually. Unmatched Gaia rows are written directly to their HEALPix output directories without further inspection. After all Gaia batches are processed, any remaining unmatched Hipparcos rows are flushed, and manual add overrides are emitted.

The output is partitioned by HEALPix pixel — one directory per sky tile, each containing one or more Parquet files. This layout means downstream tools can process one sky region at a time without reading the whole dataset.

What the output contains

Each row in the merged Parquet table carries:

Column	Description
`source`, `source_id`	Canonical identity
`x_icrs_pc`, `y_icrs_pc`, `z_icrs_pc`	Sun-centred Cartesian coordinates (parsecs, ICRS J2016.0)
`ra_deg`, `dec_deg`, `r_pc`	Spherical coordinates and distance
`mag_abs`	Absolute magnitude
`teff`	Effective temperature (K)
`quality_flags`	Packed uint16: distance source, Teff source, photometry source, status bits
`astrometry_quality`	Fractional parallax error or tier sentinel value
`photometry_quality`	GSP-Phot M_G confidence interval width, or 0 if unavailable

A separate merge decisions sidecar records the winner, quality scores, and diagnostic columns for every matched pair — bounded in size by the crossmatch table, not the full catalogue.