如何使用最近邻插补法处理缺失年龄值并解决残留NaN问题_技术教程

本文详解为何pandas的`interpolate(method='nearest')`在泰坦尼克测试集上无法填补全部年龄缺失值，并提供多种可靠替代方案及实操代码。

Pandas 的 Series.interpolate(method='nearest') 并非基于多维特征的“k-近邻回归”，而是一种一维序列插值方法：它仅依据索引（或默认整数位置）查找前后最近的非空数值位置，然后直接复制该位置的值。关键在于——它不考虑其他特征（如 pclass、sex、fare 等）的相似性，也不进行距离计算，本质上是“按行号找上下最近的数字”。

因此，在你的测试集中出现两个残留 NaN（索引 416 和 417），根本原因很可能是：这两个样本在 age 列中处于连续缺失段的起始/末尾，且其前后若干行的 age 均为 NaN。例如：

# 模拟问题场景：age 列中存在长段连续 NaN
import pandas as pd
import numpy as np

ages = pd.Series([25, np.nan, np.nan, np.nan, 32, np.nan, 28])
print(ages.interpolate(method='nearest'))
# 输出：[25.0, 25.0, 25.0, 32.0, 32.0, 32.0, 28.0] → 中间三个被填满

# 但如果开头/结尾连续缺失：
ages_edge = pd.Series([np.nan, np.nan, 25, 32, np.nan, np.nan])
print(ages_edge.interpolate(method='nearest'))
# 输出：[nan, nan, 25.0, 32.0, nan, nan] → 首尾 NaN 无法被填充（无“最近”有效值）

这正是你遇到的情况：索引 416 和 417 很可能位于 age 列某段连续缺失的边缘（如前面若干行 age 全为 NaN，后面才出现第一个有效 age），导致 nearest 插值因“无邻近有效值”而放弃填充。

✅ 正确解决方案应基于多变量相似性，推荐以下三种专业做法：

1. 使用 sklearn.impute.KNNImputer（推荐）
真正实现基于特征距离的 k-近邻插补，支持多列协同推断：

from sklearn.impute import KNNImputer
import pandas as pd

# 构造用于插补的特征矩阵（排除非数值列如 'cabin', 'embarked'）
features = ['pclass', 'sex', 'sibsp', 'parch', 'fare']  # 确保均为数值型
X_test = titanic_Test[features].copy()

# 注意：KNNImputer 要求训练数据（通常用训练集拟合）
imputer = KNNImputer(n_neighbors=5)
# 若仅有测试集需插补，可先用训练集 fit，再 transform 测试集
# X_train_imputed = imputer.fit_transform(X_train[features])
titanic_Test['age'] = imputer.fit_transform(X_test)[:, features.index('age')]  # 假设 age 在 features 中

2. 分组均值插补（业务导向）
利用泰坦尼克数据强业务逻辑：不同舱位、性别乘客年龄分布差异显著：

# 按 pclass + sex 分组填充（更鲁棒，避免空组）
titanic_Test['age'] = titanic_Test.groupby(['pclass', 'sex'])['age'].transform(
    lambda x: x.fillna(x.mean())
)
# 对仍剩余 NaN（如某组全缺失），回退到全局均值
titanic_Test['age'].fillna(titanic_Test['age'].mean(), inplace=True)

3. 随机森林回归插补（高精度）
将 age 作为目标变量，其余数值特征为输入，训练轻量模型预测：

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# 准备训练数据（建议用完整训练集）
X_train_age = titanic_Train[features].dropna(subset=['age'])
y_train_age = titanic_Train.loc[X_train_age.index, 'age']

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train_age, y_train_age)

# 预测测试集缺失 age
mask_missing = titanic_Test['age'].isna()
X_test_missing = titanic_Test.loc[mask_missing, features]
titanic_Test.loc[mask_missing, 'age'] = rf.predict(X_test_missing)

⚠️ 重要提醒：