"""诊断 NaN 来源 - pytest 版本""" import numpy as np import polars as pl import pytest from src.factors import FactorEngine from src.training import ( FactorManager, NullFiller, Winsorizer, StandardScaler, ) from src.training.components.filters import STFilter from src.training.core.stock_pool_manager import StockPoolManager from src.experiment.common import ( SELECTED_FACTORS, FACTOR_DEFINITIONS, LABEL_NAME, LABEL_FACTOR, stock_pool_filter, STOCK_FILTER_REQUIRED_COLUMNS, ) # 只使用少量因子加速测试 EXCLUDED_FACTORS = [ "GTJA_alpha001", "GTJA_alpha002", "GTJA_alpha003", "GTJA_alpha004", "GTJA_alpha005", "GTJA_alpha006", "GTJA_alpha007", "GTJA_alpha008", "GTJA_alpha009", "GTJA_alpha010", "GTJA_alpha011", "GTJA_alpha012", "GTJA_alpha013", "GTJA_alpha014", "GTJA_alpha015", ] TEST_DATE_RANGE = { "train": ("20200101", "20200331"), # 缩小范围加速测试 "val": ("20200401", "20200430"), "test": ("20200501", "20200531"), } def test_diagnose_nan_source(): """诊断 NaN 来源""" print("\n" + "=" * 80) print("NaN 来源诊断") print("=" * 80) engine = FactorEngine() factor_manager = FactorManager( selected_factors=SELECTED_FACTORS, factor_definitions=FACTOR_DEFINITIONS, label_factor=LABEL_FACTOR, excluded_factors=EXCLUDED_FACTORS, ) # Step 1: 注册因子并计算原始数据 print("\n[Step 1] 注册因子并计算原始数据...") feature_cols = factor_manager.register_to_engine(engine, verbose=False) print(f" 特征数: {len(feature_cols)}") all_start = min( TEST_DATE_RANGE["train"][0], TEST_DATE_RANGE["val"][0], TEST_DATE_RANGE["test"][0], ) all_end = max( TEST_DATE_RANGE["train"][1], TEST_DATE_RANGE["val"][1], TEST_DATE_RANGE["test"][1], ) raw_data = engine.compute( factor_names=feature_cols + [LABEL_NAME], start_date=all_start, end_date=all_end, ) print(f" 原始数据形状: {raw_data.shape}") # 检查原始数据中的 NaN print("\n[Step 2] 原始数据 NaN 统计...") nan_counts = {} for col in feature_cols[:20]: # 只检查前20个特征 nan_count = raw_data[col].null_count() if nan_count > 0: nan_counts[col] = nan_count print(f" 含 NaN 的特征数 (前20个): {len(nan_counts)}") for col, count in list(nan_counts.items())[:10]: pct = count / len(raw_data) * 100 print(f" {col}: {count} ({pct:.1f}%)") # Step 3: 应用过滤器 print("\n[Step 3] 应用过滤器...") st_filter = STFilter(data_router=engine.router) filtered_data = st_filter.filter(raw_data) print(f" 过滤后数据形状: {filtered_data.shape}") # 检查过滤后的 NaN nan_after_filter = sum(filtered_data[col].null_count() for col in feature_cols[:20]) print(f" 前20个特征总 NaN 数: {nan_after_filter}") # Step 4: 应用股票池筛选 print("\n[Step 4] 应用股票池筛选...") pool_manager = StockPoolManager( filter_func=stock_pool_filter, required_columns=STOCK_FILTER_REQUIRED_COLUMNS, data_router=engine.router, ) pool_data = pool_manager.filter_and_select_daily(filtered_data) print(f" 筛选后数据形状: {pool_data.shape}") # 检查筛选后的 NaN nan_after_pool = sum(pool_data[col].null_count() for col in feature_cols[:20]) print(f" 前20个特征总 NaN 数: {nan_after_pool}") # Step 5: 划分数据 print("\n[Step 5] 划分训练集...") train_mask = (pool_data["trade_date"] >= TEST_DATE_RANGE["train"][0]) & ( pool_data["trade_date"] <= TEST_DATE_RANGE["train"][1] ) train_df = pool_data.filter(train_mask) print(f" 训练集形状: {train_df.shape}") # 检查训练集的 NaN nan_train_before = sum(train_df[col].null_count() for col in feature_cols[:20]) print(f" 前20个特征总 NaN 数: {nan_train_before}") # Step 6: 依次应用 processors 并检查每一步的 NaN print("\n[Step 6] 依次应用 processors...") # 6.1 NullFiller print("\n [6.1] NullFiller (by_date=True, strategy=mean)...") null_filler = NullFiller(feature_cols=feature_cols, strategy="mean", by_date=True) after_null = null_filler.fit_transform(train_df) nan_after_null = sum(after_null[col].null_count() for col in feature_cols[:20]) print(f" 处理后前20个特征总 NaN 数: {nan_after_null}") # 检查具体哪些列还有 NaN if nan_after_null > 0: print(" 仍有 NaN 的列:") for col in feature_cols[:20]: count = after_null[col].null_count() if count > 0: print(f" {col}: {count}") # 6.2 Winsorizer print("\n [6.2] Winsorizer (by_date=False)...") winsorizer = Winsorizer( feature_cols=feature_cols, lower=0.01, upper=0.99, by_date=False ) after_winsor = winsorizer.fit_transform(after_null) nan_after_winsor = sum(after_winsor[col].null_count() for col in feature_cols[:20]) print(f" 处理后前20个特征总 NaN 数: {nan_after_winsor}") # 6.3 StandardScaler print("\n [6.3] StandardScaler...") scaler = StandardScaler(feature_cols=feature_cols) after_scaler = scaler.fit_transform(after_winsor) nan_after_scaler = sum(after_scaler[col].null_count() for col in feature_cols[:20]) print(f" 处理后前20个特征总 NaN 数: {nan_after_scaler}") # 检查具体哪些列还有 NaN if nan_after_scaler > 0: print(" 仍有 NaN 的列:") for col in feature_cols[:20]: count = after_scaler[col].null_count() if count > 0: # 检查这列在训练时的统计量 has_mean = col in scaler.mean_ has_std = col in scaler.std_ mean_val = scaler.mean_.get(col, "N/A") std_val = scaler.std_.get(col, "N/A") print(f" {col}: {count}, mean={has_mean}, std={has_std}") # Step 6.4: 检查 StandardScaler 之后、select 之前的所有列 print("\n [6.4] 检查 StandardScaler 后的所有列...") all_nan_counts = {} for col in feature_cols: count = after_scaler[col].null_count() if count > 0: all_nan_counts[col] = count print(f" 所有特征列中含 NaN 的列数: {len(all_nan_counts)}") # 检查这些列是否在 feature_cols 的前20个中 nan_cols_in_first_20 = [c for c in all_nan_counts.keys() if c in feature_cols[:20]] nan_cols_not_in_first_20 = [ c for c in all_nan_counts.keys() if c not in feature_cols[:20] ] print(f" 在前20个中的: {len(nan_cols_in_first_20)}") print(f" 不在前20个中的: {len(nan_cols_not_in_first_20)}") if nan_cols_not_in_first_20: print(f" 例如: {nan_cols_not_in_first_20[:10]}") # 检查 StandardScaler 是否学到了这些列的统计量 print("\n [6.5] 检查 StandardScaler 学到的统计量...") missing_stats_cols = [c for c in all_nan_counts.keys() if c not in scaler.mean_] print(f" 未学到 mean 的列数: {len(missing_stats_cols)}") if missing_stats_cols: print(f" 例如: {missing_stats_cols[:10]}") # 检查这些列的数据类型 for col in missing_stats_cols[:3]: dtype = after_scaler[col].dtype print(f" {col}: dtype={dtype}") # Step 7: 提取 X 并检查 print("\n[Step 7] 提取特征矩阵 X...") X = after_scaler.select(feature_cols) # 关键检查:对比 after_scaler 和 X 中的列 print("\n [7.1] 对比 after_scaler 和 X 中的列...") for col in feature_cols[:20]: null_in_raw = after_scaler[col].null_count() null_in_x = X[col].null_count() if null_in_raw != null_in_x: print(f" {col}: after_scaler={null_in_raw}, X={null_in_x}") X_np = X.to_numpy() print(f" X 形状: {X_np.shape}") print(f" X 中 NaN 总数: {np.isnan(X_np).sum()}") # 检查哪些特征列有 NaN nan_by_col = [] for i, col in enumerate(feature_cols): col_nan = np.isnan(X_np[:, i]).sum() if col_nan > 0: nan_by_col.append((col, col_nan)) print(f" 含 NaN 的特征列数: {len(nan_by_col)}") for col, count in nan_by_col[:10]: print(f" {col}: {count}") # 检查这些列在 after_scaler 中的数据类型 print("\n [Step 8] 检查含 NaN 列的数据类型...") for col, count in nan_by_col[:5]: dtype = after_scaler[col].dtype null_count = after_scaler[col].null_count() print(f" {col}: dtype={dtype}, null_count={null_count}") # 检查这些列是否是布尔类型 boolean_cols = [ col for col in feature_cols if after_scaler[col].dtype == pl.Boolean ] print(f"\n Boolean 类型的特征列数: {len(boolean_cols)}") print(f" 例如: {boolean_cols[:10]}") # 检查这些布尔列是否有 null boolean_with_null = [ col for col in boolean_cols if after_scaler[col].null_count() > 0 ] print(f"\n 含 null 的 Boolean 列数: {len(boolean_with_null)}") # Step 9: 检查是否有不在 feature_cols 中的列有 NaN print("\n [Step 9] 检查非特征列的 NaN...") non_feature_cols = [c for c in after_scaler.columns if c not in feature_cols] non_feature_nan = {} for col in non_feature_cols[:10]: count = after_scaler[col].null_count() if count > 0: non_feature_nan[col] = count print(f" 非特征列中含 NaN 的列数: {len(non_feature_nan)}") for col, count in list(non_feature_nan.items())[:5]: print(f" {col}: {count}") print("\n" + "=" * 80) print("诊断完成") print("=" * 80) # 断言用于pytest assert True