- 新增 FactorManager 组件:统一管理多种来源因子 - 新增 DataPipeline 组件:完整数据处理流程(注册、过滤、划分、预处理) - 新增 Task 策略组件:BaseTask 抽象基类、RegressionTask、RankTask - 新增 ResultAnalyzer 组件:特征重要性分析和结果组装 - 新增 TrainerV2:作为纯调度引擎协调各组件 - 支持回归和排序学习两种训练模式 - 采用组合模式解耦训练流程,消除代码重复
2380 lines
68 KiB
Markdown
2380 lines
68 KiB
Markdown
# Trainer 模块化重构实现计划
|
||
|
||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||
|
||
**Goal:** 重构 Trainer 实现模块化训练流程,完整保留所有现有功能,消除 regression.py 和 learn_to_rank.py 的代码重复
|
||
|
||
**Architecture:** 采用组合模式(Composition over Inheritance),将训练流程解耦为 FactorManager(因子管理)、DataPipeline(数据流程)、Task(任务策略)、ResultAnalyzer(结果分析)四个独立组件,Trainer 作为纯调度引擎协调各组件
|
||
|
||
**Tech Stack:** Python 3.10+, Polars, LightGBM, Pydantic
|
||
|
||
---
|
||
|
||
## 前置检查
|
||
|
||
**读取参考文件以了解当前实现:**
|
||
- @src/experiment/common.py - 当前配置和共用函数
|
||
- @src/experiment/regression.py - 回归训练流程(640行)
|
||
- @src/experiment/learn_to_rank.py - 排序学习流程(876行)
|
||
- @src/training/core/trainer.py - 当前 Trainer 实现
|
||
- @src/training/components/models/lightgbm.py - LightGBM 回归模型
|
||
- @src/training/components/models/lightgbm_lambdarank.py - LambdaRank 排序模型
|
||
- @src/training/components/base.py - 基础抽象类
|
||
|
||
---
|
||
|
||
## Task 1: 创建 docs/plans 目录并保存计划
|
||
|
||
**Files:**
|
||
- Create: `docs/plans/2026-03-23-trainer-refactor-plan.md`
|
||
|
||
**Step 1: 创建目录并复制计划文件**
|
||
|
||
```bash
|
||
mkdir -p docs/plans
|
||
cp .plannotator/plans/trainer-v3-2026-03-23-approved.md docs/plans/2026-03-23-trainer-refactor-plan.md
|
||
```
|
||
|
||
**Step 2: Commit**
|
||
|
||
```bash
|
||
git add docs/plans/
|
||
git commit -m "docs: add trainer refactoring implementation plan"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 2: 重构 common.py - 添加统一配置结构
|
||
|
||
**Files:**
|
||
- Modify: `src/experiment/common.py` - 在文件末尾添加新的配置结构
|
||
|
||
**Step 1: 在 common.py 末尾添加 TRAINING_CONFIG 和辅助函数**
|
||
|
||
```python
|
||
# ============================================================
|
||
# 新增:统一配置结构(用于模块化 Trainer)
|
||
# ============================================================
|
||
|
||
from typing import Dict, List, Tuple, Any, Callable, Optional
|
||
from dataclasses import dataclass, field
|
||
|
||
|
||
@dataclass
|
||
class TrainingConfig:
|
||
"""训练配置数据结构"""
|
||
|
||
# 因子配置
|
||
selected_factors: List[str]
|
||
factor_definitions: Dict[str, str]
|
||
label_factor: Dict[str, str]
|
||
excluded_factors: List[str]
|
||
|
||
# 数据配置
|
||
stock_pool_filter: Callable
|
||
stock_pool_required_columns: List[str]
|
||
st_filter_enabled: bool = True
|
||
|
||
# 日期范围
|
||
train_start: str
|
||
train_end: str
|
||
val_start: str
|
||
val_end: str
|
||
test_start: str
|
||
test_end: str
|
||
|
||
# 输出配置
|
||
output_dir: str
|
||
save_predictions: bool
|
||
save_model: bool
|
||
top_n: int
|
||
|
||
@property
|
||
def date_range(self) -> Dict[str, Tuple[str, str]]:
|
||
"""获取日期范围字典"""
|
||
return {
|
||
"train": (self.train_start, self.train_end),
|
||
"val": (self.val_start, self.val_end),
|
||
"test": (self.test_start, self.test_end),
|
||
}
|
||
|
||
|
||
@dataclass
|
||
class ModelConfig:
|
||
"""模型配置基类"""
|
||
model_params: Dict[str, Any]
|
||
label_name: str
|
||
|
||
|
||
@dataclass
|
||
class RegressionModelConfig(ModelConfig):
|
||
"""回归模型配置"""
|
||
pass
|
||
|
||
|
||
@dataclass
|
||
class RankModelConfig(ModelConfig):
|
||
"""排序学习模型配置"""
|
||
n_quantiles: int = 20
|
||
|
||
|
||
# 创建统一配置实例
|
||
def create_training_config() -> TrainingConfig:
|
||
"""创建训练配置"""
|
||
return TrainingConfig(
|
||
selected_factors=SELECTED_FACTORS,
|
||
factor_definitions=FACTOR_DEFINITIONS,
|
||
label_factor=LABEL_FACTOR,
|
||
excluded_factors=EXCLUDED_FACTORS,
|
||
stock_pool_filter=stock_pool_filter,
|
||
stock_pool_required_columns=STOCK_FILTER_REQUIRED_COLUMNS,
|
||
st_filter_enabled=True,
|
||
train_start=TRAIN_START,
|
||
train_end=TRAIN_END,
|
||
val_start=VAL_START,
|
||
val_end=VAL_END,
|
||
test_start=TEST_START,
|
||
test_end=TEST_END,
|
||
output_dir=OUTPUT_DIR,
|
||
save_predictions=SAVE_PREDICTIONS,
|
||
save_model=SAVE_MODEL,
|
||
top_n=TOP_N,
|
||
)
|
||
|
||
|
||
def create_regression_config() -> RegressionModelConfig:
|
||
"""创建回归模型配置"""
|
||
return RegressionModelConfig(
|
||
model_params=MODEL_PARAMS_REGRESSION,
|
||
label_name="future_return_5",
|
||
)
|
||
|
||
|
||
def create_rank_config() -> RankModelConfig:
|
||
"""创建排序学习模型配置"""
|
||
return RankModelConfig(
|
||
model_params=MODEL_PARAMS_RANK,
|
||
label_name="future_return_5",
|
||
n_quantiles=20,
|
||
)
|
||
|
||
|
||
# 保持向后兼容的导入
|
||
__all__ = [
|
||
# 原有导出
|
||
"SELECTED_FACTORS",
|
||
"FACTOR_DEFINITIONS",
|
||
"LABEL_FACTOR",
|
||
"EXCLUDED_FACTORS",
|
||
"register_factors",
|
||
"prepare_data",
|
||
"stock_pool_filter",
|
||
"STOCK_FILTER_REQUIRED_COLUMNS",
|
||
"TRAIN_START",
|
||
"TRAIN_END",
|
||
"VAL_START",
|
||
"VAL_END",
|
||
"TEST_START",
|
||
"TEST_END",
|
||
"OUTPUT_DIR",
|
||
"SAVE_PREDICTIONS",
|
||
"SAVE_MODEL",
|
||
"TOP_N",
|
||
"get_model_save_path",
|
||
"save_model_with_factors",
|
||
"get_label_factor",
|
||
# 新增导出
|
||
"TrainingConfig",
|
||
"ModelConfig",
|
||
"RegressionModelConfig",
|
||
"RankModelConfig",
|
||
"create_training_config",
|
||
"create_regression_config",
|
||
"create_rank_config",
|
||
]
|
||
```
|
||
|
||
**Step 2: 在 common.py 的 MODEL_PARAMS 定义后添加回归和排序的参数分开定义**
|
||
|
||
找到 MODEL_PARAMS 定义的位置(约第400行左右),将其重命名为 MODEL_PARAMS_REGRESSION,然后添加排序学习的参数:
|
||
|
||
```python
|
||
# 回归模型参数
|
||
MODEL_PARAMS_REGRESSION = {
|
||
# ... 原有的 MODEL_PARAMS 内容 ...
|
||
}
|
||
|
||
# 排序学习模型参数
|
||
MODEL_PARAMS_RANK = {
|
||
"objective": "lambdarank",
|
||
"metric": "ndcg",
|
||
"ndcg_at": 25,
|
||
"learning_rate": 0.1,
|
||
"n_estimators": 1000,
|
||
"early_stopping_round": 50,
|
||
"max_depth": 4,
|
||
"num_leaves": 32,
|
||
"min_data_in_leaf": 256,
|
||
"subsample": 0.4,
|
||
"subsample_freq": 1,
|
||
"colsample_bytree": 0.4,
|
||
"reg_alpha": 10.0,
|
||
"reg_lambda": 50.0,
|
||
"lambdarank_truncation_level": 50,
|
||
"label_gain": [i * i for i in range(1, 21)],
|
||
"verbose": -1,
|
||
"random_state": 42,
|
||
}
|
||
|
||
# 保持向后兼容
|
||
MODEL_PARAMS = MODEL_PARAMS_REGRESSION
|
||
```
|
||
|
||
**Step 3: Run tests to verify changes don't break existing code**
|
||
|
||
```bash
|
||
uv run pytest tests/test_sync.py -v -x
|
||
```
|
||
|
||
Expected: Tests pass (or at least not broken by our changes)
|
||
|
||
**Step 4: Commit**
|
||
|
||
```bash
|
||
git add src/experiment/common.py
|
||
git commit -m "refactor(common): add unified config structure for modular trainer
|
||
|
||
- Add TrainingConfig dataclass for unified configuration
|
||
- Add ModelConfig, RegressionModelConfig, RankModelConfig
|
||
- Separate MODEL_PARAMS into MODEL_PARAMS_REGRESSION and MODEL_PARAMS_RANK
|
||
- Add factory functions: create_training_config, create_regression_config, create_rank_config
|
||
- Maintain backward compatibility"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 3: 创建 FactorManager 组件
|
||
|
||
**Files:**
|
||
- Create: `src/training/factor_manager.py`
|
||
- Test: `tests/test_factor_manager.py`
|
||
|
||
**Step 1: Create the FactorManager implementation**
|
||
|
||
```python
|
||
"""因子管理器
|
||
|
||
管理多种来源的因子:
|
||
- metadata 中注册的因子
|
||
- DSL 表达式定义的因子
|
||
- Label 因子
|
||
- 排除的因子列表
|
||
"""
|
||
|
||
from typing import Dict, List, Optional, Any
|
||
import polars as pl
|
||
|
||
from src.factors import FactorEngine
|
||
|
||
|
||
class FactorManager:
|
||
"""因子管理器
|
||
|
||
统一管理多种来源的因子注册和准备:
|
||
1. metadata 中已注册的因子(通过名称引用)
|
||
2. DSL 表达式定义的因子(动态注册)
|
||
3. Label 因子(通过表达式定义)
|
||
4. 排除的因子列表(从最终列表中移除)
|
||
|
||
Attributes:
|
||
selected_factors: 从 metadata 中选择的因子名称列表
|
||
factor_definitions: DSL 表达式定义的因子字典 {name: dsl_expression}
|
||
label_factor: Label 因子定义 {name: dsl_expression}
|
||
excluded_factors: 需要排除的因子名称列表
|
||
registered_factors: 已注册到 FactorEngine 的因子列表
|
||
"""
|
||
|
||
def __init__(
|
||
self,
|
||
selected_factors: List[str],
|
||
factor_definitions: Dict[str, str],
|
||
label_factor: Dict[str, str],
|
||
excluded_factors: Optional[List[str]] = None,
|
||
):
|
||
"""初始化因子管理器
|
||
|
||
Args:
|
||
selected_factors: 从 metadata 中选择的因子名称列表
|
||
factor_definitions: DSL 表达式定义的因子字典
|
||
label_factor: Label 因子定义字典
|
||
excluded_factors: 需要排除的因子名称列表
|
||
"""
|
||
self.selected_factors = selected_factors or []
|
||
self.factor_definitions = factor_definitions or {}
|
||
self.label_factor = label_factor or {}
|
||
self.excluded_factors = excluded_factors or []
|
||
self.registered_factors: List[str] = []
|
||
|
||
def register_to_engine(
|
||
self,
|
||
engine: FactorEngine,
|
||
verbose: bool = True,
|
||
) -> List[str]:
|
||
"""注册所有因子到 FactorEngine
|
||
|
||
按以下顺序注册:
|
||
1. metadata 中的因子(通过名称从 metadata 加载)
|
||
2. DSL 表达式定义的因子(使用 add_factor 注册)
|
||
3. Label 因子(使用 add_factor 注册)
|
||
4. 排除指定的因子
|
||
|
||
Args:
|
||
engine: FactorEngine 实例
|
||
verbose: 是否打印注册信息
|
||
|
||
Returns:
|
||
最终的特征列名列表(已排除指定因子)
|
||
"""
|
||
if verbose:
|
||
print("\n" + "=" * 80)
|
||
print("因子注册")
|
||
print("=" * 80)
|
||
|
||
# Step 1: 从 metadata 注册选中的因子
|
||
if verbose:
|
||
print(f"\n[1/4] 从 metadata 注册 {len(self.selected_factors)} 个因子...")
|
||
|
||
feature_cols = []
|
||
for factor_name in self.selected_factors:
|
||
try:
|
||
engine.add_factor(factor_name)
|
||
feature_cols.append(factor_name)
|
||
if verbose:
|
||
print(f" ✓ {factor_name}")
|
||
except Exception as e:
|
||
if verbose:
|
||
print(f" ✗ {factor_name}: {e}")
|
||
|
||
# Step 2: 注册 DSL 定义的因子
|
||
if self.factor_definitions:
|
||
if verbose:
|
||
print(f"\n[2/4] 注册 {len(self.factor_definitions)} 个 DSL 定义因子...")
|
||
|
||
for factor_name, dsl_expr in self.factor_definitions.items():
|
||
if factor_name not in self.excluded_factors:
|
||
try:
|
||
engine.add_factor(factor_name, dsl_expr)
|
||
feature_cols.append(factor_name)
|
||
if verbose:
|
||
print(f" ✓ {factor_name}: {dsl_expr[:50]}...")
|
||
except Exception as e:
|
||
if verbose:
|
||
print(f" ✗ {factor_name}: {e}")
|
||
|
||
# Step 3: 注册 Label 因子
|
||
if self.label_factor:
|
||
if verbose:
|
||
print(f"\n[3/4] 注册 Label 因子...")
|
||
|
||
for factor_name, dsl_expr in self.label_factor.items():
|
||
try:
|
||
engine.add_factor(factor_name, dsl_expr)
|
||
if verbose:
|
||
print(f" ✓ Label: {factor_name}")
|
||
except Exception as e:
|
||
if verbose:
|
||
print(f" ✗ Label {factor_name}: {e}")
|
||
|
||
# Step 4: 排除指定因子
|
||
if self.excluded_factors:
|
||
if verbose:
|
||
print(f"\n[4/4] 排除 {len(self.excluded_factors)} 个因子...")
|
||
|
||
original_count = len(feature_cols)
|
||
feature_cols = [f for f in feature_cols if f not in self.excluded_factors]
|
||
excluded_count = original_count - len(feature_cols)
|
||
|
||
if verbose:
|
||
print(f" 排除 {excluded_count} 个因子")
|
||
for f in self.excluded_factors:
|
||
if f in self.selected_factors or f in self.factor_definitions:
|
||
print(f" - {f}")
|
||
|
||
self.registered_factors = feature_cols
|
||
|
||
if verbose:
|
||
print(f"\n[结果] 最终特征数: {len(feature_cols)}")
|
||
print("=" * 80)
|
||
|
||
return feature_cols
|
||
|
||
def get_feature_cols(self) -> List[str]:
|
||
"""获取已注册的特征列名列表
|
||
|
||
Returns:
|
||
特征列名列表
|
||
"""
|
||
return self.registered_factors
|
||
|
||
def get_label_col(self) -> Optional[str]:
|
||
"""获取 Label 列名
|
||
|
||
Returns:
|
||
Label 列名,如果没有则返回 None
|
||
"""
|
||
if self.label_factor:
|
||
return list(self.label_factor.keys())[0]
|
||
return None
|
||
```
|
||
|
||
**Step 2: Create test file**
|
||
|
||
```python
|
||
"""FactorManager 测试"""
|
||
|
||
import pytest
|
||
from unittest.mock import Mock, MagicMock
|
||
|
||
from src.training.factor_manager import FactorManager
|
||
|
||
|
||
class TestFactorManager:
|
||
"""测试 FactorManager"""
|
||
|
||
def test_init(self):
|
||
"""测试初始化"""
|
||
fm = FactorManager(
|
||
selected_factors=["factor1", "factor2"],
|
||
factor_definitions={"factor3": "close + open"},
|
||
label_factor={"label": "future_return_5"},
|
||
excluded_factors=["factor2"],
|
||
)
|
||
|
||
assert fm.selected_factors == ["factor1", "factor2"]
|
||
assert fm.factor_definitions == {"factor3": "close + open"}
|
||
assert fm.label_factor == {"label": "future_return_5"}
|
||
assert fm.excluded_factors == ["factor2"]
|
||
assert fm.registered_factors == []
|
||
|
||
def test_register_to_engine(self):
|
||
"""测试注册到引擎"""
|
||
# 创建 mock engine
|
||
engine = Mock()
|
||
engine.add_factor = Mock()
|
||
|
||
fm = FactorManager(
|
||
selected_factors=["factor1", "factor2"],
|
||
factor_definitions={"factor3": "close + open"},
|
||
label_factor={"label": "future_return"},
|
||
excluded_factors=["factor2"],
|
||
)
|
||
|
||
feature_cols = fm.register_to_engine(engine, verbose=False)
|
||
|
||
# 验证调用
|
||
assert engine.add_factor.call_count == 4 # 2 selected + 1 dsl + 1 label
|
||
|
||
# 验证结果(factor2 被排除)
|
||
assert "factor1" in feature_cols
|
||
assert "factor2" not in feature_cols
|
||
assert "factor3" in feature_cols
|
||
assert fm.registered_factors == feature_cols
|
||
|
||
def test_get_feature_cols(self):
|
||
"""测试获取特征列"""
|
||
fm = FactorManager(
|
||
selected_factors=["factor1"],
|
||
factor_definitions={},
|
||
label_factor={},
|
||
)
|
||
|
||
# 注册前为空
|
||
assert fm.get_feature_cols() == []
|
||
|
||
# 注册后
|
||
engine = Mock()
|
||
engine.add_factor = Mock()
|
||
fm.register_to_engine(engine, verbose=False)
|
||
|
||
assert fm.get_feature_cols() == ["factor1"]
|
||
|
||
def test_get_label_col(self):
|
||
"""测试获取 Label 列"""
|
||
fm = FactorManager(
|
||
selected_factors=[],
|
||
factor_definitions={},
|
||
label_factor={"label": "future_return"},
|
||
)
|
||
|
||
assert fm.get_label_col() == "label"
|
||
|
||
# 没有 label 时返回 None
|
||
fm2 = FactorManager(selected_factors=[], factor_definitions={}, label_factor={})
|
||
assert fm2.get_label_col() is None
|
||
```
|
||
|
||
**Step 3: Run tests**
|
||
|
||
```bash
|
||
uv run pytest tests/test_factor_manager.py -v
|
||
```
|
||
|
||
Expected: All tests pass
|
||
|
||
**Step 4: Commit**
|
||
|
||
```bash
|
||
git add src/training/factor_manager.py tests/test_factor_manager.py
|
||
git commit -m "feat(training): add FactorManager component
|
||
|
||
- Manage factors from multiple sources (metadata, DSL, label, excluded)
|
||
- Register factors to FactorEngine with proper ordering
|
||
- Support factor exclusion
|
||
- Add comprehensive tests"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 4: 创建 DataPipeline 组件
|
||
|
||
**Files:**
|
||
- Create: `src/training/pipeline.py`
|
||
- Test: `tests/test_pipeline.py`
|
||
|
||
**Step 1: Create the DataPipeline implementation**
|
||
|
||
```python
|
||
"""数据流水线
|
||
|
||
完整的数据处理流程:
|
||
1. 因子注册和数据准备
|
||
2. 应用过滤器(STFilter 等)
|
||
3. 股票池筛选(自定义函数)
|
||
4. 数据质量检查
|
||
5. 数据划分(train/val/test)
|
||
6. 数据预处理(fit_transform/transform)
|
||
"""
|
||
|
||
from typing import Any, Callable, Dict, List, Optional, Tuple
|
||
import polars as pl
|
||
import numpy as np
|
||
|
||
from src.factors import FactorEngine
|
||
from src.training.factor_manager import FactorManager
|
||
from src.training.components.base import BaseProcessor
|
||
from src.training.core.stock_pool_manager import StockPoolManager
|
||
|
||
|
||
class DataPipeline:
|
||
"""数据流水线
|
||
|
||
执行完整的数据处理流程,返回标准化的数据字典。
|
||
|
||
Attributes:
|
||
factor_manager: 因子管理器
|
||
filters: 类形式的过滤器列表(如 STFilter)
|
||
stock_pool_filter_func: 函数形式的股票池筛选器
|
||
processors: 数据处理器列表
|
||
stock_pool_required_columns: 股票池筛选所需的额外列
|
||
fitted_processors: 已拟合的处理器列表(训练后填充)
|
||
"""
|
||
|
||
def __init__(
|
||
self,
|
||
factor_manager: FactorManager,
|
||
processors: List[BaseProcessor],
|
||
filters: Optional[List[Any]] = None,
|
||
stock_pool_filter_func: Optional[Callable] = None,
|
||
stock_pool_required_columns: Optional[List[str]] = None,
|
||
):
|
||
"""初始化数据流水线
|
||
|
||
Args:
|
||
factor_manager: 因子管理器实例
|
||
processors: 数据处理器列表(顺序执行)
|
||
filters: 类形式的过滤器列表(如 [STFilter])
|
||
stock_pool_filter_func: 函数形式的股票池筛选器
|
||
stock_pool_required_columns: 股票池筛选所需的额外列
|
||
"""
|
||
self.factor_manager = factor_manager
|
||
self.processors = processors or []
|
||
self.filters = filters or []
|
||
self.stock_pool_filter_func = stock_pool_filter_func
|
||
self.stock_pool_required_columns = stock_pool_required_columns or []
|
||
self.fitted_processors: List[BaseProcessor] = []
|
||
|
||
def prepare_data(
|
||
self,
|
||
engine: FactorEngine,
|
||
date_range: Dict[str, Tuple[str, str]],
|
||
label_name: str,
|
||
verbose: bool = True,
|
||
) -> Dict[str, Dict[str, Any]]:
|
||
"""执行完整数据流程
|
||
|
||
流程:
|
||
1. 注册因子并准备数据
|
||
2. 应用类过滤器(STFilter)
|
||
3. 应用股票池筛选(函数形式)
|
||
4. 数据质量检查
|
||
5. 数据划分
|
||
6. 数据预处理
|
||
|
||
Args:
|
||
engine: FactorEngine 实例
|
||
date_range: 日期范围字典 {"train": (start, end), "val": ..., "test": ...}
|
||
label_name: Label 列名
|
||
verbose: 是否打印处理信息
|
||
|
||
Returns:
|
||
标准化的数据字典:
|
||
{
|
||
"train": {
|
||
"X": pl.DataFrame, # 特征矩阵
|
||
"y": pl.Series, # 目标变量
|
||
"raw_data": pl.DataFrame, # 原始数据(保留完整信息)
|
||
"feature_cols": List[str], # 特征列名
|
||
},
|
||
"val": {...},
|
||
"test": {...},
|
||
}
|
||
"""
|
||
if verbose:
|
||
print("\n" + "=" * 80)
|
||
print("数据流水线")
|
||
print("=" * 80)
|
||
|
||
# Step 1: 注册因子并准备数据
|
||
if verbose:
|
||
print("\n[1/6] 注册因子并准备数据...")
|
||
|
||
feature_cols = self.factor_manager.register_to_engine(engine, verbose=verbose)
|
||
|
||
# 计算完整日期范围
|
||
all_start = min(date_range["train"][0], date_range["val"][0], date_range["test"][0])
|
||
all_end = max(date_range["train"][1], date_range["val"][1], date_range["test"][1])
|
||
|
||
# 准备数据
|
||
data = engine.compute(
|
||
factors=feature_cols + [label_name],
|
||
start_date=all_start,
|
||
end_date=all_end,
|
||
)
|
||
|
||
if verbose:
|
||
print(f" 原始数据规模: {data.shape}")
|
||
print(f" 特征数: {len(feature_cols)}")
|
||
|
||
# Step 2: 应用类过滤器(STFilter)
|
||
if self.filters:
|
||
if verbose:
|
||
print(f"\n[2/6] 应用过滤器({len(self.filters)}个)...")
|
||
|
||
for filter_obj in self.filters:
|
||
data_before = len(data)
|
||
data = filter_obj.filter(data)
|
||
data_after = len(data)
|
||
|
||
if verbose:
|
||
print(f" {filter_obj.__class__.__name__}:")
|
||
print(f" 过滤前: {data_before}, 过滤后: {data_after}")
|
||
print(f" 删除: {data_before - data_after}")
|
||
|
||
# Step 3: 应用股票池筛选(函数形式)
|
||
if self.stock_pool_filter_func:
|
||
if verbose:
|
||
print(f"\n[3/6] 股票池筛选...")
|
||
|
||
data_before = len(data)
|
||
|
||
# 创建 StockPoolManager
|
||
pool_manager = StockPoolManager(
|
||
filter_func=self.stock_pool_filter_func,
|
||
required_columns=self.stock_pool_required_columns,
|
||
data_router=engine.router,
|
||
)
|
||
|
||
data = pool_manager.filter_and_select_daily(data)
|
||
data_after = len(data)
|
||
|
||
if verbose:
|
||
print(f" 筛选前: {data_before}, 筛选后: {data_after}")
|
||
print(f" 删除: {data_before - data_after}")
|
||
|
||
# Step 4: 数据质量检查
|
||
if verbose:
|
||
print(f"\n[4/6] 数据质量检查...")
|
||
|
||
self._check_data_quality(data, feature_cols, verbose=verbose)
|
||
|
||
# Step 5: 数据划分
|
||
if verbose:
|
||
print(f"\n[5/6] 数据划分...")
|
||
|
||
split_data = self._split_data(data, date_range, feature_cols, label_name, verbose=verbose)
|
||
|
||
# Step 6: 数据预处理
|
||
if verbose:
|
||
print(f"\n[6/6] 数据预处理...")
|
||
|
||
split_data = self._preprocess(split_data, verbose=verbose)
|
||
|
||
if verbose:
|
||
print("\n" + "=" * 80)
|
||
print("数据流水线完成")
|
||
print("=" * 80)
|
||
|
||
return split_data
|
||
|
||
def _check_data_quality(
|
||
self,
|
||
data: pl.DataFrame,
|
||
feature_cols: List[str],
|
||
verbose: bool = True,
|
||
) -> None:
|
||
"""检查数据质量
|
||
|
||
Args:
|
||
data: 数据框
|
||
feature_cols: 特征列名列表
|
||
verbose: 是否打印信息
|
||
"""
|
||
# 检查缺失值
|
||
null_counts = {}
|
||
for col in feature_cols:
|
||
null_count = data[col].null_count()
|
||
if null_count > 0:
|
||
null_counts[col] = null_count
|
||
|
||
if null_counts and verbose:
|
||
print(f" [警告] 发现缺失值:")
|
||
for col, count in list(null_counts.items())[:5]: # 只显示前5个
|
||
pct = count / len(data) * 100
|
||
print(f" {col}: {count} ({pct:.2f}%)")
|
||
|
||
def _split_data(
|
||
self,
|
||
data: pl.DataFrame,
|
||
date_range: Dict[str, Tuple[str, str]],
|
||
feature_cols: List[str],
|
||
label_name: str,
|
||
verbose: bool = True,
|
||
) -> Dict[str, Dict[str, Any]]:
|
||
"""划分数据集
|
||
|
||
Args:
|
||
data: 完整数据
|
||
date_range: 日期范围字典
|
||
feature_cols: 特征列名
|
||
label_name: Label 列名
|
||
verbose: 是否打印信息
|
||
|
||
Returns:
|
||
划分后的数据字典
|
||
"""
|
||
result = {}
|
||
|
||
for split_name, (start, end) in date_range.items():
|
||
mask = (data["trade_date"] >= start) & (data["trade_date"] <= end)
|
||
split_df = data.filter(mask)
|
||
|
||
result[split_name] = {
|
||
"X": split_df.select(feature_cols),
|
||
"y": split_df[label_name],
|
||
"raw_data": split_df,
|
||
"feature_cols": feature_cols,
|
||
}
|
||
|
||
if verbose:
|
||
print(f" {split_name}: {len(split_df)} 条记录")
|
||
|
||
return result
|
||
|
||
def _preprocess(
|
||
self,
|
||
split_data: Dict[str, Dict[str, Any]],
|
||
verbose: bool = True,
|
||
) -> Dict[str, Dict[str, Any]]:
|
||
"""预处理数据
|
||
|
||
训练集使用 fit_transform,验证集和测试集使用 transform
|
||
|
||
Args:
|
||
split_data: 划分后的数据字典
|
||
verbose: 是否打印信息
|
||
|
||
Returns:
|
||
预处理后的数据字典
|
||
"""
|
||
if not self.processors:
|
||
return split_data
|
||
|
||
self.fitted_processors = []
|
||
|
||
# 训练集:fit_transform
|
||
if verbose:
|
||
print(f" 训练集预处理(fit_transform)...")
|
||
|
||
train_data = split_data["train"]["raw_data"]
|
||
for processor in self.processors:
|
||
train_data = processor.fit_transform(train_data)
|
||
self.fitted_processors.append(processor)
|
||
|
||
# 更新训练集
|
||
split_data["train"]["raw_data"] = train_data
|
||
split_data["train"]["X"] = train_data.select(split_data["train"]["feature_cols"])
|
||
split_data["train"]["y"] = train_data[split_data["train"]["y"].name]
|
||
|
||
# 验证集和测试集:transform
|
||
for split_name in ["val", "test"]:
|
||
if split_name in split_data:
|
||
if verbose:
|
||
print(f" {split_name}集预处理(transform)...")
|
||
|
||
split_df = split_data[split_name]["raw_data"]
|
||
for processor in self.fitted_processors:
|
||
split_df = processor.transform(split_df)
|
||
|
||
split_data[split_name]["raw_data"] = split_df
|
||
split_data[split_name]["X"] = split_df.select(split_data[split_name]["feature_cols"])
|
||
split_data[split_name]["y"] = split_df[split_data[split_name]["y"].name]
|
||
|
||
return split_data
|
||
|
||
def get_fitted_processors(self) -> List[BaseProcessor]:
|
||
"""获取已拟合的处理器列表
|
||
|
||
Returns:
|
||
已拟合的处理器列表(用于模型保存)
|
||
"""
|
||
return self.fitted_processors
|
||
```
|
||
|
||
**Step 2: Create test file**
|
||
|
||
```python
|
||
"""DataPipeline 测试"""
|
||
|
||
import pytest
|
||
from unittest.mock import Mock, MagicMock
|
||
import polars as pl
|
||
|
||
from src.training.pipeline import DataPipeline
|
||
from src.training.factor_manager import FactorManager
|
||
from src.training.components.processors import NullFiller
|
||
|
||
|
||
class TestDataPipeline:
|
||
"""测试 DataPipeline"""
|
||
|
||
def test_init(self):
|
||
"""测试初始化"""
|
||
fm = Mock(spec=FactorManager)
|
||
processors = [NullFiller(feature_cols=["f1"])]
|
||
|
||
pipeline = DataPipeline(
|
||
factor_manager=fm,
|
||
processors=processors,
|
||
)
|
||
|
||
assert pipeline.factor_manager == fm
|
||
assert pipeline.processors == processors
|
||
assert pipeline.fitted_processors == []
|
||
|
||
def test_get_fitted_processors(self):
|
||
"""测试获取已拟合处理器"""
|
||
pipeline = DataPipeline(
|
||
factor_manager=Mock(),
|
||
processors=[],
|
||
)
|
||
|
||
# 模拟已拟合处理器
|
||
pipeline.fitted_processors = [Mock()]
|
||
|
||
assert len(pipeline.get_fitted_processors()) == 1
|
||
```
|
||
|
||
**Step 3: Commit**
|
||
|
||
```bash
|
||
git add src/training/pipeline.py tests/test_pipeline.py
|
||
git commit -m "feat(training): add DataPipeline component
|
||
|
||
- Complete data processing pipeline: register factors, filter, split, preprocess
|
||
- Support both class filters (STFilter) and function filters (stock_pool_filter)
|
||
- Proper fit_transform/transform separation for processors
|
||
- Add comprehensive tests"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 5: 创建 Task 策略组件
|
||
|
||
**Files:**
|
||
- Create: `src/training/tasks/base.py`
|
||
- Create: `src/training/tasks/regression_task.py`
|
||
- Create: `src/training/tasks/rank_task.py`
|
||
- Create: `src/training/tasks/__init__.py`
|
||
- Test: `tests/test_tasks.py`
|
||
|
||
**Step 1: Create base Task protocol**
|
||
|
||
```python
|
||
"""任务抽象基类
|
||
|
||
定义 Task 接口,所有具体任务必须实现此接口。
|
||
"""
|
||
|
||
from abc import ABC, abstractmethod
|
||
from typing import Any, Dict, Optional
|
||
import numpy as np
|
||
|
||
|
||
class BaseTask(ABC):
|
||
"""任务抽象基类
|
||
|
||
所有训练任务(回归、排序学习、分类等)必须继承此类。
|
||
提供统一的接口:Label处理、模型训练、预测、评估。
|
||
|
||
Attributes:
|
||
label_name: Label 列名
|
||
model_params: 模型参数字典
|
||
"""
|
||
|
||
def __init__(self, model_params: Dict[str, Any], label_name: str):
|
||
"""初始化任务
|
||
|
||
Args:
|
||
model_params: 模型参数字典
|
||
label_name: Label 列名
|
||
"""
|
||
self.model_params = model_params
|
||
self.label_name = label_name
|
||
self.model = None
|
||
|
||
@abstractmethod
|
||
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
|
||
"""准备标签
|
||
|
||
子类可实现特定的 Label 转换逻辑(如排序学习的分位数转换)。
|
||
|
||
Args:
|
||
data: 数据字典
|
||
|
||
Returns:
|
||
处理后的数据字典
|
||
"""
|
||
raise NotImplementedError
|
||
|
||
@abstractmethod
|
||
def fit(self, train_data: Dict, val_data: Dict) -> None:
|
||
"""训练模型
|
||
|
||
Args:
|
||
train_data: 训练数据字典 {"X": DataFrame, "y": Series, ...}
|
||
val_data: 验证数据字典
|
||
"""
|
||
raise NotImplementedError
|
||
|
||
@abstractmethod
|
||
def predict(self, test_data: Dict) -> np.ndarray:
|
||
"""生成预测
|
||
|
||
Args:
|
||
test_data: 测试数据字典
|
||
|
||
Returns:
|
||
预测结果数组
|
||
"""
|
||
raise NotImplementedError
|
||
|
||
def get_model(self):
|
||
"""获取底层模型
|
||
|
||
Returns:
|
||
训练后的模型实例
|
||
"""
|
||
return self.model
|
||
|
||
def plot_training_metrics(self) -> None:
|
||
"""绘制训练指标曲线(可选)"""
|
||
pass
|
||
```
|
||
|
||
**Step 2: Create RegressionTask**
|
||
|
||
```python
|
||
"""回归任务实现
|
||
|
||
实现回归任务的训练流程:
|
||
- Label 无需转换(保持连续值)
|
||
- 使用 LightGBM 回归模型
|
||
- 支持 MAE/RMSE 评估
|
||
"""
|
||
|
||
from typing import Any, Dict, Optional
|
||
import numpy as np
|
||
import polars as pl
|
||
|
||
from src.training.tasks.base import BaseTask
|
||
from src.training.components.models.lightgbm import LightGBMModel
|
||
|
||
|
||
class RegressionTask(BaseTask):
|
||
"""回归任务
|
||
|
||
使用 LightGBM 进行回归训练,支持早停和训练曲线。
|
||
"""
|
||
|
||
def __init__(
|
||
self,
|
||
model_params: Dict[str, Any],
|
||
label_name: str = "future_return_5",
|
||
):
|
||
"""初始化回归任务
|
||
|
||
Args:
|
||
model_params: LightGBM 参数字典
|
||
label_name: Label 列名
|
||
"""
|
||
super().__init__(model_params, label_name)
|
||
self.evals_result: Optional[Dict] = None
|
||
|
||
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
|
||
"""准备标签(回归任务无需转换)
|
||
|
||
Args:
|
||
data: 数据字典
|
||
|
||
Returns:
|
||
原样返回数据字典
|
||
"""
|
||
# 回归任务不需要转换 Label
|
||
return data
|
||
|
||
def fit(self, train_data: Dict, val_data: Dict) -> None:
|
||
"""训练回归模型
|
||
|
||
Args:
|
||
train_data: 训练数据 {"X": DataFrame, "y": Series}
|
||
val_data: 验证数据
|
||
"""
|
||
self.model = LightGBMModel(params=self.model_params)
|
||
|
||
X_train = train_data["X"]
|
||
y_train = train_data["y"]
|
||
X_val = val_data["X"]
|
||
y_val = val_data["y"]
|
||
|
||
self.model.fit(
|
||
X_train, y_train,
|
||
eval_set=(X_val, y_val) if X_val is not None else None
|
||
)
|
||
|
||
def predict(self, test_data: Dict) -> np.ndarray:
|
||
"""生成预测
|
||
|
||
Args:
|
||
test_data: 测试数据
|
||
|
||
Returns:
|
||
预测结果
|
||
"""
|
||
return self.model.predict(test_data["X"])
|
||
|
||
def plot_training_metrics(self) -> None:
|
||
"""绘制训练指标曲线"""
|
||
if self.model and hasattr(self.model, 'model') and self.model.model:
|
||
try:
|
||
import lightgbm as lgb
|
||
import matplotlib.pyplot as plt
|
||
|
||
fig, ax = plt.subplots(figsize=(10, 6))
|
||
lgb.plot_metric(self.model.model, ax=ax)
|
||
plt.title("Training Metrics", fontsize=12, fontweight="bold")
|
||
plt.tight_layout()
|
||
plt.show()
|
||
except Exception as e:
|
||
print(f"[警告] 无法绘制训练曲线: {e}")
|
||
```
|
||
|
||
**Step 3: Create RankTask**
|
||
|
||
```python
|
||
"""排序学习任务实现
|
||
|
||
实现排序学习任务的训练流程:
|
||
- Label 转换为分位数标签
|
||
- 生成 group 数组
|
||
- 使用 LightGBM LambdaRank
|
||
- 支持 NDCG@k 评估
|
||
"""
|
||
|
||
from typing import Any, Dict, List, Optional
|
||
import numpy as np
|
||
import polars as pl
|
||
|
||
from src.training.tasks.base import BaseTask
|
||
from src.training.components.models.lightgbm_lambdarank import LightGBMLambdaRankModel
|
||
|
||
|
||
class RankTask(BaseTask):
|
||
"""排序学习任务
|
||
|
||
使用 LightGBM LambdaRank 进行排序学习训练。
|
||
将连续收益率转换为分位数标签进行训练。
|
||
"""
|
||
|
||
def __init__(
|
||
self,
|
||
model_params: Dict[str, Any],
|
||
label_name: str = "future_return_5",
|
||
n_quantiles: int = 20,
|
||
):
|
||
"""初始化排序学习任务
|
||
|
||
Args:
|
||
model_params: LightGBM 参数字典
|
||
label_name: Label 列名
|
||
n_quantiles: 分位数数量
|
||
"""
|
||
super().__init__(model_params, label_name)
|
||
self.n_quantiles = n_quantiles
|
||
|
||
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
|
||
"""准备标签(转换为分位数标签)
|
||
|
||
将连续收益率转换为分位数标签,并生成 group 数组。
|
||
|
||
Args:
|
||
data: 数据字典
|
||
|
||
Returns:
|
||
处理后的数据字典(添加了 y_rank 和 groups)
|
||
"""
|
||
for split in ["train", "val", "test"]:
|
||
if split not in data:
|
||
continue
|
||
|
||
df = data[split]["raw_data"]
|
||
|
||
# 分位数转换
|
||
rank_col = f"{self.label_name}_rank"
|
||
df_ranked = (
|
||
df.with_columns(
|
||
pl.col(self.label_name)
|
||
.rank(method="min")
|
||
.over("trade_date")
|
||
.alias("_rank")
|
||
)
|
||
.with_columns(
|
||
((pl.col("_rank") - 1) / pl.len().over("trade_date") * self.n_quantiles)
|
||
.floor()
|
||
.cast(pl.Int64)
|
||
.clip(0, self.n_quantiles - 1)
|
||
.alias(rank_col)
|
||
)
|
||
.drop("_rank")
|
||
)
|
||
|
||
# 更新数据
|
||
data[split]["raw_data"] = df_ranked
|
||
data[split]["y"] = df_ranked[rank_col]
|
||
data[split]["y_raw"] = df_ranked[self.label_name] # 保留原始值
|
||
|
||
# 生成 group 数组
|
||
data[split]["groups"] = self._compute_group_array(df_ranked, "trade_date")
|
||
|
||
return data
|
||
|
||
def _compute_group_array(
|
||
self,
|
||
df: pl.DataFrame,
|
||
date_col: str = "trade_date",
|
||
) -> np.ndarray:
|
||
"""计算 group 数组
|
||
|
||
Args:
|
||
df: 数据框
|
||
date_col: 日期列名
|
||
|
||
Returns:
|
||
group 数组(每个日期的样本数)
|
||
"""
|
||
group_counts = df.group_by(date_col, maintain_order=True).agg(
|
||
pl.count().alias("count")
|
||
)
|
||
return group_counts["count"].to_numpy()
|
||
|
||
def fit(self, train_data: Dict, val_data: Dict) -> None:
|
||
"""训练排序模型
|
||
|
||
Args:
|
||
train_data: 训练数据
|
||
val_data: 验证数据
|
||
"""
|
||
self.model = LightGBMLambdaRankModel(params=self.model_params)
|
||
|
||
self.model.fit(
|
||
train_data["X"], train_data["y"],
|
||
group=train_data["groups"],
|
||
eval_set=(val_data["X"], val_data["y"], val_data["groups"]) if val_data else None
|
||
)
|
||
|
||
def predict(self, test_data: Dict) -> np.ndarray:
|
||
"""生成预测
|
||
|
||
Args:
|
||
test_data: 测试数据
|
||
|
||
Returns:
|
||
预测结果
|
||
"""
|
||
return self.model.predict(test_data["X"])
|
||
|
||
def evaluate_ndcg(
|
||
self,
|
||
test_data: Dict,
|
||
k_list: List[int] = None,
|
||
) -> Dict[str, float]:
|
||
"""评估 NDCG@k
|
||
|
||
Args:
|
||
test_data: 测试数据
|
||
k_list: k 值列表,默认 [1, 5, 10, 20]
|
||
|
||
Returns:
|
||
NDCG 分数字典 {"ndcg@1": score, ...}
|
||
"""
|
||
if k_list is None:
|
||
k_list = [1, 5, 10, 20]
|
||
|
||
y_true = test_data["y_raw"]
|
||
y_pred = self.predict(test_data)
|
||
groups = test_data["groups"]
|
||
|
||
from sklearn.metrics import ndcg_score
|
||
|
||
results = {}
|
||
|
||
# 按 group 拆分
|
||
start_idx = 0
|
||
y_true_groups = []
|
||
y_pred_groups = []
|
||
|
||
for group_size in groups:
|
||
end_idx = start_idx + group_size
|
||
y_true_groups.append(y_true.to_numpy()[start_idx:end_idx])
|
||
y_pred_groups.append(y_pred[start_idx:end_idx])
|
||
start_idx = end_idx
|
||
|
||
# 计算每个 k 的 NDCG
|
||
for k in k_list:
|
||
ndcg_scores = []
|
||
for yt, yp in zip(y_true_groups, y_pred_groups):
|
||
if len(yt) > 1:
|
||
try:
|
||
score = ndcg_score([yt], [yp], k=k)
|
||
ndcg_scores.append(score)
|
||
except ValueError:
|
||
pass
|
||
|
||
results[f"ndcg@{k}"] = float(np.mean(ndcg_scores)) if ndcg_scores else 0.0
|
||
|
||
return results
|
||
|
||
def plot_training_metrics(self) -> None:
|
||
"""绘制训练指标曲线(NDCG)"""
|
||
if self.model:
|
||
try:
|
||
self.model.plot_all_metrics()
|
||
except Exception as e:
|
||
print(f"[警告] 无法绘制训练曲线: {e}")
|
||
```
|
||
|
||
**Step 4: Create tasks/__init__.py**
|
||
|
||
```python
|
||
"""Tasks 模块
|
||
|
||
提供各种训练任务的实现。
|
||
"""
|
||
|
||
from src.training.tasks.base import BaseTask
|
||
from src.training.tasks.regression_task import RegressionTask
|
||
from src.training.tasks.rank_task import RankTask
|
||
|
||
__all__ = [
|
||
"BaseTask",
|
||
"RegressionTask",
|
||
"RankTask",
|
||
]
|
||
```
|
||
|
||
**Step 5: Create test file**
|
||
|
||
```python
|
||
"""Task 测试"""
|
||
|
||
import pytest
|
||
from unittest.mock import Mock, patch
|
||
import numpy as np
|
||
import polars as pl
|
||
|
||
from src.training.tasks import RegressionTask, RankTask
|
||
|
||
|
||
class TestRegressionTask:
|
||
"""测试 RegressionTask"""
|
||
|
||
def test_init(self):
|
||
"""测试初始化"""
|
||
task = RegressionTask(
|
||
model_params={"objective": "regression"},
|
||
label_name="target",
|
||
)
|
||
|
||
assert task.model_params == {"objective": "regression"}
|
||
assert task.label_name == "target"
|
||
assert task.model is None
|
||
|
||
def test_prepare_labels(self):
|
||
"""测试 Label 准备(回归无需转换)"""
|
||
task = RegressionTask(model_params={}, label_name="target")
|
||
|
||
data = {"train": {"y": Mock()}}
|
||
result = task.prepare_labels(data)
|
||
|
||
# 回归任务应该原样返回
|
||
assert result == data
|
||
|
||
|
||
class TestRankTask:
|
||
"""测试 RankTask"""
|
||
|
||
def test_init(self):
|
||
"""测试初始化"""
|
||
task = RankTask(
|
||
model_params={"objective": "lambdarank"},
|
||
label_name="target",
|
||
n_quantiles=10,
|
||
)
|
||
|
||
assert task.n_quantiles == 10
|
||
|
||
def test_compute_group_array(self):
|
||
"""测试 group 数组计算"""
|
||
task = RankTask(model_params={}, label_name="target")
|
||
|
||
# 创建测试数据
|
||
df = pl.DataFrame({
|
||
"trade_date": ["20240101", "20240101", "20240102", "20240102", "20240102"],
|
||
"value": [1, 2, 3, 4, 5],
|
||
})
|
||
|
||
groups = task._compute_group_array(df, "trade_date")
|
||
|
||
assert len(groups) == 2 # 两个日期
|
||
assert groups[0] == 2 # 第一个日期2条
|
||
assert groups[1] == 3 # 第二个日期3条
|
||
```
|
||
|
||
**Step 6: Commit**
|
||
|
||
```bash
|
||
git add src/training/tasks/
|
||
git add tests/test_tasks.py
|
||
git commit -m "feat(training): add Task strategy components
|
||
|
||
- Add BaseTask abstract base class
|
||
- Add RegressionTask for regression training
|
||
- Add RankTask for learning-to-rank with LambdaRank
|
||
- Support quantile label conversion and NDCG evaluation
|
||
- Add comprehensive tests"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 6: 创建 ResultAnalyzer 组件
|
||
|
||
**Files:**
|
||
- Create: `src/training/result_analyzer.py`
|
||
- Test: `tests/test_result_analyzer.py`
|
||
|
||
**Step 1: Create the ResultAnalyzer implementation**
|
||
|
||
```python
|
||
"""结果分析器
|
||
|
||
训练后的分析和结果处理:
|
||
1. 特征重要性分析(Top N、零贡献特征)
|
||
2. 结果组装(生成每日 Top N)
|
||
3. 结果保存
|
||
"""
|
||
|
||
from typing import Any, Dict, List, Optional
|
||
import os
|
||
import polars as pl
|
||
import pandas as pd
|
||
import numpy as np
|
||
|
||
|
||
class ResultAnalyzer:
|
||
"""结果分析器
|
||
|
||
分析训练结果,生成报告并保存。
|
||
"""
|
||
|
||
def analyze_feature_importance(
|
||
self,
|
||
model,
|
||
feature_cols: List[str],
|
||
top_n: int = 20,
|
||
verbose: bool = True,
|
||
) -> Dict[str, Any]:
|
||
"""分析特征重要性
|
||
|
||
Args:
|
||
model: 训练好的模型
|
||
feature_cols: 特征列名列表
|
||
top_n: 显示 Top N 特征
|
||
verbose: 是否打印信息
|
||
|
||
Returns:
|
||
分析结果字典
|
||
"""
|
||
importance = model.feature_importance()
|
||
|
||
if importance is None:
|
||
if verbose:
|
||
print("[警告] 无法获取特征重要性")
|
||
return {}
|
||
|
||
# 按重要性排序
|
||
importance_sorted = importance.sort_values(ascending=False)
|
||
|
||
# 计算百分比
|
||
total_importance = importance_sorted.sum()
|
||
importance_pct = (importance_sorted / total_importance * 100).round(2)
|
||
|
||
# 识别零贡献特征
|
||
zero_importance_features = importance_sorted[importance_sorted == 0].index.tolist()
|
||
|
||
if verbose:
|
||
print("\n" + "=" * 80)
|
||
print("特征重要性分析")
|
||
print("=" * 80)
|
||
|
||
# 打印 Top N
|
||
print(f"\nTop {top_n} 特征:")
|
||
print("-" * 80)
|
||
print(f"{'排名':<6}{'特征名':<35}{'重要性':<15}{'占比':<10}")
|
||
print("-" * 80)
|
||
|
||
for i, (feature, score) in enumerate(importance_sorted.head(top_n).items(), 1):
|
||
pct = importance_pct[feature]
|
||
if pct >= 10:
|
||
marker = " [高贡献]"
|
||
elif pct >= 1:
|
||
marker = " [中贡献]"
|
||
else:
|
||
marker = " [低贡献]"
|
||
print(f"{i:<6}{feature:<35}{score:<15.2f}{pct:<8.2f}%{marker}")
|
||
|
||
# 打印零贡献特征
|
||
if zero_importance_features:
|
||
print("\n" + "-" * 80)
|
||
print(f"[警告] 贡献为0的特征(共 {len(zero_importance_features)} 个):")
|
||
for i, feature in enumerate(zero_importance_features, 1):
|
||
print(f" {i}. {feature}")
|
||
|
||
# 统计摘要
|
||
print("\n" + "=" * 80)
|
||
print("统计摘要:")
|
||
print("-" * 80)
|
||
print(f" 特征总数: {len(importance_sorted)}")
|
||
print(f" 有贡献特征数: {len(importance_sorted) - len(zero_importance_features)}")
|
||
print(f" 零贡献特征数: {len(zero_importance_features)}")
|
||
if len(importance_sorted) > 0:
|
||
print(f" 零贡献占比: {len(zero_importance_features) / len(importance_sorted) * 100:.1f}%")
|
||
print(f" Top {top_n} 累计占比: {importance_pct.head(top_n).sum():.1f}%")
|
||
print("=" * 80)
|
||
|
||
return {
|
||
"importance": importance_sorted,
|
||
"importance_pct": importance_pct,
|
||
"zero_importance_features": zero_importance_features,
|
||
"top_n": importance_sorted.head(top_n),
|
||
}
|
||
|
||
def assemble_results(
|
||
self,
|
||
test_data: Dict[str, Any],
|
||
predictions: np.ndarray,
|
||
top_n: int = 50,
|
||
verbose: bool = True,
|
||
) -> pl.DataFrame:
|
||
"""组装结果
|
||
|
||
生成每日 Top N 股票推荐列表。
|
||
|
||
Args:
|
||
test_data: 测试数据字典
|
||
predictions: 预测结果数组
|
||
top_n: 每日选择的股票数
|
||
verbose: 是否打印信息
|
||
|
||
Returns:
|
||
结果数据框
|
||
"""
|
||
# 添加预测列
|
||
raw_data = test_data["raw_data"]
|
||
results = raw_data.with_columns([
|
||
pl.Series("prediction", predictions)
|
||
])
|
||
|
||
# 按日期分组取 Top N
|
||
unique_dates = results["trade_date"].unique().sort()
|
||
topn_by_date = []
|
||
|
||
for date in unique_dates:
|
||
day_data = results.filter(results["trade_date"] == date)
|
||
topn = day_data.sort("prediction", descending=True).head(top_n)
|
||
topn_by_date.append(topn)
|
||
|
||
# 合并所有日期的 Top N
|
||
topn_results = pl.concat(topn_by_date)
|
||
|
||
if verbose:
|
||
print(f"\n生成每日 Top {top_n} 股票列表:")
|
||
print(f" 交易日数: {len(unique_dates)}")
|
||
print(f" 总推荐数: {len(topn_results)}")
|
||
|
||
return topn_results
|
||
|
||
def save_results(
|
||
self,
|
||
results: pl.DataFrame,
|
||
output_path: str,
|
||
verbose: bool = True,
|
||
) -> None:
|
||
"""保存结果
|
||
|
||
Args:
|
||
results: 结果数据框
|
||
output_path: 输出路径
|
||
verbose: 是否打印信息
|
||
"""
|
||
# 格式化日期并调整列顺序
|
||
formatted = results.select([
|
||
(pl.col("trade_date").str.slice(0, 4) + "-" +
|
||
pl.col("trade_date").str.slice(4, 2) + "-" +
|
||
pl.col("trade_date").str.slice(6, 2)).alias("date"),
|
||
pl.col("prediction").alias("score"),
|
||
pl.col("ts_code"),
|
||
])
|
||
|
||
# 确保目录存在
|
||
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
||
|
||
# 保存 CSV
|
||
formatted.write_csv(output_path, include_header=True)
|
||
|
||
if verbose:
|
||
print(f" 保存路径: {output_path}")
|
||
print(f" 保存行数: {len(formatted)}")
|
||
```
|
||
|
||
**Step 2: Create test file**
|
||
|
||
```python
|
||
"""ResultAnalyzer 测试"""
|
||
|
||
import pytest
|
||
from unittest.mock import Mock
|
||
import polars as pl
|
||
import pandas as pd
|
||
import numpy as np
|
||
|
||
from src.training.result_analyzer import ResultAnalyzer
|
||
|
||
|
||
class TestResultAnalyzer:
|
||
"""测试 ResultAnalyzer"""
|
||
|
||
def test_init(self):
|
||
"""测试初始化"""
|
||
analyzer = ResultAnalyzer()
|
||
assert analyzer is not None
|
||
|
||
def test_analyze_feature_importance(self):
|
||
"""测试特征重要性分析"""
|
||
analyzer = ResultAnalyzer()
|
||
|
||
# 创建 mock model
|
||
model = Mock()
|
||
importance = pd.Series(
|
||
[100, 50, 0, 0, 30],
|
||
index=["feat1", "feat2", "feat3", "feat4", "feat5"]
|
||
)
|
||
model.feature_importance.return_value = importance
|
||
|
||
result = analyzer.analyze_feature_importance(
|
||
model=model,
|
||
feature_cols=["feat1", "feat2", "feat3", "feat4", "feat5"],
|
||
top_n=3,
|
||
verbose=False,
|
||
)
|
||
|
||
assert "importance" in result
|
||
assert "zero_importance_features" in result
|
||
assert len(result["zero_importance_features"]) == 2 # feat3, feat4
|
||
|
||
def test_assemble_results(self):
|
||
"""测试结果组装"""
|
||
analyzer = ResultAnalyzer()
|
||
|
||
# 创建测试数据
|
||
test_data = {
|
||
"raw_data": pl.DataFrame({
|
||
"trade_date": ["20240101", "20240101", "20240102", "20240102"],
|
||
"ts_code": ["000001.SZ", "000002.SZ", "000001.SZ", "000002.SZ"],
|
||
})
|
||
}
|
||
predictions = np.array([0.5, 0.3, 0.8, 0.2])
|
||
|
||
results = analyzer.assemble_results(
|
||
test_data=test_data,
|
||
predictions=predictions,
|
||
top_n=1,
|
||
verbose=False,
|
||
)
|
||
|
||
assert len(results) == 2 # 每天选1个,共2天
|
||
```
|
||
|
||
**Step 3: Commit**
|
||
|
||
```bash
|
||
git add src/training/result_analyzer.py tests/test_result_analyzer.py
|
||
git commit -m "feat(training): add ResultAnalyzer component
|
||
|
||
- Analyze feature importance with top N and zero-contribution features
|
||
- Assemble daily Top N stock recommendations
|
||
- Save results to CSV with proper formatting
|
||
- Add comprehensive tests"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 7: 重构 Trainer 为调度引擎
|
||
|
||
**Files:**
|
||
- Create: `src/training/core/trainer_new.py` (新实现)
|
||
- Modify: `src/training/__init__.py` - 添加新导出
|
||
|
||
**Step 1: Create new Trainer implementation**
|
||
|
||
```python
|
||
"""训练调度引擎
|
||
|
||
协调 FactorManager、DataPipeline、Task 和 ResultAnalyzer 完成训练流程。
|
||
"""
|
||
|
||
from typing import Any, Callable, Dict, List, Optional, Tuple
|
||
import os
|
||
from datetime import datetime
|
||
|
||
import polars as pl
|
||
|
||
from src.factors import FactorEngine
|
||
from src.training.pipeline import DataPipeline
|
||
from src.training.tasks.base import BaseTask
|
||
from src.training.result_analyzer import ResultAnalyzer
|
||
|
||
|
||
class Trainer:
|
||
"""训练调度引擎
|
||
|
||
协调各个组件执行完整训练流程:
|
||
1. 准备数据(DataPipeline)
|
||
2. 处理标签(Task)
|
||
3. 训练模型(Task)
|
||
4. 绘制指标(Task)
|
||
5. 生成预测(Task)
|
||
6. 分析结果(ResultAnalyzer)
|
||
7. 保存结果
|
||
|
||
Attributes:
|
||
data_pipeline: 数据流水线
|
||
task: 任务实例(RegressionTask/RankTask)
|
||
analyzer: 结果分析器
|
||
output_config: 输出配置
|
||
verbose: 是否打印详细信息
|
||
results: 训练结果
|
||
"""
|
||
|
||
def __init__(
|
||
self,
|
||
data_pipeline: DataPipeline,
|
||
task: BaseTask,
|
||
analyzer: Optional[ResultAnalyzer] = None,
|
||
output_config: Optional[Dict[str, Any]] = None,
|
||
verbose: bool = True,
|
||
):
|
||
"""初始化训练器
|
||
|
||
Args:
|
||
data_pipeline: 数据流水线实例
|
||
task: 任务实例(RegressionTask 或 RankTask)
|
||
analyzer: 结果分析器(可选,默认创建新实例)
|
||
output_config: 输出配置字典
|
||
verbose: 是否打印详细信息
|
||
"""
|
||
self.data_pipeline = data_pipeline
|
||
self.task = task
|
||
self.analyzer = analyzer or ResultAnalyzer()
|
||
self.output_config = output_config or {}
|
||
self.verbose = verbose
|
||
self.results: Optional[pl.DataFrame] = None
|
||
|
||
def run(
|
||
self,
|
||
engine: FactorEngine,
|
||
date_range: Dict[str, Tuple[str, str]],
|
||
) -> pl.DataFrame:
|
||
"""执行完整训练流程
|
||
|
||
Args:
|
||
engine: FactorEngine 实例
|
||
date_range: 日期范围字典
|
||
{
|
||
"train": (start_date, end_date),
|
||
"val": (start_date, end_date),
|
||
"test": (start_date, end_date),
|
||
}
|
||
|
||
Returns:
|
||
训练结果数据框
|
||
"""
|
||
if self.verbose:
|
||
print("\n" + "=" * 80)
|
||
print(f"开始训练: {self.task.__class__.__name__}")
|
||
print("=" * 80)
|
||
|
||
# Step 1: 准备数据
|
||
if self.verbose:
|
||
print("\n[Step 1/7] 准备数据...")
|
||
|
||
data = self.data_pipeline.prepare_data(
|
||
engine=engine,
|
||
date_range=date_range,
|
||
label_name=self.task.label_name,
|
||
verbose=self.verbose,
|
||
)
|
||
|
||
# Step 2: 处理标签
|
||
if self.verbose:
|
||
print("\n[Step 2/7] 处理标签...")
|
||
|
||
data = self.task.prepare_labels(data)
|
||
|
||
# Step 3: 训练模型
|
||
if self.verbose:
|
||
print("\n[Step 3/7] 训练模型...")
|
||
|
||
self.task.fit(data["train"], data["val"])
|
||
|
||
# Step 4: 绘制训练指标
|
||
if self.verbose:
|
||
print("\n[Step 4/7] 绘制训练指标...")
|
||
|
||
self.task.plot_training_metrics()
|
||
|
||
# Step 5: 生成预测
|
||
if self.verbose:
|
||
print("\n[Step 5/7] 生成预测...")
|
||
|
||
predictions = self.task.predict(data["test"])
|
||
|
||
# Step 6: 分析结果
|
||
if self.verbose:
|
||
print("\n[Step 6/7] 分析结果...")
|
||
|
||
# 特征重要性
|
||
self.analyzer.analyze_feature_importance(
|
||
model=self.task.get_model(),
|
||
feature_cols=data["test"]["feature_cols"],
|
||
top_n=20,
|
||
verbose=self.verbose,
|
||
)
|
||
|
||
# NDCG 评估(排序任务特有)
|
||
if hasattr(self.task, 'evaluate_ndcg'):
|
||
ndcg_scores = self.task.evaluate_ndcg(data["test"])
|
||
if self.verbose:
|
||
print("\nNDCG 评估结果:")
|
||
for metric, score in ndcg_scores.items():
|
||
print(f" {metric}: {score:.4f}")
|
||
|
||
# 组装结果
|
||
self.results = self.analyzer.assemble_results(
|
||
test_data=data["test"],
|
||
predictions=predictions,
|
||
top_n=self.output_config.get("top_n", 50),
|
||
verbose=self.verbose,
|
||
)
|
||
|
||
# Step 7: 保存结果
|
||
if self.verbose:
|
||
print("\n[Step 7/7] 保存结果...")
|
||
|
||
if self.output_config.get("save_predictions", True):
|
||
self._save_predictions()
|
||
|
||
if self.output_config.get("save_model", False):
|
||
self._save_model()
|
||
|
||
if self.verbose:
|
||
print("\n" + "=" * 80)
|
||
print("训练完成!")
|
||
print("=" * 80)
|
||
|
||
return self.results
|
||
|
||
def _save_predictions(self) -> None:
|
||
"""保存预测结果"""
|
||
output_dir = self.output_config.get("output_dir", "experiment/output")
|
||
output_filename = self.output_config.get("output_filename", "output.csv")
|
||
output_path = os.path.join(output_dir, output_filename)
|
||
|
||
self.analyzer.save_results(
|
||
results=self.results,
|
||
output_path=output_path,
|
||
verbose=self.verbose,
|
||
)
|
||
|
||
def _save_model(self) -> None:
|
||
"""保存模型"""
|
||
model_save_path = self.output_config.get("model_save_path")
|
||
if not model_save_path:
|
||
return
|
||
|
||
# 确保目录存在
|
||
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)
|
||
|
||
# 获取模型和相关信息
|
||
model = self.task.get_model()
|
||
|
||
# 保存模型
|
||
model.save(model_save_path)
|
||
|
||
if self.verbose:
|
||
print(f" 模型保存路径: {model_save_path}")
|
||
|
||
def get_results(self) -> Optional[pl.DataFrame]:
|
||
"""获取训练结果
|
||
|
||
Returns:
|
||
训练结果数据框,如果尚未训练则返回 None
|
||
"""
|
||
return self.results
|
||
|
||
def get_task(self) -> BaseTask:
|
||
"""获取任务实例
|
||
|
||
Returns:
|
||
任务实例
|
||
"""
|
||
return self.task
|
||
```
|
||
|
||
**Step 2: Update __init__.py to export new components**
|
||
|
||
Add to `src/training/__init__.py`:
|
||
|
||
```python
|
||
# 新增导出(模块化 Trainer 组件)
|
||
from src.training.factor_manager import FactorManager
|
||
from src.training.pipeline import DataPipeline
|
||
from src.training.result_analyzer import ResultAnalyzer
|
||
from src.training.tasks import RegressionTask, RankTask
|
||
# 可以选择性地导出新的 Trainer,或者保持原有 Trainer 不变
|
||
# from src.training.core.trainer_new import Trainer as ModularTrainer
|
||
|
||
__all__ = [
|
||
# 原有导出
|
||
"Trainer",
|
||
"DateSplitter",
|
||
"StockPoolManager",
|
||
"check_data_quality",
|
||
"STFilter",
|
||
"Winsorizer",
|
||
"NullFiller",
|
||
"StandardScaler",
|
||
"CrossSectionalStandardScaler",
|
||
"TrainingConfig",
|
||
# 新增导出
|
||
"FactorManager",
|
||
"DataPipeline",
|
||
"ResultAnalyzer",
|
||
"RegressionTask",
|
||
"RankTask",
|
||
]
|
||
```
|
||
|
||
**Step 3: Run basic import tests**
|
||
|
||
```bash
|
||
uv run python -c "from src.training import FactorManager, DataPipeline, RegressionTask, RankTask, ResultAnalyzer; print('All imports successful')"
|
||
```
|
||
|
||
Expected: All imports successful
|
||
|
||
**Step 4: Commit**
|
||
|
||
```bash
|
||
git add src/training/core/trainer_new.py
|
||
git add src/training/__init__.py
|
||
git add src/training/factor_manager.py
|
||
git add src/training/pipeline.py
|
||
git add src/training/result_analyzer.py
|
||
git add src/training/tasks/
|
||
git commit -m "feat(training): add modular Trainer architecture
|
||
|
||
- Add FactorManager for unified factor management
|
||
- Add DataPipeline for complete data processing workflow
|
||
- Add Task strategy components (RegressionTask, RankTask)
|
||
- Add ResultAnalyzer for post-training analysis
|
||
- Add new Trainer as orchestration engine
|
||
- Update __init__.py exports"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 8: 重写 regression.py 使用新架构
|
||
|
||
**Files:**
|
||
- Create: `src/experiment/regression_v2.py` (新实现)
|
||
- Keep: `src/experiment/regression.py` (原文件保留,添加注释说明已迁移)
|
||
|
||
**Step 1: Create new regression.py with new architecture**
|
||
|
||
```python
|
||
# %% md
|
||
# # LightGBM 回归训练流程(模块化版本)
|
||
#
|
||
# 使用新的模块化 Trainer 架构
|
||
# %% md
|
||
# ## 1. 导入依赖
|
||
# %%
|
||
from src.training import (
|
||
Trainer,
|
||
DataPipeline,
|
||
FactorManager,
|
||
RegressionTask,
|
||
NullFiller,
|
||
Winsorizer,
|
||
StandardScaler,
|
||
)
|
||
from src.training.components.filters import STFilter
|
||
from src.experiment.common import (
|
||
create_training_config,
|
||
create_regression_config,
|
||
FactorEngine,
|
||
)
|
||
|
||
# %% md
|
||
# ## 2. 配置参数
|
||
# %%
|
||
# 创建统一配置
|
||
training_config = create_training_config()
|
||
model_config = create_regression_config()
|
||
|
||
print("训练配置:")
|
||
print(f" 训练期: {training_config.train_start} - {training_config.train_end}")
|
||
print(f" 验证期: {training_config.val_start} - {training_config.val_end}")
|
||
print(f" 测试期: {training_config.test_start} - {training_config.test_end}")
|
||
print(f" 特征数: {len(training_config.selected_factors)}")
|
||
print(f" Label: {model_config.label_name}")
|
||
|
||
# %% md
|
||
# ## 3. 创建组件
|
||
# %%
|
||
# 1. 创建 FactorEngine
|
||
engine = FactorEngine()
|
||
|
||
# 2. 创建 FactorManager
|
||
factor_manager = FactorManager(
|
||
selected_factors=training_config.selected_factors,
|
||
factor_definitions=training_config.factor_definitions,
|
||
label_factor=training_config.label_factor,
|
||
excluded_factors=training_config.excluded_factors,
|
||
)
|
||
|
||
# 3. 创建 DataPipeline
|
||
processors = [
|
||
NullFiller(strategy="mean"),
|
||
Winsorizer(lower=0.01, upper=0.99),
|
||
StandardScaler(),
|
||
]
|
||
|
||
filters = [STFilter(data_router=engine.router)] if training_config.st_filter_enabled else []
|
||
|
||
pipeline = DataPipeline(
|
||
factor_manager=factor_manager,
|
||
processors=processors,
|
||
filters=filters,
|
||
stock_pool_filter_func=training_config.stock_pool_filter,
|
||
stock_pool_required_columns=training_config.stock_pool_required_columns,
|
||
)
|
||
|
||
# 4. 创建 Task
|
||
task = RegressionTask(
|
||
model_params=model_config.model_params,
|
||
label_name=model_config.label_name,
|
||
)
|
||
|
||
# 5. 创建 Trainer
|
||
output_config = {
|
||
"output_dir": training_config.output_dir,
|
||
"output_filename": "regression_output.csv",
|
||
"save_predictions": training_config.save_predictions,
|
||
"save_model": training_config.save_model,
|
||
"model_save_path": f"{training_config.output_dir}/regression_model.txt",
|
||
"top_n": training_config.top_n,
|
||
}
|
||
|
||
trainer = Trainer(
|
||
data_pipeline=pipeline,
|
||
task=task,
|
||
output_config=output_config,
|
||
verbose=True,
|
||
)
|
||
|
||
# %% md
|
||
# ## 4. 执行训练
|
||
# %%
|
||
results = trainer.run(
|
||
engine=engine,
|
||
date_range=training_config.date_range,
|
||
)
|
||
|
||
# %% md
|
||
# ## 5. 额外分析(可选)
|
||
# %%
|
||
# 获取模型进行进一步分析
|
||
model = task.get_model()
|
||
|
||
# 可以在这里添加自定义可视化
|
||
print("\n训练完成!")
|
||
print(f"结果保存路径: {output_config['output_dir']}/regression_output.csv")
|
||
```
|
||
|
||
**Step 2: Add deprecation notice to old regression.py**
|
||
|
||
在原有 `regression.py` 文件顶部添加:
|
||
|
||
```python
|
||
# 注意:此文件已迁移到 regression_v2.py
|
||
# 新文件使用模块化 Trainer 架构
|
||
# 此文件保留用于参考和对比
|
||
```
|
||
|
||
**Step 3: Test new regression script**
|
||
|
||
```bash
|
||
# 注意:这会实际运行训练,可能需要较长时间
|
||
# 建议先用小数据测试
|
||
uv run python src/experiment/regression_v2.py
|
||
```
|
||
|
||
**Step 4: Commit**
|
||
|
||
```bash
|
||
git add src/experiment/regression_v2.py
|
||
git add src/experiment/regression.py # 已添加弃用注释
|
||
git commit -m "feat(experiment): add modular regression training script
|
||
|
||
- Create regression_v2.py using new modular Trainer architecture
|
||
- Reduce code from 640 lines to ~80 lines
|
||
- Add deprecation notice to old regression.py
|
||
- All functionality preserved"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 9: 重写 learn_to_rank.py 使用新架构
|
||
|
||
**Files:**
|
||
- Create: `src/experiment/learn_to_rank_v2.py` (新实现)
|
||
- Keep: `src/experiment/learn_to_rank.py` (原文件保留,添加注释说明已迁移)
|
||
|
||
**Step 1: Create new learn_to_rank.py with new architecture**
|
||
|
||
```python
|
||
# %% md
|
||
# # LightGBM LambdaRank 排序学习训练流程(模块化版本)
|
||
#
|
||
# 使用新的模块化 Trainer 架构
|
||
# %% md
|
||
# ## 1. 导入依赖
|
||
# %%
|
||
from src.training import (
|
||
Trainer,
|
||
DataPipeline,
|
||
FactorManager,
|
||
RankTask,
|
||
NullFiller,
|
||
Winsorizer,
|
||
CrossSectionalStandardScaler,
|
||
)
|
||
from src.training.components.filters import STFilter
|
||
from src.experiment.common import (
|
||
create_training_config,
|
||
create_rank_config,
|
||
FactorEngine,
|
||
)
|
||
|
||
# %% md
|
||
# ## 2. 配置参数
|
||
# %%
|
||
# 创建统一配置
|
||
training_config = create_training_config()
|
||
model_config = create_rank_config()
|
||
|
||
print("训练配置:")
|
||
print(f" 训练期: {training_config.train_start} - {training_config.train_end}")
|
||
print(f" 验证期: {training_config.val_start} - {training_config.val_end}")
|
||
print(f" 测试期: {training_config.test_start} - {training_config.test_end}")
|
||
print(f" 特征数: {len(training_config.selected_factors)}")
|
||
print(f" Label: {model_config.label_name}")
|
||
print(f" 分位数: {model_config.n_quantiles}")
|
||
|
||
# %% md
|
||
# ## 3. 创建组件
|
||
# %%
|
||
# 1. 创建 FactorEngine
|
||
engine = FactorEngine()
|
||
|
||
# 2. 创建 FactorManager
|
||
factor_manager = FactorManager(
|
||
selected_factors=training_config.selected_factors,
|
||
factor_definitions=training_config.factor_definitions,
|
||
label_factor=training_config.label_factor,
|
||
excluded_factors=training_config.excluded_factors,
|
||
)
|
||
|
||
# 3. 创建 DataPipeline(使用截面标准化)
|
||
processors = [
|
||
NullFiller(strategy="mean"),
|
||
Winsorizer(lower=0.01, upper=0.99),
|
||
CrossSectionalStandardScaler(),
|
||
]
|
||
|
||
filters = [STFilter(data_router=engine.router)] if training_config.st_filter_enabled else []
|
||
|
||
pipeline = DataPipeline(
|
||
factor_manager=factor_manager,
|
||
processors=processors,
|
||
filters=filters,
|
||
stock_pool_filter_func=training_config.stock_pool_filter,
|
||
stock_pool_required_columns=training_config.stock_pool_required_columns,
|
||
)
|
||
|
||
# 4. 创建 Task(排序学习特有 n_quantiles)
|
||
task = RankTask(
|
||
model_params=model_config.model_params,
|
||
label_name=model_config.label_name,
|
||
n_quantiles=model_config.n_quantiles,
|
||
)
|
||
|
||
# 5. 创建 Trainer
|
||
output_config = {
|
||
"output_dir": training_config.output_dir,
|
||
"output_filename": "rank_output.csv",
|
||
"save_predictions": training_config.save_predictions,
|
||
"save_model": training_config.save_model,
|
||
"model_save_path": f"{training_config.output_dir}/rank_model.txt",
|
||
"top_n": training_config.top_n,
|
||
}
|
||
|
||
trainer = Trainer(
|
||
data_pipeline=pipeline,
|
||
task=task,
|
||
output_config=output_config,
|
||
verbose=True,
|
||
)
|
||
|
||
# %% md
|
||
# ## 4. 执行训练
|
||
# %%
|
||
results = trainer.run(
|
||
engine=engine,
|
||
date_range=training_config.date_range,
|
||
)
|
||
|
||
# %% md
|
||
# ## 5. 额外分析(NDCG)
|
||
# %%
|
||
# NDCG 评估已在 Trainer.run() 中自动执行
|
||
# 可以在这里添加额外的可视化
|
||
|
||
print("\n训练完成!")
|
||
print(f"结果保存路径: {output_config['output_dir']}/rank_output.csv")
|
||
```
|
||
|
||
**Step 2: Add deprecation notice to old learn_to_rank.py**
|
||
|
||
在原有 `learn_to_rank.py` 文件顶部添加:
|
||
|
||
```python
|
||
# 注意:此文件已迁移到 learn_to_rank_v2.py
|
||
# 新文件使用模块化 Trainer 架构
|
||
# 此文件保留用于参考和对比
|
||
```
|
||
|
||
**Step 3: Test new learn_to_rank script**
|
||
|
||
```bash
|
||
# 注意:这会实际运行训练
|
||
uv run python src/experiment/learn_to_rank_v2.py
|
||
```
|
||
|
||
**Step 4: Commit**
|
||
|
||
```bash
|
||
git add src/experiment/learn_to_rank_v2.py
|
||
git add src/experiment/learn_to_rank.py # 已添加弃用注释
|
||
git commit -m "feat(experiment): add modular learn-to-rank training script
|
||
|
||
- Create learn_to_rank_v2.py using new modular Trainer architecture
|
||
- Reduce code from 876 lines to ~80 lines
|
||
- Add deprecation notice to old learn_to_rank.py
|
||
- All functionality preserved including NDCG evaluation"
|
||
```
|
||
|
||
---
|
||
|
||
## Task 10: 验证和对比
|
||
|
||
**Files:**
|
||
- Test both implementations
|
||
|
||
**Step 1: Compare outputs**
|
||
|
||
```bash
|
||
# 运行旧版本(如果数据已存在,可以直接比较输出)
|
||
# 注意:这会运行实际训练,需要较长时间
|
||
|
||
# 运行新版本
|
||
uv run python src/experiment/regression_v2.py 2>&1 | tee regression_v2.log
|
||
uv run python src/experiment/learn_to_rank_v2.py 2>&1 | tee rank_v2.log
|
||
|
||
# 检查输出文件
|
||
ls -lh experiment/output/
|
||
# 应该生成 regression_output.csv 和 rank_output.csv
|
||
```
|
||
|
||
**Step 2: Validate feature importance output**
|
||
|
||
确保特征重要性分析输出格式正确:
|
||
- Top 20 特征列表
|
||
- 零贡献特征列表
|
||
- 统计摘要
|
||
|
||
**Step 3: Validate NDCG evaluation (learn_to_rank)**
|
||
|
||
确保 NDCG@k 评估正确执行:
|
||
- ndcg@1, ndcg@5, ndcg@10, ndcg@20 都计算
|
||
|
||
**Step 4: Code statistics**
|
||
|
||
```bash
|
||
# 对比代码行数
|
||
echo "=== Old implementation ==="
|
||
wc -l src/experiment/regression.py src/experiment/learn_to_rank.py
|
||
|
||
echo "=== New implementation ==="
|
||
wc -l src/experiment/regression_v2.py src/experiment/learn_to_rank_v2.py
|
||
|
||
echo "=== New components ==="
|
||
wc -l src/training/factor_manager.py src/training/pipeline.py src/training/result_analyzer.py
|
||
find src/training/tasks -name "*.py" -exec wc -l {} +
|
||
```
|
||
|
||
Expected:
|
||
- Old: ~640 + ~876 = ~1516 lines
|
||
- New: ~80 + ~80 = ~160 lines
|
||
- New components: ~500-800 lines (reusable)
|
||
|
||
**Step 5: Commit final changes**
|
||
|
||
```bash
|
||
git add -A
|
||
git commit -m "refactor(training): complete modular Trainer architecture
|
||
|
||
- Implement FactorManager, DataPipeline, Task strategies, ResultAnalyzer
|
||
- Rewrite regression.py (640 -> 80 lines)
|
||
- Rewrite learn_to_rank.py (876 -> 80 lines)
|
||
- Preserve all functionality:
|
||
* Factor management (metadata, DSL, label, exclusion)
|
||
* Data filtering (STFilter, stock_pool_filter)
|
||
* Data preprocessing (NullFiller, Winsorizer, Scaler)
|
||
* Model training with early stopping
|
||
* Feature importance analysis
|
||
* NDCG evaluation for ranking
|
||
* Result saving (predictions, model)
|
||
- Add comprehensive tests for all components
|
||
- Code reduction: 94% less duplication in experiment scripts"
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
### 代码结构变化
|
||
|
||
```
|
||
Before:
|
||
├── src/experiment/regression.py (640 lines) - 独立完整实现
|
||
├── src/experiment/learn_to_rank.py (876 lines) - 独立完整实现
|
||
└── 重复代码: 80%+
|
||
|
||
After:
|
||
├── src/experiment/regression_v2.py (80 lines) - 配置+运行
|
||
├── src/experiment/learn_to_rank_v2.py (80 lines) - 配置+运行
|
||
├── src/training/factor_manager.py - 因子管理(可复用)
|
||
├── src/training/pipeline.py - 数据流水线(可复用)
|
||
├── src/training/tasks/
|
||
│ ├── base.py - 任务接口
|
||
│ ├── regression_task.py - 回归任务
|
||
│ └── rank_task.py - 排序任务
|
||
├── src/training/result_analyzer.py - 结果分析(可复用)
|
||
└── src/training/core/trainer_new.py - 调度引擎
|
||
```
|
||
|
||
### 新增训练类型的工作量
|
||
|
||
添加**分类任务**:
|
||
1. 创建 `ClassificationTask` 类(继承 BaseTask,实现3个方法)
|
||
2. 在实验脚本中使用(80行,与回归/排序类似)
|
||
|
||
无需复制任何数据流程代码!
|
||
|
||
### 测试覆盖
|
||
|
||
- FactorManager: ✓
|
||
- DataPipeline: ✓
|
||
- Tasks: ✓
|
||
- ResultAnalyzer: ✓
|
||
|
||
---
|
||
|
||
## 后续可选优化
|
||
|
||
1. **完全移除旧文件**:验证新文件工作正常后,可以删除 regression.py 和 learn_to_rank.py,将 v2 文件重命名
|
||
2. **添加更多测试**:集成测试、端到端测试
|
||
3. **文档更新**:更新 README,添加新架构使用说明
|
||
4. **配置优化**:支持从 YAML/JSON 文件加载配置
|