Files
ProStock/docs/plans/2026-03-23-trainer-refactor-plan.md
liaozhaorun e41a128ca3 feat(training): 实现 Trainer 模块化重构 (Trainer V2)
- 新增 FactorManager 组件:统一管理多种来源因子
- 新增 DataPipeline 组件:完整数据处理流程(注册、过滤、划分、预处理)
- 新增 Task 策略组件:BaseTask 抽象基类、RegressionTask、RankTask
- 新增 ResultAnalyzer 组件:特征重要性分析和结果组装
- 新增 TrainerV2:作为纯调度引擎协调各组件
- 支持回归和排序学习两种训练模式
- 采用组合模式解耦训练流程,消除代码重复
2026-03-24 23:35:31 +08:00

2380 lines
68 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Trainer 模块化重构实现计划
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** 重构 Trainer 实现模块化训练流程,完整保留所有现有功能,消除 regression.py 和 learn_to_rank.py 的代码重复
**Architecture:** 采用组合模式Composition over Inheritance将训练流程解耦为 FactorManager因子管理、DataPipeline数据流程、Task任务策略、ResultAnalyzer结果分析四个独立组件Trainer 作为纯调度引擎协调各组件
**Tech Stack:** Python 3.10+, Polars, LightGBM, Pydantic
---
## 前置检查
**读取参考文件以了解当前实现:**
- @src/experiment/common.py - 当前配置和共用函数
- @src/experiment/regression.py - 回归训练流程640行
- @src/experiment/learn_to_rank.py - 排序学习流程876行
- @src/training/core/trainer.py - 当前 Trainer 实现
- @src/training/components/models/lightgbm.py - LightGBM 回归模型
- @src/training/components/models/lightgbm_lambdarank.py - LambdaRank 排序模型
- @src/training/components/base.py - 基础抽象类
---
## Task 1: 创建 docs/plans 目录并保存计划
**Files:**
- Create: `docs/plans/2026-03-23-trainer-refactor-plan.md`
**Step 1: 创建目录并复制计划文件**
```bash
mkdir -p docs/plans
cp .plannotator/plans/trainer-v3-2026-03-23-approved.md docs/plans/2026-03-23-trainer-refactor-plan.md
```
**Step 2: Commit**
```bash
git add docs/plans/
git commit -m "docs: add trainer refactoring implementation plan"
```
---
## Task 2: 重构 common.py - 添加统一配置结构
**Files:**
- Modify: `src/experiment/common.py` - 在文件末尾添加新的配置结构
**Step 1: 在 common.py 末尾添加 TRAINING_CONFIG 和辅助函数**
```python
# ============================================================
# 新增:统一配置结构(用于模块化 Trainer
# ============================================================
from typing import Dict, List, Tuple, Any, Callable, Optional
from dataclasses import dataclass, field
@dataclass
class TrainingConfig:
"""训练配置数据结构"""
# 因子配置
selected_factors: List[str]
factor_definitions: Dict[str, str]
label_factor: Dict[str, str]
excluded_factors: List[str]
# 数据配置
stock_pool_filter: Callable
stock_pool_required_columns: List[str]
st_filter_enabled: bool = True
# 日期范围
train_start: str
train_end: str
val_start: str
val_end: str
test_start: str
test_end: str
# 输出配置
output_dir: str
save_predictions: bool
save_model: bool
top_n: int
@property
def date_range(self) -> Dict[str, Tuple[str, str]]:
"""获取日期范围字典"""
return {
"train": (self.train_start, self.train_end),
"val": (self.val_start, self.val_end),
"test": (self.test_start, self.test_end),
}
@dataclass
class ModelConfig:
"""模型配置基类"""
model_params: Dict[str, Any]
label_name: str
@dataclass
class RegressionModelConfig(ModelConfig):
"""回归模型配置"""
pass
@dataclass
class RankModelConfig(ModelConfig):
"""排序学习模型配置"""
n_quantiles: int = 20
# 创建统一配置实例
def create_training_config() -> TrainingConfig:
"""创建训练配置"""
return TrainingConfig(
selected_factors=SELECTED_FACTORS,
factor_definitions=FACTOR_DEFINITIONS,
label_factor=LABEL_FACTOR,
excluded_factors=EXCLUDED_FACTORS,
stock_pool_filter=stock_pool_filter,
stock_pool_required_columns=STOCK_FILTER_REQUIRED_COLUMNS,
st_filter_enabled=True,
train_start=TRAIN_START,
train_end=TRAIN_END,
val_start=VAL_START,
val_end=VAL_END,
test_start=TEST_START,
test_end=TEST_END,
output_dir=OUTPUT_DIR,
save_predictions=SAVE_PREDICTIONS,
save_model=SAVE_MODEL,
top_n=TOP_N,
)
def create_regression_config() -> RegressionModelConfig:
"""创建回归模型配置"""
return RegressionModelConfig(
model_params=MODEL_PARAMS_REGRESSION,
label_name="future_return_5",
)
def create_rank_config() -> RankModelConfig:
"""创建排序学习模型配置"""
return RankModelConfig(
model_params=MODEL_PARAMS_RANK,
label_name="future_return_5",
n_quantiles=20,
)
# 保持向后兼容的导入
__all__ = [
# 原有导出
"SELECTED_FACTORS",
"FACTOR_DEFINITIONS",
"LABEL_FACTOR",
"EXCLUDED_FACTORS",
"register_factors",
"prepare_data",
"stock_pool_filter",
"STOCK_FILTER_REQUIRED_COLUMNS",
"TRAIN_START",
"TRAIN_END",
"VAL_START",
"VAL_END",
"TEST_START",
"TEST_END",
"OUTPUT_DIR",
"SAVE_PREDICTIONS",
"SAVE_MODEL",
"TOP_N",
"get_model_save_path",
"save_model_with_factors",
"get_label_factor",
# 新增导出
"TrainingConfig",
"ModelConfig",
"RegressionModelConfig",
"RankModelConfig",
"create_training_config",
"create_regression_config",
"create_rank_config",
]
```
**Step 2: 在 common.py 的 MODEL_PARAMS 定义后添加回归和排序的参数分开定义**
找到 MODEL_PARAMS 定义的位置约第400行左右将其重命名为 MODEL_PARAMS_REGRESSION然后添加排序学习的参数
```python
# 回归模型参数
MODEL_PARAMS_REGRESSION = {
# ... 原有的 MODEL_PARAMS 内容 ...
}
# 排序学习模型参数
MODEL_PARAMS_RANK = {
"objective": "lambdarank",
"metric": "ndcg",
"ndcg_at": 25,
"learning_rate": 0.1,
"n_estimators": 1000,
"early_stopping_round": 50,
"max_depth": 4,
"num_leaves": 32,
"min_data_in_leaf": 256,
"subsample": 0.4,
"subsample_freq": 1,
"colsample_bytree": 0.4,
"reg_alpha": 10.0,
"reg_lambda": 50.0,
"lambdarank_truncation_level": 50,
"label_gain": [i * i for i in range(1, 21)],
"verbose": -1,
"random_state": 42,
}
# 保持向后兼容
MODEL_PARAMS = MODEL_PARAMS_REGRESSION
```
**Step 3: Run tests to verify changes don't break existing code**
```bash
uv run pytest tests/test_sync.py -v -x
```
Expected: Tests pass (or at least not broken by our changes)
**Step 4: Commit**
```bash
git add src/experiment/common.py
git commit -m "refactor(common): add unified config structure for modular trainer
- Add TrainingConfig dataclass for unified configuration
- Add ModelConfig, RegressionModelConfig, RankModelConfig
- Separate MODEL_PARAMS into MODEL_PARAMS_REGRESSION and MODEL_PARAMS_RANK
- Add factory functions: create_training_config, create_regression_config, create_rank_config
- Maintain backward compatibility"
```
---
## Task 3: 创建 FactorManager 组件
**Files:**
- Create: `src/training/factor_manager.py`
- Test: `tests/test_factor_manager.py`
**Step 1: Create the FactorManager implementation**
```python
"""因子管理器
管理多种来源的因子:
- metadata 中注册的因子
- DSL 表达式定义的因子
- Label 因子
- 排除的因子列表
"""
from typing import Dict, List, Optional, Any
import polars as pl
from src.factors import FactorEngine
class FactorManager:
"""因子管理器
统一管理多种来源的因子注册和准备:
1. metadata 中已注册的因子(通过名称引用)
2. DSL 表达式定义的因子(动态注册)
3. Label 因子(通过表达式定义)
4. 排除的因子列表(从最终列表中移除)
Attributes:
selected_factors: 从 metadata 中选择的因子名称列表
factor_definitions: DSL 表达式定义的因子字典 {name: dsl_expression}
label_factor: Label 因子定义 {name: dsl_expression}
excluded_factors: 需要排除的因子名称列表
registered_factors: 已注册到 FactorEngine 的因子列表
"""
def __init__(
self,
selected_factors: List[str],
factor_definitions: Dict[str, str],
label_factor: Dict[str, str],
excluded_factors: Optional[List[str]] = None,
):
"""初始化因子管理器
Args:
selected_factors: 从 metadata 中选择的因子名称列表
factor_definitions: DSL 表达式定义的因子字典
label_factor: Label 因子定义字典
excluded_factors: 需要排除的因子名称列表
"""
self.selected_factors = selected_factors or []
self.factor_definitions = factor_definitions or {}
self.label_factor = label_factor or {}
self.excluded_factors = excluded_factors or []
self.registered_factors: List[str] = []
def register_to_engine(
self,
engine: FactorEngine,
verbose: bool = True,
) -> List[str]:
"""注册所有因子到 FactorEngine
按以下顺序注册:
1. metadata 中的因子(通过名称从 metadata 加载)
2. DSL 表达式定义的因子(使用 add_factor 注册)
3. Label 因子(使用 add_factor 注册)
4. 排除指定的因子
Args:
engine: FactorEngine 实例
verbose: 是否打印注册信息
Returns:
最终的特征列名列表(已排除指定因子)
"""
if verbose:
print("\n" + "=" * 80)
print("因子注册")
print("=" * 80)
# Step 1: 从 metadata 注册选中的因子
if verbose:
print(f"\n[1/4] 从 metadata 注册 {len(self.selected_factors)} 个因子...")
feature_cols = []
for factor_name in self.selected_factors:
try:
engine.add_factor(factor_name)
feature_cols.append(factor_name)
if verbose:
print(f"{factor_name}")
except Exception as e:
if verbose:
print(f"{factor_name}: {e}")
# Step 2: 注册 DSL 定义的因子
if self.factor_definitions:
if verbose:
print(f"\n[2/4] 注册 {len(self.factor_definitions)} 个 DSL 定义因子...")
for factor_name, dsl_expr in self.factor_definitions.items():
if factor_name not in self.excluded_factors:
try:
engine.add_factor(factor_name, dsl_expr)
feature_cols.append(factor_name)
if verbose:
print(f"{factor_name}: {dsl_expr[:50]}...")
except Exception as e:
if verbose:
print(f"{factor_name}: {e}")
# Step 3: 注册 Label 因子
if self.label_factor:
if verbose:
print(f"\n[3/4] 注册 Label 因子...")
for factor_name, dsl_expr in self.label_factor.items():
try:
engine.add_factor(factor_name, dsl_expr)
if verbose:
print(f" ✓ Label: {factor_name}")
except Exception as e:
if verbose:
print(f" ✗ Label {factor_name}: {e}")
# Step 4: 排除指定因子
if self.excluded_factors:
if verbose:
print(f"\n[4/4] 排除 {len(self.excluded_factors)} 个因子...")
original_count = len(feature_cols)
feature_cols = [f for f in feature_cols if f not in self.excluded_factors]
excluded_count = original_count - len(feature_cols)
if verbose:
print(f" 排除 {excluded_count} 个因子")
for f in self.excluded_factors:
if f in self.selected_factors or f in self.factor_definitions:
print(f" - {f}")
self.registered_factors = feature_cols
if verbose:
print(f"\n[结果] 最终特征数: {len(feature_cols)}")
print("=" * 80)
return feature_cols
def get_feature_cols(self) -> List[str]:
"""获取已注册的特征列名列表
Returns:
特征列名列表
"""
return self.registered_factors
def get_label_col(self) -> Optional[str]:
"""获取 Label 列名
Returns:
Label 列名,如果没有则返回 None
"""
if self.label_factor:
return list(self.label_factor.keys())[0]
return None
```
**Step 2: Create test file**
```python
"""FactorManager 测试"""
import pytest
from unittest.mock import Mock, MagicMock
from src.training.factor_manager import FactorManager
class TestFactorManager:
"""测试 FactorManager"""
def test_init(self):
"""测试初始化"""
fm = FactorManager(
selected_factors=["factor1", "factor2"],
factor_definitions={"factor3": "close + open"},
label_factor={"label": "future_return_5"},
excluded_factors=["factor2"],
)
assert fm.selected_factors == ["factor1", "factor2"]
assert fm.factor_definitions == {"factor3": "close + open"}
assert fm.label_factor == {"label": "future_return_5"}
assert fm.excluded_factors == ["factor2"]
assert fm.registered_factors == []
def test_register_to_engine(self):
"""测试注册到引擎"""
# 创建 mock engine
engine = Mock()
engine.add_factor = Mock()
fm = FactorManager(
selected_factors=["factor1", "factor2"],
factor_definitions={"factor3": "close + open"},
label_factor={"label": "future_return"},
excluded_factors=["factor2"],
)
feature_cols = fm.register_to_engine(engine, verbose=False)
# 验证调用
assert engine.add_factor.call_count == 4 # 2 selected + 1 dsl + 1 label
# 验证结果factor2 被排除)
assert "factor1" in feature_cols
assert "factor2" not in feature_cols
assert "factor3" in feature_cols
assert fm.registered_factors == feature_cols
def test_get_feature_cols(self):
"""测试获取特征列"""
fm = FactorManager(
selected_factors=["factor1"],
factor_definitions={},
label_factor={},
)
# 注册前为空
assert fm.get_feature_cols() == []
# 注册后
engine = Mock()
engine.add_factor = Mock()
fm.register_to_engine(engine, verbose=False)
assert fm.get_feature_cols() == ["factor1"]
def test_get_label_col(self):
"""测试获取 Label 列"""
fm = FactorManager(
selected_factors=[],
factor_definitions={},
label_factor={"label": "future_return"},
)
assert fm.get_label_col() == "label"
# 没有 label 时返回 None
fm2 = FactorManager(selected_factors=[], factor_definitions={}, label_factor={})
assert fm2.get_label_col() is None
```
**Step 3: Run tests**
```bash
uv run pytest tests/test_factor_manager.py -v
```
Expected: All tests pass
**Step 4: Commit**
```bash
git add src/training/factor_manager.py tests/test_factor_manager.py
git commit -m "feat(training): add FactorManager component
- Manage factors from multiple sources (metadata, DSL, label, excluded)
- Register factors to FactorEngine with proper ordering
- Support factor exclusion
- Add comprehensive tests"
```
---
## Task 4: 创建 DataPipeline 组件
**Files:**
- Create: `src/training/pipeline.py`
- Test: `tests/test_pipeline.py`
**Step 1: Create the DataPipeline implementation**
```python
"""数据流水线
完整的数据处理流程:
1. 因子注册和数据准备
2. 应用过滤器STFilter 等)
3. 股票池筛选(自定义函数)
4. 数据质量检查
5. 数据划分train/val/test
6. 数据预处理fit_transform/transform
"""
from typing import Any, Callable, Dict, List, Optional, Tuple
import polars as pl
import numpy as np
from src.factors import FactorEngine
from src.training.factor_manager import FactorManager
from src.training.components.base import BaseProcessor
from src.training.core.stock_pool_manager import StockPoolManager
class DataPipeline:
"""数据流水线
执行完整的数据处理流程,返回标准化的数据字典。
Attributes:
factor_manager: 因子管理器
filters: 类形式的过滤器列表(如 STFilter
stock_pool_filter_func: 函数形式的股票池筛选器
processors: 数据处理器列表
stock_pool_required_columns: 股票池筛选所需的额外列
fitted_processors: 已拟合的处理器列表(训练后填充)
"""
def __init__(
self,
factor_manager: FactorManager,
processors: List[BaseProcessor],
filters: Optional[List[Any]] = None,
stock_pool_filter_func: Optional[Callable] = None,
stock_pool_required_columns: Optional[List[str]] = None,
):
"""初始化数据流水线
Args:
factor_manager: 因子管理器实例
processors: 数据处理器列表(顺序执行)
filters: 类形式的过滤器列表(如 [STFilter]
stock_pool_filter_func: 函数形式的股票池筛选器
stock_pool_required_columns: 股票池筛选所需的额外列
"""
self.factor_manager = factor_manager
self.processors = processors or []
self.filters = filters or []
self.stock_pool_filter_func = stock_pool_filter_func
self.stock_pool_required_columns = stock_pool_required_columns or []
self.fitted_processors: List[BaseProcessor] = []
def prepare_data(
self,
engine: FactorEngine,
date_range: Dict[str, Tuple[str, str]],
label_name: str,
verbose: bool = True,
) -> Dict[str, Dict[str, Any]]:
"""执行完整数据流程
流程:
1. 注册因子并准备数据
2. 应用类过滤器STFilter
3. 应用股票池筛选(函数形式)
4. 数据质量检查
5. 数据划分
6. 数据预处理
Args:
engine: FactorEngine 实例
date_range: 日期范围字典 {"train": (start, end), "val": ..., "test": ...}
label_name: Label 列名
verbose: 是否打印处理信息
Returns:
标准化的数据字典:
{
"train": {
"X": pl.DataFrame, # 特征矩阵
"y": pl.Series, # 目标变量
"raw_data": pl.DataFrame, # 原始数据(保留完整信息)
"feature_cols": List[str], # 特征列名
},
"val": {...},
"test": {...},
}
"""
if verbose:
print("\n" + "=" * 80)
print("数据流水线")
print("=" * 80)
# Step 1: 注册因子并准备数据
if verbose:
print("\n[1/6] 注册因子并准备数据...")
feature_cols = self.factor_manager.register_to_engine(engine, verbose=verbose)
# 计算完整日期范围
all_start = min(date_range["train"][0], date_range["val"][0], date_range["test"][0])
all_end = max(date_range["train"][1], date_range["val"][1], date_range["test"][1])
# 准备数据
data = engine.compute(
factors=feature_cols + [label_name],
start_date=all_start,
end_date=all_end,
)
if verbose:
print(f" 原始数据规模: {data.shape}")
print(f" 特征数: {len(feature_cols)}")
# Step 2: 应用类过滤器STFilter
if self.filters:
if verbose:
print(f"\n[2/6] 应用过滤器({len(self.filters)}个)...")
for filter_obj in self.filters:
data_before = len(data)
data = filter_obj.filter(data)
data_after = len(data)
if verbose:
print(f" {filter_obj.__class__.__name__}:")
print(f" 过滤前: {data_before}, 过滤后: {data_after}")
print(f" 删除: {data_before - data_after}")
# Step 3: 应用股票池筛选(函数形式)
if self.stock_pool_filter_func:
if verbose:
print(f"\n[3/6] 股票池筛选...")
data_before = len(data)
# 创建 StockPoolManager
pool_manager = StockPoolManager(
filter_func=self.stock_pool_filter_func,
required_columns=self.stock_pool_required_columns,
data_router=engine.router,
)
data = pool_manager.filter_and_select_daily(data)
data_after = len(data)
if verbose:
print(f" 筛选前: {data_before}, 筛选后: {data_after}")
print(f" 删除: {data_before - data_after}")
# Step 4: 数据质量检查
if verbose:
print(f"\n[4/6] 数据质量检查...")
self._check_data_quality(data, feature_cols, verbose=verbose)
# Step 5: 数据划分
if verbose:
print(f"\n[5/6] 数据划分...")
split_data = self._split_data(data, date_range, feature_cols, label_name, verbose=verbose)
# Step 6: 数据预处理
if verbose:
print(f"\n[6/6] 数据预处理...")
split_data = self._preprocess(split_data, verbose=verbose)
if verbose:
print("\n" + "=" * 80)
print("数据流水线完成")
print("=" * 80)
return split_data
def _check_data_quality(
self,
data: pl.DataFrame,
feature_cols: List[str],
verbose: bool = True,
) -> None:
"""检查数据质量
Args:
data: 数据框
feature_cols: 特征列名列表
verbose: 是否打印信息
"""
# 检查缺失值
null_counts = {}
for col in feature_cols:
null_count = data[col].null_count()
if null_count > 0:
null_counts[col] = null_count
if null_counts and verbose:
print(f" [警告] 发现缺失值:")
for col, count in list(null_counts.items())[:5]: # 只显示前5个
pct = count / len(data) * 100
print(f" {col}: {count} ({pct:.2f}%)")
def _split_data(
self,
data: pl.DataFrame,
date_range: Dict[str, Tuple[str, str]],
feature_cols: List[str],
label_name: str,
verbose: bool = True,
) -> Dict[str, Dict[str, Any]]:
"""划分数据集
Args:
data: 完整数据
date_range: 日期范围字典
feature_cols: 特征列名
label_name: Label 列名
verbose: 是否打印信息
Returns:
划分后的数据字典
"""
result = {}
for split_name, (start, end) in date_range.items():
mask = (data["trade_date"] >= start) & (data["trade_date"] <= end)
split_df = data.filter(mask)
result[split_name] = {
"X": split_df.select(feature_cols),
"y": split_df[label_name],
"raw_data": split_df,
"feature_cols": feature_cols,
}
if verbose:
print(f" {split_name}: {len(split_df)} 条记录")
return result
def _preprocess(
self,
split_data: Dict[str, Dict[str, Any]],
verbose: bool = True,
) -> Dict[str, Dict[str, Any]]:
"""预处理数据
训练集使用 fit_transform验证集和测试集使用 transform
Args:
split_data: 划分后的数据字典
verbose: 是否打印信息
Returns:
预处理后的数据字典
"""
if not self.processors:
return split_data
self.fitted_processors = []
# 训练集fit_transform
if verbose:
print(f" 训练集预处理fit_transform...")
train_data = split_data["train"]["raw_data"]
for processor in self.processors:
train_data = processor.fit_transform(train_data)
self.fitted_processors.append(processor)
# 更新训练集
split_data["train"]["raw_data"] = train_data
split_data["train"]["X"] = train_data.select(split_data["train"]["feature_cols"])
split_data["train"]["y"] = train_data[split_data["train"]["y"].name]
# 验证集和测试集transform
for split_name in ["val", "test"]:
if split_name in split_data:
if verbose:
print(f" {split_name}集预处理transform...")
split_df = split_data[split_name]["raw_data"]
for processor in self.fitted_processors:
split_df = processor.transform(split_df)
split_data[split_name]["raw_data"] = split_df
split_data[split_name]["X"] = split_df.select(split_data[split_name]["feature_cols"])
split_data[split_name]["y"] = split_df[split_data[split_name]["y"].name]
return split_data
def get_fitted_processors(self) -> List[BaseProcessor]:
"""获取已拟合的处理器列表
Returns:
已拟合的处理器列表(用于模型保存)
"""
return self.fitted_processors
```
**Step 2: Create test file**
```python
"""DataPipeline 测试"""
import pytest
from unittest.mock import Mock, MagicMock
import polars as pl
from src.training.pipeline import DataPipeline
from src.training.factor_manager import FactorManager
from src.training.components.processors import NullFiller
class TestDataPipeline:
"""测试 DataPipeline"""
def test_init(self):
"""测试初始化"""
fm = Mock(spec=FactorManager)
processors = [NullFiller(feature_cols=["f1"])]
pipeline = DataPipeline(
factor_manager=fm,
processors=processors,
)
assert pipeline.factor_manager == fm
assert pipeline.processors == processors
assert pipeline.fitted_processors == []
def test_get_fitted_processors(self):
"""测试获取已拟合处理器"""
pipeline = DataPipeline(
factor_manager=Mock(),
processors=[],
)
# 模拟已拟合处理器
pipeline.fitted_processors = [Mock()]
assert len(pipeline.get_fitted_processors()) == 1
```
**Step 3: Commit**
```bash
git add src/training/pipeline.py tests/test_pipeline.py
git commit -m "feat(training): add DataPipeline component
- Complete data processing pipeline: register factors, filter, split, preprocess
- Support both class filters (STFilter) and function filters (stock_pool_filter)
- Proper fit_transform/transform separation for processors
- Add comprehensive tests"
```
---
## Task 5: 创建 Task 策略组件
**Files:**
- Create: `src/training/tasks/base.py`
- Create: `src/training/tasks/regression_task.py`
- Create: `src/training/tasks/rank_task.py`
- Create: `src/training/tasks/__init__.py`
- Test: `tests/test_tasks.py`
**Step 1: Create base Task protocol**
```python
"""任务抽象基类
定义 Task 接口,所有具体任务必须实现此接口。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict, Optional
import numpy as np
class BaseTask(ABC):
"""任务抽象基类
所有训练任务(回归、排序学习、分类等)必须继承此类。
提供统一的接口Label处理、模型训练、预测、评估。
Attributes:
label_name: Label 列名
model_params: 模型参数字典
"""
def __init__(self, model_params: Dict[str, Any], label_name: str):
"""初始化任务
Args:
model_params: 模型参数字典
label_name: Label 列名
"""
self.model_params = model_params
self.label_name = label_name
self.model = None
@abstractmethod
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
"""准备标签
子类可实现特定的 Label 转换逻辑(如排序学习的分位数转换)。
Args:
data: 数据字典
Returns:
处理后的数据字典
"""
raise NotImplementedError
@abstractmethod
def fit(self, train_data: Dict, val_data: Dict) -> None:
"""训练模型
Args:
train_data: 训练数据字典 {"X": DataFrame, "y": Series, ...}
val_data: 验证数据字典
"""
raise NotImplementedError
@abstractmethod
def predict(self, test_data: Dict) -> np.ndarray:
"""生成预测
Args:
test_data: 测试数据字典
Returns:
预测结果数组
"""
raise NotImplementedError
def get_model(self):
"""获取底层模型
Returns:
训练后的模型实例
"""
return self.model
def plot_training_metrics(self) -> None:
"""绘制训练指标曲线(可选)"""
pass
```
**Step 2: Create RegressionTask**
```python
"""回归任务实现
实现回归任务的训练流程:
- Label 无需转换(保持连续值)
- 使用 LightGBM 回归模型
- 支持 MAE/RMSE 评估
"""
from typing import Any, Dict, Optional
import numpy as np
import polars as pl
from src.training.tasks.base import BaseTask
from src.training.components.models.lightgbm import LightGBMModel
class RegressionTask(BaseTask):
"""回归任务
使用 LightGBM 进行回归训练,支持早停和训练曲线。
"""
def __init__(
self,
model_params: Dict[str, Any],
label_name: str = "future_return_5",
):
"""初始化回归任务
Args:
model_params: LightGBM 参数字典
label_name: Label 列名
"""
super().__init__(model_params, label_name)
self.evals_result: Optional[Dict] = None
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
"""准备标签(回归任务无需转换)
Args:
data: 数据字典
Returns:
原样返回数据字典
"""
# 回归任务不需要转换 Label
return data
def fit(self, train_data: Dict, val_data: Dict) -> None:
"""训练回归模型
Args:
train_data: 训练数据 {"X": DataFrame, "y": Series}
val_data: 验证数据
"""
self.model = LightGBMModel(params=self.model_params)
X_train = train_data["X"]
y_train = train_data["y"]
X_val = val_data["X"]
y_val = val_data["y"]
self.model.fit(
X_train, y_train,
eval_set=(X_val, y_val) if X_val is not None else None
)
def predict(self, test_data: Dict) -> np.ndarray:
"""生成预测
Args:
test_data: 测试数据
Returns:
预测结果
"""
return self.model.predict(test_data["X"])
def plot_training_metrics(self) -> None:
"""绘制训练指标曲线"""
if self.model and hasattr(self.model, 'model') and self.model.model:
try:
import lightgbm as lgb
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
lgb.plot_metric(self.model.model, ax=ax)
plt.title("Training Metrics", fontsize=12, fontweight="bold")
plt.tight_layout()
plt.show()
except Exception as e:
print(f"[警告] 无法绘制训练曲线: {e}")
```
**Step 3: Create RankTask**
```python
"""排序学习任务实现
实现排序学习任务的训练流程:
- Label 转换为分位数标签
- 生成 group 数组
- 使用 LightGBM LambdaRank
- 支持 NDCG@k 评估
"""
from typing import Any, Dict, List, Optional
import numpy as np
import polars as pl
from src.training.tasks.base import BaseTask
from src.training.components.models.lightgbm_lambdarank import LightGBMLambdaRankModel
class RankTask(BaseTask):
"""排序学习任务
使用 LightGBM LambdaRank 进行排序学习训练。
将连续收益率转换为分位数标签进行训练。
"""
def __init__(
self,
model_params: Dict[str, Any],
label_name: str = "future_return_5",
n_quantiles: int = 20,
):
"""初始化排序学习任务
Args:
model_params: LightGBM 参数字典
label_name: Label 列名
n_quantiles: 分位数数量
"""
super().__init__(model_params, label_name)
self.n_quantiles = n_quantiles
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
"""准备标签(转换为分位数标签)
将连续收益率转换为分位数标签,并生成 group 数组。
Args:
data: 数据字典
Returns:
处理后的数据字典(添加了 y_rank 和 groups
"""
for split in ["train", "val", "test"]:
if split not in data:
continue
df = data[split]["raw_data"]
# 分位数转换
rank_col = f"{self.label_name}_rank"
df_ranked = (
df.with_columns(
pl.col(self.label_name)
.rank(method="min")
.over("trade_date")
.alias("_rank")
)
.with_columns(
((pl.col("_rank") - 1) / pl.len().over("trade_date") * self.n_quantiles)
.floor()
.cast(pl.Int64)
.clip(0, self.n_quantiles - 1)
.alias(rank_col)
)
.drop("_rank")
)
# 更新数据
data[split]["raw_data"] = df_ranked
data[split]["y"] = df_ranked[rank_col]
data[split]["y_raw"] = df_ranked[self.label_name] # 保留原始值
# 生成 group 数组
data[split]["groups"] = self._compute_group_array(df_ranked, "trade_date")
return data
def _compute_group_array(
self,
df: pl.DataFrame,
date_col: str = "trade_date",
) -> np.ndarray:
"""计算 group 数组
Args:
df: 数据框
date_col: 日期列名
Returns:
group 数组(每个日期的样本数)
"""
group_counts = df.group_by(date_col, maintain_order=True).agg(
pl.count().alias("count")
)
return group_counts["count"].to_numpy()
def fit(self, train_data: Dict, val_data: Dict) -> None:
"""训练排序模型
Args:
train_data: 训练数据
val_data: 验证数据
"""
self.model = LightGBMLambdaRankModel(params=self.model_params)
self.model.fit(
train_data["X"], train_data["y"],
group=train_data["groups"],
eval_set=(val_data["X"], val_data["y"], val_data["groups"]) if val_data else None
)
def predict(self, test_data: Dict) -> np.ndarray:
"""生成预测
Args:
test_data: 测试数据
Returns:
预测结果
"""
return self.model.predict(test_data["X"])
def evaluate_ndcg(
self,
test_data: Dict,
k_list: List[int] = None,
) -> Dict[str, float]:
"""评估 NDCG@k
Args:
test_data: 测试数据
k_list: k 值列表,默认 [1, 5, 10, 20]
Returns:
NDCG 分数字典 {"ndcg@1": score, ...}
"""
if k_list is None:
k_list = [1, 5, 10, 20]
y_true = test_data["y_raw"]
y_pred = self.predict(test_data)
groups = test_data["groups"]
from sklearn.metrics import ndcg_score
results = {}
# 按 group 拆分
start_idx = 0
y_true_groups = []
y_pred_groups = []
for group_size in groups:
end_idx = start_idx + group_size
y_true_groups.append(y_true.to_numpy()[start_idx:end_idx])
y_pred_groups.append(y_pred[start_idx:end_idx])
start_idx = end_idx
# 计算每个 k 的 NDCG
for k in k_list:
ndcg_scores = []
for yt, yp in zip(y_true_groups, y_pred_groups):
if len(yt) > 1:
try:
score = ndcg_score([yt], [yp], k=k)
ndcg_scores.append(score)
except ValueError:
pass
results[f"ndcg@{k}"] = float(np.mean(ndcg_scores)) if ndcg_scores else 0.0
return results
def plot_training_metrics(self) -> None:
"""绘制训练指标曲线NDCG"""
if self.model:
try:
self.model.plot_all_metrics()
except Exception as e:
print(f"[警告] 无法绘制训练曲线: {e}")
```
**Step 4: Create tasks/__init__.py**
```python
"""Tasks 模块
提供各种训练任务的实现。
"""
from src.training.tasks.base import BaseTask
from src.training.tasks.regression_task import RegressionTask
from src.training.tasks.rank_task import RankTask
__all__ = [
"BaseTask",
"RegressionTask",
"RankTask",
]
```
**Step 5: Create test file**
```python
"""Task 测试"""
import pytest
from unittest.mock import Mock, patch
import numpy as np
import polars as pl
from src.training.tasks import RegressionTask, RankTask
class TestRegressionTask:
"""测试 RegressionTask"""
def test_init(self):
"""测试初始化"""
task = RegressionTask(
model_params={"objective": "regression"},
label_name="target",
)
assert task.model_params == {"objective": "regression"}
assert task.label_name == "target"
assert task.model is None
def test_prepare_labels(self):
"""测试 Label 准备(回归无需转换)"""
task = RegressionTask(model_params={}, label_name="target")
data = {"train": {"y": Mock()}}
result = task.prepare_labels(data)
# 回归任务应该原样返回
assert result == data
class TestRankTask:
"""测试 RankTask"""
def test_init(self):
"""测试初始化"""
task = RankTask(
model_params={"objective": "lambdarank"},
label_name="target",
n_quantiles=10,
)
assert task.n_quantiles == 10
def test_compute_group_array(self):
"""测试 group 数组计算"""
task = RankTask(model_params={}, label_name="target")
# 创建测试数据
df = pl.DataFrame({
"trade_date": ["20240101", "20240101", "20240102", "20240102", "20240102"],
"value": [1, 2, 3, 4, 5],
})
groups = task._compute_group_array(df, "trade_date")
assert len(groups) == 2 # 两个日期
assert groups[0] == 2 # 第一个日期2条
assert groups[1] == 3 # 第二个日期3条
```
**Step 6: Commit**
```bash
git add src/training/tasks/
git add tests/test_tasks.py
git commit -m "feat(training): add Task strategy components
- Add BaseTask abstract base class
- Add RegressionTask for regression training
- Add RankTask for learning-to-rank with LambdaRank
- Support quantile label conversion and NDCG evaluation
- Add comprehensive tests"
```
---
## Task 6: 创建 ResultAnalyzer 组件
**Files:**
- Create: `src/training/result_analyzer.py`
- Test: `tests/test_result_analyzer.py`
**Step 1: Create the ResultAnalyzer implementation**
```python
"""结果分析器
训练后的分析和结果处理:
1. 特征重要性分析Top N、零贡献特征
2. 结果组装(生成每日 Top N
3. 结果保存
"""
from typing import Any, Dict, List, Optional
import os
import polars as pl
import pandas as pd
import numpy as np
class ResultAnalyzer:
"""结果分析器
分析训练结果,生成报告并保存。
"""
def analyze_feature_importance(
self,
model,
feature_cols: List[str],
top_n: int = 20,
verbose: bool = True,
) -> Dict[str, Any]:
"""分析特征重要性
Args:
model: 训练好的模型
feature_cols: 特征列名列表
top_n: 显示 Top N 特征
verbose: 是否打印信息
Returns:
分析结果字典
"""
importance = model.feature_importance()
if importance is None:
if verbose:
print("[警告] 无法获取特征重要性")
return {}
# 按重要性排序
importance_sorted = importance.sort_values(ascending=False)
# 计算百分比
total_importance = importance_sorted.sum()
importance_pct = (importance_sorted / total_importance * 100).round(2)
# 识别零贡献特征
zero_importance_features = importance_sorted[importance_sorted == 0].index.tolist()
if verbose:
print("\n" + "=" * 80)
print("特征重要性分析")
print("=" * 80)
# 打印 Top N
print(f"\nTop {top_n} 特征:")
print("-" * 80)
print(f"{'排名':<6}{'特征名':<35}{'重要性':<15}{'占比':<10}")
print("-" * 80)
for i, (feature, score) in enumerate(importance_sorted.head(top_n).items(), 1):
pct = importance_pct[feature]
if pct >= 10:
marker = " [高贡献]"
elif pct >= 1:
marker = " [中贡献]"
else:
marker = " [低贡献]"
print(f"{i:<6}{feature:<35}{score:<15.2f}{pct:<8.2f}%{marker}")
# 打印零贡献特征
if zero_importance_features:
print("\n" + "-" * 80)
print(f"[警告] 贡献为0的特征{len(zero_importance_features)} 个):")
for i, feature in enumerate(zero_importance_features, 1):
print(f" {i}. {feature}")
# 统计摘要
print("\n" + "=" * 80)
print("统计摘要:")
print("-" * 80)
print(f" 特征总数: {len(importance_sorted)}")
print(f" 有贡献特征数: {len(importance_sorted) - len(zero_importance_features)}")
print(f" 零贡献特征数: {len(zero_importance_features)}")
if len(importance_sorted) > 0:
print(f" 零贡献占比: {len(zero_importance_features) / len(importance_sorted) * 100:.1f}%")
print(f" Top {top_n} 累计占比: {importance_pct.head(top_n).sum():.1f}%")
print("=" * 80)
return {
"importance": importance_sorted,
"importance_pct": importance_pct,
"zero_importance_features": zero_importance_features,
"top_n": importance_sorted.head(top_n),
}
def assemble_results(
self,
test_data: Dict[str, Any],
predictions: np.ndarray,
top_n: int = 50,
verbose: bool = True,
) -> pl.DataFrame:
"""组装结果
生成每日 Top N 股票推荐列表。
Args:
test_data: 测试数据字典
predictions: 预测结果数组
top_n: 每日选择的股票数
verbose: 是否打印信息
Returns:
结果数据框
"""
# 添加预测列
raw_data = test_data["raw_data"]
results = raw_data.with_columns([
pl.Series("prediction", predictions)
])
# 按日期分组取 Top N
unique_dates = results["trade_date"].unique().sort()
topn_by_date = []
for date in unique_dates:
day_data = results.filter(results["trade_date"] == date)
topn = day_data.sort("prediction", descending=True).head(top_n)
topn_by_date.append(topn)
# 合并所有日期的 Top N
topn_results = pl.concat(topn_by_date)
if verbose:
print(f"\n生成每日 Top {top_n} 股票列表:")
print(f" 交易日数: {len(unique_dates)}")
print(f" 总推荐数: {len(topn_results)}")
return topn_results
def save_results(
self,
results: pl.DataFrame,
output_path: str,
verbose: bool = True,
) -> None:
"""保存结果
Args:
results: 结果数据框
output_path: 输出路径
verbose: 是否打印信息
"""
# 格式化日期并调整列顺序
formatted = results.select([
(pl.col("trade_date").str.slice(0, 4) + "-" +
pl.col("trade_date").str.slice(4, 2) + "-" +
pl.col("trade_date").str.slice(6, 2)).alias("date"),
pl.col("prediction").alias("score"),
pl.col("ts_code"),
])
# 确保目录存在
os.makedirs(os.path.dirname(output_path), exist_ok=True)
# 保存 CSV
formatted.write_csv(output_path, include_header=True)
if verbose:
print(f" 保存路径: {output_path}")
print(f" 保存行数: {len(formatted)}")
```
**Step 2: Create test file**
```python
"""ResultAnalyzer 测试"""
import pytest
from unittest.mock import Mock
import polars as pl
import pandas as pd
import numpy as np
from src.training.result_analyzer import ResultAnalyzer
class TestResultAnalyzer:
"""测试 ResultAnalyzer"""
def test_init(self):
"""测试初始化"""
analyzer = ResultAnalyzer()
assert analyzer is not None
def test_analyze_feature_importance(self):
"""测试特征重要性分析"""
analyzer = ResultAnalyzer()
# 创建 mock model
model = Mock()
importance = pd.Series(
[100, 50, 0, 0, 30],
index=["feat1", "feat2", "feat3", "feat4", "feat5"]
)
model.feature_importance.return_value = importance
result = analyzer.analyze_feature_importance(
model=model,
feature_cols=["feat1", "feat2", "feat3", "feat4", "feat5"],
top_n=3,
verbose=False,
)
assert "importance" in result
assert "zero_importance_features" in result
assert len(result["zero_importance_features"]) == 2 # feat3, feat4
def test_assemble_results(self):
"""测试结果组装"""
analyzer = ResultAnalyzer()
# 创建测试数据
test_data = {
"raw_data": pl.DataFrame({
"trade_date": ["20240101", "20240101", "20240102", "20240102"],
"ts_code": ["000001.SZ", "000002.SZ", "000001.SZ", "000002.SZ"],
})
}
predictions = np.array([0.5, 0.3, 0.8, 0.2])
results = analyzer.assemble_results(
test_data=test_data,
predictions=predictions,
top_n=1,
verbose=False,
)
assert len(results) == 2 # 每天选1个共2天
```
**Step 3: Commit**
```bash
git add src/training/result_analyzer.py tests/test_result_analyzer.py
git commit -m "feat(training): add ResultAnalyzer component
- Analyze feature importance with top N and zero-contribution features
- Assemble daily Top N stock recommendations
- Save results to CSV with proper formatting
- Add comprehensive tests"
```
---
## Task 7: 重构 Trainer 为调度引擎
**Files:**
- Create: `src/training/core/trainer_new.py` (新实现)
- Modify: `src/training/__init__.py` - 添加新导出
**Step 1: Create new Trainer implementation**
```python
"""训练调度引擎
协调 FactorManager、DataPipeline、Task 和 ResultAnalyzer 完成训练流程。
"""
from typing import Any, Callable, Dict, List, Optional, Tuple
import os
from datetime import datetime
import polars as pl
from src.factors import FactorEngine
from src.training.pipeline import DataPipeline
from src.training.tasks.base import BaseTask
from src.training.result_analyzer import ResultAnalyzer
class Trainer:
"""训练调度引擎
协调各个组件执行完整训练流程:
1. 准备数据DataPipeline
2. 处理标签Task
3. 训练模型Task
4. 绘制指标Task
5. 生成预测Task
6. 分析结果ResultAnalyzer
7. 保存结果
Attributes:
data_pipeline: 数据流水线
task: 任务实例RegressionTask/RankTask
analyzer: 结果分析器
output_config: 输出配置
verbose: 是否打印详细信息
results: 训练结果
"""
def __init__(
self,
data_pipeline: DataPipeline,
task: BaseTask,
analyzer: Optional[ResultAnalyzer] = None,
output_config: Optional[Dict[str, Any]] = None,
verbose: bool = True,
):
"""初始化训练器
Args:
data_pipeline: 数据流水线实例
task: 任务实例RegressionTask 或 RankTask
analyzer: 结果分析器(可选,默认创建新实例)
output_config: 输出配置字典
verbose: 是否打印详细信息
"""
self.data_pipeline = data_pipeline
self.task = task
self.analyzer = analyzer or ResultAnalyzer()
self.output_config = output_config or {}
self.verbose = verbose
self.results: Optional[pl.DataFrame] = None
def run(
self,
engine: FactorEngine,
date_range: Dict[str, Tuple[str, str]],
) -> pl.DataFrame:
"""执行完整训练流程
Args:
engine: FactorEngine 实例
date_range: 日期范围字典
{
"train": (start_date, end_date),
"val": (start_date, end_date),
"test": (start_date, end_date),
}
Returns:
训练结果数据框
"""
if self.verbose:
print("\n" + "=" * 80)
print(f"开始训练: {self.task.__class__.__name__}")
print("=" * 80)
# Step 1: 准备数据
if self.verbose:
print("\n[Step 1/7] 准备数据...")
data = self.data_pipeline.prepare_data(
engine=engine,
date_range=date_range,
label_name=self.task.label_name,
verbose=self.verbose,
)
# Step 2: 处理标签
if self.verbose:
print("\n[Step 2/7] 处理标签...")
data = self.task.prepare_labels(data)
# Step 3: 训练模型
if self.verbose:
print("\n[Step 3/7] 训练模型...")
self.task.fit(data["train"], data["val"])
# Step 4: 绘制训练指标
if self.verbose:
print("\n[Step 4/7] 绘制训练指标...")
self.task.plot_training_metrics()
# Step 5: 生成预测
if self.verbose:
print("\n[Step 5/7] 生成预测...")
predictions = self.task.predict(data["test"])
# Step 6: 分析结果
if self.verbose:
print("\n[Step 6/7] 分析结果...")
# 特征重要性
self.analyzer.analyze_feature_importance(
model=self.task.get_model(),
feature_cols=data["test"]["feature_cols"],
top_n=20,
verbose=self.verbose,
)
# NDCG 评估(排序任务特有)
if hasattr(self.task, 'evaluate_ndcg'):
ndcg_scores = self.task.evaluate_ndcg(data["test"])
if self.verbose:
print("\nNDCG 评估结果:")
for metric, score in ndcg_scores.items():
print(f" {metric}: {score:.4f}")
# 组装结果
self.results = self.analyzer.assemble_results(
test_data=data["test"],
predictions=predictions,
top_n=self.output_config.get("top_n", 50),
verbose=self.verbose,
)
# Step 7: 保存结果
if self.verbose:
print("\n[Step 7/7] 保存结果...")
if self.output_config.get("save_predictions", True):
self._save_predictions()
if self.output_config.get("save_model", False):
self._save_model()
if self.verbose:
print("\n" + "=" * 80)
print("训练完成!")
print("=" * 80)
return self.results
def _save_predictions(self) -> None:
"""保存预测结果"""
output_dir = self.output_config.get("output_dir", "experiment/output")
output_filename = self.output_config.get("output_filename", "output.csv")
output_path = os.path.join(output_dir, output_filename)
self.analyzer.save_results(
results=self.results,
output_path=output_path,
verbose=self.verbose,
)
def _save_model(self) -> None:
"""保存模型"""
model_save_path = self.output_config.get("model_save_path")
if not model_save_path:
return
# 确保目录存在
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)
# 获取模型和相关信息
model = self.task.get_model()
# 保存模型
model.save(model_save_path)
if self.verbose:
print(f" 模型保存路径: {model_save_path}")
def get_results(self) -> Optional[pl.DataFrame]:
"""获取训练结果
Returns:
训练结果数据框,如果尚未训练则返回 None
"""
return self.results
def get_task(self) -> BaseTask:
"""获取任务实例
Returns:
任务实例
"""
return self.task
```
**Step 2: Update __init__.py to export new components**
Add to `src/training/__init__.py`:
```python
# 新增导出(模块化 Trainer 组件)
from src.training.factor_manager import FactorManager
from src.training.pipeline import DataPipeline
from src.training.result_analyzer import ResultAnalyzer
from src.training.tasks import RegressionTask, RankTask
# 可以选择性地导出新的 Trainer或者保持原有 Trainer 不变
# from src.training.core.trainer_new import Trainer as ModularTrainer
__all__ = [
# 原有导出
"Trainer",
"DateSplitter",
"StockPoolManager",
"check_data_quality",
"STFilter",
"Winsorizer",
"NullFiller",
"StandardScaler",
"CrossSectionalStandardScaler",
"TrainingConfig",
# 新增导出
"FactorManager",
"DataPipeline",
"ResultAnalyzer",
"RegressionTask",
"RankTask",
]
```
**Step 3: Run basic import tests**
```bash
uv run python -c "from src.training import FactorManager, DataPipeline, RegressionTask, RankTask, ResultAnalyzer; print('All imports successful')"
```
Expected: All imports successful
**Step 4: Commit**
```bash
git add src/training/core/trainer_new.py
git add src/training/__init__.py
git add src/training/factor_manager.py
git add src/training/pipeline.py
git add src/training/result_analyzer.py
git add src/training/tasks/
git commit -m "feat(training): add modular Trainer architecture
- Add FactorManager for unified factor management
- Add DataPipeline for complete data processing workflow
- Add Task strategy components (RegressionTask, RankTask)
- Add ResultAnalyzer for post-training analysis
- Add new Trainer as orchestration engine
- Update __init__.py exports"
```
---
## Task 8: 重写 regression.py 使用新架构
**Files:**
- Create: `src/experiment/regression_v2.py` (新实现)
- Keep: `src/experiment/regression.py` (原文件保留,添加注释说明已迁移)
**Step 1: Create new regression.py with new architecture**
```python
# %% md
# # LightGBM 回归训练流程(模块化版本)
#
# 使用新的模块化 Trainer 架构
# %% md
# ## 1. 导入依赖
# %%
from src.training import (
Trainer,
DataPipeline,
FactorManager,
RegressionTask,
NullFiller,
Winsorizer,
StandardScaler,
)
from src.training.components.filters import STFilter
from src.experiment.common import (
create_training_config,
create_regression_config,
FactorEngine,
)
# %% md
# ## 2. 配置参数
# %%
# 创建统一配置
training_config = create_training_config()
model_config = create_regression_config()
print("训练配置:")
print(f" 训练期: {training_config.train_start} - {training_config.train_end}")
print(f" 验证期: {training_config.val_start} - {training_config.val_end}")
print(f" 测试期: {training_config.test_start} - {training_config.test_end}")
print(f" 特征数: {len(training_config.selected_factors)}")
print(f" Label: {model_config.label_name}")
# %% md
# ## 3. 创建组件
# %%
# 1. 创建 FactorEngine
engine = FactorEngine()
# 2. 创建 FactorManager
factor_manager = FactorManager(
selected_factors=training_config.selected_factors,
factor_definitions=training_config.factor_definitions,
label_factor=training_config.label_factor,
excluded_factors=training_config.excluded_factors,
)
# 3. 创建 DataPipeline
processors = [
NullFiller(strategy="mean"),
Winsorizer(lower=0.01, upper=0.99),
StandardScaler(),
]
filters = [STFilter(data_router=engine.router)] if training_config.st_filter_enabled else []
pipeline = DataPipeline(
factor_manager=factor_manager,
processors=processors,
filters=filters,
stock_pool_filter_func=training_config.stock_pool_filter,
stock_pool_required_columns=training_config.stock_pool_required_columns,
)
# 4. 创建 Task
task = RegressionTask(
model_params=model_config.model_params,
label_name=model_config.label_name,
)
# 5. 创建 Trainer
output_config = {
"output_dir": training_config.output_dir,
"output_filename": "regression_output.csv",
"save_predictions": training_config.save_predictions,
"save_model": training_config.save_model,
"model_save_path": f"{training_config.output_dir}/regression_model.txt",
"top_n": training_config.top_n,
}
trainer = Trainer(
data_pipeline=pipeline,
task=task,
output_config=output_config,
verbose=True,
)
# %% md
# ## 4. 执行训练
# %%
results = trainer.run(
engine=engine,
date_range=training_config.date_range,
)
# %% md
# ## 5. 额外分析(可选)
# %%
# 获取模型进行进一步分析
model = task.get_model()
# 可以在这里添加自定义可视化
print("\n训练完成!")
print(f"结果保存路径: {output_config['output_dir']}/regression_output.csv")
```
**Step 2: Add deprecation notice to old regression.py**
在原有 `regression.py` 文件顶部添加:
```python
# 注意:此文件已迁移到 regression_v2.py
# 新文件使用模块化 Trainer 架构
# 此文件保留用于参考和对比
```
**Step 3: Test new regression script**
```bash
# 注意:这会实际运行训练,可能需要较长时间
# 建议先用小数据测试
uv run python src/experiment/regression_v2.py
```
**Step 4: Commit**
```bash
git add src/experiment/regression_v2.py
git add src/experiment/regression.py # 已添加弃用注释
git commit -m "feat(experiment): add modular regression training script
- Create regression_v2.py using new modular Trainer architecture
- Reduce code from 640 lines to ~80 lines
- Add deprecation notice to old regression.py
- All functionality preserved"
```
---
## Task 9: 重写 learn_to_rank.py 使用新架构
**Files:**
- Create: `src/experiment/learn_to_rank_v2.py` (新实现)
- Keep: `src/experiment/learn_to_rank.py` (原文件保留,添加注释说明已迁移)
**Step 1: Create new learn_to_rank.py with new architecture**
```python
# %% md
# # LightGBM LambdaRank 排序学习训练流程(模块化版本)
#
# 使用新的模块化 Trainer 架构
# %% md
# ## 1. 导入依赖
# %%
from src.training import (
Trainer,
DataPipeline,
FactorManager,
RankTask,
NullFiller,
Winsorizer,
CrossSectionalStandardScaler,
)
from src.training.components.filters import STFilter
from src.experiment.common import (
create_training_config,
create_rank_config,
FactorEngine,
)
# %% md
# ## 2. 配置参数
# %%
# 创建统一配置
training_config = create_training_config()
model_config = create_rank_config()
print("训练配置:")
print(f" 训练期: {training_config.train_start} - {training_config.train_end}")
print(f" 验证期: {training_config.val_start} - {training_config.val_end}")
print(f" 测试期: {training_config.test_start} - {training_config.test_end}")
print(f" 特征数: {len(training_config.selected_factors)}")
print(f" Label: {model_config.label_name}")
print(f" 分位数: {model_config.n_quantiles}")
# %% md
# ## 3. 创建组件
# %%
# 1. 创建 FactorEngine
engine = FactorEngine()
# 2. 创建 FactorManager
factor_manager = FactorManager(
selected_factors=training_config.selected_factors,
factor_definitions=training_config.factor_definitions,
label_factor=training_config.label_factor,
excluded_factors=training_config.excluded_factors,
)
# 3. 创建 DataPipeline使用截面标准化
processors = [
NullFiller(strategy="mean"),
Winsorizer(lower=0.01, upper=0.99),
CrossSectionalStandardScaler(),
]
filters = [STFilter(data_router=engine.router)] if training_config.st_filter_enabled else []
pipeline = DataPipeline(
factor_manager=factor_manager,
processors=processors,
filters=filters,
stock_pool_filter_func=training_config.stock_pool_filter,
stock_pool_required_columns=training_config.stock_pool_required_columns,
)
# 4. 创建 Task排序学习特有 n_quantiles
task = RankTask(
model_params=model_config.model_params,
label_name=model_config.label_name,
n_quantiles=model_config.n_quantiles,
)
# 5. 创建 Trainer
output_config = {
"output_dir": training_config.output_dir,
"output_filename": "rank_output.csv",
"save_predictions": training_config.save_predictions,
"save_model": training_config.save_model,
"model_save_path": f"{training_config.output_dir}/rank_model.txt",
"top_n": training_config.top_n,
}
trainer = Trainer(
data_pipeline=pipeline,
task=task,
output_config=output_config,
verbose=True,
)
# %% md
# ## 4. 执行训练
# %%
results = trainer.run(
engine=engine,
date_range=training_config.date_range,
)
# %% md
# ## 5. 额外分析NDCG
# %%
# NDCG 评估已在 Trainer.run() 中自动执行
# 可以在这里添加额外的可视化
print("\n训练完成!")
print(f"结果保存路径: {output_config['output_dir']}/rank_output.csv")
```
**Step 2: Add deprecation notice to old learn_to_rank.py**
在原有 `learn_to_rank.py` 文件顶部添加:
```python
# 注意:此文件已迁移到 learn_to_rank_v2.py
# 新文件使用模块化 Trainer 架构
# 此文件保留用于参考和对比
```
**Step 3: Test new learn_to_rank script**
```bash
# 注意:这会实际运行训练
uv run python src/experiment/learn_to_rank_v2.py
```
**Step 4: Commit**
```bash
git add src/experiment/learn_to_rank_v2.py
git add src/experiment/learn_to_rank.py # 已添加弃用注释
git commit -m "feat(experiment): add modular learn-to-rank training script
- Create learn_to_rank_v2.py using new modular Trainer architecture
- Reduce code from 876 lines to ~80 lines
- Add deprecation notice to old learn_to_rank.py
- All functionality preserved including NDCG evaluation"
```
---
## Task 10: 验证和对比
**Files:**
- Test both implementations
**Step 1: Compare outputs**
```bash
# 运行旧版本(如果数据已存在,可以直接比较输出)
# 注意:这会运行实际训练,需要较长时间
# 运行新版本
uv run python src/experiment/regression_v2.py 2>&1 | tee regression_v2.log
uv run python src/experiment/learn_to_rank_v2.py 2>&1 | tee rank_v2.log
# 检查输出文件
ls -lh experiment/output/
# 应该生成 regression_output.csv 和 rank_output.csv
```
**Step 2: Validate feature importance output**
确保特征重要性分析输出格式正确:
- Top 20 特征列表
- 零贡献特征列表
- 统计摘要
**Step 3: Validate NDCG evaluation (learn_to_rank)**
确保 NDCG@k 评估正确执行:
- ndcg@1, ndcg@5, ndcg@10, ndcg@20 都计算
**Step 4: Code statistics**
```bash
# 对比代码行数
echo "=== Old implementation ==="
wc -l src/experiment/regression.py src/experiment/learn_to_rank.py
echo "=== New implementation ==="
wc -l src/experiment/regression_v2.py src/experiment/learn_to_rank_v2.py
echo "=== New components ==="
wc -l src/training/factor_manager.py src/training/pipeline.py src/training/result_analyzer.py
find src/training/tasks -name "*.py" -exec wc -l {} +
```
Expected:
- Old: ~640 + ~876 = ~1516 lines
- New: ~80 + ~80 = ~160 lines
- New components: ~500-800 lines (reusable)
**Step 5: Commit final changes**
```bash
git add -A
git commit -m "refactor(training): complete modular Trainer architecture
- Implement FactorManager, DataPipeline, Task strategies, ResultAnalyzer
- Rewrite regression.py (640 -> 80 lines)
- Rewrite learn_to_rank.py (876 -> 80 lines)
- Preserve all functionality:
* Factor management (metadata, DSL, label, exclusion)
* Data filtering (STFilter, stock_pool_filter)
* Data preprocessing (NullFiller, Winsorizer, Scaler)
* Model training with early stopping
* Feature importance analysis
* NDCG evaluation for ranking
* Result saving (predictions, model)
- Add comprehensive tests for all components
- Code reduction: 94% less duplication in experiment scripts"
```
---
## Summary
### 代码结构变化
```
Before:
├── src/experiment/regression.py (640 lines) - 独立完整实现
├── src/experiment/learn_to_rank.py (876 lines) - 独立完整实现
└── 重复代码: 80%+
After:
├── src/experiment/regression_v2.py (80 lines) - 配置+运行
├── src/experiment/learn_to_rank_v2.py (80 lines) - 配置+运行
├── src/training/factor_manager.py - 因子管理(可复用)
├── src/training/pipeline.py - 数据流水线(可复用)
├── src/training/tasks/
│ ├── base.py - 任务接口
│ ├── regression_task.py - 回归任务
│ └── rank_task.py - 排序任务
├── src/training/result_analyzer.py - 结果分析(可复用)
└── src/training/core/trainer_new.py - 调度引擎
```
### 新增训练类型的工作量
添加**分类任务**
1. 创建 `ClassificationTask` 类(继承 BaseTask实现3个方法
2. 在实验脚本中使用80行与回归/排序类似)
无需复制任何数据流程代码!
### 测试覆盖
- FactorManager: ✓
- DataPipeline: ✓
- Tasks: ✓
- ResultAnalyzer: ✓
---
## 后续可选优化
1. **完全移除旧文件**:验证新文件工作正常后,可以删除 regression.py 和 learn_to_rank.py将 v2 文件重命名
2. **添加更多测试**:集成测试、端到端测试
3. **文档更新**:更新 README添加新架构使用说明
4. **配置优化**:支持从 YAML/JSON 文件加载配置