2380 lines
68 KiB
Markdown
2380 lines
68 KiB
Markdown
|
|
# Trainer 模块化重构实现计划
|
|||
|
|
|
|||
|
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|||
|
|
|
|||
|
|
**Goal:** 重构 Trainer 实现模块化训练流程,完整保留所有现有功能,消除 regression.py 和 learn_to_rank.py 的代码重复
|
|||
|
|
|
|||
|
|
**Architecture:** 采用组合模式(Composition over Inheritance),将训练流程解耦为 FactorManager(因子管理)、DataPipeline(数据流程)、Task(任务策略)、ResultAnalyzer(结果分析)四个独立组件,Trainer 作为纯调度引擎协调各组件
|
|||
|
|
|
|||
|
|
**Tech Stack:** Python 3.10+, Polars, LightGBM, Pydantic
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 前置检查
|
|||
|
|
|
|||
|
|
**读取参考文件以了解当前实现:**
|
|||
|
|
- @src/experiment/common.py - 当前配置和共用函数
|
|||
|
|
- @src/experiment/regression.py - 回归训练流程(640行)
|
|||
|
|
- @src/experiment/learn_to_rank.py - 排序学习流程(876行)
|
|||
|
|
- @src/training/core/trainer.py - 当前 Trainer 实现
|
|||
|
|
- @src/training/components/models/lightgbm.py - LightGBM 回归模型
|
|||
|
|
- @src/training/components/models/lightgbm_lambdarank.py - LambdaRank 排序模型
|
|||
|
|
- @src/training/components/base.py - 基础抽象类
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 1: 创建 docs/plans 目录并保存计划
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `docs/plans/2026-03-23-trainer-refactor-plan.md`
|
|||
|
|
|
|||
|
|
**Step 1: 创建目录并复制计划文件**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
mkdir -p docs/plans
|
|||
|
|
cp .plannotator/plans/trainer-v3-2026-03-23-approved.md docs/plans/2026-03-23-trainer-refactor-plan.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add docs/plans/
|
|||
|
|
git commit -m "docs: add trainer refactoring implementation plan"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 2: 重构 common.py - 添加统一配置结构
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Modify: `src/experiment/common.py` - 在文件末尾添加新的配置结构
|
|||
|
|
|
|||
|
|
**Step 1: 在 common.py 末尾添加 TRAINING_CONFIG 和辅助函数**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# ============================================================
|
|||
|
|
# 新增:统一配置结构(用于模块化 Trainer)
|
|||
|
|
# ============================================================
|
|||
|
|
|
|||
|
|
from typing import Dict, List, Tuple, Any, Callable, Optional
|
|||
|
|
from dataclasses import dataclass, field
|
|||
|
|
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
class TrainingConfig:
|
|||
|
|
"""训练配置数据结构"""
|
|||
|
|
|
|||
|
|
# 因子配置
|
|||
|
|
selected_factors: List[str]
|
|||
|
|
factor_definitions: Dict[str, str]
|
|||
|
|
label_factor: Dict[str, str]
|
|||
|
|
excluded_factors: List[str]
|
|||
|
|
|
|||
|
|
# 数据配置
|
|||
|
|
stock_pool_filter: Callable
|
|||
|
|
stock_pool_required_columns: List[str]
|
|||
|
|
st_filter_enabled: bool = True
|
|||
|
|
|
|||
|
|
# 日期范围
|
|||
|
|
train_start: str
|
|||
|
|
train_end: str
|
|||
|
|
val_start: str
|
|||
|
|
val_end: str
|
|||
|
|
test_start: str
|
|||
|
|
test_end: str
|
|||
|
|
|
|||
|
|
# 输出配置
|
|||
|
|
output_dir: str
|
|||
|
|
save_predictions: bool
|
|||
|
|
save_model: bool
|
|||
|
|
top_n: int
|
|||
|
|
|
|||
|
|
@property
|
|||
|
|
def date_range(self) -> Dict[str, Tuple[str, str]]:
|
|||
|
|
"""获取日期范围字典"""
|
|||
|
|
return {
|
|||
|
|
"train": (self.train_start, self.train_end),
|
|||
|
|
"val": (self.val_start, self.val_end),
|
|||
|
|
"test": (self.test_start, self.test_end),
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
class ModelConfig:
|
|||
|
|
"""模型配置基类"""
|
|||
|
|
model_params: Dict[str, Any]
|
|||
|
|
label_name: str
|
|||
|
|
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
class RegressionModelConfig(ModelConfig):
|
|||
|
|
"""回归模型配置"""
|
|||
|
|
pass
|
|||
|
|
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
class RankModelConfig(ModelConfig):
|
|||
|
|
"""排序学习模型配置"""
|
|||
|
|
n_quantiles: int = 20
|
|||
|
|
|
|||
|
|
|
|||
|
|
# 创建统一配置实例
|
|||
|
|
def create_training_config() -> TrainingConfig:
|
|||
|
|
"""创建训练配置"""
|
|||
|
|
return TrainingConfig(
|
|||
|
|
selected_factors=SELECTED_FACTORS,
|
|||
|
|
factor_definitions=FACTOR_DEFINITIONS,
|
|||
|
|
label_factor=LABEL_FACTOR,
|
|||
|
|
excluded_factors=EXCLUDED_FACTORS,
|
|||
|
|
stock_pool_filter=stock_pool_filter,
|
|||
|
|
stock_pool_required_columns=STOCK_FILTER_REQUIRED_COLUMNS,
|
|||
|
|
st_filter_enabled=True,
|
|||
|
|
train_start=TRAIN_START,
|
|||
|
|
train_end=TRAIN_END,
|
|||
|
|
val_start=VAL_START,
|
|||
|
|
val_end=VAL_END,
|
|||
|
|
test_start=TEST_START,
|
|||
|
|
test_end=TEST_END,
|
|||
|
|
output_dir=OUTPUT_DIR,
|
|||
|
|
save_predictions=SAVE_PREDICTIONS,
|
|||
|
|
save_model=SAVE_MODEL,
|
|||
|
|
top_n=TOP_N,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def create_regression_config() -> RegressionModelConfig:
|
|||
|
|
"""创建回归模型配置"""
|
|||
|
|
return RegressionModelConfig(
|
|||
|
|
model_params=MODEL_PARAMS_REGRESSION,
|
|||
|
|
label_name="future_return_5",
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def create_rank_config() -> RankModelConfig:
|
|||
|
|
"""创建排序学习模型配置"""
|
|||
|
|
return RankModelConfig(
|
|||
|
|
model_params=MODEL_PARAMS_RANK,
|
|||
|
|
label_name="future_return_5",
|
|||
|
|
n_quantiles=20,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
|
|||
|
|
# 保持向后兼容的导入
|
|||
|
|
__all__ = [
|
|||
|
|
# 原有导出
|
|||
|
|
"SELECTED_FACTORS",
|
|||
|
|
"FACTOR_DEFINITIONS",
|
|||
|
|
"LABEL_FACTOR",
|
|||
|
|
"EXCLUDED_FACTORS",
|
|||
|
|
"register_factors",
|
|||
|
|
"prepare_data",
|
|||
|
|
"stock_pool_filter",
|
|||
|
|
"STOCK_FILTER_REQUIRED_COLUMNS",
|
|||
|
|
"TRAIN_START",
|
|||
|
|
"TRAIN_END",
|
|||
|
|
"VAL_START",
|
|||
|
|
"VAL_END",
|
|||
|
|
"TEST_START",
|
|||
|
|
"TEST_END",
|
|||
|
|
"OUTPUT_DIR",
|
|||
|
|
"SAVE_PREDICTIONS",
|
|||
|
|
"SAVE_MODEL",
|
|||
|
|
"TOP_N",
|
|||
|
|
"get_model_save_path",
|
|||
|
|
"save_model_with_factors",
|
|||
|
|
"get_label_factor",
|
|||
|
|
# 新增导出
|
|||
|
|
"TrainingConfig",
|
|||
|
|
"ModelConfig",
|
|||
|
|
"RegressionModelConfig",
|
|||
|
|
"RankModelConfig",
|
|||
|
|
"create_training_config",
|
|||
|
|
"create_regression_config",
|
|||
|
|
"create_rank_config",
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: 在 common.py 的 MODEL_PARAMS 定义后添加回归和排序的参数分开定义**
|
|||
|
|
|
|||
|
|
找到 MODEL_PARAMS 定义的位置(约第400行左右),将其重命名为 MODEL_PARAMS_REGRESSION,然后添加排序学习的参数:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 回归模型参数
|
|||
|
|
MODEL_PARAMS_REGRESSION = {
|
|||
|
|
# ... 原有的 MODEL_PARAMS 内容 ...
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 排序学习模型参数
|
|||
|
|
MODEL_PARAMS_RANK = {
|
|||
|
|
"objective": "lambdarank",
|
|||
|
|
"metric": "ndcg",
|
|||
|
|
"ndcg_at": 25,
|
|||
|
|
"learning_rate": 0.1,
|
|||
|
|
"n_estimators": 1000,
|
|||
|
|
"early_stopping_round": 50,
|
|||
|
|
"max_depth": 4,
|
|||
|
|
"num_leaves": 32,
|
|||
|
|
"min_data_in_leaf": 256,
|
|||
|
|
"subsample": 0.4,
|
|||
|
|
"subsample_freq": 1,
|
|||
|
|
"colsample_bytree": 0.4,
|
|||
|
|
"reg_alpha": 10.0,
|
|||
|
|
"reg_lambda": 50.0,
|
|||
|
|
"lambdarank_truncation_level": 50,
|
|||
|
|
"label_gain": [i * i for i in range(1, 21)],
|
|||
|
|
"verbose": -1,
|
|||
|
|
"random_state": 42,
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 保持向后兼容
|
|||
|
|
MODEL_PARAMS = MODEL_PARAMS_REGRESSION
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Run tests to verify changes don't break existing code**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv run pytest tests/test_sync.py -v -x
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: Tests pass (or at least not broken by our changes)
|
|||
|
|
|
|||
|
|
**Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/experiment/common.py
|
|||
|
|
git commit -m "refactor(common): add unified config structure for modular trainer
|
|||
|
|
|
|||
|
|
- Add TrainingConfig dataclass for unified configuration
|
|||
|
|
- Add ModelConfig, RegressionModelConfig, RankModelConfig
|
|||
|
|
- Separate MODEL_PARAMS into MODEL_PARAMS_REGRESSION and MODEL_PARAMS_RANK
|
|||
|
|
- Add factory functions: create_training_config, create_regression_config, create_rank_config
|
|||
|
|
- Maintain backward compatibility"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 3: 创建 FactorManager 组件
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/training/factor_manager.py`
|
|||
|
|
- Test: `tests/test_factor_manager.py`
|
|||
|
|
|
|||
|
|
**Step 1: Create the FactorManager implementation**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""因子管理器
|
|||
|
|
|
|||
|
|
管理多种来源的因子:
|
|||
|
|
- metadata 中注册的因子
|
|||
|
|
- DSL 表达式定义的因子
|
|||
|
|
- Label 因子
|
|||
|
|
- 排除的因子列表
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from typing import Dict, List, Optional, Any
|
|||
|
|
import polars as pl
|
|||
|
|
|
|||
|
|
from src.factors import FactorEngine
|
|||
|
|
|
|||
|
|
|
|||
|
|
class FactorManager:
|
|||
|
|
"""因子管理器
|
|||
|
|
|
|||
|
|
统一管理多种来源的因子注册和准备:
|
|||
|
|
1. metadata 中已注册的因子(通过名称引用)
|
|||
|
|
2. DSL 表达式定义的因子(动态注册)
|
|||
|
|
3. Label 因子(通过表达式定义)
|
|||
|
|
4. 排除的因子列表(从最终列表中移除)
|
|||
|
|
|
|||
|
|
Attributes:
|
|||
|
|
selected_factors: 从 metadata 中选择的因子名称列表
|
|||
|
|
factor_definitions: DSL 表达式定义的因子字典 {name: dsl_expression}
|
|||
|
|
label_factor: Label 因子定义 {name: dsl_expression}
|
|||
|
|
excluded_factors: 需要排除的因子名称列表
|
|||
|
|
registered_factors: 已注册到 FactorEngine 的因子列表
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
selected_factors: List[str],
|
|||
|
|
factor_definitions: Dict[str, str],
|
|||
|
|
label_factor: Dict[str, str],
|
|||
|
|
excluded_factors: Optional[List[str]] = None,
|
|||
|
|
):
|
|||
|
|
"""初始化因子管理器
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
selected_factors: 从 metadata 中选择的因子名称列表
|
|||
|
|
factor_definitions: DSL 表达式定义的因子字典
|
|||
|
|
label_factor: Label 因子定义字典
|
|||
|
|
excluded_factors: 需要排除的因子名称列表
|
|||
|
|
"""
|
|||
|
|
self.selected_factors = selected_factors or []
|
|||
|
|
self.factor_definitions = factor_definitions or {}
|
|||
|
|
self.label_factor = label_factor or {}
|
|||
|
|
self.excluded_factors = excluded_factors or []
|
|||
|
|
self.registered_factors: List[str] = []
|
|||
|
|
|
|||
|
|
def register_to_engine(
|
|||
|
|
self,
|
|||
|
|
engine: FactorEngine,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> List[str]:
|
|||
|
|
"""注册所有因子到 FactorEngine
|
|||
|
|
|
|||
|
|
按以下顺序注册:
|
|||
|
|
1. metadata 中的因子(通过名称从 metadata 加载)
|
|||
|
|
2. DSL 表达式定义的因子(使用 add_factor 注册)
|
|||
|
|
3. Label 因子(使用 add_factor 注册)
|
|||
|
|
4. 排除指定的因子
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
engine: FactorEngine 实例
|
|||
|
|
verbose: 是否打印注册信息
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
最终的特征列名列表(已排除指定因子)
|
|||
|
|
"""
|
|||
|
|
if verbose:
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print("因子注册")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
# Step 1: 从 metadata 注册选中的因子
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[1/4] 从 metadata 注册 {len(self.selected_factors)} 个因子...")
|
|||
|
|
|
|||
|
|
feature_cols = []
|
|||
|
|
for factor_name in self.selected_factors:
|
|||
|
|
try:
|
|||
|
|
engine.add_factor(factor_name)
|
|||
|
|
feature_cols.append(factor_name)
|
|||
|
|
if verbose:
|
|||
|
|
print(f" ✓ {factor_name}")
|
|||
|
|
except Exception as e:
|
|||
|
|
if verbose:
|
|||
|
|
print(f" ✗ {factor_name}: {e}")
|
|||
|
|
|
|||
|
|
# Step 2: 注册 DSL 定义的因子
|
|||
|
|
if self.factor_definitions:
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[2/4] 注册 {len(self.factor_definitions)} 个 DSL 定义因子...")
|
|||
|
|
|
|||
|
|
for factor_name, dsl_expr in self.factor_definitions.items():
|
|||
|
|
if factor_name not in self.excluded_factors:
|
|||
|
|
try:
|
|||
|
|
engine.add_factor(factor_name, dsl_expr)
|
|||
|
|
feature_cols.append(factor_name)
|
|||
|
|
if verbose:
|
|||
|
|
print(f" ✓ {factor_name}: {dsl_expr[:50]}...")
|
|||
|
|
except Exception as e:
|
|||
|
|
if verbose:
|
|||
|
|
print(f" ✗ {factor_name}: {e}")
|
|||
|
|
|
|||
|
|
# Step 3: 注册 Label 因子
|
|||
|
|
if self.label_factor:
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[3/4] 注册 Label 因子...")
|
|||
|
|
|
|||
|
|
for factor_name, dsl_expr in self.label_factor.items():
|
|||
|
|
try:
|
|||
|
|
engine.add_factor(factor_name, dsl_expr)
|
|||
|
|
if verbose:
|
|||
|
|
print(f" ✓ Label: {factor_name}")
|
|||
|
|
except Exception as e:
|
|||
|
|
if verbose:
|
|||
|
|
print(f" ✗ Label {factor_name}: {e}")
|
|||
|
|
|
|||
|
|
# Step 4: 排除指定因子
|
|||
|
|
if self.excluded_factors:
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[4/4] 排除 {len(self.excluded_factors)} 个因子...")
|
|||
|
|
|
|||
|
|
original_count = len(feature_cols)
|
|||
|
|
feature_cols = [f for f in feature_cols if f not in self.excluded_factors]
|
|||
|
|
excluded_count = original_count - len(feature_cols)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f" 排除 {excluded_count} 个因子")
|
|||
|
|
for f in self.excluded_factors:
|
|||
|
|
if f in self.selected_factors or f in self.factor_definitions:
|
|||
|
|
print(f" - {f}")
|
|||
|
|
|
|||
|
|
self.registered_factors = feature_cols
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[结果] 最终特征数: {len(feature_cols)}")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
return feature_cols
|
|||
|
|
|
|||
|
|
def get_feature_cols(self) -> List[str]:
|
|||
|
|
"""获取已注册的特征列名列表
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
特征列名列表
|
|||
|
|
"""
|
|||
|
|
return self.registered_factors
|
|||
|
|
|
|||
|
|
def get_label_col(self) -> Optional[str]:
|
|||
|
|
"""获取 Label 列名
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
Label 列名,如果没有则返回 None
|
|||
|
|
"""
|
|||
|
|
if self.label_factor:
|
|||
|
|
return list(self.label_factor.keys())[0]
|
|||
|
|
return None
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Create test file**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""FactorManager 测试"""
|
|||
|
|
|
|||
|
|
import pytest
|
|||
|
|
from unittest.mock import Mock, MagicMock
|
|||
|
|
|
|||
|
|
from src.training.factor_manager import FactorManager
|
|||
|
|
|
|||
|
|
|
|||
|
|
class TestFactorManager:
|
|||
|
|
"""测试 FactorManager"""
|
|||
|
|
|
|||
|
|
def test_init(self):
|
|||
|
|
"""测试初始化"""
|
|||
|
|
fm = FactorManager(
|
|||
|
|
selected_factors=["factor1", "factor2"],
|
|||
|
|
factor_definitions={"factor3": "close + open"},
|
|||
|
|
label_factor={"label": "future_return_5"},
|
|||
|
|
excluded_factors=["factor2"],
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert fm.selected_factors == ["factor1", "factor2"]
|
|||
|
|
assert fm.factor_definitions == {"factor3": "close + open"}
|
|||
|
|
assert fm.label_factor == {"label": "future_return_5"}
|
|||
|
|
assert fm.excluded_factors == ["factor2"]
|
|||
|
|
assert fm.registered_factors == []
|
|||
|
|
|
|||
|
|
def test_register_to_engine(self):
|
|||
|
|
"""测试注册到引擎"""
|
|||
|
|
# 创建 mock engine
|
|||
|
|
engine = Mock()
|
|||
|
|
engine.add_factor = Mock()
|
|||
|
|
|
|||
|
|
fm = FactorManager(
|
|||
|
|
selected_factors=["factor1", "factor2"],
|
|||
|
|
factor_definitions={"factor3": "close + open"},
|
|||
|
|
label_factor={"label": "future_return"},
|
|||
|
|
excluded_factors=["factor2"],
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
feature_cols = fm.register_to_engine(engine, verbose=False)
|
|||
|
|
|
|||
|
|
# 验证调用
|
|||
|
|
assert engine.add_factor.call_count == 4 # 2 selected + 1 dsl + 1 label
|
|||
|
|
|
|||
|
|
# 验证结果(factor2 被排除)
|
|||
|
|
assert "factor1" in feature_cols
|
|||
|
|
assert "factor2" not in feature_cols
|
|||
|
|
assert "factor3" in feature_cols
|
|||
|
|
assert fm.registered_factors == feature_cols
|
|||
|
|
|
|||
|
|
def test_get_feature_cols(self):
|
|||
|
|
"""测试获取特征列"""
|
|||
|
|
fm = FactorManager(
|
|||
|
|
selected_factors=["factor1"],
|
|||
|
|
factor_definitions={},
|
|||
|
|
label_factor={},
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 注册前为空
|
|||
|
|
assert fm.get_feature_cols() == []
|
|||
|
|
|
|||
|
|
# 注册后
|
|||
|
|
engine = Mock()
|
|||
|
|
engine.add_factor = Mock()
|
|||
|
|
fm.register_to_engine(engine, verbose=False)
|
|||
|
|
|
|||
|
|
assert fm.get_feature_cols() == ["factor1"]
|
|||
|
|
|
|||
|
|
def test_get_label_col(self):
|
|||
|
|
"""测试获取 Label 列"""
|
|||
|
|
fm = FactorManager(
|
|||
|
|
selected_factors=[],
|
|||
|
|
factor_definitions={},
|
|||
|
|
label_factor={"label": "future_return"},
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert fm.get_label_col() == "label"
|
|||
|
|
|
|||
|
|
# 没有 label 时返回 None
|
|||
|
|
fm2 = FactorManager(selected_factors=[], factor_definitions={}, label_factor={})
|
|||
|
|
assert fm2.get_label_col() is None
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Run tests**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv run pytest tests/test_factor_manager.py -v
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: All tests pass
|
|||
|
|
|
|||
|
|
**Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/training/factor_manager.py tests/test_factor_manager.py
|
|||
|
|
git commit -m "feat(training): add FactorManager component
|
|||
|
|
|
|||
|
|
- Manage factors from multiple sources (metadata, DSL, label, excluded)
|
|||
|
|
- Register factors to FactorEngine with proper ordering
|
|||
|
|
- Support factor exclusion
|
|||
|
|
- Add comprehensive tests"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 4: 创建 DataPipeline 组件
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/training/pipeline.py`
|
|||
|
|
- Test: `tests/test_pipeline.py`
|
|||
|
|
|
|||
|
|
**Step 1: Create the DataPipeline implementation**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""数据流水线
|
|||
|
|
|
|||
|
|
完整的数据处理流程:
|
|||
|
|
1. 因子注册和数据准备
|
|||
|
|
2. 应用过滤器(STFilter 等)
|
|||
|
|
3. 股票池筛选(自定义函数)
|
|||
|
|
4. 数据质量检查
|
|||
|
|
5. 数据划分(train/val/test)
|
|||
|
|
6. 数据预处理(fit_transform/transform)
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from typing import Any, Callable, Dict, List, Optional, Tuple
|
|||
|
|
import polars as pl
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
from src.factors import FactorEngine
|
|||
|
|
from src.training.factor_manager import FactorManager
|
|||
|
|
from src.training.components.base import BaseProcessor
|
|||
|
|
from src.training.core.stock_pool_manager import StockPoolManager
|
|||
|
|
|
|||
|
|
|
|||
|
|
class DataPipeline:
|
|||
|
|
"""数据流水线
|
|||
|
|
|
|||
|
|
执行完整的数据处理流程,返回标准化的数据字典。
|
|||
|
|
|
|||
|
|
Attributes:
|
|||
|
|
factor_manager: 因子管理器
|
|||
|
|
filters: 类形式的过滤器列表(如 STFilter)
|
|||
|
|
stock_pool_filter_func: 函数形式的股票池筛选器
|
|||
|
|
processors: 数据处理器列表
|
|||
|
|
stock_pool_required_columns: 股票池筛选所需的额外列
|
|||
|
|
fitted_processors: 已拟合的处理器列表(训练后填充)
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
factor_manager: FactorManager,
|
|||
|
|
processors: List[BaseProcessor],
|
|||
|
|
filters: Optional[List[Any]] = None,
|
|||
|
|
stock_pool_filter_func: Optional[Callable] = None,
|
|||
|
|
stock_pool_required_columns: Optional[List[str]] = None,
|
|||
|
|
):
|
|||
|
|
"""初始化数据流水线
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
factor_manager: 因子管理器实例
|
|||
|
|
processors: 数据处理器列表(顺序执行)
|
|||
|
|
filters: 类形式的过滤器列表(如 [STFilter])
|
|||
|
|
stock_pool_filter_func: 函数形式的股票池筛选器
|
|||
|
|
stock_pool_required_columns: 股票池筛选所需的额外列
|
|||
|
|
"""
|
|||
|
|
self.factor_manager = factor_manager
|
|||
|
|
self.processors = processors or []
|
|||
|
|
self.filters = filters or []
|
|||
|
|
self.stock_pool_filter_func = stock_pool_filter_func
|
|||
|
|
self.stock_pool_required_columns = stock_pool_required_columns or []
|
|||
|
|
self.fitted_processors: List[BaseProcessor] = []
|
|||
|
|
|
|||
|
|
def prepare_data(
|
|||
|
|
self,
|
|||
|
|
engine: FactorEngine,
|
|||
|
|
date_range: Dict[str, Tuple[str, str]],
|
|||
|
|
label_name: str,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> Dict[str, Dict[str, Any]]:
|
|||
|
|
"""执行完整数据流程
|
|||
|
|
|
|||
|
|
流程:
|
|||
|
|
1. 注册因子并准备数据
|
|||
|
|
2. 应用类过滤器(STFilter)
|
|||
|
|
3. 应用股票池筛选(函数形式)
|
|||
|
|
4. 数据质量检查
|
|||
|
|
5. 数据划分
|
|||
|
|
6. 数据预处理
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
engine: FactorEngine 实例
|
|||
|
|
date_range: 日期范围字典 {"train": (start, end), "val": ..., "test": ...}
|
|||
|
|
label_name: Label 列名
|
|||
|
|
verbose: 是否打印处理信息
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
标准化的数据字典:
|
|||
|
|
{
|
|||
|
|
"train": {
|
|||
|
|
"X": pl.DataFrame, # 特征矩阵
|
|||
|
|
"y": pl.Series, # 目标变量
|
|||
|
|
"raw_data": pl.DataFrame, # 原始数据(保留完整信息)
|
|||
|
|
"feature_cols": List[str], # 特征列名
|
|||
|
|
},
|
|||
|
|
"val": {...},
|
|||
|
|
"test": {...},
|
|||
|
|
}
|
|||
|
|
"""
|
|||
|
|
if verbose:
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print("数据流水线")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
# Step 1: 注册因子并准备数据
|
|||
|
|
if verbose:
|
|||
|
|
print("\n[1/6] 注册因子并准备数据...")
|
|||
|
|
|
|||
|
|
feature_cols = self.factor_manager.register_to_engine(engine, verbose=verbose)
|
|||
|
|
|
|||
|
|
# 计算完整日期范围
|
|||
|
|
all_start = min(date_range["train"][0], date_range["val"][0], date_range["test"][0])
|
|||
|
|
all_end = max(date_range["train"][1], date_range["val"][1], date_range["test"][1])
|
|||
|
|
|
|||
|
|
# 准备数据
|
|||
|
|
data = engine.compute(
|
|||
|
|
factors=feature_cols + [label_name],
|
|||
|
|
start_date=all_start,
|
|||
|
|
end_date=all_end,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f" 原始数据规模: {data.shape}")
|
|||
|
|
print(f" 特征数: {len(feature_cols)}")
|
|||
|
|
|
|||
|
|
# Step 2: 应用类过滤器(STFilter)
|
|||
|
|
if self.filters:
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[2/6] 应用过滤器({len(self.filters)}个)...")
|
|||
|
|
|
|||
|
|
for filter_obj in self.filters:
|
|||
|
|
data_before = len(data)
|
|||
|
|
data = filter_obj.filter(data)
|
|||
|
|
data_after = len(data)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f" {filter_obj.__class__.__name__}:")
|
|||
|
|
print(f" 过滤前: {data_before}, 过滤后: {data_after}")
|
|||
|
|
print(f" 删除: {data_before - data_after}")
|
|||
|
|
|
|||
|
|
# Step 3: 应用股票池筛选(函数形式)
|
|||
|
|
if self.stock_pool_filter_func:
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[3/6] 股票池筛选...")
|
|||
|
|
|
|||
|
|
data_before = len(data)
|
|||
|
|
|
|||
|
|
# 创建 StockPoolManager
|
|||
|
|
pool_manager = StockPoolManager(
|
|||
|
|
filter_func=self.stock_pool_filter_func,
|
|||
|
|
required_columns=self.stock_pool_required_columns,
|
|||
|
|
data_router=engine.router,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
data = pool_manager.filter_and_select_daily(data)
|
|||
|
|
data_after = len(data)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f" 筛选前: {data_before}, 筛选后: {data_after}")
|
|||
|
|
print(f" 删除: {data_before - data_after}")
|
|||
|
|
|
|||
|
|
# Step 4: 数据质量检查
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[4/6] 数据质量检查...")
|
|||
|
|
|
|||
|
|
self._check_data_quality(data, feature_cols, verbose=verbose)
|
|||
|
|
|
|||
|
|
# Step 5: 数据划分
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[5/6] 数据划分...")
|
|||
|
|
|
|||
|
|
split_data = self._split_data(data, date_range, feature_cols, label_name, verbose=verbose)
|
|||
|
|
|
|||
|
|
# Step 6: 数据预处理
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n[6/6] 数据预处理...")
|
|||
|
|
|
|||
|
|
split_data = self._preprocess(split_data, verbose=verbose)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print("数据流水线完成")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
return split_data
|
|||
|
|
|
|||
|
|
def _check_data_quality(
|
|||
|
|
self,
|
|||
|
|
data: pl.DataFrame,
|
|||
|
|
feature_cols: List[str],
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> None:
|
|||
|
|
"""检查数据质量
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
data: 数据框
|
|||
|
|
feature_cols: 特征列名列表
|
|||
|
|
verbose: 是否打印信息
|
|||
|
|
"""
|
|||
|
|
# 检查缺失值
|
|||
|
|
null_counts = {}
|
|||
|
|
for col in feature_cols:
|
|||
|
|
null_count = data[col].null_count()
|
|||
|
|
if null_count > 0:
|
|||
|
|
null_counts[col] = null_count
|
|||
|
|
|
|||
|
|
if null_counts and verbose:
|
|||
|
|
print(f" [警告] 发现缺失值:")
|
|||
|
|
for col, count in list(null_counts.items())[:5]: # 只显示前5个
|
|||
|
|
pct = count / len(data) * 100
|
|||
|
|
print(f" {col}: {count} ({pct:.2f}%)")
|
|||
|
|
|
|||
|
|
def _split_data(
|
|||
|
|
self,
|
|||
|
|
data: pl.DataFrame,
|
|||
|
|
date_range: Dict[str, Tuple[str, str]],
|
|||
|
|
feature_cols: List[str],
|
|||
|
|
label_name: str,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> Dict[str, Dict[str, Any]]:
|
|||
|
|
"""划分数据集
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
data: 完整数据
|
|||
|
|
date_range: 日期范围字典
|
|||
|
|
feature_cols: 特征列名
|
|||
|
|
label_name: Label 列名
|
|||
|
|
verbose: 是否打印信息
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
划分后的数据字典
|
|||
|
|
"""
|
|||
|
|
result = {}
|
|||
|
|
|
|||
|
|
for split_name, (start, end) in date_range.items():
|
|||
|
|
mask = (data["trade_date"] >= start) & (data["trade_date"] <= end)
|
|||
|
|
split_df = data.filter(mask)
|
|||
|
|
|
|||
|
|
result[split_name] = {
|
|||
|
|
"X": split_df.select(feature_cols),
|
|||
|
|
"y": split_df[label_name],
|
|||
|
|
"raw_data": split_df,
|
|||
|
|
"feature_cols": feature_cols,
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f" {split_name}: {len(split_df)} 条记录")
|
|||
|
|
|
|||
|
|
return result
|
|||
|
|
|
|||
|
|
def _preprocess(
|
|||
|
|
self,
|
|||
|
|
split_data: Dict[str, Dict[str, Any]],
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> Dict[str, Dict[str, Any]]:
|
|||
|
|
"""预处理数据
|
|||
|
|
|
|||
|
|
训练集使用 fit_transform,验证集和测试集使用 transform
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
split_data: 划分后的数据字典
|
|||
|
|
verbose: 是否打印信息
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
预处理后的数据字典
|
|||
|
|
"""
|
|||
|
|
if not self.processors:
|
|||
|
|
return split_data
|
|||
|
|
|
|||
|
|
self.fitted_processors = []
|
|||
|
|
|
|||
|
|
# 训练集:fit_transform
|
|||
|
|
if verbose:
|
|||
|
|
print(f" 训练集预处理(fit_transform)...")
|
|||
|
|
|
|||
|
|
train_data = split_data["train"]["raw_data"]
|
|||
|
|
for processor in self.processors:
|
|||
|
|
train_data = processor.fit_transform(train_data)
|
|||
|
|
self.fitted_processors.append(processor)
|
|||
|
|
|
|||
|
|
# 更新训练集
|
|||
|
|
split_data["train"]["raw_data"] = train_data
|
|||
|
|
split_data["train"]["X"] = train_data.select(split_data["train"]["feature_cols"])
|
|||
|
|
split_data["train"]["y"] = train_data[split_data["train"]["y"].name]
|
|||
|
|
|
|||
|
|
# 验证集和测试集:transform
|
|||
|
|
for split_name in ["val", "test"]:
|
|||
|
|
if split_name in split_data:
|
|||
|
|
if verbose:
|
|||
|
|
print(f" {split_name}集预处理(transform)...")
|
|||
|
|
|
|||
|
|
split_df = split_data[split_name]["raw_data"]
|
|||
|
|
for processor in self.fitted_processors:
|
|||
|
|
split_df = processor.transform(split_df)
|
|||
|
|
|
|||
|
|
split_data[split_name]["raw_data"] = split_df
|
|||
|
|
split_data[split_name]["X"] = split_df.select(split_data[split_name]["feature_cols"])
|
|||
|
|
split_data[split_name]["y"] = split_df[split_data[split_name]["y"].name]
|
|||
|
|
|
|||
|
|
return split_data
|
|||
|
|
|
|||
|
|
def get_fitted_processors(self) -> List[BaseProcessor]:
|
|||
|
|
"""获取已拟合的处理器列表
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
已拟合的处理器列表(用于模型保存)
|
|||
|
|
"""
|
|||
|
|
return self.fitted_processors
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Create test file**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""DataPipeline 测试"""
|
|||
|
|
|
|||
|
|
import pytest
|
|||
|
|
from unittest.mock import Mock, MagicMock
|
|||
|
|
import polars as pl
|
|||
|
|
|
|||
|
|
from src.training.pipeline import DataPipeline
|
|||
|
|
from src.training.factor_manager import FactorManager
|
|||
|
|
from src.training.components.processors import NullFiller
|
|||
|
|
|
|||
|
|
|
|||
|
|
class TestDataPipeline:
|
|||
|
|
"""测试 DataPipeline"""
|
|||
|
|
|
|||
|
|
def test_init(self):
|
|||
|
|
"""测试初始化"""
|
|||
|
|
fm = Mock(spec=FactorManager)
|
|||
|
|
processors = [NullFiller(feature_cols=["f1"])]
|
|||
|
|
|
|||
|
|
pipeline = DataPipeline(
|
|||
|
|
factor_manager=fm,
|
|||
|
|
processors=processors,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert pipeline.factor_manager == fm
|
|||
|
|
assert pipeline.processors == processors
|
|||
|
|
assert pipeline.fitted_processors == []
|
|||
|
|
|
|||
|
|
def test_get_fitted_processors(self):
|
|||
|
|
"""测试获取已拟合处理器"""
|
|||
|
|
pipeline = DataPipeline(
|
|||
|
|
factor_manager=Mock(),
|
|||
|
|
processors=[],
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 模拟已拟合处理器
|
|||
|
|
pipeline.fitted_processors = [Mock()]
|
|||
|
|
|
|||
|
|
assert len(pipeline.get_fitted_processors()) == 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/training/pipeline.py tests/test_pipeline.py
|
|||
|
|
git commit -m "feat(training): add DataPipeline component
|
|||
|
|
|
|||
|
|
- Complete data processing pipeline: register factors, filter, split, preprocess
|
|||
|
|
- Support both class filters (STFilter) and function filters (stock_pool_filter)
|
|||
|
|
- Proper fit_transform/transform separation for processors
|
|||
|
|
- Add comprehensive tests"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 5: 创建 Task 策略组件
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/training/tasks/base.py`
|
|||
|
|
- Create: `src/training/tasks/regression_task.py`
|
|||
|
|
- Create: `src/training/tasks/rank_task.py`
|
|||
|
|
- Create: `src/training/tasks/__init__.py`
|
|||
|
|
- Test: `tests/test_tasks.py`
|
|||
|
|
|
|||
|
|
**Step 1: Create base Task protocol**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""任务抽象基类
|
|||
|
|
|
|||
|
|
定义 Task 接口,所有具体任务必须实现此接口。
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from abc import ABC, abstractmethod
|
|||
|
|
from typing import Any, Dict, Optional
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
|
|||
|
|
class BaseTask(ABC):
|
|||
|
|
"""任务抽象基类
|
|||
|
|
|
|||
|
|
所有训练任务(回归、排序学习、分类等)必须继承此类。
|
|||
|
|
提供统一的接口:Label处理、模型训练、预测、评估。
|
|||
|
|
|
|||
|
|
Attributes:
|
|||
|
|
label_name: Label 列名
|
|||
|
|
model_params: 模型参数字典
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(self, model_params: Dict[str, Any], label_name: str):
|
|||
|
|
"""初始化任务
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
model_params: 模型参数字典
|
|||
|
|
label_name: Label 列名
|
|||
|
|
"""
|
|||
|
|
self.model_params = model_params
|
|||
|
|
self.label_name = label_name
|
|||
|
|
self.model = None
|
|||
|
|
|
|||
|
|
@abstractmethod
|
|||
|
|
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
|
|||
|
|
"""准备标签
|
|||
|
|
|
|||
|
|
子类可实现特定的 Label 转换逻辑(如排序学习的分位数转换)。
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
data: 数据字典
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
处理后的数据字典
|
|||
|
|
"""
|
|||
|
|
raise NotImplementedError
|
|||
|
|
|
|||
|
|
@abstractmethod
|
|||
|
|
def fit(self, train_data: Dict, val_data: Dict) -> None:
|
|||
|
|
"""训练模型
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
train_data: 训练数据字典 {"X": DataFrame, "y": Series, ...}
|
|||
|
|
val_data: 验证数据字典
|
|||
|
|
"""
|
|||
|
|
raise NotImplementedError
|
|||
|
|
|
|||
|
|
@abstractmethod
|
|||
|
|
def predict(self, test_data: Dict) -> np.ndarray:
|
|||
|
|
"""生成预测
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
test_data: 测试数据字典
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
预测结果数组
|
|||
|
|
"""
|
|||
|
|
raise NotImplementedError
|
|||
|
|
|
|||
|
|
def get_model(self):
|
|||
|
|
"""获取底层模型
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
训练后的模型实例
|
|||
|
|
"""
|
|||
|
|
return self.model
|
|||
|
|
|
|||
|
|
def plot_training_metrics(self) -> None:
|
|||
|
|
"""绘制训练指标曲线(可选)"""
|
|||
|
|
pass
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Create RegressionTask**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""回归任务实现
|
|||
|
|
|
|||
|
|
实现回归任务的训练流程:
|
|||
|
|
- Label 无需转换(保持连续值)
|
|||
|
|
- 使用 LightGBM 回归模型
|
|||
|
|
- 支持 MAE/RMSE 评估
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from typing import Any, Dict, Optional
|
|||
|
|
import numpy as np
|
|||
|
|
import polars as pl
|
|||
|
|
|
|||
|
|
from src.training.tasks.base import BaseTask
|
|||
|
|
from src.training.components.models.lightgbm import LightGBMModel
|
|||
|
|
|
|||
|
|
|
|||
|
|
class RegressionTask(BaseTask):
|
|||
|
|
"""回归任务
|
|||
|
|
|
|||
|
|
使用 LightGBM 进行回归训练,支持早停和训练曲线。
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
model_params: Dict[str, Any],
|
|||
|
|
label_name: str = "future_return_5",
|
|||
|
|
):
|
|||
|
|
"""初始化回归任务
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
model_params: LightGBM 参数字典
|
|||
|
|
label_name: Label 列名
|
|||
|
|
"""
|
|||
|
|
super().__init__(model_params, label_name)
|
|||
|
|
self.evals_result: Optional[Dict] = None
|
|||
|
|
|
|||
|
|
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
|
|||
|
|
"""准备标签(回归任务无需转换)
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
data: 数据字典
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
原样返回数据字典
|
|||
|
|
"""
|
|||
|
|
# 回归任务不需要转换 Label
|
|||
|
|
return data
|
|||
|
|
|
|||
|
|
def fit(self, train_data: Dict, val_data: Dict) -> None:
|
|||
|
|
"""训练回归模型
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
train_data: 训练数据 {"X": DataFrame, "y": Series}
|
|||
|
|
val_data: 验证数据
|
|||
|
|
"""
|
|||
|
|
self.model = LightGBMModel(params=self.model_params)
|
|||
|
|
|
|||
|
|
X_train = train_data["X"]
|
|||
|
|
y_train = train_data["y"]
|
|||
|
|
X_val = val_data["X"]
|
|||
|
|
y_val = val_data["y"]
|
|||
|
|
|
|||
|
|
self.model.fit(
|
|||
|
|
X_train, y_train,
|
|||
|
|
eval_set=(X_val, y_val) if X_val is not None else None
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def predict(self, test_data: Dict) -> np.ndarray:
|
|||
|
|
"""生成预测
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
test_data: 测试数据
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
预测结果
|
|||
|
|
"""
|
|||
|
|
return self.model.predict(test_data["X"])
|
|||
|
|
|
|||
|
|
def plot_training_metrics(self) -> None:
|
|||
|
|
"""绘制训练指标曲线"""
|
|||
|
|
if self.model and hasattr(self.model, 'model') and self.model.model:
|
|||
|
|
try:
|
|||
|
|
import lightgbm as lgb
|
|||
|
|
import matplotlib.pyplot as plt
|
|||
|
|
|
|||
|
|
fig, ax = plt.subplots(figsize=(10, 6))
|
|||
|
|
lgb.plot_metric(self.model.model, ax=ax)
|
|||
|
|
plt.title("Training Metrics", fontsize=12, fontweight="bold")
|
|||
|
|
plt.tight_layout()
|
|||
|
|
plt.show()
|
|||
|
|
except Exception as e:
|
|||
|
|
print(f"[警告] 无法绘制训练曲线: {e}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Create RankTask**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""排序学习任务实现
|
|||
|
|
|
|||
|
|
实现排序学习任务的训练流程:
|
|||
|
|
- Label 转换为分位数标签
|
|||
|
|
- 生成 group 数组
|
|||
|
|
- 使用 LightGBM LambdaRank
|
|||
|
|
- 支持 NDCG@k 评估
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from typing import Any, Dict, List, Optional
|
|||
|
|
import numpy as np
|
|||
|
|
import polars as pl
|
|||
|
|
|
|||
|
|
from src.training.tasks.base import BaseTask
|
|||
|
|
from src.training.components.models.lightgbm_lambdarank import LightGBMLambdaRankModel
|
|||
|
|
|
|||
|
|
|
|||
|
|
class RankTask(BaseTask):
|
|||
|
|
"""排序学习任务
|
|||
|
|
|
|||
|
|
使用 LightGBM LambdaRank 进行排序学习训练。
|
|||
|
|
将连续收益率转换为分位数标签进行训练。
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
model_params: Dict[str, Any],
|
|||
|
|
label_name: str = "future_return_5",
|
|||
|
|
n_quantiles: int = 20,
|
|||
|
|
):
|
|||
|
|
"""初始化排序学习任务
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
model_params: LightGBM 参数字典
|
|||
|
|
label_name: Label 列名
|
|||
|
|
n_quantiles: 分位数数量
|
|||
|
|
"""
|
|||
|
|
super().__init__(model_params, label_name)
|
|||
|
|
self.n_quantiles = n_quantiles
|
|||
|
|
|
|||
|
|
def prepare_labels(self, data: Dict[str, Dict]) -> Dict[str, Dict]:
|
|||
|
|
"""准备标签(转换为分位数标签)
|
|||
|
|
|
|||
|
|
将连续收益率转换为分位数标签,并生成 group 数组。
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
data: 数据字典
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
处理后的数据字典(添加了 y_rank 和 groups)
|
|||
|
|
"""
|
|||
|
|
for split in ["train", "val", "test"]:
|
|||
|
|
if split not in data:
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
df = data[split]["raw_data"]
|
|||
|
|
|
|||
|
|
# 分位数转换
|
|||
|
|
rank_col = f"{self.label_name}_rank"
|
|||
|
|
df_ranked = (
|
|||
|
|
df.with_columns(
|
|||
|
|
pl.col(self.label_name)
|
|||
|
|
.rank(method="min")
|
|||
|
|
.over("trade_date")
|
|||
|
|
.alias("_rank")
|
|||
|
|
)
|
|||
|
|
.with_columns(
|
|||
|
|
((pl.col("_rank") - 1) / pl.len().over("trade_date") * self.n_quantiles)
|
|||
|
|
.floor()
|
|||
|
|
.cast(pl.Int64)
|
|||
|
|
.clip(0, self.n_quantiles - 1)
|
|||
|
|
.alias(rank_col)
|
|||
|
|
)
|
|||
|
|
.drop("_rank")
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 更新数据
|
|||
|
|
data[split]["raw_data"] = df_ranked
|
|||
|
|
data[split]["y"] = df_ranked[rank_col]
|
|||
|
|
data[split]["y_raw"] = df_ranked[self.label_name] # 保留原始值
|
|||
|
|
|
|||
|
|
# 生成 group 数组
|
|||
|
|
data[split]["groups"] = self._compute_group_array(df_ranked, "trade_date")
|
|||
|
|
|
|||
|
|
return data
|
|||
|
|
|
|||
|
|
def _compute_group_array(
|
|||
|
|
self,
|
|||
|
|
df: pl.DataFrame,
|
|||
|
|
date_col: str = "trade_date",
|
|||
|
|
) -> np.ndarray:
|
|||
|
|
"""计算 group 数组
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
df: 数据框
|
|||
|
|
date_col: 日期列名
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
group 数组(每个日期的样本数)
|
|||
|
|
"""
|
|||
|
|
group_counts = df.group_by(date_col, maintain_order=True).agg(
|
|||
|
|
pl.count().alias("count")
|
|||
|
|
)
|
|||
|
|
return group_counts["count"].to_numpy()
|
|||
|
|
|
|||
|
|
def fit(self, train_data: Dict, val_data: Dict) -> None:
|
|||
|
|
"""训练排序模型
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
train_data: 训练数据
|
|||
|
|
val_data: 验证数据
|
|||
|
|
"""
|
|||
|
|
self.model = LightGBMLambdaRankModel(params=self.model_params)
|
|||
|
|
|
|||
|
|
self.model.fit(
|
|||
|
|
train_data["X"], train_data["y"],
|
|||
|
|
group=train_data["groups"],
|
|||
|
|
eval_set=(val_data["X"], val_data["y"], val_data["groups"]) if val_data else None
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def predict(self, test_data: Dict) -> np.ndarray:
|
|||
|
|
"""生成预测
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
test_data: 测试数据
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
预测结果
|
|||
|
|
"""
|
|||
|
|
return self.model.predict(test_data["X"])
|
|||
|
|
|
|||
|
|
def evaluate_ndcg(
|
|||
|
|
self,
|
|||
|
|
test_data: Dict,
|
|||
|
|
k_list: List[int] = None,
|
|||
|
|
) -> Dict[str, float]:
|
|||
|
|
"""评估 NDCG@k
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
test_data: 测试数据
|
|||
|
|
k_list: k 值列表,默认 [1, 5, 10, 20]
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
NDCG 分数字典 {"ndcg@1": score, ...}
|
|||
|
|
"""
|
|||
|
|
if k_list is None:
|
|||
|
|
k_list = [1, 5, 10, 20]
|
|||
|
|
|
|||
|
|
y_true = test_data["y_raw"]
|
|||
|
|
y_pred = self.predict(test_data)
|
|||
|
|
groups = test_data["groups"]
|
|||
|
|
|
|||
|
|
from sklearn.metrics import ndcg_score
|
|||
|
|
|
|||
|
|
results = {}
|
|||
|
|
|
|||
|
|
# 按 group 拆分
|
|||
|
|
start_idx = 0
|
|||
|
|
y_true_groups = []
|
|||
|
|
y_pred_groups = []
|
|||
|
|
|
|||
|
|
for group_size in groups:
|
|||
|
|
end_idx = start_idx + group_size
|
|||
|
|
y_true_groups.append(y_true.to_numpy()[start_idx:end_idx])
|
|||
|
|
y_pred_groups.append(y_pred[start_idx:end_idx])
|
|||
|
|
start_idx = end_idx
|
|||
|
|
|
|||
|
|
# 计算每个 k 的 NDCG
|
|||
|
|
for k in k_list:
|
|||
|
|
ndcg_scores = []
|
|||
|
|
for yt, yp in zip(y_true_groups, y_pred_groups):
|
|||
|
|
if len(yt) > 1:
|
|||
|
|
try:
|
|||
|
|
score = ndcg_score([yt], [yp], k=k)
|
|||
|
|
ndcg_scores.append(score)
|
|||
|
|
except ValueError:
|
|||
|
|
pass
|
|||
|
|
|
|||
|
|
results[f"ndcg@{k}"] = float(np.mean(ndcg_scores)) if ndcg_scores else 0.0
|
|||
|
|
|
|||
|
|
return results
|
|||
|
|
|
|||
|
|
def plot_training_metrics(self) -> None:
|
|||
|
|
"""绘制训练指标曲线(NDCG)"""
|
|||
|
|
if self.model:
|
|||
|
|
try:
|
|||
|
|
self.model.plot_all_metrics()
|
|||
|
|
except Exception as e:
|
|||
|
|
print(f"[警告] 无法绘制训练曲线: {e}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 4: Create tasks/__init__.py**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""Tasks 模块
|
|||
|
|
|
|||
|
|
提供各种训练任务的实现。
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from src.training.tasks.base import BaseTask
|
|||
|
|
from src.training.tasks.regression_task import RegressionTask
|
|||
|
|
from src.training.tasks.rank_task import RankTask
|
|||
|
|
|
|||
|
|
__all__ = [
|
|||
|
|
"BaseTask",
|
|||
|
|
"RegressionTask",
|
|||
|
|
"RankTask",
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 5: Create test file**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""Task 测试"""
|
|||
|
|
|
|||
|
|
import pytest
|
|||
|
|
from unittest.mock import Mock, patch
|
|||
|
|
import numpy as np
|
|||
|
|
import polars as pl
|
|||
|
|
|
|||
|
|
from src.training.tasks import RegressionTask, RankTask
|
|||
|
|
|
|||
|
|
|
|||
|
|
class TestRegressionTask:
|
|||
|
|
"""测试 RegressionTask"""
|
|||
|
|
|
|||
|
|
def test_init(self):
|
|||
|
|
"""测试初始化"""
|
|||
|
|
task = RegressionTask(
|
|||
|
|
model_params={"objective": "regression"},
|
|||
|
|
label_name="target",
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert task.model_params == {"objective": "regression"}
|
|||
|
|
assert task.label_name == "target"
|
|||
|
|
assert task.model is None
|
|||
|
|
|
|||
|
|
def test_prepare_labels(self):
|
|||
|
|
"""测试 Label 准备(回归无需转换)"""
|
|||
|
|
task = RegressionTask(model_params={}, label_name="target")
|
|||
|
|
|
|||
|
|
data = {"train": {"y": Mock()}}
|
|||
|
|
result = task.prepare_labels(data)
|
|||
|
|
|
|||
|
|
# 回归任务应该原样返回
|
|||
|
|
assert result == data
|
|||
|
|
|
|||
|
|
|
|||
|
|
class TestRankTask:
|
|||
|
|
"""测试 RankTask"""
|
|||
|
|
|
|||
|
|
def test_init(self):
|
|||
|
|
"""测试初始化"""
|
|||
|
|
task = RankTask(
|
|||
|
|
model_params={"objective": "lambdarank"},
|
|||
|
|
label_name="target",
|
|||
|
|
n_quantiles=10,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert task.n_quantiles == 10
|
|||
|
|
|
|||
|
|
def test_compute_group_array(self):
|
|||
|
|
"""测试 group 数组计算"""
|
|||
|
|
task = RankTask(model_params={}, label_name="target")
|
|||
|
|
|
|||
|
|
# 创建测试数据
|
|||
|
|
df = pl.DataFrame({
|
|||
|
|
"trade_date": ["20240101", "20240101", "20240102", "20240102", "20240102"],
|
|||
|
|
"value": [1, 2, 3, 4, 5],
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
groups = task._compute_group_array(df, "trade_date")
|
|||
|
|
|
|||
|
|
assert len(groups) == 2 # 两个日期
|
|||
|
|
assert groups[0] == 2 # 第一个日期2条
|
|||
|
|
assert groups[1] == 3 # 第二个日期3条
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 6: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/training/tasks/
|
|||
|
|
git add tests/test_tasks.py
|
|||
|
|
git commit -m "feat(training): add Task strategy components
|
|||
|
|
|
|||
|
|
- Add BaseTask abstract base class
|
|||
|
|
- Add RegressionTask for regression training
|
|||
|
|
- Add RankTask for learning-to-rank with LambdaRank
|
|||
|
|
- Support quantile label conversion and NDCG evaluation
|
|||
|
|
- Add comprehensive tests"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 6: 创建 ResultAnalyzer 组件
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/training/result_analyzer.py`
|
|||
|
|
- Test: `tests/test_result_analyzer.py`
|
|||
|
|
|
|||
|
|
**Step 1: Create the ResultAnalyzer implementation**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""结果分析器
|
|||
|
|
|
|||
|
|
训练后的分析和结果处理:
|
|||
|
|
1. 特征重要性分析(Top N、零贡献特征)
|
|||
|
|
2. 结果组装(生成每日 Top N)
|
|||
|
|
3. 结果保存
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from typing import Any, Dict, List, Optional
|
|||
|
|
import os
|
|||
|
|
import polars as pl
|
|||
|
|
import pandas as pd
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
|
|||
|
|
class ResultAnalyzer:
|
|||
|
|
"""结果分析器
|
|||
|
|
|
|||
|
|
分析训练结果,生成报告并保存。
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def analyze_feature_importance(
|
|||
|
|
self,
|
|||
|
|
model,
|
|||
|
|
feature_cols: List[str],
|
|||
|
|
top_n: int = 20,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> Dict[str, Any]:
|
|||
|
|
"""分析特征重要性
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
model: 训练好的模型
|
|||
|
|
feature_cols: 特征列名列表
|
|||
|
|
top_n: 显示 Top N 特征
|
|||
|
|
verbose: 是否打印信息
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
分析结果字典
|
|||
|
|
"""
|
|||
|
|
importance = model.feature_importance()
|
|||
|
|
|
|||
|
|
if importance is None:
|
|||
|
|
if verbose:
|
|||
|
|
print("[警告] 无法获取特征重要性")
|
|||
|
|
return {}
|
|||
|
|
|
|||
|
|
# 按重要性排序
|
|||
|
|
importance_sorted = importance.sort_values(ascending=False)
|
|||
|
|
|
|||
|
|
# 计算百分比
|
|||
|
|
total_importance = importance_sorted.sum()
|
|||
|
|
importance_pct = (importance_sorted / total_importance * 100).round(2)
|
|||
|
|
|
|||
|
|
# 识别零贡献特征
|
|||
|
|
zero_importance_features = importance_sorted[importance_sorted == 0].index.tolist()
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print("特征重要性分析")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
# 打印 Top N
|
|||
|
|
print(f"\nTop {top_n} 特征:")
|
|||
|
|
print("-" * 80)
|
|||
|
|
print(f"{'排名':<6}{'特征名':<35}{'重要性':<15}{'占比':<10}")
|
|||
|
|
print("-" * 80)
|
|||
|
|
|
|||
|
|
for i, (feature, score) in enumerate(importance_sorted.head(top_n).items(), 1):
|
|||
|
|
pct = importance_pct[feature]
|
|||
|
|
if pct >= 10:
|
|||
|
|
marker = " [高贡献]"
|
|||
|
|
elif pct >= 1:
|
|||
|
|
marker = " [中贡献]"
|
|||
|
|
else:
|
|||
|
|
marker = " [低贡献]"
|
|||
|
|
print(f"{i:<6}{feature:<35}{score:<15.2f}{pct:<8.2f}%{marker}")
|
|||
|
|
|
|||
|
|
# 打印零贡献特征
|
|||
|
|
if zero_importance_features:
|
|||
|
|
print("\n" + "-" * 80)
|
|||
|
|
print(f"[警告] 贡献为0的特征(共 {len(zero_importance_features)} 个):")
|
|||
|
|
for i, feature in enumerate(zero_importance_features, 1):
|
|||
|
|
print(f" {i}. {feature}")
|
|||
|
|
|
|||
|
|
# 统计摘要
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print("统计摘要:")
|
|||
|
|
print("-" * 80)
|
|||
|
|
print(f" 特征总数: {len(importance_sorted)}")
|
|||
|
|
print(f" 有贡献特征数: {len(importance_sorted) - len(zero_importance_features)}")
|
|||
|
|
print(f" 零贡献特征数: {len(zero_importance_features)}")
|
|||
|
|
if len(importance_sorted) > 0:
|
|||
|
|
print(f" 零贡献占比: {len(zero_importance_features) / len(importance_sorted) * 100:.1f}%")
|
|||
|
|
print(f" Top {top_n} 累计占比: {importance_pct.head(top_n).sum():.1f}%")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
return {
|
|||
|
|
"importance": importance_sorted,
|
|||
|
|
"importance_pct": importance_pct,
|
|||
|
|
"zero_importance_features": zero_importance_features,
|
|||
|
|
"top_n": importance_sorted.head(top_n),
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
def assemble_results(
|
|||
|
|
self,
|
|||
|
|
test_data: Dict[str, Any],
|
|||
|
|
predictions: np.ndarray,
|
|||
|
|
top_n: int = 50,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> pl.DataFrame:
|
|||
|
|
"""组装结果
|
|||
|
|
|
|||
|
|
生成每日 Top N 股票推荐列表。
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
test_data: 测试数据字典
|
|||
|
|
predictions: 预测结果数组
|
|||
|
|
top_n: 每日选择的股票数
|
|||
|
|
verbose: 是否打印信息
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
结果数据框
|
|||
|
|
"""
|
|||
|
|
# 添加预测列
|
|||
|
|
raw_data = test_data["raw_data"]
|
|||
|
|
results = raw_data.with_columns([
|
|||
|
|
pl.Series("prediction", predictions)
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
# 按日期分组取 Top N
|
|||
|
|
unique_dates = results["trade_date"].unique().sort()
|
|||
|
|
topn_by_date = []
|
|||
|
|
|
|||
|
|
for date in unique_dates:
|
|||
|
|
day_data = results.filter(results["trade_date"] == date)
|
|||
|
|
topn = day_data.sort("prediction", descending=True).head(top_n)
|
|||
|
|
topn_by_date.append(topn)
|
|||
|
|
|
|||
|
|
# 合并所有日期的 Top N
|
|||
|
|
topn_results = pl.concat(topn_by_date)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f"\n生成每日 Top {top_n} 股票列表:")
|
|||
|
|
print(f" 交易日数: {len(unique_dates)}")
|
|||
|
|
print(f" 总推荐数: {len(topn_results)}")
|
|||
|
|
|
|||
|
|
return topn_results
|
|||
|
|
|
|||
|
|
def save_results(
|
|||
|
|
self,
|
|||
|
|
results: pl.DataFrame,
|
|||
|
|
output_path: str,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
) -> None:
|
|||
|
|
"""保存结果
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
results: 结果数据框
|
|||
|
|
output_path: 输出路径
|
|||
|
|
verbose: 是否打印信息
|
|||
|
|
"""
|
|||
|
|
# 格式化日期并调整列顺序
|
|||
|
|
formatted = results.select([
|
|||
|
|
(pl.col("trade_date").str.slice(0, 4) + "-" +
|
|||
|
|
pl.col("trade_date").str.slice(4, 2) + "-" +
|
|||
|
|
pl.col("trade_date").str.slice(6, 2)).alias("date"),
|
|||
|
|
pl.col("prediction").alias("score"),
|
|||
|
|
pl.col("ts_code"),
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
# 确保目录存在
|
|||
|
|
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
|||
|
|
|
|||
|
|
# 保存 CSV
|
|||
|
|
formatted.write_csv(output_path, include_header=True)
|
|||
|
|
|
|||
|
|
if verbose:
|
|||
|
|
print(f" 保存路径: {output_path}")
|
|||
|
|
print(f" 保存行数: {len(formatted)}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Create test file**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""ResultAnalyzer 测试"""
|
|||
|
|
|
|||
|
|
import pytest
|
|||
|
|
from unittest.mock import Mock
|
|||
|
|
import polars as pl
|
|||
|
|
import pandas as pd
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
from src.training.result_analyzer import ResultAnalyzer
|
|||
|
|
|
|||
|
|
|
|||
|
|
class TestResultAnalyzer:
|
|||
|
|
"""测试 ResultAnalyzer"""
|
|||
|
|
|
|||
|
|
def test_init(self):
|
|||
|
|
"""测试初始化"""
|
|||
|
|
analyzer = ResultAnalyzer()
|
|||
|
|
assert analyzer is not None
|
|||
|
|
|
|||
|
|
def test_analyze_feature_importance(self):
|
|||
|
|
"""测试特征重要性分析"""
|
|||
|
|
analyzer = ResultAnalyzer()
|
|||
|
|
|
|||
|
|
# 创建 mock model
|
|||
|
|
model = Mock()
|
|||
|
|
importance = pd.Series(
|
|||
|
|
[100, 50, 0, 0, 30],
|
|||
|
|
index=["feat1", "feat2", "feat3", "feat4", "feat5"]
|
|||
|
|
)
|
|||
|
|
model.feature_importance.return_value = importance
|
|||
|
|
|
|||
|
|
result = analyzer.analyze_feature_importance(
|
|||
|
|
model=model,
|
|||
|
|
feature_cols=["feat1", "feat2", "feat3", "feat4", "feat5"],
|
|||
|
|
top_n=3,
|
|||
|
|
verbose=False,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert "importance" in result
|
|||
|
|
assert "zero_importance_features" in result
|
|||
|
|
assert len(result["zero_importance_features"]) == 2 # feat3, feat4
|
|||
|
|
|
|||
|
|
def test_assemble_results(self):
|
|||
|
|
"""测试结果组装"""
|
|||
|
|
analyzer = ResultAnalyzer()
|
|||
|
|
|
|||
|
|
# 创建测试数据
|
|||
|
|
test_data = {
|
|||
|
|
"raw_data": pl.DataFrame({
|
|||
|
|
"trade_date": ["20240101", "20240101", "20240102", "20240102"],
|
|||
|
|
"ts_code": ["000001.SZ", "000002.SZ", "000001.SZ", "000002.SZ"],
|
|||
|
|
})
|
|||
|
|
}
|
|||
|
|
predictions = np.array([0.5, 0.3, 0.8, 0.2])
|
|||
|
|
|
|||
|
|
results = analyzer.assemble_results(
|
|||
|
|
test_data=test_data,
|
|||
|
|
predictions=predictions,
|
|||
|
|
top_n=1,
|
|||
|
|
verbose=False,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
assert len(results) == 2 # 每天选1个,共2天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/training/result_analyzer.py tests/test_result_analyzer.py
|
|||
|
|
git commit -m "feat(training): add ResultAnalyzer component
|
|||
|
|
|
|||
|
|
- Analyze feature importance with top N and zero-contribution features
|
|||
|
|
- Assemble daily Top N stock recommendations
|
|||
|
|
- Save results to CSV with proper formatting
|
|||
|
|
- Add comprehensive tests"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 7: 重构 Trainer 为调度引擎
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/training/core/trainer_new.py` (新实现)
|
|||
|
|
- Modify: `src/training/__init__.py` - 添加新导出
|
|||
|
|
|
|||
|
|
**Step 1: Create new Trainer implementation**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
"""训练调度引擎
|
|||
|
|
|
|||
|
|
协调 FactorManager、DataPipeline、Task 和 ResultAnalyzer 完成训练流程。
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
from typing import Any, Callable, Dict, List, Optional, Tuple
|
|||
|
|
import os
|
|||
|
|
from datetime import datetime
|
|||
|
|
|
|||
|
|
import polars as pl
|
|||
|
|
|
|||
|
|
from src.factors import FactorEngine
|
|||
|
|
from src.training.pipeline import DataPipeline
|
|||
|
|
from src.training.tasks.base import BaseTask
|
|||
|
|
from src.training.result_analyzer import ResultAnalyzer
|
|||
|
|
|
|||
|
|
|
|||
|
|
class Trainer:
|
|||
|
|
"""训练调度引擎
|
|||
|
|
|
|||
|
|
协调各个组件执行完整训练流程:
|
|||
|
|
1. 准备数据(DataPipeline)
|
|||
|
|
2. 处理标签(Task)
|
|||
|
|
3. 训练模型(Task)
|
|||
|
|
4. 绘制指标(Task)
|
|||
|
|
5. 生成预测(Task)
|
|||
|
|
6. 分析结果(ResultAnalyzer)
|
|||
|
|
7. 保存结果
|
|||
|
|
|
|||
|
|
Attributes:
|
|||
|
|
data_pipeline: 数据流水线
|
|||
|
|
task: 任务实例(RegressionTask/RankTask)
|
|||
|
|
analyzer: 结果分析器
|
|||
|
|
output_config: 输出配置
|
|||
|
|
verbose: 是否打印详细信息
|
|||
|
|
results: 训练结果
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
data_pipeline: DataPipeline,
|
|||
|
|
task: BaseTask,
|
|||
|
|
analyzer: Optional[ResultAnalyzer] = None,
|
|||
|
|
output_config: Optional[Dict[str, Any]] = None,
|
|||
|
|
verbose: bool = True,
|
|||
|
|
):
|
|||
|
|
"""初始化训练器
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
data_pipeline: 数据流水线实例
|
|||
|
|
task: 任务实例(RegressionTask 或 RankTask)
|
|||
|
|
analyzer: 结果分析器(可选,默认创建新实例)
|
|||
|
|
output_config: 输出配置字典
|
|||
|
|
verbose: 是否打印详细信息
|
|||
|
|
"""
|
|||
|
|
self.data_pipeline = data_pipeline
|
|||
|
|
self.task = task
|
|||
|
|
self.analyzer = analyzer or ResultAnalyzer()
|
|||
|
|
self.output_config = output_config or {}
|
|||
|
|
self.verbose = verbose
|
|||
|
|
self.results: Optional[pl.DataFrame] = None
|
|||
|
|
|
|||
|
|
def run(
|
|||
|
|
self,
|
|||
|
|
engine: FactorEngine,
|
|||
|
|
date_range: Dict[str, Tuple[str, str]],
|
|||
|
|
) -> pl.DataFrame:
|
|||
|
|
"""执行完整训练流程
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
engine: FactorEngine 实例
|
|||
|
|
date_range: 日期范围字典
|
|||
|
|
{
|
|||
|
|
"train": (start_date, end_date),
|
|||
|
|
"val": (start_date, end_date),
|
|||
|
|
"test": (start_date, end_date),
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
训练结果数据框
|
|||
|
|
"""
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print(f"开始训练: {self.task.__class__.__name__}")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
# Step 1: 准备数据
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 1/7] 准备数据...")
|
|||
|
|
|
|||
|
|
data = self.data_pipeline.prepare_data(
|
|||
|
|
engine=engine,
|
|||
|
|
date_range=date_range,
|
|||
|
|
label_name=self.task.label_name,
|
|||
|
|
verbose=self.verbose,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Step 2: 处理标签
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 2/7] 处理标签...")
|
|||
|
|
|
|||
|
|
data = self.task.prepare_labels(data)
|
|||
|
|
|
|||
|
|
# Step 3: 训练模型
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 3/7] 训练模型...")
|
|||
|
|
|
|||
|
|
self.task.fit(data["train"], data["val"])
|
|||
|
|
|
|||
|
|
# Step 4: 绘制训练指标
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 4/7] 绘制训练指标...")
|
|||
|
|
|
|||
|
|
self.task.plot_training_metrics()
|
|||
|
|
|
|||
|
|
# Step 5: 生成预测
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 5/7] 生成预测...")
|
|||
|
|
|
|||
|
|
predictions = self.task.predict(data["test"])
|
|||
|
|
|
|||
|
|
# Step 6: 分析结果
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 6/7] 分析结果...")
|
|||
|
|
|
|||
|
|
# 特征重要性
|
|||
|
|
self.analyzer.analyze_feature_importance(
|
|||
|
|
model=self.task.get_model(),
|
|||
|
|
feature_cols=data["test"]["feature_cols"],
|
|||
|
|
top_n=20,
|
|||
|
|
verbose=self.verbose,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# NDCG 评估(排序任务特有)
|
|||
|
|
if hasattr(self.task, 'evaluate_ndcg'):
|
|||
|
|
ndcg_scores = self.task.evaluate_ndcg(data["test"])
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\nNDCG 评估结果:")
|
|||
|
|
for metric, score in ndcg_scores.items():
|
|||
|
|
print(f" {metric}: {score:.4f}")
|
|||
|
|
|
|||
|
|
# 组装结果
|
|||
|
|
self.results = self.analyzer.assemble_results(
|
|||
|
|
test_data=data["test"],
|
|||
|
|
predictions=predictions,
|
|||
|
|
top_n=self.output_config.get("top_n", 50),
|
|||
|
|
verbose=self.verbose,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Step 7: 保存结果
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n[Step 7/7] 保存结果...")
|
|||
|
|
|
|||
|
|
if self.output_config.get("save_predictions", True):
|
|||
|
|
self._save_predictions()
|
|||
|
|
|
|||
|
|
if self.output_config.get("save_model", False):
|
|||
|
|
self._save_model()
|
|||
|
|
|
|||
|
|
if self.verbose:
|
|||
|
|
print("\n" + "=" * 80)
|
|||
|
|
print("训练完成!")
|
|||
|
|
print("=" * 80)
|
|||
|
|
|
|||
|
|
return self.results
|
|||
|
|
|
|||
|
|
def _save_predictions(self) -> None:
|
|||
|
|
"""保存预测结果"""
|
|||
|
|
output_dir = self.output_config.get("output_dir", "experiment/output")
|
|||
|
|
output_filename = self.output_config.get("output_filename", "output.csv")
|
|||
|
|
output_path = os.path.join(output_dir, output_filename)
|
|||
|
|
|
|||
|
|
self.analyzer.save_results(
|
|||
|
|
results=self.results,
|
|||
|
|
output_path=output_path,
|
|||
|
|
verbose=self.verbose,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def _save_model(self) -> None:
|
|||
|
|
"""保存模型"""
|
|||
|
|
model_save_path = self.output_config.get("model_save_path")
|
|||
|
|
if not model_save_path:
|
|||
|
|
return
|
|||
|
|
|
|||
|
|
# 确保目录存在
|
|||
|
|
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)
|
|||
|
|
|
|||
|
|
# 获取模型和相关信息
|
|||
|
|
model = self.task.get_model()
|
|||
|
|
|
|||
|
|
# 保存模型
|
|||
|
|
model.save(model_save_path)
|
|||
|
|
|
|||
|
|
if self.verbose:
|
|||
|
|
print(f" 模型保存路径: {model_save_path}")
|
|||
|
|
|
|||
|
|
def get_results(self) -> Optional[pl.DataFrame]:
|
|||
|
|
"""获取训练结果
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
训练结果数据框,如果尚未训练则返回 None
|
|||
|
|
"""
|
|||
|
|
return self.results
|
|||
|
|
|
|||
|
|
def get_task(self) -> BaseTask:
|
|||
|
|
"""获取任务实例
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
任务实例
|
|||
|
|
"""
|
|||
|
|
return self.task
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Update __init__.py to export new components**
|
|||
|
|
|
|||
|
|
Add to `src/training/__init__.py`:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 新增导出(模块化 Trainer 组件)
|
|||
|
|
from src.training.factor_manager import FactorManager
|
|||
|
|
from src.training.pipeline import DataPipeline
|
|||
|
|
from src.training.result_analyzer import ResultAnalyzer
|
|||
|
|
from src.training.tasks import RegressionTask, RankTask
|
|||
|
|
# 可以选择性地导出新的 Trainer,或者保持原有 Trainer 不变
|
|||
|
|
# from src.training.core.trainer_new import Trainer as ModularTrainer
|
|||
|
|
|
|||
|
|
__all__ = [
|
|||
|
|
# 原有导出
|
|||
|
|
"Trainer",
|
|||
|
|
"DateSplitter",
|
|||
|
|
"StockPoolManager",
|
|||
|
|
"check_data_quality",
|
|||
|
|
"STFilter",
|
|||
|
|
"Winsorizer",
|
|||
|
|
"NullFiller",
|
|||
|
|
"StandardScaler",
|
|||
|
|
"CrossSectionalStandardScaler",
|
|||
|
|
"TrainingConfig",
|
|||
|
|
# 新增导出
|
|||
|
|
"FactorManager",
|
|||
|
|
"DataPipeline",
|
|||
|
|
"ResultAnalyzer",
|
|||
|
|
"RegressionTask",
|
|||
|
|
"RankTask",
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Run basic import tests**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv run python -c "from src.training import FactorManager, DataPipeline, RegressionTask, RankTask, ResultAnalyzer; print('All imports successful')"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected: All imports successful
|
|||
|
|
|
|||
|
|
**Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/training/core/trainer_new.py
|
|||
|
|
git add src/training/__init__.py
|
|||
|
|
git add src/training/factor_manager.py
|
|||
|
|
git add src/training/pipeline.py
|
|||
|
|
git add src/training/result_analyzer.py
|
|||
|
|
git add src/training/tasks/
|
|||
|
|
git commit -m "feat(training): add modular Trainer architecture
|
|||
|
|
|
|||
|
|
- Add FactorManager for unified factor management
|
|||
|
|
- Add DataPipeline for complete data processing workflow
|
|||
|
|
- Add Task strategy components (RegressionTask, RankTask)
|
|||
|
|
- Add ResultAnalyzer for post-training analysis
|
|||
|
|
- Add new Trainer as orchestration engine
|
|||
|
|
- Update __init__.py exports"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 8: 重写 regression.py 使用新架构
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/experiment/regression_v2.py` (新实现)
|
|||
|
|
- Keep: `src/experiment/regression.py` (原文件保留,添加注释说明已迁移)
|
|||
|
|
|
|||
|
|
**Step 1: Create new regression.py with new architecture**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# %% md
|
|||
|
|
# # LightGBM 回归训练流程(模块化版本)
|
|||
|
|
#
|
|||
|
|
# 使用新的模块化 Trainer 架构
|
|||
|
|
# %% md
|
|||
|
|
# ## 1. 导入依赖
|
|||
|
|
# %%
|
|||
|
|
from src.training import (
|
|||
|
|
Trainer,
|
|||
|
|
DataPipeline,
|
|||
|
|
FactorManager,
|
|||
|
|
RegressionTask,
|
|||
|
|
NullFiller,
|
|||
|
|
Winsorizer,
|
|||
|
|
StandardScaler,
|
|||
|
|
)
|
|||
|
|
from src.training.components.filters import STFilter
|
|||
|
|
from src.experiment.common import (
|
|||
|
|
create_training_config,
|
|||
|
|
create_regression_config,
|
|||
|
|
FactorEngine,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 2. 配置参数
|
|||
|
|
# %%
|
|||
|
|
# 创建统一配置
|
|||
|
|
training_config = create_training_config()
|
|||
|
|
model_config = create_regression_config()
|
|||
|
|
|
|||
|
|
print("训练配置:")
|
|||
|
|
print(f" 训练期: {training_config.train_start} - {training_config.train_end}")
|
|||
|
|
print(f" 验证期: {training_config.val_start} - {training_config.val_end}")
|
|||
|
|
print(f" 测试期: {training_config.test_start} - {training_config.test_end}")
|
|||
|
|
print(f" 特征数: {len(training_config.selected_factors)}")
|
|||
|
|
print(f" Label: {model_config.label_name}")
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 3. 创建组件
|
|||
|
|
# %%
|
|||
|
|
# 1. 创建 FactorEngine
|
|||
|
|
engine = FactorEngine()
|
|||
|
|
|
|||
|
|
# 2. 创建 FactorManager
|
|||
|
|
factor_manager = FactorManager(
|
|||
|
|
selected_factors=training_config.selected_factors,
|
|||
|
|
factor_definitions=training_config.factor_definitions,
|
|||
|
|
label_factor=training_config.label_factor,
|
|||
|
|
excluded_factors=training_config.excluded_factors,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 3. 创建 DataPipeline
|
|||
|
|
processors = [
|
|||
|
|
NullFiller(strategy="mean"),
|
|||
|
|
Winsorizer(lower=0.01, upper=0.99),
|
|||
|
|
StandardScaler(),
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
filters = [STFilter(data_router=engine.router)] if training_config.st_filter_enabled else []
|
|||
|
|
|
|||
|
|
pipeline = DataPipeline(
|
|||
|
|
factor_manager=factor_manager,
|
|||
|
|
processors=processors,
|
|||
|
|
filters=filters,
|
|||
|
|
stock_pool_filter_func=training_config.stock_pool_filter,
|
|||
|
|
stock_pool_required_columns=training_config.stock_pool_required_columns,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 4. 创建 Task
|
|||
|
|
task = RegressionTask(
|
|||
|
|
model_params=model_config.model_params,
|
|||
|
|
label_name=model_config.label_name,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 5. 创建 Trainer
|
|||
|
|
output_config = {
|
|||
|
|
"output_dir": training_config.output_dir,
|
|||
|
|
"output_filename": "regression_output.csv",
|
|||
|
|
"save_predictions": training_config.save_predictions,
|
|||
|
|
"save_model": training_config.save_model,
|
|||
|
|
"model_save_path": f"{training_config.output_dir}/regression_model.txt",
|
|||
|
|
"top_n": training_config.top_n,
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
trainer = Trainer(
|
|||
|
|
data_pipeline=pipeline,
|
|||
|
|
task=task,
|
|||
|
|
output_config=output_config,
|
|||
|
|
verbose=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 4. 执行训练
|
|||
|
|
# %%
|
|||
|
|
results = trainer.run(
|
|||
|
|
engine=engine,
|
|||
|
|
date_range=training_config.date_range,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 5. 额外分析(可选)
|
|||
|
|
# %%
|
|||
|
|
# 获取模型进行进一步分析
|
|||
|
|
model = task.get_model()
|
|||
|
|
|
|||
|
|
# 可以在这里添加自定义可视化
|
|||
|
|
print("\n训练完成!")
|
|||
|
|
print(f"结果保存路径: {output_config['output_dir']}/regression_output.csv")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Add deprecation notice to old regression.py**
|
|||
|
|
|
|||
|
|
在原有 `regression.py` 文件顶部添加:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 注意:此文件已迁移到 regression_v2.py
|
|||
|
|
# 新文件使用模块化 Trainer 架构
|
|||
|
|
# 此文件保留用于参考和对比
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Test new regression script**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 注意:这会实际运行训练,可能需要较长时间
|
|||
|
|
# 建议先用小数据测试
|
|||
|
|
uv run python src/experiment/regression_v2.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/experiment/regression_v2.py
|
|||
|
|
git add src/experiment/regression.py # 已添加弃用注释
|
|||
|
|
git commit -m "feat(experiment): add modular regression training script
|
|||
|
|
|
|||
|
|
- Create regression_v2.py using new modular Trainer architecture
|
|||
|
|
- Reduce code from 640 lines to ~80 lines
|
|||
|
|
- Add deprecation notice to old regression.py
|
|||
|
|
- All functionality preserved"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 9: 重写 learn_to_rank.py 使用新架构
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `src/experiment/learn_to_rank_v2.py` (新实现)
|
|||
|
|
- Keep: `src/experiment/learn_to_rank.py` (原文件保留,添加注释说明已迁移)
|
|||
|
|
|
|||
|
|
**Step 1: Create new learn_to_rank.py with new architecture**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# %% md
|
|||
|
|
# # LightGBM LambdaRank 排序学习训练流程(模块化版本)
|
|||
|
|
#
|
|||
|
|
# 使用新的模块化 Trainer 架构
|
|||
|
|
# %% md
|
|||
|
|
# ## 1. 导入依赖
|
|||
|
|
# %%
|
|||
|
|
from src.training import (
|
|||
|
|
Trainer,
|
|||
|
|
DataPipeline,
|
|||
|
|
FactorManager,
|
|||
|
|
RankTask,
|
|||
|
|
NullFiller,
|
|||
|
|
Winsorizer,
|
|||
|
|
CrossSectionalStandardScaler,
|
|||
|
|
)
|
|||
|
|
from src.training.components.filters import STFilter
|
|||
|
|
from src.experiment.common import (
|
|||
|
|
create_training_config,
|
|||
|
|
create_rank_config,
|
|||
|
|
FactorEngine,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 2. 配置参数
|
|||
|
|
# %%
|
|||
|
|
# 创建统一配置
|
|||
|
|
training_config = create_training_config()
|
|||
|
|
model_config = create_rank_config()
|
|||
|
|
|
|||
|
|
print("训练配置:")
|
|||
|
|
print(f" 训练期: {training_config.train_start} - {training_config.train_end}")
|
|||
|
|
print(f" 验证期: {training_config.val_start} - {training_config.val_end}")
|
|||
|
|
print(f" 测试期: {training_config.test_start} - {training_config.test_end}")
|
|||
|
|
print(f" 特征数: {len(training_config.selected_factors)}")
|
|||
|
|
print(f" Label: {model_config.label_name}")
|
|||
|
|
print(f" 分位数: {model_config.n_quantiles}")
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 3. 创建组件
|
|||
|
|
# %%
|
|||
|
|
# 1. 创建 FactorEngine
|
|||
|
|
engine = FactorEngine()
|
|||
|
|
|
|||
|
|
# 2. 创建 FactorManager
|
|||
|
|
factor_manager = FactorManager(
|
|||
|
|
selected_factors=training_config.selected_factors,
|
|||
|
|
factor_definitions=training_config.factor_definitions,
|
|||
|
|
label_factor=training_config.label_factor,
|
|||
|
|
excluded_factors=training_config.excluded_factors,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 3. 创建 DataPipeline(使用截面标准化)
|
|||
|
|
processors = [
|
|||
|
|
NullFiller(strategy="mean"),
|
|||
|
|
Winsorizer(lower=0.01, upper=0.99),
|
|||
|
|
CrossSectionalStandardScaler(),
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
filters = [STFilter(data_router=engine.router)] if training_config.st_filter_enabled else []
|
|||
|
|
|
|||
|
|
pipeline = DataPipeline(
|
|||
|
|
factor_manager=factor_manager,
|
|||
|
|
processors=processors,
|
|||
|
|
filters=filters,
|
|||
|
|
stock_pool_filter_func=training_config.stock_pool_filter,
|
|||
|
|
stock_pool_required_columns=training_config.stock_pool_required_columns,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 4. 创建 Task(排序学习特有 n_quantiles)
|
|||
|
|
task = RankTask(
|
|||
|
|
model_params=model_config.model_params,
|
|||
|
|
label_name=model_config.label_name,
|
|||
|
|
n_quantiles=model_config.n_quantiles,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 5. 创建 Trainer
|
|||
|
|
output_config = {
|
|||
|
|
"output_dir": training_config.output_dir,
|
|||
|
|
"output_filename": "rank_output.csv",
|
|||
|
|
"save_predictions": training_config.save_predictions,
|
|||
|
|
"save_model": training_config.save_model,
|
|||
|
|
"model_save_path": f"{training_config.output_dir}/rank_model.txt",
|
|||
|
|
"top_n": training_config.top_n,
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
trainer = Trainer(
|
|||
|
|
data_pipeline=pipeline,
|
|||
|
|
task=task,
|
|||
|
|
output_config=output_config,
|
|||
|
|
verbose=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 4. 执行训练
|
|||
|
|
# %%
|
|||
|
|
results = trainer.run(
|
|||
|
|
engine=engine,
|
|||
|
|
date_range=training_config.date_range,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# %% md
|
|||
|
|
# ## 5. 额外分析(NDCG)
|
|||
|
|
# %%
|
|||
|
|
# NDCG 评估已在 Trainer.run() 中自动执行
|
|||
|
|
# 可以在这里添加额外的可视化
|
|||
|
|
|
|||
|
|
print("\n训练完成!")
|
|||
|
|
print(f"结果保存路径: {output_config['output_dir']}/rank_output.csv")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Add deprecation notice to old learn_to_rank.py**
|
|||
|
|
|
|||
|
|
在原有 `learn_to_rank.py` 文件顶部添加:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 注意:此文件已迁移到 learn_to_rank_v2.py
|
|||
|
|
# 新文件使用模块化 Trainer 架构
|
|||
|
|
# 此文件保留用于参考和对比
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 3: Test new learn_to_rank script**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 注意:这会实际运行训练
|
|||
|
|
uv run python src/experiment/learn_to_rank_v2.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 4: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add src/experiment/learn_to_rank_v2.py
|
|||
|
|
git add src/experiment/learn_to_rank.py # 已添加弃用注释
|
|||
|
|
git commit -m "feat(experiment): add modular learn-to-rank training script
|
|||
|
|
|
|||
|
|
- Create learn_to_rank_v2.py using new modular Trainer architecture
|
|||
|
|
- Reduce code from 876 lines to ~80 lines
|
|||
|
|
- Add deprecation notice to old learn_to_rank.py
|
|||
|
|
- All functionality preserved including NDCG evaluation"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Task 10: 验证和对比
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Test both implementations
|
|||
|
|
|
|||
|
|
**Step 1: Compare outputs**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 运行旧版本(如果数据已存在,可以直接比较输出)
|
|||
|
|
# 注意:这会运行实际训练,需要较长时间
|
|||
|
|
|
|||
|
|
# 运行新版本
|
|||
|
|
uv run python src/experiment/regression_v2.py 2>&1 | tee regression_v2.log
|
|||
|
|
uv run python src/experiment/learn_to_rank_v2.py 2>&1 | tee rank_v2.log
|
|||
|
|
|
|||
|
|
# 检查输出文件
|
|||
|
|
ls -lh experiment/output/
|
|||
|
|
# 应该生成 regression_output.csv 和 rank_output.csv
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Step 2: Validate feature importance output**
|
|||
|
|
|
|||
|
|
确保特征重要性分析输出格式正确:
|
|||
|
|
- Top 20 特征列表
|
|||
|
|
- 零贡献特征列表
|
|||
|
|
- 统计摘要
|
|||
|
|
|
|||
|
|
**Step 3: Validate NDCG evaluation (learn_to_rank)**
|
|||
|
|
|
|||
|
|
确保 NDCG@k 评估正确执行:
|
|||
|
|
- ndcg@1, ndcg@5, ndcg@10, ndcg@20 都计算
|
|||
|
|
|
|||
|
|
**Step 4: Code statistics**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 对比代码行数
|
|||
|
|
echo "=== Old implementation ==="
|
|||
|
|
wc -l src/experiment/regression.py src/experiment/learn_to_rank.py
|
|||
|
|
|
|||
|
|
echo "=== New implementation ==="
|
|||
|
|
wc -l src/experiment/regression_v2.py src/experiment/learn_to_rank_v2.py
|
|||
|
|
|
|||
|
|
echo "=== New components ==="
|
|||
|
|
wc -l src/training/factor_manager.py src/training/pipeline.py src/training/result_analyzer.py
|
|||
|
|
find src/training/tasks -name "*.py" -exec wc -l {} +
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Expected:
|
|||
|
|
- Old: ~640 + ~876 = ~1516 lines
|
|||
|
|
- New: ~80 + ~80 = ~160 lines
|
|||
|
|
- New components: ~500-800 lines (reusable)
|
|||
|
|
|
|||
|
|
**Step 5: Commit final changes**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add -A
|
|||
|
|
git commit -m "refactor(training): complete modular Trainer architecture
|
|||
|
|
|
|||
|
|
- Implement FactorManager, DataPipeline, Task strategies, ResultAnalyzer
|
|||
|
|
- Rewrite regression.py (640 -> 80 lines)
|
|||
|
|
- Rewrite learn_to_rank.py (876 -> 80 lines)
|
|||
|
|
- Preserve all functionality:
|
|||
|
|
* Factor management (metadata, DSL, label, exclusion)
|
|||
|
|
* Data filtering (STFilter, stock_pool_filter)
|
|||
|
|
* Data preprocessing (NullFiller, Winsorizer, Scaler)
|
|||
|
|
* Model training with early stopping
|
|||
|
|
* Feature importance analysis
|
|||
|
|
* NDCG evaluation for ranking
|
|||
|
|
* Result saving (predictions, model)
|
|||
|
|
- Add comprehensive tests for all components
|
|||
|
|
- Code reduction: 94% less duplication in experiment scripts"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
|
|||
|
|
### 代码结构变化
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Before:
|
|||
|
|
├── src/experiment/regression.py (640 lines) - 独立完整实现
|
|||
|
|
├── src/experiment/learn_to_rank.py (876 lines) - 独立完整实现
|
|||
|
|
└── 重复代码: 80%+
|
|||
|
|
|
|||
|
|
After:
|
|||
|
|
├── src/experiment/regression_v2.py (80 lines) - 配置+运行
|
|||
|
|
├── src/experiment/learn_to_rank_v2.py (80 lines) - 配置+运行
|
|||
|
|
├── src/training/factor_manager.py - 因子管理(可复用)
|
|||
|
|
├── src/training/pipeline.py - 数据流水线(可复用)
|
|||
|
|
├── src/training/tasks/
|
|||
|
|
│ ├── base.py - 任务接口
|
|||
|
|
│ ├── regression_task.py - 回归任务
|
|||
|
|
│ └── rank_task.py - 排序任务
|
|||
|
|
├── src/training/result_analyzer.py - 结果分析(可复用)
|
|||
|
|
└── src/training/core/trainer_new.py - 调度引擎
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 新增训练类型的工作量
|
|||
|
|
|
|||
|
|
添加**分类任务**:
|
|||
|
|
1. 创建 `ClassificationTask` 类(继承 BaseTask,实现3个方法)
|
|||
|
|
2. 在实验脚本中使用(80行,与回归/排序类似)
|
|||
|
|
|
|||
|
|
无需复制任何数据流程代码!
|
|||
|
|
|
|||
|
|
### 测试覆盖
|
|||
|
|
|
|||
|
|
- FactorManager: ✓
|
|||
|
|
- DataPipeline: ✓
|
|||
|
|
- Tasks: ✓
|
|||
|
|
- ResultAnalyzer: ✓
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 后续可选优化
|
|||
|
|
|
|||
|
|
1. **完全移除旧文件**:验证新文件工作正常后,可以删除 regression.py 和 learn_to_rank.py,将 v2 文件重命名
|
|||
|
|
2. **添加更多测试**:集成测试、端到端测试
|
|||
|
|
3. **文档更新**:更新 README,添加新架构使用说明
|
|||
|
|
4. **配置优化**:支持从 YAML/JSON 文件加载配置
|