diff --git a/docs/plans/2026-04-07-factorminer-local-integration.md b/docs/plans/2026-04-07-factorminer-local-integration.md index 574d1c2..d940c9b 100644 --- a/docs/plans/2026-04-07-factorminer-local-integration.md +++ b/docs/plans/2026-04-07-factorminer-local-integration.md @@ -1,6 +1,10 @@ -# FactorMiner 本地框架整合实施计划 +# FactorMiner 本地框架整合实施计划(修订版) -> 目标:将 `src/factorminer` 完全整合进 ProStock 项目,数据加载、因子计算全部使用本地框架,仅在因子生成、落库、指标分析时保留 FactorMiner 代码。 +> 目标:将 `src/factorminer` 完全整合进 ProStock 项目。数据读取与因子计算全部复用本地 `FactorEngine`,不再引入 FactorMiner 原生的数据加载、DSL 计算与 `(M,T)` 矩阵缓存。仅在因子生成、落库、指标分析时保留 FactorMiner 代码。 +> +> 本次修订核心变更: +> 1. **删除 Step 1(LocalDataLoader)**:本地 `FactorEngine.compute()` 已自带数据路由与读取能力,无需自行封装数据加载层。 +> 2. **删除运行时 DSL 翻译器**:不再维护 `FmToLocalTranslator`。改为一次性脚本把 110 个 paper factors 的 CamelCase DSL 翻译成本地 DSL,并直接回填到常量列表中;LLM Prompt 同步改造为直接输出本地 DSL。 --- @@ -68,138 +72,166 @@ --- -## Step 1: 本地数据加载层(`LocalDataLoader`) +## Step 1: 一次性 Paper Factors DSL 迁移脚本 **文件** -- 新建:`src/factorminer/factorminer/data/local_data_loader.py` -- 测试:`tests/test_factorminer_local_data_loader.py` +- 新建:`scripts/translate_paper_factors.py` **目标** -- 弃用 `loader.py` + `preprocessor.py`,改为从本地 DuckDB `pro_bar` 表读取数据 -- 统一日期范围:`20190101` ~ `20231231` -- 支持股票池筛选(与 `experiment/common.py` 的 `stock_pool_filter` 对齐) -- 生成 `$vwap` 等价字段(`amount / vol`),并提供统一的 `asset_ids` / `timestamps` 索引 +- 将 `src/factorminer/core/library_io.py` 中硬编码的 110 个 `PAPER_FACTORS` 的 CamelCase DSL 公式,一次性翻译为本地 snake_case DSL 字符串。 +- 翻译结果直接替换回原常量列表,后续 `import_from_paper()` 加载的公式已经是本地格式,无需运行时翻译。 -**实现要点** -- 使用 `Storage(read_only=True).load_polars("pro_bar", ...)` 读取数据 -- 日期格式统一为字符串 `YYYYMMDD` -- 股票池筛选通过注入的 `filter_func` 完成(默认使用 `experiment/common.py` 的筛选逻辑) -- 返回封装对象 `LocalPanel`,包含: - - `df: pl.DataFrame`(原始长表) - - `asset_ids: np.ndarray` - - `timestamps: np.ndarray` - -**代码风格检查点** -- 类名 `LocalDataLoader` / `LocalPanel` -- 所有公共方法带类型提示和中文 docstring -- 导入顺序正确 - ---- - -## Step 2: DSL 翻译器(`FmToLocalTranslator`) - -**文件** -- 新建:`src/factorminer/factorminer/core/formula_translator.py` -- 测试:`tests/test_factorminer_formula_translator.py` - -**目标** -- 将 FactorMiner 论文中的 110 个 CamelCase DSL 公式翻译成本地 snake_case DSL -- 覆盖全部算子,未覆盖的算子翻译结果前加 `# TODO` 标记 -- 翻译器**仅用于** paper factors 导入和向后兼容,不用于 LLM 生成路径 - -**映射规则示例** +**映射规则(核心算子)** | FactorMiner | 本地 DSL | |-------------|----------| | `Neg(X)` | `-X` | +| `Add(A, B)` | `A + B` | | `Sub(A, B)` | `A - B` | +| `Mul(A, B)` | `A * B` | | `Div(A, B)` | `A / B` | +| `Greater(A, B)` | `A > B` | +| `Square(X)` | `X ** 2` | | `CsRank(X)` | `cs_rank(X)` | -| `TsMean(X, 20)` | `ts_mean(X, 20)` | +| `CsZscore(X)` | `cs_zscore(X)` | +| `TsMean(X, n)` | `ts_mean(X, n)` | +| `TsMax(X, n)` | `ts_max(X, n)` | +| `TsMin(X, n)` | `ts_min(X, n)` | +| `Std(X, n)` | `ts_std(X, n)` | +| `Delta(X, n)` | `ts_delta(X, n)` | +| `Delay(X, n)` | `ts_delay(X, n)` | +| `Corr(X, Y, n)` | `ts_corr(X, Y, n)` | +| `Cov(X, Y, n)` | `ts_cov(X, Y, n)` | +| `Sum(X, n)` | `ts_sum(X, n)` | +| `Return(X, n)` | `ts_pct_change(X, n)`(或 `X / ts_delay(X, n) - 1`) | +| `EMA(X, n)` | `ts_ema(X, n)` | +| `WMA(X, n)` | `ts_wma(X, n)` | +| `SMA(X, n)` | `ts_mean(X, n)`(FactorMiner 里的 SMA 即简单移动平均) | +| `Skew(X, n)` | `ts_skew(X, n)` | +| `Kurt(X, n)` | `ts_kurt(X, n)` | +| `Abs(X)` | `abs(X)` | +| `Sign(X)` | `sign(X)` | +| `Max(A, B)` | `max_(A, B)` | +| `Min(A, B)` | `min_(A, B)` | +| `IfElse(C, T, F)` | `if_(C, T, F)` | | `$close` | `close` | | `$volume` | `vol` | | `$amt` | `amount` | | `$vwap` | `amount / vol` | +| `$returns` | `close / ts_delay(close, 1) - 1` | + +**未实现算子处理** +本地框架缺少以下 FactorMiner 算子,翻译时将其整条公式替换为 `# TODO: <原始公式>`: +- `Decay(...)` +- `TsLinRegSlope(...)` +- `TsLinRegResid(...)` +- `Resid(...)` +- `Quantile(...)` +- `HMA(...)` +- `DEMA(...)` **实现要点** -- 使用递归下降直接翻译 `ExpressionTree` 节点,不依赖字符串替换(避免括号歧义) -- `LeafNode` 处理字段映射;`OperatorNode` 处理算子映射 -- 对二元算术算子输出中缀表达式并合理加括号 -- 未实现的算子返回 `# TODO: <原始算子名>(...)` +- 脚本解析 `PAPER_FACTORS` 中的字符串公式,使用括号递归栈做 AST 风格的拆分。 +- 对 `LeafNode`(字段和数字常量)做直接映射;对 `OperatorNode` 做算子映射。 +- 脚本输出新的 Python 列表代码,可直接复制并替换 `library_io.py` 中的 `PAPER_FACTORS`。 +- 运行脚本后手动校验前 10 个公式的正确性,确保括号匹配。 **代码风格检查点** -- 翻译器为一个纯函数类,无状态 -- 单元测试覆盖 paper factors 中的高频算子和至少 5 个完整公式 +- 脚本放 `scripts/` 目录,使用 `snake_case` 命名。 +- 带中文 docstring,打印统计:`print("[translate] 成功 {n}/110,TODO {m} 个")`。 --- -## Step 3: 禁用 npz 并将翻译器集成到库 I/O +## Step 2: 禁用 npz 并将库 I/O 对接本地 DSL **文件** -- 修改:`src/factorminer/factorminer/core/library_io.py` -- 修改:`src/factorminer/factorminer/cli.py`(如有 `save_signals` 参数则改为始终 False) +- 修改:`src/factorminer/core/library_io.py` +- 修改:`src/factorminer/cli.py`(如有 `save_signals` 参数则改为始终 False) - 测试:`tests/test_factorminer_library_io.py` **目标** -- 彻底禁止 `.npz` 信号缓存落盘 -- `load_library` 加载内置 110 个 paper factors 时,自动调用翻译器将其转换为本地的 snake_case DSL -- 如果翻译结果是 `# TODO`,则在 factor metadata 中标记 `unsupported=True` +- 彻底禁止 `.npz` 信号缓存落盘。 +- `PAPER_FACTORS` 中的公式已通过 Step 1 变为本地 DSL,`import_from_paper()` 直接加载即可,不再做运行时翻译。 +- 对于 Step 1 中标记为 `# TODO` 的公式,在构建 `FactorLibrary` 时设置 `factor.metadata["unsupported"] = True`。 **修改要点** -- `save_library(..., save_signals)`:无论传入什么,均忽略 `save_signals`,且不写 `.npz` -- `load_library(path)`:恢复 JSON 后,将每个 `factor.formula` 通过翻译器转换 -- `import_from_paper()`:在构建 FactorLibrary 时直接翻译所有公式 +- `save_library(..., save_signals)`:无论传入什么,均忽略 `save_signals`,不写 `.npz`。 +- `load_library(path)`:恢复 JSON 后,若公式以 `# TODO` 开头,标记 `unsupported=True`。 +- `import_from_paper()`:由于 `PAPER_FACTORS` 已本地化,直接构建 `FactorLibrary`。 +- 移除 `library_io.py` 中对 `ExpressionTree` / `canonicalizer` 的任何依赖(如果存在)。 **代码风格检查点** -- 修改点尽量少,废弃参数保留以兼容旧签名,但内部忽略 -- 打印日志说明 npz 已禁用:`print("[library_io] 信号缓存已禁用,仅保存 JSON 元数据")` +- 废弃参数保留以兼容旧签名,但内部忽略。 +- 打印日志说明 npz 已禁用:`print("[library_io] 信号缓存已禁用,仅保存 JSON 元数据")`。 --- -## Step 4: LLM Prompt 改造(让 Agent 直接生成本地 DSL) +## Step 3: LLM Prompt 改造(让 Agent 直接生成本地 DSL) **文件** - 修改:`src/factorminer/factorminer/agent/prompt_builder.py` +- 修改:`src/factorminer/factorminer/agent/output_parser.py` - 修改:`src/factorminer/factorminer/agent/factor_generator.py`(如有必要) - 测试:`tests/test_factorminer_prompt.py` **目标** -- 将 Prompt 中的 DSL 规范从 CamelCase + `$` 前缀改为本地 snake_case DSL -- 修改示例公式,使其全部为本地 DSL 格式(如 `cs_rank(close / ts_delay(close, 5) - 1)`) -- 明确可用字段:`open`, `high`, `low`, `close`, `vol`, `amount`, `vwap`(可用 `amount / vol` 计算) +- 将 Prompt 中的 DSL 规范从 CamelCase + `$` 前缀改为本地 snake_case DSL。 +- 所有示例公式替换为本地格式(如 `cs_rank(close / ts_delay(close, 5) - 1)`)。 +- 明确可用字段:`open`, `high`, `low`, `close`, `vol`, `amount`, `vwap`(可用 `amount / vol` 计算)。 +- LLM 输出直接是本地 DSL 字符串,解析层只需提取字符串,**不再**做 `$` 替换或 CamelCase 转换。 **修改要点** -- 重写 `SYSTEM_PROMPT` 中的 DSL 规则段落 -- 将所有 prompt 示例公式替换为本地 DSL -- `OutputParser` 中的公式清洗逻辑需同步适配(去掉 `$`,但保留中文描述) +- 重写 `SYSTEM_PROMPT` 中的 DSL 规则段落,列出现有函数名与字段名。 +- 将所有 prompt 示例公式替换为本地 DSL。 +- `OutputParser` 去掉 `$` 清洗逻辑;改为直接截取公式字符串(保留中文描述之外的纯公式部分)。 +- `factor_generator.py` 中的 `generate` / `try_parse` 不再调用 FactorMiner 的 `ExpressionTree.from_string`,改为直接返回字符串(因为本地 DSL 由 `FactorEngine` 在计算时解析)。 **代码风格检查点** -- Prompt 内容易读、无 emoji -- 通过单元测试验证 prompt 中生成本地 DSL 示例的正确性 +- Prompt 内容易读、无 emoji。 +- 单元测试验证 prompt 中包含的示例公式均为本地 DSL,且 `OutputParser` 能正确提取。 --- -## Step 5: `LocalFactorEvaluator`(FactorEngine 执行封装) +## Step 4: `LocalFactorEvaluator`(FactorEngine 执行封装) **文件** - 新建:`src/factorminer/factorminer/evaluation/local_engine.py` - 测试:`tests/test_factorminer_local_engine.py` **目标** -- 封装 `FactorEngine`,提供与 FactorMiner `compute_tree_signals` 兼容的接口 -- 输入:候选因子 DSL 列表;输出:`(M, T)` numpy 信号矩阵字典 -- 支持批量计算 + 立即清理 engine 状态 +- 封装 `FactorEngine`,提供与 FactorMiner `compute_tree_signals` 兼容的输出接口。 +- 输入:候选因子 DSL 列表(`(name, formula)`);输出:`{name: (M, T) np.ndarray}`。 +- 无需外部数据加载器,直接利用 `FactorEngine` 内建的数据路由读取 `pro_bar` 表。 **类签名设计** ```python class LocalFactorEvaluator: - def __init__(self, data_loader: LocalDataLoader) -> None: + def __init__( + self, + start_date: str, + end_date: str, + stock_codes: Optional[List[str]] = None, + ) -> None: + """初始化评估器。 + + Args: + start_date: 计算开始日期,YYYYMMDD 格式 + end_date: 计算结束日期,YYYYMMDD 格式 + stock_codes: 可选的股票代码列表,None 表示全量 + """ ... def evaluate( self, specs: List[Tuple[str, str]], ) -> Dict[str, np.ndarray]: - """批量计算并返回 {name: (M, T) 矩阵}。""" + """批量计算并返回 {name: (M, T) 矩阵}。 + + Args: + specs: (因子名, 本地 DSL 公式) 列表 + + Returns: + 每个因子对应的 (asset, time) numpy 矩阵,缺失值填充 np.nan + """ ... def evaluate_single( @@ -209,95 +241,112 @@ class LocalFactorEvaluator: ) -> np.ndarray: """计算单个因子。""" ... + + def evaluate_returns( + self, + periods: int = 1, + ) -> np.ndarray: + """计算收益率矩阵,用于后续 IC / quintile 分析。 + + Returns: + (M, T) 的 forward returns 矩阵 + """ + ... ``` **实现要点** -- `evaluate` 中一次性注册所有 specs,调用 `engine.compute(...)` -- 使用 `pivot_table` 将返回的 Polars 长表转换为 `(M, T)` numpy 矩阵 -- 缺失值填充 `np.nan` -- 计算结束后调用 `engine.clear()` +- `evaluate` 中一次性注册所有 specs,调用 `FactorEngine.compute(...)`。 +- 返回的 Polars 长表按 `ts_code`(字母序)和 `trade_date`(时间序)`pivot` 为 numpy 矩阵。 +- 缺失值填充 `np.nan`。 +- 计算结束后调用 `engine.clear()`。 +- `evaluate_returns` 计算 `ts_pct_change(close, periods)`(或 `close / ts_delay(close, periods) - 1`),同样 pivot 为矩阵。 **代码风格检查点** -- 严格的类型提示和中文 docstring -- 日志打印:`print("[local_engine] 开始批量计算 {n} 个因子...")` +- 严格的类型提示和中文 docstring。 +- 日志打印:`print("[local_engine] 开始批量计算 {n} 个因子...")`。 --- -## Step 6: 替换计算管线(`pipeline.py` / `runtime.py`) +## Step 5: 替换计算管线(`pipeline.py` / `runtime.py`) **文件** - 修改:`src/factorminer/factorminer/evaluation/pipeline.py` - 修改:`src/factorminer/factorminer/evaluation/runtime.py` +- 修改:`src/factorminer/factorminer/data/loader.py`(弃用标记,可选) +- 修改:`src/factorminer/factorminer/data/preprocessor.py`(弃用标记,可选) +- 修改:`src/factorminer/factorminer/data/tensor_builder.py`(弃用标记,可选) - 测试:`tests/test_factorminer_pipeline_integration.py` **目标** -- 将 `compute_tree_signals(..., data_dict)` 替换为通过 `LocalFactorEvaluator` 计算 -- 保留原有 IC、stats、quintile 分析逻辑 - -**修改 `pipeline.py` 要点** -- `ValidationPipeline.__init__` 接收 `data_loader: LocalDataLoader` -- 构建内部 `LocalFactorEvaluator` -- `compute_tree_signals` 改为调用 `evaluator.evaluate_single(name, formula)` -- `evaluate` 方法中,一次性批量计算所有候选因子,再逐个进入 stats +- 移除 `compute_tree_signals(..., data_dict)` 及其对 FactorMiner 原生 `(M,T)` 数据面板的依赖。 +- 所有信号计算统一通过 `LocalFactorEvaluator` 完成。 +- 保留原有 IC、stats、quintile 分析逻辑。 **修改 `runtime.py` 要点** -- `evaluate_factors` 中实例化 `LocalFactorEvaluator` -- 对每个 factor 调用 `evaluate_single`;若 formula 以 `# TODO` 开头,标记为 reject -- 保留 split-mask 和 stats 计算逻辑 +- `EvaluationDataset` 不再持有 `data_dict` 和 `returns`。 +- `evaluate_factors` 接收 `evaluator: LocalFactorEvaluator` 和 `returns: np.ndarray`。 +- 不再需要 `load_runtime_dataset` 做面版预处理;改为由调用方直接构造 `evaluator`(指定日期范围)即可。 +- 对每个 factor 调用 `evaluator.evaluate_single(name, formula)`;若 formula 以 `# TODO` 开头,标记为 `reject`。 +- 保留 split-mask 和 stats 计算逻辑(它们只消费 `(M,T)` 矩阵,无需改动)。 + +**修改 `pipeline.py` 要点** +- `ValidationPipeline.__init__` 改为接收 `evaluator: LocalFactorEvaluator` 和 `returns: np.ndarray`。 +- 删除 `compute_signals_fn` 参数(或保留为向后兼容的弃用参数)。 +- `compute_tree_signals` 改为调用 `evaluator.evaluate_single(name, formula)`。 +- `evaluate` 方法中,一次性批量计算所有候选因子的信号,再逐个进入 stats / correlation / replacement 阶段。 **代码风格检查点** -- 修改点精确定位,不改变评估函数的返回数据结构 -- 兼容测试通过后再提交 +- 修改点精确定位,不改变评估函数的返回数据结构。 +- 兼容测试通过后再提交。 --- -## Step 7: 内存优化——库中因子按需重算 +## Step 6: 内存优化——库中因子按需重算 **文件** - 修改:`src/factorminer/factorminer/core/factor_library.py` - 测试:`tests/test_factorminer_library_memory.py` **目标** -- 库内因子对象不再长期持有 `(M, T)` numpy signals -- 相关性检查改为按需调用 `LocalFactorEvaluator` 重算 +- 库内 `Factor` 对象不再长期持有 `(M, T)` numpy signals。 +- 相关性检查改为按需调用 `LocalFactorEvaluator` 重算。 **修改要点** -- `admit()` 时不再保存 `signals` 到 `Factor` 对象 -- `compute_correlation` 签名改为接收 `evaluator: LocalFactorEvaluator` -- 内部遍历库中因子,临时调用 `evaluator.evaluate_single` 计算信号,再与候选信号求相关 -- 若 formula 为 `# TODO` 则跳过(返回 `0.0`) -- 删除 `_extend_correlation_matrix` / `_recompute_matrix_slot` 增量维护逻辑(改为动态求最大相关) +- `admit()` 时不再保存 `signals` 到 `Factor` 对象。 +- `compute_correlation` 签名改为接收 `evaluator: LocalFactorEvaluator, candidate_signals: np.ndarray`。 +- 内部遍历库中因子,临时调用 `evaluator.evaluate_single(name, formula)` 计算信号,再与候选信号求相关。 +- 若 formula 为 `# TODO` 则跳过(返回 `0.0`)。 +- 删除 `_extend_correlation_matrix` / `_recompute_matrix_slot` 增量维护逻辑(改为动态求最大相关)。 **代码风格检查点** -- 废弃旧方法时保留空壳或私有方法,避免测试大面积报错 -- 中文注释说明为什么删除增量矩阵(本地引擎重算成本低,内存优先) +- 废弃旧方法时保留空壳或私有方法,避免测试大面积报错。 +- 中文注释说明为什么删除增量矩阵(本地引擎重算成本低,内存优先)。 --- -## Step 8: 端到端集成测试(110 Paper Factors) +## Step 7: 端到端集成测试(110 Paper Factors) **文件** - 新建:`tests/test_factorminer_e2e.py` **目标** -- 验证翻译后的 110 个 paper factors 全部能在本地引擎上成功计算信号 -- 排除因未实现算子导致的 TODO 公式,统计成功率 +- 验证迁移后的 110 个 paper factors 全部能在本地引擎上成功计算信号。 +- 排除 `# TODO` 公式,统计实际可运行因子的成功率。 **测试逻辑** -1. 调用 `import_from_paper()` 加载因子库 -2. 实例化 `LocalDataLoader` 读取 20200101 ~ 20201231 数据 -3. 实例化 `LocalFactorEvaluator` -4. 过滤掉 `unsupported=True` 的因子 -5. 批量计算剩余因子,断言输出形状为 `(M, T)` 且不含全 NaN -6. 打印统计:`print("[e2e] 成功 {x}/110,跳过 {y} 个未实现算子")` +1. 调用 `import_from_paper()` 加载因子库。 +2. 实例化 `LocalFactorEvaluator(start_date="20200101", end_date="20201231")`。 +3. 过滤掉 `unsupported=True` 的因子。 +4. 批量计算剩余因子,断言输出形状为 `(M, T)` 且不含全 NaN。 +5. 打印统计:`print("[e2e] 成功 {x}/110,跳过 {y} 个未实现算子")`。 **代码风格检查点** -- 使用 `pytest.mark.slow` 标记(若运行时间 > 30 秒) -- 不依赖外部 API Key +- 使用 `pytest.mark.slow` 标记(若运行时间 > 30 秒)。 +- 不依赖外部 API Key。 --- -## Step 9: 清理所有 checkpoint 和 demo 中的 npz 保存逻辑 +## Step 8: 清理所有 checkpoint 和 demo 中的 npz 保存逻辑 **文件** - 修改:`src/factorminer/factorminer/core/ralph_loop.py` @@ -307,27 +356,27 @@ class LocalFactorEvaluator: - 修改:`src/factorminer/factorminer/benchmark/*.py`(如有 `save_signals` 调用) **目标** -- 确保任何运行路径都不会意外触发 `.npz` 信号缓存落盘 -- 移除或注释掉所有 `library_io.save_library(..., save_signals=True)` 调用 +- 确保任何运行路径都不会意外触发 `.npz` 信号缓存落盘。 +- 移除或注释掉所有 `library_io.save_library(..., save_signals=True)` 调用。 **修改要点** -- 搜索 `save_signals=True` 和 `.npz` 关键字,逐一处理 -- 改为 `save_signals=False` 或直接调用不带该参数的 `save_library` +- 搜索 `save_signals=True` 和 `.npz` 关键字,逐一处理。 +- 改为 `save_signals=False` 或直接调用不带该参数的 `save_library`。 --- -## Step 10: 代码风格审查、测试全量回归与提交 +## Step 9: 代码风格审查、测试全量回归与提交 **执行清单** -1. 运行 `uv run pytest tests/test_factorminer_* -v`,确保全部通过 -2. 运行 `uv run pytest tests/test_factor_engine.py tests/test_factor_integration.py -v`,确保本地框架未受影响 -3. 检查新增代码中是否混入 emoji -4. 检查新增代码的导入顺序和 docstring 完整性 -5. 提交前做一次 `git diff --stat`,确认没有误删或大规模重写无关文件 +1. 运行 `uv run pytest tests/test_factorminer_* -v`,确保全部通过。 +2. 运行 `uv run pytest tests/test_factor_engine.py tests/test_factor_integration.py -v`,确保本地框架未受影响。 +3. 检查新增代码中是否混入 emoji。 +4. 检查新增代码的导入顺序和 docstring 完整性。 +5. 提交前做一次 `git diff --stat`,确认没有误删或大规模重写无关文件。 **提交建议** -- 按模块分几个 commit,而不是一个巨大的 commit -- 使用 Conventional Commits 风格(`feat:` / `refactor:` / `perf:` / `test:`) +- 按模块分几个 commit,而不是一个巨大的 commit。 +- 使用 Conventional Commits 风格(`feat:` / `refactor:` / `perf:` / `test:`)。 --- @@ -335,29 +384,41 @@ class LocalFactorEvaluator: | 风险 | 应对 | |------|------| -| FactorMiner 某些算子本地框架没有实现 | 翻译时标记 `# TODO`,评估阶段 reject | -| `FactorEngine` 在极宽表(>1000 列)时内存激增 | 以 batch 为单位分批计算,并配合 `engine.clear()` | -| 本地 `pro_bar` 表数据不完整或缺少某些日期 | 在 `LocalDataLoader` 中加入 coverage check,缺失率过高时抛异常 | +| FactorMiner 某些算子本地框架没有实现 | 一次性脚本翻译时标记 `# TODO`,`unsupported=True` 的因子在评估阶段直接 reject | +| `FactorEngine` 在极宽表(>1000 列)时内存激增 | `LocalFactorEvaluator` 以 batch 为单位分批计算,并配合 `engine.clear()` | +| 本地 `pro_bar` 表数据不完整或缺少某些日期 | `FactorEngine` 本身有数据完整性校验;缺失率过高时会在计算阶段报错 | | `OutputParser` 对本地 DSL 的括号/逗号解析不兼容 | 修改 `OutputParser` 的清洗正则,增加单元测试 | +| 110 个 paper factors 中有大量使用未实现算子 | 统计 TODO 比例,若 >30% 则优先在本地框架补充 `ts_linreg_slope`、`ts_decay` 等高频算子 | --- -## 附:核心模块依赖关系 +## 附:核心模块依赖关系(修订后) ``` -┌────────────────────┐ -│ LocalDataLoader │ ← Storage(read_only=True) -└────────┬───────────┘ - │ - ▼ -┌────────────────────┐ -│ LocalFactorEvaluator│ ← FactorEngine (批量计算 -> pivot -> np.ndarray) -└────────┬───────────┘ - │ - ┌────┴────┐ - ▼ ▼ -pipeline.py runtime.py ← 保留 FactorMiner 的 stats / metrics / admission 逻辑 - │ - ▼ -factor_library.py ← 按需重算,不保存 signals +┌─────────────────────────────┐ +│ scripts/translate_paper_ │ ← 一次性脚本(跑完即删除/保留归档) +│ factors.py │ +└─────────────┬───────────────┘ + │ 替换 PAPER_FACTORS + ▼ +┌─────────────────────────────┐ +│ library_io.py │ ← 禁用 npz,公式已本地 DSL 化 +└─────────────┬───────────────┘ + │ 加载 FactorLibrary + ▼ +┌─────────────────────────────┐ +│ LocalFactorEvaluator │ ← FactorEngine (read_only 自动读取数据) +│ (local_engine.py) │ +└─────────────┬───────────────┘ + │ + ┌─────┴─────┐ + ▼ ▼ + pipeline.py runtime.py ← 保留 FactorMiner 的 stats / metrics / admission 逻辑 + │ + ▼ + factor_library.py ← 按需重算,不保存 signals + │ + ▼ + prompt_builder.py ← LLM 直接生成本地 DSL ``` + diff --git a/src/factorminer/core/library_io.py b/src/factorminer/core/library_io.py index 2146d75..0902852 100644 --- a/src/factorminer/core/library_io.py +++ b/src/factorminer/core/library_io.py @@ -203,669 +203,560 @@ def export_formulas(library: FactorLibrary, path: Union[str, Path]) -> None: # Representative subset of the 110 factors discovered by FactorMiner. # Each entry: (name, formula, category) PAPER_FACTORS: List[Dict[str, str]] = [ - # Factor 001 - { - "name": "Intraday Range Position", - "formula": "Neg(CsRank(Div(Sub($close, TsMin($close, 48)), Add(Sub(TsMax($close, 48), TsMin($close, 48)), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 002 - { - "name": "Volume-Weighted Momentum", - "formula": "Neg(CsRank(Mul(Return($close, 5), Div($volume, Mean($volume, 20)))))", - "category": "Momentum", - }, - # Factor 003 - { - "name": "Residual Volatility", - "formula": "Neg(CsRank(Std(Sub($close, EMA($close, 10)), 20)))", - "category": "Volatility", - }, - # Factor 004 - { - "name": "Intraday Amplitude Ratio", - "formula": "Neg(CsRank(Div(Sub($high, $low), Add($close, 1e-8))))", - "category": "Volatility", - }, - # Factor 005 - { - "name": "Volume Surprise", - "formula": "Neg(CsRank(Div(Sub($volume, Mean($volume, 20)), Add(Std($volume, 20), 1e-8))))", - "category": "Volume", - }, - # Factor 006 - { - "name": "VWAP Deviation", - "formula": "Neg(Div(Sub($close, $vwap), $vwap))", - "category": "VWAP", - }, - # Factor 007 - { - "name": "Short-term Reversal", - "formula": "Neg(CsRank(Return($close, 3)))", - "category": "Mean-reversion", - }, - # Factor 008 - { - "name": "Turnover Momentum", - "formula": "Neg(CsRank(Delta(Div($amt, Add($volume, 1e-8)), 5)))", - "category": "Turnover", - }, - # Factor 009 - { - "name": "High-Low Midpoint Reversion", - "formula": "Neg(CsRank(Sub($close, Div(Add($high, $low), 2))))", - "category": "Mean-reversion", - }, - # Factor 010 - { - "name": "Rolling Beta Residual", - "formula": "Neg(CsRank(Resid($returns, Mean($returns, 20), 20)))", - "category": "Risk", - }, - # Factor 011 - { - "name": "VWAP Slope", - "formula": "Neg(CsRank(TsLinRegSlope(Div(Sub($close, $vwap), $vwap), 10)))", - "category": "VWAP", - }, - # Factor 012 - { - "name": "Accumulation-Distribution", - "formula": "Neg(CsRank(Sum(Mul(Div(Sub(Mul(2, $close), Add($high, $low)), Add(Sub($high, $low), 1e-8)), $volume), 10)))", - "category": "Volume", - }, - # Factor 013 - { - "name": "Relative Strength Index Deviation", - "formula": "Neg(CsRank(Sub(Mean(Max(Delta($close, 1), 0), 14), Mean(Abs(Min(Delta($close, 1), 0)), 14))))", - "category": "Momentum", - }, - # Factor 014 - { - "name": "Price-Volume Correlation", - "formula": "Neg(Corr($close, $volume, 10))", - "category": "Volume", - }, - # Factor 015 - { - "name": "Skewness of Returns", - "formula": "Neg(CsRank(Skew($returns, 20)))", - "category": "Higher-moment", - }, - # Factor 016 - { - "name": "Kurtosis of Returns", - "formula": "Neg(CsRank(Kurt($returns, 20)))", - "category": "Higher-moment", - }, - # Factor 017 - { - "name": "Volume-Weighted Return", - "formula": "Neg(CsRank(Div(Sum(Mul($returns, $volume), 10), Add(Sum($volume, 10), 1e-8))))", - "category": "Volume", - }, - # Factor 018 - { - "name": "Close-to-High Ratio", - "formula": "Neg(CsRank(Div(Sub($high, $close), Add($high, 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 019 - { - "name": "Delayed Correlation Shift", - "formula": "Neg(CsRank(Sub(Corr($close, $volume, 10), Corr(Delay($close, 5), $volume, 10))))", - "category": "Volume", - }, - # Factor 020 - { - "name": "Exponential Momentum", - "formula": "Neg(CsRank(Sub($close, EMA($close, 20))))", - "category": "Momentum", - }, - # Factor 021 - { - "name": "Range-Adjusted Volume", - "formula": "Neg(CsRank(Div($volume, Add(Sub($high, $low), 1e-8))))", - "category": "Volume", - }, - # Factor 022 - { - "name": "Cumulative Return Rank", - "formula": "Neg(CsRank(Sum($returns, 10)))", - "category": "Momentum", - }, - # Factor 023 - { - "name": "VWAP Momentum", - "formula": "Neg(CsRank(Return($vwap, 5)))", - "category": "VWAP", - }, - # Factor 024 - { - "name": "Bollinger Band Position", - "formula": "Neg(CsRank(Div(Sub($close, Mean($close, 20)), Add(Std($close, 20), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 025 - { - "name": "Volume Decay Weighted", - "formula": "Neg(CsRank(Decay($volume, 10)))", - "category": "Volume", - }, - # Factor 026 - { - "name": "Overnight Return", - "formula": "Neg(CsRank(Div(Sub($open, Delay($close, 1)), Add(Delay($close, 1), 1e-8))))", - "category": "Overnight", - }, - # Factor 027 - { - "name": "Intraday Return", - "formula": "Neg(CsRank(Div(Sub($close, $open), Add($open, 1e-8))))", - "category": "Intraday", - }, - # Factor 028 - { - "name": "Max Drawdown", - "formula": "Neg(CsRank(Div(Sub($close, TsMax($close, 20)), Add(TsMax($close, 20), 1e-8))))", - "category": "Risk", - }, - # Factor 029 - { - "name": "Hurst Exponent Proxy", - "formula": "Neg(CsRank(Div(Std($returns, 20), Add(Std($returns, 5), 1e-8))))", - "category": "Volatility", - }, - # Factor 030 - { - "name": "Volume Imbalance", - "formula": "Neg(CsRank(Sub(Mean($volume, 5), Mean($volume, 20))))", - "category": "Volume", - }, - # Factor 031 - { - "name": "Weighted Close Position", - "formula": "Neg(CsRank(Div(Sub(Mul(2, $close), Add($high, $low)), Add(Sub($high, $low), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 032 - { - "name": "Trend Intensity", - "formula": "Neg(CsRank(Div(Abs(Delta($close, 10)), Add(Sum(Abs(Delta($close, 1)), 10), 1e-8))))", - "category": "Trend", - }, - # Factor 033 - { - "name": "Return Dispersion", - "formula": "Neg(CsRank(Std($returns, 5)))", - "category": "Volatility", - }, - # Factor 034 - { - "name": "VWAP Relative Strength", - "formula": "Neg(CsRank(Div(Sub(Mean($close, 5), $vwap), Add($vwap, 1e-8))))", - "category": "VWAP", - }, - # Factor 035 - { - "name": "Rank Reversal", - "formula": "Neg(CsRank(Sub(TsRank($close, 10), TsRank($close, 30))))", - "category": "Mean-reversion", - }, - # Factor 036 - { - "name": "Money Flow Index", - "formula": "Neg(CsRank(Div(Sum(Mul(Max(Delta($close, 1), 0), $volume), 14), Add(Sum(Mul(Abs(Delta($close, 1)), $volume), 14), 1e-8))))", - "category": "Volume", - }, - # Factor 037 - { - "name": "Adaptive Momentum", - "formula": "Neg(CsRank(Mul(Return($close, 10), Div(Std($returns, 5), Add(Std($returns, 20), 1e-8)))))", - "category": "Momentum", - }, - # Factor 038 - { - "name": "Volume Trend", - "formula": "Neg(CsRank(TsLinRegSlope($volume, 10)))", - "category": "Volume", - }, - # Factor 039 - { - "name": "Price Acceleration", - "formula": "Neg(CsRank(Sub(Delta($close, 5), Delta(Delay($close, 5), 5))))", - "category": "Momentum", - }, - # Factor 040 - { - "name": "Realized Volatility Ratio", - "formula": "Neg(CsRank(Div(Std($returns, 10), Add(Std($returns, 30), 1e-8))))", - "category": "Volatility", - }, - # Factor 041 - { - "name": "Amount Concentration", - "formula": "Neg(CsRank(Div(TsMax($amt, 5), Add(Mean($amt, 20), 1e-8))))", - "category": "Turnover", - }, - # Factor 042 - { - "name": "Cross-Sectional Volume Rank", - "formula": "Neg(CsRank(Div($volume, Add(Mean($volume, 60), 1e-8))))", - "category": "Volume", - }, - # Factor 043 - { - "name": "Gap Momentum", - "formula": "Neg(CsRank(Sum(Div(Sub($open, Delay($close, 1)), Add(Delay($close, 1), 1e-8)), 5)))", - "category": "Overnight", - }, - # Factor 044 - { - "name": "VWAP Distance Decay", - "formula": "Neg(CsRank(Decay(Div(Sub($close, $vwap), Add($vwap, 1e-8)), 10)))", - "category": "VWAP", - }, - # Factor 045 - { - "name": "Tail Risk Indicator", - "formula": "Neg(CsRank(Div(TsMin($returns, 20), Add(Std($returns, 20), 1e-8))))", - "category": "Risk", - }, - # Factor 046 - { - "name": "Volatility-Regime Reversal Divergence", - "formula": "IfElse(Greater(Std($returns, 12), Mean(Std($returns, 12), 48)), Neg(CsRank(Delta($close, 3))), Neg(CsRank(Div(Sub($close, $low), Add(Sub($high, $low), 0.0001)))))", - "category": "Regime-switching", - }, - # Factor 047 - { - "name": "Regime Volume Signal", - "formula": "IfElse(Greater($volume, Mean($volume, 20)), Neg(CsRank($returns)), Neg(CsRank(Return($close, 5))))", - "category": "Regime-switching", - }, - # Factor 048 - { - "name": "Liquidity-Adjusted Reversal", - "formula": "Neg(CsRank(Mul(Return($close, 3), Div($volume, Add(Mean($volume, 20), 1e-8)))))", - "category": "Mean-reversion", - }, - # Factor 049 - { - "name": "Cross-Sectional Volatility Rank", - "formula": "Neg(CsRank(CsRank(Std($returns, 10))))", - "category": "Volatility", - }, - # Factor 050 - { - "name": "VWAP Bollinger", - "formula": "Neg(CsRank(Div(Sub($vwap, Mean($vwap, 20)), Add(Std($vwap, 20), 1e-8))))", - "category": "VWAP", - }, - # Factor 051 - { - "name": "Smoothed Return Reversal", - "formula": "Neg(CsRank(EMA($returns, 5)))", - "category": "Mean-reversion", - }, - # Factor 052 - { - "name": "Volume-Price Divergence", - "formula": "Neg(CsRank(Sub(TsRank($volume, 10), TsRank($close, 10))))", - "category": "Volume", - }, - # Factor 053 - { - "name": "Decay Weighted Momentum", - "formula": "Neg(CsRank(Decay($returns, 20)))", - "category": "Momentum", - }, - # Factor 054 - { - "name": "Range Percentile", - "formula": "Neg(CsRank(Div(Sub($close, TsMin($close, 20)), Add(Sub(TsMax($close, 20), TsMin($close, 20)), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 055 - { - "name": "Volume Skewness", - "formula": "Neg(CsRank(Skew($volume, 20)))", - "category": "Volume", - }, - # Factor 056 - { - "name": "Residual Momentum", - "formula": "Neg(CsRank(TsLinRegResid($close, 20)))", - "category": "Momentum", - }, - # Factor 057 - { - "name": "VWAP Trend", - "formula": "Neg(CsRank(Delta(Div(Sub($close, $vwap), $vwap), 5)))", - "category": "VWAP", - }, - # Factor 058 - { - "name": "Return Autocorrelation", - "formula": "Neg(CsRank(Corr($returns, Delay($returns, 1), 10)))", - "category": "Mean-reversion", - }, - # Factor 059 - { - "name": "Price Efficiency", - "formula": "Neg(CsRank(Div(Abs(Sum($returns, 10)), Add(Sum(Abs($returns), 10), 1e-8))))", - "category": "Trend", - }, - # Factor 060 - { - "name": "Relative Volume Change", - "formula": "Neg(CsRank(Return($volume, 5)))", - "category": "Volume", - }, - # Factor 061 - { - "name": "Weighted VWAP Position", - "formula": "Neg(CsRank(WMA(Div(Sub($close, $vwap), $vwap), 10)))", - "category": "VWAP", - }, - # Factor 062 - { - "name": "Regime Momentum Flip", - "formula": "IfElse(Greater(Mean($returns, 5), 0), Neg(CsRank(Return($close, 10))), CsRank(Return($close, 3)))", - "category": "Regime-switching", - }, - # Factor 063 - { - "name": "High-Low Volatility", - "formula": "Neg(CsRank(Mean(Div(Sub($high, $low), Add($close, 1e-8)), 10)))", - "category": "Volatility", - }, - # Factor 064 - { - "name": "Opening Gap Reversal", - "formula": "Neg(CsRank(Div(Sub($open, Delay($close, 1)), Add(Std($returns, 10), 1e-8))))", - "category": "Overnight", - }, - # Factor 065 - { - "name": "Volume Momentum Spread", - "formula": "Neg(CsRank(Sub(Mean($volume, 5), Mean($volume, 40))))", - "category": "Volume", - }, - # Factor 066 - { - "name": "Regime Volume Reversal", - "formula": "IfElse(Greater(Div($volume, Add(Mean($volume, 20), 1e-8)), 1.5), Neg(CsRank($returns)), Neg(CsRank(Return($close, 10))))", - "category": "Regime-switching", - }, - # Factor 067 - { - "name": "Slope Reversal", - "formula": "Neg(CsRank(TsLinRegSlope($close, 5)))", - "category": "Mean-reversion", - }, - # Factor 068 - { - "name": "VWAP Momentum Decay", - "formula": "Neg(CsRank(Decay(Return($vwap, 3), 10)))", - "category": "VWAP", - }, - # Factor 069 - { - "name": "Turnover Rate Change", - "formula": "Neg(CsRank(Delta(Div($amt, Add($volume, 1e-8)), 10)))", - "category": "Turnover", - }, - # Factor 070 - { - "name": "Return Quantile Signal", - "formula": "Neg(CsRank(Quantile($returns, 20, 0.75)))", - "category": "Higher-moment", - }, - # Factor 071 - { - "name": "Double EMA Crossover", - "formula": "Neg(CsRank(Sub(EMA($close, 5), EMA($close, 20))))", - "category": "Trend", - }, - # Factor 072 - { - "name": "Conditional Volatility Return", - "formula": "Neg(CsRank(Div($returns, Add(Std($returns, 10), 1e-8))))", - "category": "Risk", - }, - # Factor 073 - { - "name": "Amplitude Trend", - "formula": "Neg(CsRank(TsLinRegSlope(Div(Sub($high, $low), Add($close, 1e-8)), 10)))", - "category": "Volatility", - }, - # Factor 074 - { - "name": "Volume-Weighted Range", - "formula": "Neg(CsRank(Mean(Mul(Div(Sub($high, $low), Add($close, 1e-8)), $volume), 10)))", - "category": "Volume", - }, - # Factor 075 - { - "name": "Intraday Efficiency Ratio", - "formula": "Neg(CsRank(Div(Abs(Sub($close, $open)), Add(Sub($high, $low), 1e-8))))", - "category": "Intraday", - }, - # Factor 076 - { - "name": "Cumulative Volume Signal", - "formula": "Neg(CsRank(Div(Sum(Mul($returns, $volume), 20), Add(Sum($volume, 20), 1e-8))))", - "category": "Volume", - }, - # Factor 077 - { - "name": "VWAP Cross-Sectional Momentum", - "formula": "Neg(CsRank(CsRank(Return($vwap, 10))))", - "category": "VWAP", - }, - # Factor 078 - { - "name": "Mean-Reversion Indicator", - "formula": "Neg(CsRank(Div(Sub($close, SMA($close, 10)), Add(SMA($close, 10), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 079 - { - "name": "Volume Regime Indicator", - "formula": "Neg(CsRank(Div(Std($volume, 5), Add(Std($volume, 20), 1e-8))))", - "category": "Volume", - }, - # Factor 080 - { - "name": "Return Persistence", - "formula": "Neg(CsRank(Mul(Sign(Delta($close, 1)), Sign(Delta($close, 5)))))", - "category": "Momentum", - }, - # Factor 081 - { - "name": "Regime Trend Strength", - "formula": "IfElse(Greater(Abs(TsLinRegSlope($close, 20)), Std($close, 20)), Neg(CsRank(TsLinRegSlope($close, 5))), Neg(CsRank(Return($close, 3))))", - "category": "Regime-switching", - }, - # Factor 082 - { - "name": "VWAP Dispersion", - "formula": "Neg(CsRank(Std(Div(Sub($close, $vwap), $vwap), 10)))", - "category": "VWAP", - }, - # Factor 083 - { - "name": "Smart Money Flow", - "formula": "Neg(CsRank(Sum(Mul(IfElse(Greater($close, Delay($close, 1)), $volume, Neg($volume)), Div(Sub($high, $low), Add($close, 1e-8))), 10)))", - "category": "Volume", - }, - # Factor 084 - { - "name": "Return Rank Dispersion", - "formula": "Neg(CsRank(Sub(TsRank($returns, 5), TsRank($returns, 20))))", - "category": "Mean-reversion", - }, - # Factor 085 - { - "name": "Volume Acceleration", - "formula": "Neg(CsRank(Sub(Delta($volume, 5), Delta(Delay($volume, 5), 5))))", - "category": "Volume", - }, - # Factor 086 - { - "name": "Close-Low Ratio Trend", - "formula": "Neg(CsRank(Mean(Div(Sub($close, $low), Add(Sub($high, $low), 1e-8)), 5)))", - "category": "Mean-reversion", - }, - # Factor 087 - { - "name": "Hull MA Deviation", - "formula": "Neg(CsRank(Div(Sub($close, HMA($close, 10)), Add(Std($close, 10), 1e-8))))", - "category": "Trend", - }, - # Factor 088 - { - "name": "DEMA Momentum Signal", - "formula": "Neg(CsRank(Sub(DEMA($close, 5), DEMA($close, 20))))", - "category": "Momentum", - }, - # Factor 089 - { - "name": "Volume Profile Skew", - "formula": "Neg(CsRank(Skew(Div($volume, Add(Mean($volume, 20), 1e-8)), 10)))", - "category": "Volume", - }, - # Factor 090 - { - "name": "Conditional VWAP Signal", - "formula": "IfElse(Greater($close, $vwap), Neg(CsRank(Div(Sub($close, $vwap), $vwap))), CsRank(Div(Sub($vwap, $close), $vwap)))", - "category": "VWAP", - }, - # Factor 091 - { - "name": "Extreme Volume Reversal", - "formula": "Neg(CsRank(Mul(IfElse(Greater($volume, Mul(2, Mean($volume, 20))), 1, 0), $returns)))", - "category": "Volume", - }, - # Factor 092 - { - "name": "Range Expansion Signal", - "formula": "Neg(CsRank(Div(Sub($high, $low), Add(Mean(Sub($high, $low), 20), 1e-8))))", - "category": "Volatility", - }, - # Factor 093 - { - "name": "Short-Term IC Momentum", - "formula": "Neg(CsRank(Sum(Mul(Sign($returns), Abs($returns)), 5)))", - "category": "Momentum", - }, - # Factor 094 - { - "name": "VWAP Curvature", - "formula": "Neg(CsRank(Sub(Div(Sub($vwap, Delay($vwap, 5)), Add(Delay($vwap, 5), 1e-8)), Div(Sub(Delay($vwap, 5), Delay($vwap, 10)), Add(Delay($vwap, 10), 1e-8)))))", - "category": "VWAP", - }, - # Factor 095 - { - "name": "Relative Strength", - "formula": "Neg(CsRank(Div(Return($close, 5), Add(Return($close, 20), 1e-8))))", - "category": "Momentum", - }, - # Factor 096 - { - "name": "Volume-Correlated Return", - "formula": "Neg(CsRank(Cov($returns, $volume, 10)))", - "category": "Volume", - }, - # Factor 097 - { - "name": "Regime Volatility Band", - "formula": "IfElse(Greater(Std($returns, 5), Mul(1.5, Std($returns, 20))), Neg(CsRank(Return($close, 1))), Neg(CsRank(Return($close, 10))))", - "category": "Regime-switching", - }, - # Factor 098 - { - "name": "Open-Close Spread Momentum", - "formula": "Neg(CsRank(Mean(Div(Sub($close, $open), Add($open, 1e-8)), 5)))", - "category": "Intraday", - }, - # Factor 099 - { - "name": "Volatility-Scaled Reversal", - "formula": "Neg(CsRank(Div(Return($close, 5), Add(Std($returns, 20), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 100 - { - "name": "VWAP Time-Weighted Signal", - "formula": "Neg(CsRank(WMA(Div(Sub($close, $vwap), Add($vwap, 1e-8)), 20)))", - "category": "VWAP", - }, - # Factor 101 - { - "name": "Covariance Structure Shift", - "formula": "Neg(CsRank(Sub(Cov($returns, $volume, 5), Cov($returns, $volume, 20))))", - "category": "Volume", - }, - # Factor 102 - { - "name": "Quadratic Regression Residual", - "formula": "Neg(CsRank(TsLinRegResid(Square($returns), 20)))", - "category": "Higher-moment", - }, - # Factor 103 - { - "name": "VWAP Mean-Reversion Strength", - "formula": "Neg(CsRank(Mul(Div(Sub($close, $vwap), $vwap), Div($volume, Add(Mean($volume, 20), 1e-8)))))", - "category": "VWAP", - }, - # Factor 104 - { - "name": "Multi-Scale Momentum", - "formula": "Neg(CsRank(Add(Return($close, 5), Return($close, 20))))", - "category": "Momentum", - }, - # Factor 105 - { - "name": "Relative High Position", - "formula": "Neg(CsRank(Div(Sub(TsMax($high, 20), $close), Add(TsMax($high, 20), 1e-8))))", - "category": "Mean-reversion", - }, - # Factor 106 - { - "name": "Turnover Volatility", - "formula": "Neg(CsRank(Std(Div($amt, Add($volume, 1e-8)), 10)))", - "category": "Turnover", - }, - # Factor 107 - { - "name": "Regime Correlation Signal", - "formula": "IfElse(Greater(Abs(Corr($close, $volume, 10)), 0.5), Neg(CsRank(Return($close, 3))), Neg(CsRank(Return($close, 10))))", - "category": "Regime-switching", - }, - # Factor 108 - { - "name": "Intraday Momentum Reversal", - "formula": "Neg(CsRank(Div(Sub($close, $open), Add(Sub($high, $low), 1e-8))))", - "category": "Intraday", - }, - # Factor 109 - { - "name": "Volume-Weighted Slope", - "formula": "Neg(CsRank(TsLinRegSlope(Mul($returns, $volume), 10)))", - "category": "Volume", - }, - # Factor 110 - { - "name": "Adaptive Range Reversal", - "formula": "IfElse(Greater(Std($returns, 10), Mean(Std($returns, 10), 40)), Neg(CsRank(Div(Sub($close, TsMin($close, 10)), Add(Sub(TsMax($close, 10), TsMin($close, 10)), 1e-8)))), Neg(CsRank(Return($close, 5))))", - "category": "Regime-switching", - }, + { + "name": 'Intraday Range Position', + "formula": '(-cs_rank(((close - ts_min(close, 48)) / ((ts_max(close, 48) - ts_min(close, 48)) + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'Volume-Weighted Momentum', + "formula": '(-cs_rank((ts_pct_change(close, 5) * (vol / ts_mean(vol, 20)))))', + "category": 'Momentum', + }, + { + "name": 'Residual Volatility', + "formula": '(-cs_rank(ts_std((close - ts_ema(close, 10)), 20)))', + "category": 'Volatility', + }, + { + "name": 'Intraday Amplitude Ratio', + "formula": '(-cs_rank(((high - low) / (close + 1e-8))))', + "category": 'Volatility', + }, + { + "name": 'Volume Surprise', + "formula": '(-cs_rank(((vol - ts_mean(vol, 20)) / (ts_std(vol, 20) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'VWAP Deviation', + "formula": '(-((close - (amount / vol)) / (amount / vol)))', + "category": 'VWAP', + }, + { + "name": 'Short-term Reversal', + "formula": '(-cs_rank(ts_pct_change(close, 3)))', + "category": 'Mean-reversion', + }, + { + "name": 'Turnover Momentum', + "formula": '(-cs_rank(ts_delta((amount / (vol + 1e-8)), 5)))', + "category": 'Turnover', + }, + { + "name": 'High-Low Midpoint Reversion', + "formula": '(-cs_rank((close - ((high + low) / 2))))', + "category": 'Mean-reversion', + }, + # { + # "name": 'Rolling Beta Residual', + # "formula": '# TODO: Neg(CsRank(Resid($returns, Mean($returns, 20), 20)))', + # "category": 'Risk', + # }, + # { + # "name": 'VWAP Slope', + # "formula": '# TODO: Neg(CsRank(TsLinRegSlope(Div(Sub($close, $vwap), $vwap), 10)))', + # "category": 'VWAP', + # }, + { + "name": 'Accumulation-Distribution', + "formula": '(-cs_rank(ts_sum(((((2 * close) - (high + low)) / ((high - low) + 1e-8)) * vol), 10)))', + "category": 'Volume', + }, + { + "name": 'Relative Strength Index Deviation', + "formula": '(-cs_rank((ts_mean(max_(ts_delta(close, 1), 0), 14) - ts_mean(abs(min_(ts_delta(close, 1), 0)), 14))))', + "category": 'Momentum', + }, + { + "name": 'Price-Volume Correlation', + "formula": '(-ts_corr(close, vol, 10))', + "category": 'Volume', + }, + { + "name": 'Skewness of Returns', + "formula": '(-cs_rank(ts_skew((close / ts_delay(close, 1) - 1), 20)))', + "category": 'Higher-moment', + }, + { + "name": 'Kurtosis of Returns', + "formula": '(-cs_rank(ts_kurt((close / ts_delay(close, 1) - 1), 20)))', + "category": 'Higher-moment', + }, + { + "name": 'Volume-Weighted Return', + "formula": '(-cs_rank((ts_sum(((close / ts_delay(close, 1) - 1) * vol), 10) / (ts_sum(vol, 10) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'Close-to-High Ratio', + "formula": '(-cs_rank(((high - close) / (high + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'Delayed Correlation Shift', + "formula": '(-cs_rank((ts_corr(close, vol, 10) - ts_corr(ts_delay(close, 5), vol, 10))))', + "category": 'Volume', + }, + { + "name": 'Exponential Momentum', + "formula": '(-cs_rank((close - ts_ema(close, 20))))', + "category": 'Momentum', + }, + { + "name": 'Range-Adjusted Volume', + "formula": '(-cs_rank((vol / ((high - low) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'Cumulative Return Rank', + "formula": '(-cs_rank(ts_sum((close / ts_delay(close, 1) - 1), 10)))', + "category": 'Momentum', + }, + { + "name": 'VWAP Momentum', + "formula": '(-cs_rank(ts_pct_change((amount / vol), 5)))', + "category": 'VWAP', + }, + { + "name": 'Bollinger Band Position', + "formula": '(-cs_rank(((close - ts_mean(close, 20)) / (ts_std(close, 20) + 1e-8))))', + "category": 'Mean-reversion', + }, + # { + # "name": 'Volume Decay Weighted', + # "formula": '# TODO: Neg(CsRank(Decay($volume, 10)))', + # "category": 'Volume', + # }, + { + "name": 'Overnight Return', + "formula": '(-cs_rank(((open - ts_delay(close, 1)) / (ts_delay(close, 1) + 1e-8))))', + "category": 'Overnight', + }, + { + "name": 'Intraday Return', + "formula": '(-cs_rank(((close - open) / (open + 1e-8))))', + "category": 'Intraday', + }, + { + "name": 'Max Drawdown', + "formula": '(-cs_rank(((close - ts_max(close, 20)) / (ts_max(close, 20) + 1e-8))))', + "category": 'Risk', + }, + { + "name": 'Hurst Exponent Proxy', + "formula": '(-cs_rank((ts_std((close / ts_delay(close, 1) - 1), 20) / (ts_std((close / ts_delay(close, 1) - 1), 5) + 1e-8))))', + "category": 'Volatility', + }, + { + "name": 'Volume Imbalance', + "formula": '(-cs_rank((ts_mean(vol, 5) - ts_mean(vol, 20))))', + "category": 'Volume', + }, + { + "name": 'Weighted Close Position', + "formula": '(-cs_rank((((2 * close) - (high + low)) / ((high - low) + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'Trend Intensity', + "formula": '(-cs_rank((abs(ts_delta(close, 10)) / (ts_sum(abs(ts_delta(close, 1)), 10) + 1e-8))))', + "category": 'Trend', + }, + { + "name": 'Return Dispersion', + "formula": '(-cs_rank(ts_std((close / ts_delay(close, 1) - 1), 5)))', + "category": 'Volatility', + }, + { + "name": 'VWAP Relative Strength', + "formula": '(-cs_rank(((ts_mean(close, 5) - (amount / vol)) / ((amount / vol) + 1e-8))))', + "category": 'VWAP', + }, + { + "name": 'Rank Reversal', + "formula": '(-cs_rank((ts_rank(close, 10) - ts_rank(close, 30))))', + "category": 'Mean-reversion', + }, + { + "name": 'Money Flow Index', + "formula": '(-cs_rank((ts_sum((max_(ts_delta(close, 1), 0) * vol), 14) / (ts_sum((abs(ts_delta(close, 1)) * vol), 14) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'Adaptive Momentum', + "formula": '(-cs_rank((ts_pct_change(close, 10) * (ts_std((close / ts_delay(close, 1) - 1), 5) / (ts_std((close / ts_delay(close, 1) - 1), 20) + 1e-8)))))', + "category": 'Momentum', + }, + # { + # "name": 'Volume Trend', + # "formula": '# TODO: Neg(CsRank(TsLinRegSlope($volume, 10)))', + # "category": 'Volume', + # }, + { + "name": 'Price Acceleration', + "formula": '(-cs_rank((ts_delta(close, 5) - ts_delta(ts_delay(close, 5), 5))))', + "category": 'Momentum', + }, + { + "name": 'Realized Volatility Ratio', + "formula": '(-cs_rank((ts_std((close / ts_delay(close, 1) - 1), 10) / (ts_std((close / ts_delay(close, 1) - 1), 30) + 1e-8))))', + "category": 'Volatility', + }, + { + "name": 'Amount Concentration', + "formula": '(-cs_rank((ts_max(amount, 5) / (ts_mean(amount, 20) + 1e-8))))', + "category": 'Turnover', + }, + { + "name": 'Cross-Sectional Volume Rank', + "formula": '(-cs_rank((vol / (ts_mean(vol, 60) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'Gap Momentum', + "formula": '(-cs_rank(ts_sum(((open - ts_delay(close, 1)) / (ts_delay(close, 1) + 1e-8)), 5)))', + "category": 'Overnight', + }, + # { + # "name": 'VWAP Distance Decay', + # "formula": '# TODO: Neg(CsRank(Decay(Div(Sub($close, $vwap), Add($vwap, 1e-8)), 10)))', + # "category": 'VWAP', + # }, + { + "name": 'Tail Risk Indicator', + "formula": '(-cs_rank((ts_min((close / ts_delay(close, 1) - 1), 20) / (ts_std((close / ts_delay(close, 1) - 1), 20) + 1e-8))))', + "category": 'Risk', + }, + { + "name": 'Volatility-Regime Reversal Divergence', + "formula": 'if_((ts_std((close / ts_delay(close, 1) - 1), 12) > ts_mean(ts_std((close / ts_delay(close, 1) - 1), 12), 48)), (-cs_rank(ts_delta(close, 3))), (-cs_rank(((close - low) / ((high - low) + 0.0001)))))', + "category": 'Regime-switching', + }, + { + "name": 'Regime Volume Signal', + "formula": 'if_((vol > ts_mean(vol, 20)), (-cs_rank((close / ts_delay(close, 1) - 1))), (-cs_rank(ts_pct_change(close, 5))))', + "category": 'Regime-switching', + }, + { + "name": 'Liquidity-Adjusted Reversal', + "formula": '(-cs_rank((ts_pct_change(close, 3) * (vol / (ts_mean(vol, 20) + 1e-8)))))', + "category": 'Mean-reversion', + }, + { + "name": 'Cross-Sectional Volatility Rank', + "formula": '(-cs_rank(cs_rank(ts_std((close / ts_delay(close, 1) - 1), 10))))', + "category": 'Volatility', + }, + { + "name": 'VWAP Bollinger', + "formula": '(-cs_rank((((amount / vol) - ts_mean((amount / vol), 20)) / (ts_std((amount / vol), 20) + 1e-8))))', + "category": 'VWAP', + }, + { + "name": 'Smoothed Return Reversal', + "formula": '(-cs_rank(ts_ema((close / ts_delay(close, 1) - 1), 5)))', + "category": 'Mean-reversion', + }, + { + "name": 'Volume-Price Divergence', + "formula": '(-cs_rank((ts_rank(vol, 10) - ts_rank(close, 10))))', + "category": 'Volume', + }, + # { + # "name": 'Decay Weighted Momentum', + # "formula": '# TODO: Neg(CsRank(Decay($returns, 20)))', + # "category": 'Momentum', + # }, + { + "name": 'Range Percentile', + "formula": '(-cs_rank(((close - ts_min(close, 20)) / ((ts_max(close, 20) - ts_min(close, 20)) + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'Volume Skewness', + "formula": '(-cs_rank(ts_skew(vol, 20)))', + "category": 'Volume', + }, + # { + # "name": 'Residual Momentum', + # "formula": '# TODO: Neg(CsRank(TsLinRegResid($close, 20)))', + # "category": 'Momentum', + # }, + { + "name": 'VWAP Trend', + "formula": '(-cs_rank(ts_delta(((close - (amount / vol)) / (amount / vol)), 5)))', + "category": 'VWAP', + }, + { + "name": 'Return Autocorrelation', + "formula": '(-cs_rank(ts_corr((close / ts_delay(close, 1) - 1), ts_delay((close / ts_delay(close, 1) - 1), 1), 10)))', + "category": 'Mean-reversion', + }, + { + "name": 'Price Efficiency', + "formula": '(-cs_rank((abs(ts_sum((close / ts_delay(close, 1) - 1), 10)) / (ts_sum(abs((close / ts_delay(close, 1) - 1)), 10) + 1e-8))))', + "category": 'Trend', + }, + { + "name": 'Relative Volume Change', + "formula": '(-cs_rank(ts_pct_change(vol, 5)))', + "category": 'Volume', + }, + { + "name": 'Weighted VWAP Position', + "formula": '(-cs_rank(ts_wma(((close - (amount / vol)) / (amount / vol)), 10)))', + "category": 'VWAP', + }, + { + "name": 'Regime Momentum Flip', + "formula": 'if_((ts_mean((close / ts_delay(close, 1) - 1), 5) > 0), (-cs_rank(ts_pct_change(close, 10))), cs_rank(ts_pct_change(close, 3)))', + "category": 'Regime-switching', + }, + { + "name": 'High-Low Volatility', + "formula": '(-cs_rank(ts_mean(((high - low) / (close + 1e-8)), 10)))', + "category": 'Volatility', + }, + { + "name": 'Opening Gap Reversal', + "formula": '(-cs_rank(((open - ts_delay(close, 1)) / (ts_std((close / ts_delay(close, 1) - 1), 10) + 1e-8))))', + "category": 'Overnight', + }, + { + "name": 'Volume Momentum Spread', + "formula": '(-cs_rank((ts_mean(vol, 5) - ts_mean(vol, 40))))', + "category": 'Volume', + }, + { + "name": 'Regime Volume Reversal', + "formula": 'if_(((vol / (ts_mean(vol, 20) + 1e-8)) > 1.5), (-cs_rank((close / ts_delay(close, 1) - 1))), (-cs_rank(ts_pct_change(close, 10))))', + "category": 'Regime-switching', + }, + # { + # "name": 'Slope Reversal', + # "formula": '# TODO: Neg(CsRank(TsLinRegSlope($close, 5)))', + # "category": 'Mean-reversion', + # }, + # { + # "name": 'VWAP Momentum Decay', + # "formula": '# TODO: Neg(CsRank(Decay(Return($vwap, 3), 10)))', + # "category": 'VWAP', + # }, + { + "name": 'Turnover Rate Change', + "formula": '(-cs_rank(ts_delta((amount / (vol + 1e-8)), 10)))', + "category": 'Turnover', + }, + # { + # "name": 'Return Quantile Signal', + # "formula": '# TODO: Neg(CsRank(Quantile($returns, 20, 0.75)))', + # "category": 'Higher-moment', + # }, + { + "name": 'Double EMA Crossover', + "formula": '(-cs_rank((ts_ema(close, 5) - ts_ema(close, 20))))', + "category": 'Trend', + }, + { + "name": 'Conditional Volatility Return', + "formula": '(-cs_rank(((close / ts_delay(close, 1) - 1) / (ts_std((close / ts_delay(close, 1) - 1), 10) + 1e-8))))', + "category": 'Risk', + }, + # { + # "name": 'Amplitude Trend', + # "formula": '# TODO: Neg(CsRank(TsLinRegSlope(Div(Sub($high, $low), Add($close, 1e-8)), 10)))', + # "category": 'Volatility', + # }, + { + "name": 'Volume-Weighted Range', + "formula": '(-cs_rank(ts_mean((((high - low) / (close + 1e-8)) * vol), 10)))', + "category": 'Volume', + }, + { + "name": 'Intraday Efficiency Ratio', + "formula": '(-cs_rank((abs((close - open)) / ((high - low) + 1e-8))))', + "category": 'Intraday', + }, + { + "name": 'Cumulative Volume Signal', + "formula": '(-cs_rank((ts_sum(((close / ts_delay(close, 1) - 1) * vol), 20) / (ts_sum(vol, 20) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'VWAP Cross-Sectional Momentum', + "formula": '(-cs_rank(cs_rank(ts_pct_change((amount / vol), 10))))', + "category": 'VWAP', + }, + { + "name": 'Mean-Reversion Indicator', + "formula": '(-cs_rank(((close - ts_mean(close, 10)) / (ts_mean(close, 10) + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'Volume Regime Indicator', + "formula": '(-cs_rank((ts_std(vol, 5) / (ts_std(vol, 20) + 1e-8))))', + "category": 'Volume', + }, + { + "name": 'Return Persistence', + "formula": '(-cs_rank((sign(ts_delta(close, 1)) * sign(ts_delta(close, 5)))))', + "category": 'Momentum', + }, + # { + # "name": 'Regime Trend Strength', + # "formula": '# TODO: IfElse(Greater(Abs(TsLinRegSlope($close, 20)), Std($close, 20)), Neg(CsRank(TsLinRegSlope($close, 5))), Neg(CsRank(Return($close, 3))))', + # "category": 'Regime-switching', + # }, + { + "name": 'VWAP Dispersion', + "formula": '(-cs_rank(ts_std(((close - (amount / vol)) / (amount / vol)), 10)))', + "category": 'VWAP', + }, + { + "name": 'Smart Money Flow', + "formula": '(-cs_rank(ts_sum((if_((close > ts_delay(close, 1)), vol, (-vol)) * ((high - low) / (close + 1e-8))), 10)))', + "category": 'Volume', + }, + { + "name": 'Return Rank Dispersion', + "formula": '(-cs_rank((ts_rank((close / ts_delay(close, 1) - 1), 5) - ts_rank((close / ts_delay(close, 1) - 1), 20))))', + "category": 'Mean-reversion', + }, + { + "name": 'Volume Acceleration', + "formula": '(-cs_rank((ts_delta(vol, 5) - ts_delta(ts_delay(vol, 5), 5))))', + "category": 'Volume', + }, + { + "name": 'Close-Low Ratio Trend', + "formula": '(-cs_rank(ts_mean(((close - low) / ((high - low) + 1e-8)), 5)))', + "category": 'Mean-reversion', + }, + # { + # "name": 'Hull MA Deviation', + # "formula": '# TODO: Neg(CsRank(Div(Sub($close, HMA($close, 10)), Add(Std($close, 10), 1e-8))))', + # "category": 'Trend', + # }, + # { + # "name": 'DEMA Momentum Signal', + # "formula": '# TODO: Neg(CsRank(Sub(DEMA($close, 5), DEMA($close, 20))))', + # "category": 'Momentum', + # }, + { + "name": 'Volume Profile Skew', + "formula": '(-cs_rank(ts_skew((vol / (ts_mean(vol, 20) + 1e-8)), 10)))', + "category": 'Volume', + }, + { + "name": 'Conditional VWAP Signal', + "formula": 'if_((close > (amount / vol)), (-cs_rank(((close - (amount / vol)) / (amount / vol)))), cs_rank((((amount / vol) - close) / (amount / vol))))', + "category": 'VWAP', + }, + { + "name": 'Extreme Volume Reversal', + "formula": '(-cs_rank((if_((vol > (2 * ts_mean(vol, 20))), 1, 0) * (close / ts_delay(close, 1) - 1))))', + "category": 'Volume', + }, + { + "name": 'Range Expansion Signal', + "formula": '(-cs_rank(((high - low) / (ts_mean((high - low), 20) + 1e-8))))', + "category": 'Volatility', + }, + { + "name": 'Short-Term IC Momentum', + "formula": '(-cs_rank(ts_sum((sign((close / ts_delay(close, 1) - 1)) * abs((close / ts_delay(close, 1) - 1))), 5)))', + "category": 'Momentum', + }, + { + "name": 'VWAP Curvature', + "formula": '(-cs_rank(((((amount / vol) - ts_delay((amount / vol), 5)) / (ts_delay((amount / vol), 5) + 1e-8)) - ((ts_delay((amount / vol), 5) - ts_delay((amount / vol), 10)) / (ts_delay((amount / vol), 10) + 1e-8)))))', + "category": 'VWAP', + }, + { + "name": 'Relative Strength', + "formula": '(-cs_rank((ts_pct_change(close, 5) / (ts_pct_change(close, 20) + 1e-8))))', + "category": 'Momentum', + }, + { + "name": 'Volume-Correlated Return', + "formula": '(-cs_rank(ts_cov((close / ts_delay(close, 1) - 1), vol, 10)))', + "category": 'Volume', + }, + { + "name": 'Regime Volatility Band', + "formula": 'if_((ts_std((close / ts_delay(close, 1) - 1), 5) > (1.5 * ts_std((close / ts_delay(close, 1) - 1), 20))), (-cs_rank(ts_pct_change(close, 1))), (-cs_rank(ts_pct_change(close, 10))))', + "category": 'Regime-switching', + }, + { + "name": 'Open-Close Spread Momentum', + "formula": '(-cs_rank(ts_mean(((close - open) / (open + 1e-8)), 5)))', + "category": 'Intraday', + }, + { + "name": 'Volatility-Scaled Reversal', + "formula": '(-cs_rank((ts_pct_change(close, 5) / (ts_std((close / ts_delay(close, 1) - 1), 20) + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'VWAP Time-Weighted Signal', + "formula": '(-cs_rank(ts_wma(((close - (amount / vol)) / ((amount / vol) + 1e-8)), 20)))', + "category": 'VWAP', + }, + { + "name": 'Covariance Structure Shift', + "formula": '(-cs_rank((ts_cov((close / ts_delay(close, 1) - 1), vol, 5) - ts_cov((close / ts_delay(close, 1) - 1), vol, 20))))', + "category": 'Volume', + }, + # { + # "name": 'Quadratic Regression Residual', + # "formula": '# TODO: Neg(CsRank(TsLinRegResid(Square($returns), 20)))', + # "category": 'Higher-moment', + # }, + { + "name": 'VWAP Mean-Reversion Strength', + "formula": '(-cs_rank((((close - (amount / vol)) / (amount / vol)) * (vol / (ts_mean(vol, 20) + 1e-8)))))', + "category": 'VWAP', + }, + { + "name": 'Multi-Scale Momentum', + "formula": '(-cs_rank((ts_pct_change(close, 5) + ts_pct_change(close, 20))))', + "category": 'Momentum', + }, + { + "name": 'Relative High Position', + "formula": '(-cs_rank(((ts_max(high, 20) - close) / (ts_max(high, 20) + 1e-8))))', + "category": 'Mean-reversion', + }, + { + "name": 'Turnover Volatility', + "formula": '(-cs_rank(ts_std((amount / (vol + 1e-8)), 10)))', + "category": 'Turnover', + }, + { + "name": 'Regime Correlation Signal', + "formula": 'if_((abs(ts_corr(close, vol, 10)) > 0.5), (-cs_rank(ts_pct_change(close, 3))), (-cs_rank(ts_pct_change(close, 10))))', + "category": 'Regime-switching', + }, + { + "name": 'Intraday Momentum Reversal', + "formula": '(-cs_rank(((close - open) / ((high - low) + 1e-8))))', + "category": 'Intraday', + }, + # { + # "name": 'Volume-Weighted Slope', + # "formula": '# TODO: Neg(CsRank(TsLinRegSlope(Mul($returns, $volume), 10)))', + # "category": 'Volume', + # }, + { + "name": 'Adaptive Range Reversal', + "formula": 'if_((ts_std((close / ts_delay(close, 1) - 1), 10) > ts_mean(ts_std((close / ts_delay(close, 1) - 1), 10), 40)), (-cs_rank(((close - ts_min(close, 10)) / ((ts_max(close, 10) - ts_min(close, 10)) + 1e-8)))), (-cs_rank(ts_pct_change(close, 5))))', + "category": 'Regime-switching', + }, ] + def import_from_paper( path: Optional[Union[str, Path]] = None, ) -> FactorLibrary: diff --git a/src/scripts/translate_paper_factors.py b/src/scripts/translate_paper_factors.py new file mode 100644 index 0000000..92c59fd --- /dev/null +++ b/src/scripts/translate_paper_factors.py @@ -0,0 +1,293 @@ +"""一次性 Paper Factors DSL 迁移脚本。 + +将 src.factorminer.core.library_io 中硬编码的 110 个 PAPER_FACTORS +的 CamelCase DSL 公式翻译为本地 snake_case DSL 字符串。 +翻译结果直接替换回原常量列表。 +""" + +import re +import sys +from pathlib import Path +from typing import List, Tuple + +# 确保能导入项目模块 +sys.path.insert(0, str(Path(__file__).parent.parent.parent)) + +from src.factorminer.core.library_io import PAPER_FACTORS + + +UNSUPPORTED_OPS = { + "Decay", + "TsLinRegSlope", + "TsLinRegResid", + "Resid", + "Quantile", + "HMA", + "DEMA", +} + +FIELD_MAP = { + "$close": "close", + "$volume": "vol", + "$amt": "amount", + "$vwap": "(amount / vol)", + "$returns": "(close / ts_delay(close, 1) - 1)", + "$high": "high", + "$low": "low", + "$open": "open", +} + + +class Node: + """AST 节点基类。""" + + pass + + +class LeafNode(Node): + """叶子节点(字段或常量)。""" + + def __init__(self, value: str) -> None: + self.value = value + + +class FuncNode(Node): + """函数调用节点。""" + + def __init__(self, name: str, args: List[Node]) -> None: + self.name = name + self.args = args + + +def _parse_expr(s: str, i: int = 0) -> Tuple[Node, int]: + """递归解析 CamelCase DSL 表达式。 + + Args: + s: 表达式字符串 + i: 起始索引 + + Returns: + (解析后的节点, 下一个索引) + """ + j = i + + # 尝试读取数字常量(支持 1e-8, 0.75 等) + num_match = re.match(r"-?\d+(\.\d+)?(e[+-]?\d+)?", s[j:]) + if num_match: + val = num_match.group(0) + j += len(val) + return LeafNode(val), j + + # 读取标识符 + while j < len(s) and (s[j].isalnum() or s[j] == "$" or s[j] == "_"): + j += 1 + name = s[i:j] + + if j >= len(s) or s[j] != "(": + return LeafNode(name), j + + # 函数调用 + j += 1 # skip '(' + args: List[Node] = [] + while j < len(s) and s[j] != ")": + # 跳过空白 + while j < len(s) and s[j] == " ": + j += 1 + if j >= len(s): + break + if s[j] == ")": + break + arg, j = _parse_expr(s, j) + args.append(arg) + while j < len(s) and s[j] == " ": + j += 1 + if j < len(s) and s[j] == ",": + j += 1 + elif j < len(s) and s[j] == ")": + break + + if j < len(s) and s[j] == ")": + j += 1 + return FuncNode(name, args), j + + +def _reconstruct_original(node: Node) -> str: + """将 AST 重新拼接为原始 CamelCase 字符串。""" + if isinstance(node, LeafNode): + return node.value + args_str = ", ".join(_reconstruct_original(a) for a in node.args) + return f"{node.name}({args_str})" + + +def _contains_unsupported(node: Node) -> bool: + """检查 AST 中是否包含未实现算子。""" + if isinstance(node, LeafNode): + return False + if node.name in UNSUPPORTED_OPS: + return True + return any(_contains_unsupported(a) for a in node.args) + + +def _translate(node: Node, toplevel: bool = True) -> str: + """将 AST 翻译为本地 snake_case DSL。 + + Args: + node: AST 节点 + toplevel: 是否处于顶层调用 + + Returns: + 本地 DSL 字符串;若包含未实现算子则返回 # TODO: <原始公式> + """ + if toplevel and _contains_unsupported(node): + return f"# TODO: {_reconstruct_original(node)}" + + if isinstance(node, LeafNode): + val = node.value + if val in FIELD_MAP: + return FIELD_MAP[val] + return val + + name = node.name + args = node.args + ta = [a for a in args] + + if name == "Neg": + return f"(-{_translate(ta[0], toplevel=False)})" + if name == "Add": + return f"({_translate(ta[0])} + {_translate(ta[1])})" + if name == "Sub": + return f"({_translate(ta[0])} - {_translate(ta[1])})" + if name == "Mul": + return f"({_translate(ta[0])} * {_translate(ta[1])})" + if name == "Div": + return f"({_translate(ta[0])} / {_translate(ta[1])})" + if name == "Greater": + return f"({_translate(ta[0])} > {_translate(ta[1])})" + if name == "Square": + return f"({_translate(ta[0])} ** 2)" + if name == "CsRank": + return f"cs_rank({_translate(ta[0])})" + if name == "CsZscore": + return f"cs_zscore({_translate(ta[0])})" + if name in ("TsMean", "Mean"): + return f"ts_mean({_translate(ta[0])}, {ta[1].value})" + if name == "TsMax": + return f"ts_max({_translate(ta[0])}, {ta[1].value})" + if name == "TsMin": + return f"ts_min({_translate(ta[0])}, {ta[1].value})" + if name == "Std": + return f"ts_std({_translate(ta[0])}, {ta[1].value})" + if name == "Delta": + return f"ts_delta({_translate(ta[0])}, {ta[1].value})" + if name == "Delay": + return f"ts_delay({_translate(ta[0])}, {ta[1].value})" + if name == "Corr": + return f"ts_corr({_translate(ta[0])}, {_translate(ta[1])}, {ta[2].value})" + if name == "Cov": + return f"ts_cov({_translate(ta[0])}, {_translate(ta[1])}, {ta[2].value})" + if name == "Sum": + return f"ts_sum({_translate(ta[0])}, {ta[1].value})" + if name == "Return": + return f"ts_pct_change({_translate(ta[0])}, {ta[1].value})" + if name == "EMA": + return f"ts_ema({_translate(ta[0])}, {ta[1].value})" + if name == "WMA": + return f"ts_wma({_translate(ta[0])}, {ta[1].value})" + if name == "SMA": + return f"ts_mean({_translate(ta[0])}, {ta[1].value})" + if name == "Skew": + return f"ts_skew({_translate(ta[0])}, {ta[1].value})" + if name == "Kurt": + return f"ts_kurt({_translate(ta[0])}, {ta[1].value})" + if name == "Abs": + return f"abs({_translate(ta[0])})" + if name == "Sign": + return f"sign({_translate(ta[0])})" + if name == "Max": + return f"max_({_translate(ta[0])}, {_translate(ta[1])})" + if name == "Min": + return f"min_({_translate(ta[0])}, {_translate(ta[1])})" + if name == "IfElse": + return f"if_({_translate(ta[0])}, {_translate(ta[1])}, {_translate(ta[2])})" + if name == "TsRank": + return f"ts_rank({_translate(ta[0])}, {ta[1].value})" + + raise ValueError(f"未知函数: {name}") + + +def _indent_block(lines: List[str], spaces: int = 4) -> List[str]: + """给代码块增加统一缩进。""" + prefix = " " * spaces + return [prefix + line for line in lines] + + +def main() -> None: + """主函数:翻译 PAPER_FACTORS 并写回 library_io.py。""" + success_count = 0 + todo_count = 0 + translated_entries: List[str] = [] + + for entry in PAPER_FACTORS: + formula = entry["formula"] + tree, next_pos = _parse_expr(formula) + if next_pos != len(formula): + print(f"[ERROR] 解析未完成: {formula} (停在 {next_pos})") + raise SystemExit(1) + + translated = _translate(tree) + is_todo = translated.startswith("# TODO:") + + if is_todo: + todo_count += 1 + # 注释掉整个字典条目,避免影响后续流程 + lines = [ + "# {", + f'# "name": {entry["name"]!r},', + f'# "formula": {translated!r},', + f'# "category": {entry["category"]!r},', + "# },", + ] + else: + success_count += 1 + lines = [ + "{", + f' "name": {entry["name"]!r},', + f' "formula": {translated!r},', + f' "category": {entry["category"]!r},', + " },", + ] + translated_entries.extend(lines) + + # 构建新的 PAPER_FACTORS 代码块 + new_factor_block = "PAPER_FACTORS: List[Dict[str, str]] = [\n" + for line in translated_entries: + if line.startswith("# "): + new_factor_block += f" {line}\n" + elif line in ("{", " },"): + new_factor_block += f" {line}\n" + else: + new_factor_block += f" {line}\n" + new_factor_block += "]\n" + + # 读取原文件 + lib_io_path = Path("src/factorminer/core/library_io.py") + original_text = lib_io_path.read_text(encoding="utf-8") + + # 用正则替换 PAPER_FACTORS 定义块 + pattern = r"PAPER_FACTORS: List\[Dict\[str, str\]\] = \[.*?^\]" + match = re.search(pattern, original_text, re.DOTALL | re.MULTILINE) + if not match: + print("[ERROR] 未能在 library_io.py 中定位 PAPER_FACTORS 定义块") + raise SystemExit(1) + + new_text = ( + original_text[: match.start()] + new_factor_block + original_text[match.end() :] + ) + lib_io_path.write_text(new_text, encoding="utf-8") + + total = len(PAPER_FACTORS) + print(f"[translate] 成功 {success_count}/{total},TODO {todo_count} 个(已注释)") + + +if __name__ == "__main__": + main() diff --git a/tests/test_factorminer_paper_factors.py b/tests/test_factorminer_paper_factors.py new file mode 100644 index 0000000..6b93f14 --- /dev/null +++ b/tests/test_factorminer_paper_factors.py @@ -0,0 +1,54 @@ +"""测试 PAPER_FACTORS 中所有因子的 DSL 公式解析。 + +排除已注释(# TODO)的因子,其余必须能被本地 FormulaParser 正确解析。 +""" + +import pytest + +from src.factorminer.core.library_io import PAPER_FACTORS +from src.factors.parser import FormulaParser +from src.factors.registry import FunctionRegistry + + +@pytest.fixture(scope="module") +def parser(): + """提供共享的 FormulaParser 实例。""" + return FormulaParser(FunctionRegistry()) + + +class TestPaperFactorParsing: + """验证 110 个 paper factors 的 DSL 解析成功率。""" + + def test_all_paper_factors_parse(self, parser): + """遍历 PAPER_FACTORS,断言每个非 TODO 公式都能成功解析。""" + success = 0 + skipped = 0 + failures = [] + + for entry in PAPER_FACTORS: + formula = entry["formula"] + name = entry["name"] + + if formula.startswith("# TODO:"): + skipped += 1 + continue + + try: + parser.parse(formula) + success += 1 + except Exception as e: + failures.append((name, formula, str(e))) + + total = len(PAPER_FACTORS) + print(f"[paper_factors] 成功 {success}/{total},跳过 {skipped} 个未实现算子") + + assert not failures, "以下因子解析失败:\n" + "\n".join( + f" - {name}: {err}\n formula: {formula}" + for name, formula, err in failures + ) + + def test_todo_ratio_acceptable(self): + """确保 TODO 比例不过高(当前阈值 20%)。""" + todo_count = sum(1 for e in PAPER_FACTORS if e["formula"].startswith("# TODO:")) + ratio = todo_count / len(PAPER_FACTORS) + assert ratio <= 0.20, f"TODO 因子比例 {ratio:.1%} 超过 20%"