refactor(factor): 完全重构因子计算框架 - 引入DSL表达式系统
- 删除旧因子框架:移除 base.py、composite.py、data_loader.py、data_spec.py 及所有子模块(momentum、financial、quality、sentiment等) - 新增DSL表达式系统:实现 factor DSL 编译器和翻译器 - dsl.py: 领域特定语言定义 - compiler.py: AST编译与优化 - translator.py: Polars表达式翻译 - api.py: 统一API接口 - 新增数据路由层:data_router.py 实现字段到表的动态路由 - 新增API封装:api_pro_bar.py 提供pro_bar数据接口 - 更新执行引擎:engine.py 适配新的DSL架构 - 重构测试体系:删除旧测试,新增 test_dsl_promotion.py、 test_factor_integration.py、test_pro_bar.py - 清理文档:删除8个过时文档(factor_design、db_sync_guide等)
This commit is contained in:
@@ -169,6 +169,120 @@ if "date" in data.columns:
|
||||
|
||||
### 4.5 令牌桶限速要求
|
||||
|
||||
所有 API 调用必须通过 `TushareClient`,自动满足令牌桶限速要求。
|
||||
|
||||
#### 4.5.1 基本用法(单线程场景)
|
||||
|
||||
```python
|
||||
from src.data.client import TushareClient
|
||||
|
||||
def get_{data_type}(...) -> pd.DataFrame:
|
||||
client = TushareClient()
|
||||
|
||||
# Build parameters
|
||||
params = {}
|
||||
if trade_date:
|
||||
params["trade_date"] = trade_date
|
||||
if ts_code:
|
||||
params["ts_code"] = ts_code
|
||||
# ...
|
||||
|
||||
# Fetch data (rate limiting handled automatically)
|
||||
data = client.query("{api_name}", **params)
|
||||
|
||||
return data
|
||||
```
|
||||
|
||||
**注意**: `TushareClient` 自动处理:
|
||||
- 令牌桶速率限制
|
||||
- API 重试逻辑(指数退避)
|
||||
- 配置加载
|
||||
|
||||
#### 4.5.2 多线程/并发场景(重要)
|
||||
|
||||
**问题**: 多线程并发调用时,如果每个线程创建独立的 `TushareClient` 实例,每个实例会有独立的限流器,导致实际并发请求数 = 线程数 × 单个限流器速率,**限流失效**。
|
||||
|
||||
**解决方案**: 数据获取函数必须接受可选的 `client` 参数,Sync 类传递共享的客户端实例。
|
||||
|
||||
**数据获取函数签名**(必须支持 client 参数):
|
||||
|
||||
```python
|
||||
from src.data.client import TushareClient
|
||||
from typing import Optional
|
||||
|
||||
def get_{data_type}(
|
||||
ts_code: str,
|
||||
start_date: Optional[str] = None,
|
||||
end_date: Optional[str] = None,
|
||||
client: Optional[TushareClient] = None, # 新增:可选客户端参数
|
||||
) -> pd.DataFrame:
|
||||
"""Fetch {数据描述} from Tushare.
|
||||
|
||||
Args:
|
||||
ts_code: Stock code
|
||||
start_date: Start date (YYYYMMDD)
|
||||
end_date: End date (YYYYMMDD)
|
||||
client: Optional TushareClient instance for shared rate limiting.
|
||||
If None, creates a new client. For concurrent sync operations,
|
||||
pass a shared client to ensure proper rate limiting.
|
||||
|
||||
Returns:
|
||||
pd.DataFrame with data
|
||||
"""
|
||||
client = client or TushareClient() # 如果没有提供则创建新实例
|
||||
|
||||
params = {"ts_code": ts_code}
|
||||
if start_date:
|
||||
params["start_date"] = start_date
|
||||
if end_date:
|
||||
params["end_date"] = end_date
|
||||
|
||||
data = client.query("{api_name}", **params)
|
||||
return data
|
||||
```
|
||||
|
||||
**Sync 类实现**(必须传递共享 client):
|
||||
|
||||
```python
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from src.data.client import TushareClient
|
||||
from src.data.storage import ThreadSafeStorage
|
||||
|
||||
class {DataType}Sync:
|
||||
def __init__(self, max_workers: Optional[int] = None):
|
||||
self.storage = ThreadSafeStorage()
|
||||
self.client = TushareClient() # 共享客户端实例
|
||||
self.max_workers = max_workers or 10
|
||||
|
||||
def sync_single_stock(
|
||||
self,
|
||||
ts_code: str,
|
||||
start_date: str,
|
||||
end_date: str,
|
||||
) -> pd.DataFrame:
|
||||
"""同步单只股票的数据。"""
|
||||
# 传递共享 client 以确保多线程下的限流生效
|
||||
data = get_{data_type}(
|
||||
ts_code=ts_code,
|
||||
start_date=start_date,
|
||||
end_date=end_date,
|
||||
client=self.client, # 关键:传递共享客户端
|
||||
)
|
||||
return data
|
||||
|
||||
def sync_all(self, ...):
|
||||
# 使用 ThreadPoolExecutor 并发执行
|
||||
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
|
||||
# 所有线程共享 self.client,限流器正常工作
|
||||
...
|
||||
```
|
||||
|
||||
**关键规则**:
|
||||
1. 所有按股票获取的接口必须接受 `client: Optional[TushareClient] = None` 参数
|
||||
2. Sync 类在 `__init__` 中创建 `self.client = TushareClient()`
|
||||
3. Sync 类的同步方法必须将 `self.client` 传递给数据获取函数
|
||||
4. 数据获取函数使用 `client = client or TushareClient()` 模式
|
||||
|
||||
所有 API 调用必须通过 `TushareClient`,自动满足令牌桶限速要求:
|
||||
|
||||
```python
|
||||
@@ -198,6 +312,26 @@ def get_{data_type}(...) -> pd.DataFrame:
|
||||
|
||||
## 5. DuckDB 存储规范
|
||||
|
||||
### 5.0 强制落库要求(关键)
|
||||
|
||||
**所有封装的 API 接口必须将数据落库到 DuckDB。**
|
||||
|
||||
这是数据同步的核心原则,确保:
|
||||
- 数据持久化:避免重复调用 API,节省 token
|
||||
- 增量更新:基于本地数据状态进行智能同步
|
||||
- 数据一致性:所有数据都有统一的存储和访问方式
|
||||
- 离线可用:数据可以在没有网络的情况下查询
|
||||
|
||||
**落库检查清单**:
|
||||
- [ ] 在 `storage.py` 的 `_init_db()` 方法中创建对应的表
|
||||
- [ ] 表结构必须包含 `ts_code` 和 `trade_date` 作为主键
|
||||
- [ ] 实现 `sync_{data_type}()` 函数,使用 `Storage` 或 `ThreadSafeStorage` 保存数据
|
||||
- [ ] 确保同步逻辑正确处理增量更新
|
||||
|
||||
**反例警示**:`api_pro_bar.py` 早期版本虽然实现了 `sync_pro_bar()` 函数,但忘记在 `storage.py` 中创建 `pro_bar` 表,导致同步的数据无法落库,造成 token 浪费和数据丢失。
|
||||
|
||||
### 5.1 存储架构
|
||||
|
||||
### 5.1 存储架构
|
||||
|
||||
项目使用 **DuckDB** 作为持久化存储:
|
||||
|
||||
Reference in New Issue
Block a user