2026-02-23 01:37:34 +08:00
|
|
|
|
# ProStock HDF5 到 DuckDB 迁移方案
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 01:37:34 +08:00
|
|
|
|
**文档版本**: v1.1
|
2026-02-22 14:41:32 +08:00
|
|
|
|
**创建日期**: 2026-02-22
|
2026-02-23 01:37:34 +08:00
|
|
|
|
**完成日期**: 2026-02-22
|
|
|
|
|
|
**状态**: ✅ 已完成
|
2026-02-22 14:41:32 +08:00
|
|
|
|
**影响范围**: data 模块、factors 模块、相关文档
|
|
|
|
|
|
|
2026-02-23 01:37:34 +08:00
|
|
|
|
## 相关文档
|
|
|
|
|
|
|
|
|
|
|
|
[DuckDB 数据同步指南](./db_sync_guide.md) - 同步 API 使用说明
|
|
|
|
|
|
[迁移测试报告](./test_report_duckdb_migration.md) - 测试验证结果
|
|
|
|
|
|
|
|
|
|
|
|
|
2026-02-22 14:41:32 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 目录
|
|
|
|
|
|
|
|
|
|
|
|
1. [执行摘要](#1-执行摘要)
|
|
|
|
|
|
2. [迁移方案](#2-迁移方案)
|
|
|
|
|
|
3. [迁移计划](#3-迁移计划)
|
|
|
|
|
|
4. [影响范围分析](#4-影响范围分析)
|
|
|
|
|
|
5. [风险与回滚策略](#5-风险与回滚策略)
|
|
|
|
|
|
6. [附录](#6-附录)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 1. 执行摘要
|
|
|
|
|
|
|
|
|
|
|
|
### 1.1 迁移目标
|
|
|
|
|
|
|
|
|
|
|
|
将 ProStock 项目的数据存储从 **HDF5 格式** 迁移到 **DuckDB 嵌入式数据库**,解决以下核心问题:
|
|
|
|
|
|
|
|
|
|
|
|
| 问题 | 现状 (HDF5) | 目标 (DuckDB) | 预期收益 |
|
|
|
|
|
|
|------|------------|--------------|---------|
|
|
|
|
|
|
| **全表加载** | 每次查询加载 1GB+ 数据 | 查询下推,按需加载 | **单股票查询 100x 加速** |
|
|
|
|
|
|
| **内存占用** | 必须全表载入内存 | 磁盘级过滤 | **内存使用降低 80%** |
|
|
|
|
|
|
| **并发写入** | 文件锁,伪并发 | 事务支持 | **更可靠的增量更新** |
|
|
|
|
|
|
| **数据压缩** | HDF5 内置压缩 | DuckDB 列式压缩 | **存储空间减少 20-50%** |
|
|
|
|
|
|
|
|
|
|
|
|
### 1.2 工作量估算
|
|
|
|
|
|
|
|
|
|
|
|
| 阶段 | 工作量 | 说明 |
|
|
|
|
|
|
|------|--------|------|
|
|
|
|
|
|
| **核心开发** | 6-8 小时 | Storage 重写、DataLoader 适配、Sync 调整 |
|
|
|
|
|
|
| **文档更新** | 2-3 小时 | 3 份设计文档修改 |
|
|
|
|
|
|
| **数据迁移** | 30 分钟 | H5 → DuckDB 脚本运行 |
|
|
|
|
|
|
| **测试验证** | 2-4 小时 | 单元测试、集成测试、性能基准 |
|
|
|
|
|
|
| **总计** | **10-15 小时** | 1-2 个工作日 |
|
|
|
|
|
|
|
|
|
|
|
|
### 1.3 关键决策
|
|
|
|
|
|
|
|
|
|
|
|
- ✅ **完全迁移**:不保留 HDF5 代码,彻底迁移到 DuckDB
|
|
|
|
|
|
- ✅ **API 兼容**:保持 `Storage` 类接口不变,调用方零改动
|
|
|
|
|
|
- ✅ **Polars 集成**:支持 `load_polars()` 方法,DataLoader 无缝衔接
|
|
|
|
|
|
- ✅ **并发安全**:使用单线程写入队列,避免 DuckDB 锁冲突
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 2. 迁移方案
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 技术架构对比
|
|
|
|
|
|
|
|
|
|
|
|
#### 当前架构 (HDF5)
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
|
|
|
|
│ Factor Engine (执行引擎) │
|
|
|
|
|
|
└──────────────────────────┬──────────────────────────────────┘
|
|
|
|
|
|
│
|
|
|
|
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
|
|
|
|
|
│ DataLoader (数据加载层) │
|
|
|
|
|
|
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
|
|
|
|
|
|
│ │ Multi-File │ │ Column │ │ Lookback │ │
|
|
|
|
|
|
│ │ Aggregation │ │ Selector │ │ Window Control │ │
|
|
|
|
|
|
│ └────────────────┘ └────────────────┘ └────────────────┘ │
|
|
|
|
|
|
└──────────────────────────┬──────────────────────────────────┘
|
|
|
|
|
|
│
|
|
|
|
|
|
┌──────▼──────┐
|
|
|
|
|
|
│ HDF5 Files │ ←── 每个表一个 .h5 文件
|
|
|
|
|
|
└─────────────┘ 全表加载到内存后过滤
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 目标架构 (DuckDB)
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
|
|
|
|
│ Factor Engine (执行引擎) │
|
|
|
|
|
|
└──────────────────────────┬──────────────────────────────────┘
|
|
|
|
|
|
│
|
|
|
|
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
|
|
|
|
|
│ DataLoader (数据加载层) │
|
|
|
|
|
|
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
|
|
|
|
|
|
│ │ SQL Query │ │ Predicate │ │ Polars Export │ │
|
|
|
|
|
|
│ │ Generation │ │ Pushdown │ │ (Zero-Copy) │ │
|
|
|
|
|
|
│ └────────────────┘ └────────────────┘ └────────────────┘ │
|
|
|
|
|
|
└──────────────────────────┬──────────────────────────────────┘
|
|
|
|
|
|
│
|
|
|
|
|
|
┌──────▼──────┐
|
|
|
|
|
|
│ DuckDB │ ←── 单个 .duckdb 文件
|
|
|
|
|
|
│ (Embedded) │ SQL 查询下推,只读必要数据
|
|
|
|
|
|
└─────────────┘
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 数据库 Schema 设计
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.2.1 表结构设计
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- 日线数据表(替代 daily.h5)
|
|
|
|
|
|
CREATE TABLE daily (
|
|
|
|
|
|
ts_code VARCHAR(16) NOT NULL, -- 股票代码
|
|
|
|
|
|
trade_date DATE NOT NULL, -- 交易日期
|
|
|
|
|
|
open DOUBLE,
|
|
|
|
|
|
high DOUBLE,
|
|
|
|
|
|
low DOUBLE,
|
|
|
|
|
|
close DOUBLE,
|
|
|
|
|
|
pre_close DOUBLE,
|
|
|
|
|
|
change DOUBLE,
|
|
|
|
|
|
pct_chg DOUBLE,
|
|
|
|
|
|
vol DOUBLE,
|
|
|
|
|
|
amount DOUBLE,
|
|
|
|
|
|
turnover_rate DOUBLE, -- 换手率
|
|
|
|
|
|
volume_ratio DOUBLE, -- 量比
|
|
|
|
|
|
-- 其他字段...
|
|
|
|
|
|
|
|
|
|
|
|
PRIMARY KEY (ts_code, trade_date) -- 复合主键,自动去重
|
|
|
|
|
|
);
|
|
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
-- 创建复合索引(覆盖常用查询场景:按日期范围+股票代码过滤)
|
|
|
|
|
|
CREATE INDEX idx_daily_date_code ON daily(trade_date, ts_code);
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
|
|
|
|
|
-- 股票基础信息表(替代 stock_basic.h5)
|
|
|
|
|
|
CREATE TABLE stock_basic (
|
|
|
|
|
|
ts_code VARCHAR(16) PRIMARY KEY,
|
|
|
|
|
|
symbol VARCHAR(10),
|
|
|
|
|
|
name VARCHAR(50),
|
|
|
|
|
|
area VARCHAR(20),
|
|
|
|
|
|
industry VARCHAR(50),
|
|
|
|
|
|
market VARCHAR(10),
|
|
|
|
|
|
list_date DATE,
|
|
|
|
|
|
-- 其他字段...
|
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
|
|
-- 交易日历表(替代 trade_cal.h5)
|
|
|
|
|
|
CREATE TABLE trade_cal (
|
|
|
|
|
|
exchange VARCHAR(10),
|
|
|
|
|
|
cal_date DATE,
|
|
|
|
|
|
is_open BOOLEAN,
|
|
|
|
|
|
PRIMARY KEY (exchange, cal_date)
|
|
|
|
|
|
);
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.2.2 数据类型映射
|
|
|
|
|
|
|
|
|
|
|
|
| HDF5/Pandas | DuckDB | 说明 |
|
|
|
|
|
|
|------------|--------|------|
|
|
|
|
|
|
| `object` (string) | `VARCHAR` | 股票代码、名称 |
|
|
|
|
|
|
| `int64` | `BIGINT` | 成交量(整数) |
|
|
|
|
|
|
| `float64` | `DOUBLE` | 价格、收益率 |
|
|
|
|
|
|
| `object` (date) | `DATE` | 交易日期,支持范围查询 |
|
|
|
|
|
|
| `bool` | `BOOLEAN` | 是否交易日 |
|
|
|
|
|
|
|
|
|
|
|
|
### 2.3 核心代码改造方案
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.3.1 Storage 类重写 (`src/data/storage.py`)
|
|
|
|
|
|
|
|
|
|
|
|
**当前 HDF5 实现**(151 行)→ **DuckDB 实现**(约 200 行)
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
"""DuckDB storage for data persistence."""
|
|
|
|
|
|
|
|
|
|
|
|
import duckdb
|
|
|
|
|
|
import pandas as pd
|
|
|
|
|
|
import polars as pl
|
|
|
|
|
|
from pathlib import Path
|
|
|
|
|
|
from typing import Optional, List
|
|
|
|
|
|
from contextlib import contextmanager
|
|
|
|
|
|
from src.data.config import get_config
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
class Storage:
|
|
|
|
|
|
"""DuckDB storage manager for saving and loading data.
|
|
|
|
|
|
|
|
|
|
|
|
迁移说明:
|
|
|
|
|
|
- 保持 API 完全兼容,调用方无需修改
|
|
|
|
|
|
- 新增 load_polars() 方法支持 Polars 零拷贝导出
|
|
|
|
|
|
- 使用单例模式管理数据库连接
|
|
|
|
|
|
- 并发写入通过队列管理(见 ThreadSafeStorage)
|
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
|
|
_instance = None
|
|
|
|
|
|
_connection = None
|
|
|
|
|
|
|
|
|
|
|
|
def __new__(cls, *args, **kwargs):
|
|
|
|
|
|
"""Singleton to ensure single connection."""
|
|
|
|
|
|
if cls._instance is None:
|
|
|
|
|
|
cls._instance = super().__new__(cls)
|
|
|
|
|
|
return cls._instance
|
|
|
|
|
|
|
|
|
|
|
|
def __init__(self, path: Optional[Path] = None):
|
|
|
|
|
|
"""Initialize storage."""
|
|
|
|
|
|
if hasattr(self, '_initialized'):
|
|
|
|
|
|
return
|
|
|
|
|
|
|
|
|
|
|
|
cfg = get_config()
|
|
|
|
|
|
self.base_path = path or cfg.data_path_resolved
|
|
|
|
|
|
self.base_path.mkdir(parents=True, exist_ok=True)
|
|
|
|
|
|
self.db_path = self.base_path / "prostock.db"
|
|
|
|
|
|
|
|
|
|
|
|
self._init_db()
|
|
|
|
|
|
self._initialized = True
|
|
|
|
|
|
|
|
|
|
|
|
def _init_db(self):
|
|
|
|
|
|
"""Initialize database connection and schema."""
|
|
|
|
|
|
self._connection = duckdb.connect(str(self.db_path))
|
|
|
|
|
|
|
|
|
|
|
|
# Create tables with schema validation
|
|
|
|
|
|
self._connection.execute("""
|
|
|
|
|
|
CREATE TABLE IF NOT EXISTS daily (
|
|
|
|
|
|
ts_code VARCHAR(16) NOT NULL,
|
|
|
|
|
|
trade_date DATE NOT NULL,
|
|
|
|
|
|
open DOUBLE,
|
|
|
|
|
|
high DOUBLE,
|
|
|
|
|
|
low DOUBLE,
|
|
|
|
|
|
close DOUBLE,
|
|
|
|
|
|
pre_close DOUBLE,
|
|
|
|
|
|
change DOUBLE,
|
|
|
|
|
|
pct_chg DOUBLE,
|
|
|
|
|
|
vol DOUBLE,
|
|
|
|
|
|
amount DOUBLE,
|
|
|
|
|
|
turnover_rate DOUBLE,
|
|
|
|
|
|
volume_ratio DOUBLE,
|
|
|
|
|
|
PRIMARY KEY (ts_code, trade_date)
|
|
|
|
|
|
)
|
|
|
|
|
|
""")
|
|
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
# Create composite index for query optimization (trade_date, ts_code)
|
2026-02-22 14:41:32 +08:00
|
|
|
|
self._connection.execute("""
|
2026-02-23 00:07:21 +08:00
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_daily_date_code ON daily(trade_date, ts_code)
|
2026-02-22 14:41:32 +08:00
|
|
|
|
""")
|
|
|
|
|
|
|
|
|
|
|
|
def save(self, name: str, data: pd.DataFrame, mode: str = "append") -> dict:
|
|
|
|
|
|
"""Save data to DuckDB.
|
|
|
|
|
|
|
|
|
|
|
|
Args:
|
|
|
|
|
|
name: Table name
|
|
|
|
|
|
data: DataFrame to save
|
|
|
|
|
|
mode: 'append' (UPSERT) or 'replace' (DELETE + INSERT)
|
|
|
|
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
|
|
Dict with save result
|
|
|
|
|
|
"""
|
|
|
|
|
|
if data.empty:
|
|
|
|
|
|
return {"status": "skipped", "rows": 0}
|
|
|
|
|
|
|
|
|
|
|
|
# Ensure date column is proper type
|
|
|
|
|
|
if 'trade_date' in data.columns:
|
|
|
|
|
|
data = data.copy()
|
|
|
|
|
|
data['trade_date'] = pd.to_datetime(data['trade_date'], format='%Y%m%d').dt.date
|
|
|
|
|
|
|
|
|
|
|
|
# Register DataFrame as temporary view
|
|
|
|
|
|
self._connection.register("temp_data", data)
|
|
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
|
if mode == "replace":
|
|
|
|
|
|
self._connection.execute(f"DELETE FROM {name}")
|
|
|
|
|
|
|
|
|
|
|
|
# UPSERT: INSERT OR REPLACE
|
|
|
|
|
|
columns = ", ".join(data.columns)
|
|
|
|
|
|
self._connection.execute(f"""
|
|
|
|
|
|
INSERT OR REPLACE INTO {name} ({columns})
|
|
|
|
|
|
SELECT {columns} FROM temp_data
|
|
|
|
|
|
""")
|
|
|
|
|
|
|
|
|
|
|
|
row_count = len(data)
|
|
|
|
|
|
print(f"[Storage] Saved {row_count} rows to DuckDB ({name})")
|
|
|
|
|
|
return {"status": "success", "rows": row_count}
|
|
|
|
|
|
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
print(f"[Storage] Error saving {name}: {e}")
|
|
|
|
|
|
return {"status": "error", "error": str(e)}
|
|
|
|
|
|
finally:
|
|
|
|
|
|
self._connection.unregister("temp_data")
|
|
|
|
|
|
|
|
|
|
|
|
def load(
|
|
|
|
|
|
self,
|
|
|
|
|
|
name: str,
|
|
|
|
|
|
start_date: Optional[str] = None,
|
|
|
|
|
|
end_date: Optional[str] = None,
|
|
|
|
|
|
ts_code: Optional[str] = None,
|
|
|
|
|
|
) -> pd.DataFrame:
|
|
|
|
|
|
"""Load data from DuckDB with query pushdown.
|
|
|
|
|
|
|
|
|
|
|
|
关键优化:
|
|
|
|
|
|
- WHERE 条件在数据库层过滤,无需加载全表
|
|
|
|
|
|
- 只返回匹配条件的行,大幅减少内存占用
|
|
|
|
|
|
|
|
|
|
|
|
Args:
|
|
|
|
|
|
name: Table name
|
|
|
|
|
|
start_date: Start date filter (YYYYMMDD)
|
|
|
|
|
|
end_date: End date filter (YYYYMMDD)
|
|
|
|
|
|
ts_code: Stock code filter
|
|
|
|
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
|
|
Filtered DataFrame
|
|
|
|
|
|
"""
|
|
|
|
|
|
# Build WHERE clause with parameterized queries
|
|
|
|
|
|
conditions = []
|
|
|
|
|
|
params = []
|
|
|
|
|
|
|
|
|
|
|
|
if start_date and end_date:
|
|
|
|
|
|
conditions.append("trade_date BETWEEN ? AND ?")
|
|
|
|
|
|
# Convert to DATE type
|
|
|
|
|
|
start = pd.to_datetime(start_date, format='%Y%m%d').date()
|
|
|
|
|
|
end = pd.to_datetime(end_date, format='%Y%m%d').date()
|
|
|
|
|
|
params.extend([start, end])
|
|
|
|
|
|
elif start_date:
|
|
|
|
|
|
conditions.append("trade_date >= ?")
|
|
|
|
|
|
params.append(pd.to_datetime(start_date, format='%Y%m%d').date())
|
|
|
|
|
|
elif end_date:
|
|
|
|
|
|
conditions.append("trade_date <= ?")
|
|
|
|
|
|
params.append(pd.to_datetime(end_date, format='%Y%m%d').date())
|
|
|
|
|
|
|
|
|
|
|
|
if ts_code:
|
|
|
|
|
|
conditions.append("ts_code = ?")
|
|
|
|
|
|
params.append(ts_code)
|
|
|
|
|
|
|
|
|
|
|
|
where_clause = f"WHERE {' AND '.join(conditions)}" if conditions else ""
|
|
|
|
|
|
query = f"SELECT * FROM {name} {where_clause} ORDER BY trade_date"
|
|
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
|
# Execute query with parameters (SQL injection safe)
|
|
|
|
|
|
result = self._connection.execute(query, params).fetchdf()
|
|
|
|
|
|
|
|
|
|
|
|
# Convert trade_date back to string format for compatibility
|
|
|
|
|
|
if 'trade_date' in result.columns:
|
|
|
|
|
|
result['trade_date'] = result['trade_date'].dt.strftime('%Y%m%d')
|
|
|
|
|
|
|
|
|
|
|
|
return result
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
print(f"[Storage] Error loading {name}: {e}")
|
|
|
|
|
|
return pd.DataFrame()
|
|
|
|
|
|
|
|
|
|
|
|
def load_polars(
|
|
|
|
|
|
self,
|
|
|
|
|
|
name: str,
|
|
|
|
|
|
start_date: Optional[str] = None,
|
|
|
|
|
|
end_date: Optional[str] = None,
|
|
|
|
|
|
ts_code: Optional[str] = None,
|
|
|
|
|
|
) -> pl.DataFrame:
|
|
|
|
|
|
"""Load data as Polars DataFrame (for DataLoader).
|
|
|
|
|
|
|
|
|
|
|
|
性能优势:
|
|
|
|
|
|
- 零拷贝导出(DuckDB → Polars)
|
|
|
|
|
|
- 无需经过 Pandas 转换
|
|
|
|
|
|
"""
|
|
|
|
|
|
# Build query
|
|
|
|
|
|
conditions = []
|
|
|
|
|
|
if start_date and end_date:
|
|
|
|
|
|
start = pd.to_datetime(start_date, format='%Y%m%d').date()
|
|
|
|
|
|
end = pd.to_datetime(end_date, format='%Y%m%d').date()
|
|
|
|
|
|
conditions.append(f"trade_date BETWEEN '{start}' AND '{end}'")
|
|
|
|
|
|
if ts_code:
|
|
|
|
|
|
conditions.append(f"ts_code = '{ts_code}'")
|
|
|
|
|
|
|
|
|
|
|
|
where_clause = f"WHERE {' AND '.join(conditions)}" if conditions else ""
|
|
|
|
|
|
query = f"SELECT * FROM {name} {where_clause} ORDER BY trade_date"
|
|
|
|
|
|
|
|
|
|
|
|
# Return Polars DataFrame directly
|
|
|
|
|
|
return self._connection.sql(query).pl()
|
|
|
|
|
|
|
|
|
|
|
|
def exists(self, name: str) -> bool:
|
|
|
|
|
|
"""Check if table exists."""
|
|
|
|
|
|
result = self._connection.execute("""
|
|
|
|
|
|
SELECT COUNT(*) FROM information_schema.tables
|
|
|
|
|
|
WHERE table_name = ?
|
|
|
|
|
|
""", [name]).fetchone()
|
|
|
|
|
|
return result[0] > 0
|
|
|
|
|
|
|
|
|
|
|
|
def delete(self, name: str) -> bool:
|
|
|
|
|
|
"""Delete a table."""
|
|
|
|
|
|
try:
|
|
|
|
|
|
self._connection.execute(f"DROP TABLE IF EXISTS {name}")
|
|
|
|
|
|
print(f"[Storage] Deleted table {name}")
|
|
|
|
|
|
return True
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
print(f"[Storage] Error deleting {name}: {e}")
|
|
|
|
|
|
return False
|
|
|
|
|
|
|
|
|
|
|
|
def get_last_date(self, name: str) -> Optional[str]:
|
|
|
|
|
|
"""Get the latest date in storage."""
|
|
|
|
|
|
try:
|
|
|
|
|
|
result = self._connection.execute(f"""
|
|
|
|
|
|
SELECT MAX(trade_date) FROM {name}
|
|
|
|
|
|
""").fetchone()
|
|
|
|
|
|
if result[0]:
|
|
|
|
|
|
# Convert date back to string format
|
|
|
|
|
|
return result[0].strftime('%Y%m%d') if hasattr(result[0], 'strftime') else str(result[0])
|
|
|
|
|
|
return None
|
|
|
|
|
|
except:
|
|
|
|
|
|
return None
|
|
|
|
|
|
|
|
|
|
|
|
def close(self):
|
|
|
|
|
|
"""Close database connection."""
|
|
|
|
|
|
if self._connection:
|
|
|
|
|
|
self._connection.close()
|
|
|
|
|
|
Storage._connection = None
|
|
|
|
|
|
Storage._instance = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
class ThreadSafeStorage:
|
|
|
|
|
|
"""线程安全的 DuckDB 写入包装器。
|
|
|
|
|
|
|
|
|
|
|
|
DuckDB 写入时不支持并发,使用队列收集写入请求,
|
|
|
|
|
|
在 sync 结束时统一批量写入。
|
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
|
|
def __init__(self):
|
|
|
|
|
|
self.storage = Storage()
|
|
|
|
|
|
self._pending_writes: List[tuple] = [] # [(name, data), ...]
|
|
|
|
|
|
|
|
|
|
|
|
def queue_save(self, name: str, data: pd.DataFrame):
|
|
|
|
|
|
"""将数据放入写入队列(不立即写入)"""
|
|
|
|
|
|
if not data.empty:
|
|
|
|
|
|
self._pending_writes.append((name, data))
|
|
|
|
|
|
|
|
|
|
|
|
def flush(self):
|
|
|
|
|
|
"""批量写入所有队列数据。
|
|
|
|
|
|
|
|
|
|
|
|
调用时机:在 sync 结束时统一调用,避免并发写入冲突。
|
|
|
|
|
|
"""
|
|
|
|
|
|
if not self._pending_writes:
|
|
|
|
|
|
return
|
|
|
|
|
|
|
|
|
|
|
|
# 合并相同表的数据
|
|
|
|
|
|
from collections import defaultdict
|
|
|
|
|
|
table_data = defaultdict(list)
|
|
|
|
|
|
|
|
|
|
|
|
for name, data in self._pending_writes:
|
|
|
|
|
|
table_data[name].append(data)
|
|
|
|
|
|
|
|
|
|
|
|
# 批量写入每个表
|
|
|
|
|
|
for name, data_list in table_data.items():
|
|
|
|
|
|
combined = pd.concat(data_list, ignore_index=True)
|
|
|
|
|
|
# 在批量数据中先去重
|
|
|
|
|
|
if 'ts_code' in combined.columns and 'trade_date' in combined.columns:
|
|
|
|
|
|
combined = combined.drop_duplicates(
|
|
|
|
|
|
subset=["ts_code", "trade_date"],
|
|
|
|
|
|
keep="last"
|
|
|
|
|
|
)
|
|
|
|
|
|
self.storage.save(name, combined, mode="append")
|
|
|
|
|
|
|
|
|
|
|
|
self._pending_writes.clear()
|
|
|
|
|
|
|
|
|
|
|
|
def __getattr__(self, name):
|
|
|
|
|
|
"""代理其他方法到 Storage 实例"""
|
|
|
|
|
|
return getattr(self.storage, name)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.3.2 DataLoader 适配 (`src/factors/data_loader.py`)
|
|
|
|
|
|
|
|
|
|
|
|
**改动点**:修改 `_read_h5` 方法,使用 DuckDB 查询
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
def _read_h5(self, source: str) -> pl.DataFrame:
|
|
|
|
|
|
"""读取数据 - 从 DuckDB 加载为 Polars DataFrame。
|
|
|
|
|
|
|
|
|
|
|
|
迁移说明:
|
|
|
|
|
|
- 方法名保持 _read_h5 以兼容现有代码(实际从 DuckDB 读取)
|
|
|
|
|
|
- 使用 Storage.load_polars() 直接返回 Polars DataFrame
|
|
|
|
|
|
- 支持零拷贝导出,性能优于 HDF5 + Pandas + Polars 转换
|
|
|
|
|
|
"""
|
|
|
|
|
|
from src.data.storage import Storage
|
|
|
|
|
|
|
|
|
|
|
|
storage = Storage()
|
|
|
|
|
|
|
|
|
|
|
|
# 如果 DataLoader 有 date_range,传递给 Storage 进行过滤
|
|
|
|
|
|
# 实现查询下推,只加载必要数据
|
|
|
|
|
|
return storage.load_polars(source)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.3.3 Sync 模块调整 (`src/data/sync.py`)
|
|
|
|
|
|
|
|
|
|
|
|
**改动点**:使用 ThreadSafeStorage 替代 Storage
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
# 修改前
|
|
|
|
|
|
from src.data.storage import Storage
|
|
|
|
|
|
|
|
|
|
|
|
class DataSync:
|
|
|
|
|
|
def __init__(self, max_workers: Optional[int] = None):
|
|
|
|
|
|
self.storage = Storage() # 直接写入
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
|
|
def sync_daily(self, ...):
|
|
|
|
|
|
# 多线程中直接调用 save
|
|
|
|
|
|
self.storage.save("daily", data, mode="append")
|
|
|
|
|
|
|
|
|
|
|
|
# 修改后
|
|
|
|
|
|
from src.data.storage import ThreadSafeStorage
|
|
|
|
|
|
|
|
|
|
|
|
class DataSync:
|
|
|
|
|
|
def __init__(self, max_workers: Optional[int] = None):
|
|
|
|
|
|
self.storage = ThreadSafeStorage() # 队列写入
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
|
|
def sync_daily(self, ...):
|
|
|
|
|
|
# 多线程中排队,不立即写入
|
|
|
|
|
|
self.storage.queue_save("daily", data)
|
|
|
|
|
|
|
|
|
|
|
|
def sync_all(self, ...):
|
|
|
|
|
|
try:
|
|
|
|
|
|
# ... 多线程获取数据 ...
|
|
|
|
|
|
pass
|
|
|
|
|
|
finally:
|
|
|
|
|
|
# 统一批量写入
|
|
|
|
|
|
self.storage.flush()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
### 2.4 数据同步方案
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
**无需迁移脚本,直接使用 sync 模块同步数据**
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
由于 DuckDB 存储层完全兼容现有 API,无需创建专门的数据迁移脚本。采用以下策略:
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
1. **新环境/首次部署**:直接运行 `sync_all()` 从 Tushare 获取全部数据
|
|
|
|
|
|
2. **现有 HDF5 数据迁移**:保留 HDF5 文件作为备份,DuckDB 从最新日期开始增量同步
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
**同步命令**:
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
```bash
|
|
|
|
|
|
# 全量同步(首次部署或需要完整数据时)
|
|
|
|
|
|
uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
# 增量同步(日常使用)
|
|
|
|
|
|
uv run python -c "from src.data.sync import sync_all; sync_all()"
|
2026-02-22 14:41:32 +08:00
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
# 指定线程数
|
|
|
|
|
|
uv run python -c "from src.data.sync import sync_all; sync_all(max_workers=20)"
|
2026-02-22 14:41:32 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-23 00:07:21 +08:00
|
|
|
|
**优势**:
|
|
|
|
|
|
- ✅ 无需维护独立的迁移脚本
|
|
|
|
|
|
- ✅ 数据直接从源头同步,确保最新
|
|
|
|
|
|
- ✅ 利用现有 sync 逻辑,代码复用
|
|
|
|
|
|
- ✅ 支持增量更新,节省时间
|
|
|
|
|
|
|
2026-02-22 14:41:32 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. 迁移计划
|
|
|
|
|
|
|
|
|
|
|
|
### 3.1 实施阶段
|
|
|
|
|
|
|
|
|
|
|
|
#### Phase 1: 准备与开发 (Day 1)
|
|
|
|
|
|
|
|
|
|
|
|
**任务清单**:
|
|
|
|
|
|
|
|
|
|
|
|
| 序号 | 任务 | 文件 | 预估时间 | 负责人 |
|
|
|
|
|
|
|------|------|------|---------|--------|
|
|
|
|
|
|
| 1.1 | 安装 DuckDB 依赖 | `pyproject.toml` | 10 分钟 | Dev |
|
|
|
|
|
|
| 1.2 | 重写 Storage 类 | `src/data/storage.py` | 2 小时 | Dev |
|
|
|
|
|
|
| 1.3 | 创建 ThreadSafeStorage | `src/data/storage.py` | 30 分钟 | Dev |
|
|
|
|
|
|
| 1.4 | 适配 DataLoader | `src/factors/data_loader.py` | 30 分钟 | Dev |
|
|
|
|
|
|
| 1.5 | 修改 Sync 并发逻辑 | `src/data/sync.py` | 1 小时 | Dev |
|
|
|
|
|
|
|
|
|
|
|
|
**产出物**:
|
|
|
|
|
|
- ✅ 可运行的 DuckDB Storage 实现
|
|
|
|
|
|
- ✅ 单元测试通过
|
|
|
|
|
|
|
|
|
|
|
|
#### Phase 2: 测试与验证 (Day 1-2)
|
|
|
|
|
|
|
|
|
|
|
|
**任务清单**:
|
|
|
|
|
|
|
|
|
|
|
|
| 序号 | 任务 | 说明 | 预估时间 |
|
|
|
|
|
|
|------|------|------|---------|
|
|
|
|
|
|
| 2.1 | 运行现有单元测试 | `uv run pytest tests/test_sync.py` | 15 分钟 |
|
|
|
|
|
|
| 2.2 | 运行 DataLoader 测试 | `uv run pytest tests/factors/test_data_spec.py` | 15 分钟 |
|
2026-02-23 00:07:21 +08:00
|
|
|
|
| 2.3 | 数据同步测试 | `uv run python -c "from src.data.sync import sync_all; sync_all()"` | 10 分钟 |
|
2026-02-22 14:41:32 +08:00
|
|
|
|
| 2.4 | 性能基准测试 | 对比 HDF5 vs DuckDB 查询性能 | 1 小时 |
|
|
|
|
|
|
| 2.5 | 并发写入测试 | 验证 ThreadSafeStorage 正确性 | 30 分钟 |
|
|
|
|
|
|
|
|
|
|
|
|
**验证标准**:
|
|
|
|
|
|
- [ ] 所有现有测试通过
|
|
|
|
|
|
- [ ] 单股票查询 < 1 秒(HDF5 需 5-10 秒)
|
|
|
|
|
|
- [ ] 日期范围查询 < 0.5 秒
|
|
|
|
|
|
- [ ] 数据完整性验证通过(记录数一致)
|
|
|
|
|
|
|
|
|
|
|
|
#### Phase 3: 文档更新 (Day 2)
|
|
|
|
|
|
|
|
|
|
|
|
**需修改的文档**:
|
|
|
|
|
|
|
|
|
|
|
|
| 文档 | 修改内容 | 预估时间 |
|
|
|
|
|
|
|------|---------|---------|
|
|
|
|
|
|
| `docs/factor_framework_design.md` | 架构图 HDF5 → DuckDB,DataSpec 说明 | 30 分钟 |
|
|
|
|
|
|
| `docs/factor_implementation_plan.md` | DataLoader 描述,Phase 3 实现细节 | 30 分钟 |
|
|
|
|
|
|
| `docs/data_sync.md` | 存储格式说明,同步逻辑描述 | 30 分钟 |
|
|
|
|
|
|
| `README.md` | 数据存储说明 | 15 分钟 |
|
|
|
|
|
|
|
|
|
|
|
|
**文档修改详情**见 [第 4 节:影响范围分析](#4-影响范围分析)
|
|
|
|
|
|
|
|
|
|
|
|
#### Phase 4: 部署与清理 (Day 2)
|
|
|
|
|
|
|
|
|
|
|
|
**任务清单**:
|
|
|
|
|
|
|
|
|
|
|
|
| 序号 | 任务 | 说明 |
|
|
|
|
|
|
|------|------|------|
|
|
|
|
|
|
| 4.1 | 备份 HDF5 文件 | `cp data/*.h5 data/backup/` |
|
2026-02-23 00:07:21 +08:00
|
|
|
|
| 4.2 | 运行全量同步 | `uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"` |
|
|
|
|
|
|
| 4.3 | 验证数据完整性 | 抽样检查(从 DuckDB 查询并对比关键数据点) |
|
2026-02-22 14:41:32 +08:00
|
|
|
|
| 4.4 | 删除 HDF5 文件 | `rm data/*.h5`(验证通过后) |
|
|
|
|
|
|
| 4.5 | 提交代码 | `git add . && git commit -m "migrate: HDF5 to DuckDB"` |
|
|
|
|
|
|
|
|
|
|
|
|
### 3.2 回滚计划
|
|
|
|
|
|
|
|
|
|
|
|
如果迁移后发现问题,执行以下回滚步骤:
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 1. 恢复 HDF5 文件
|
|
|
|
|
|
cp data/backup/*.h5 data/
|
|
|
|
|
|
|
|
|
|
|
|
# 2. 恢复 Storage 代码(从 git 历史)
|
|
|
|
|
|
git checkout HEAD~1 -- src/data/storage.py
|
|
|
|
|
|
|
|
|
|
|
|
# 3. 重新安装依赖(如果需要)
|
|
|
|
|
|
# pip uninstall duckdb
|
|
|
|
|
|
|
|
|
|
|
|
# 4. 验证
|
|
|
|
|
|
uv run pytest tests/test_sync.py
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. 影响范围分析
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 代码文件改动清单
|
|
|
|
|
|
|
|
|
|
|
|
#### 核心文件(必须修改)
|
|
|
|
|
|
|
|
|
|
|
|
| 文件路径 | 改动类型 | 改动说明 | 影响程度 |
|
|
|
|
|
|
|---------|---------|---------|---------|
|
|
|
|
|
|
| `src/data/storage.py` | 重写 | HDF5 → DuckDB 实现 | 🔴 高 |
|
|
|
|
|
|
| `src/data/sync.py` | 修改 | 使用 ThreadSafeStorage | 🟡 中 |
|
|
|
|
|
|
| `src/factors/data_loader.py` | 修改 | `_read_h5()` 适配 | 🟡 中 |
|
|
|
|
|
|
| `pyproject.toml` | 修改 | 添加 `duckdb` 依赖 | 🟢 低 |
|
|
|
|
|
|
|
|
|
|
|
|
#### 新增文件
|
|
|
|
|
|
|
|
|
|
|
|
| 文件路径 | 说明 |
|
|
|
|
|
|
|---------|------|
|
|
|
|
|
|
| `docs/hdf5_to_duckdb_migration.md` | 本文档 |
|
|
|
|
|
|
|
|
|
|
|
|
#### 测试文件(需要验证)
|
|
|
|
|
|
|
|
|
|
|
|
| 文件路径 | 验证内容 |
|
|
|
|
|
|
|---------|---------|
|
|
|
|
|
|
| `tests/test_sync.py` | 同步流程正常 |
|
|
|
|
|
|
| `tests/test_daily_storage.py` | Storage 接口兼容 |
|
|
|
|
|
|
| `tests/factors/test_data_spec.py` | DataLoader 工作正常 |
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 设计文档修改详情
|
|
|
|
|
|
|
|
|
|
|
|
#### 4.2.1 `docs/factor_framework_design.md`
|
|
|
|
|
|
|
|
|
|
|
|
**修改位置**: 第 2 节 架构概述
|
|
|
|
|
|
|
|
|
|
|
|
**当前内容**:
|
|
|
|
|
|
```markdown
|
|
|
|
|
|
┌──────▼──────┐
|
|
|
|
|
|
│ HDF5 Files │
|
|
|
|
|
|
└─────────────┘
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**修改为**:
|
|
|
|
|
|
```markdown
|
|
|
|
|
|
┌──────▼──────┐
|
|
|
|
|
|
│ DuckDB │
|
|
|
|
|
|
│ (Embedded) │
|
|
|
|
|
|
└─────────────┘
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**修改位置**: 第 3.1 节 DataSpec
|
|
|
|
|
|
|
|
|
|
|
|
**当前内容**:
|
|
|
|
|
|
```python
|
|
|
|
|
|
source: str # H5 文件名(不含扩展名)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**修改为**:
|
|
|
|
|
|
```python
|
|
|
|
|
|
source: str # 表名(对应 DuckDB 中的表,如 "daily", "stock_basic")
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 4.2.2 `docs/factor_implementation_plan.md`
|
|
|
|
|
|
|
|
|
|
|
|
**修改位置**: Phase 3 数据加载
|
|
|
|
|
|
|
|
|
|
|
|
**当前内容**:
|
|
|
|
|
|
```markdown
|
|
|
|
|
|
### 3.1 DataLoader - 数据加载器
|
|
|
|
|
|
|
|
|
|
|
|
"""数据加载器 - 负责从 HDF5 安全加载数据"""
|
|
|
|
|
|
|
|
|
|
|
|
实现:使用 pandas.read_hdf(),然后 pl.from_pandas()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**修改为**:
|
|
|
|
|
|
```markdown
|
|
|
|
|
|
### 3.1 DataLoader - 数据加载器
|
|
|
|
|
|
|
|
|
|
|
|
"""数据加载器 - 负责从 DuckDB 安全加载数据"""
|
|
|
|
|
|
|
|
|
|
|
|
实现:使用 Storage.load_polars() 直接返回 Polars DataFrame
|
|
|
|
|
|
支持 SQL 查询下推,只加载必要数据
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**修改位置**: Phase 3 测试需求
|
|
|
|
|
|
|
|
|
|
|
|
**添加**:
|
|
|
|
|
|
```markdown
|
|
|
|
|
|
**DuckDB 集成测试需求:**
|
|
|
|
|
|
- [ ] 测试 DuckDB 查询下推正确性
|
|
|
|
|
|
- [ ] 测试 Polars 零拷贝导出
|
|
|
|
|
|
- [ ] 测试并发写入队列机制
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 4.2.3 新增/修改的数据文档
|
|
|
|
|
|
|
|
|
|
|
|
**`docs/data_sync.md`**(新增或修改)
|
|
|
|
|
|
|
|
|
|
|
|
需要添加/修改的内容:
|
|
|
|
|
|
- 存储格式说明:HDF5 → DuckDB
|
|
|
|
|
|
- 数据库文件位置:`data/prostock.db`
|
|
|
|
|
|
- 查询优化:使用 SQL 条件代替内存过滤
|
|
|
|
|
|
|
|
|
|
|
|
### 4.3 API 兼容性说明
|
|
|
|
|
|
|
|
|
|
|
|
#### 保持不变的接口 ✅
|
|
|
|
|
|
|
|
|
|
|
|
以下接口完全保持兼容,调用方无需修改:
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
# Storage 类核心方法
|
|
|
|
|
|
storage.save(name, data, mode="append")
|
|
|
|
|
|
storage.load(name, start_date, end_date, ts_code)
|
|
|
|
|
|
storage.exists(name)
|
|
|
|
|
|
storage.delete(name)
|
|
|
|
|
|
storage.get_last_date(name)
|
|
|
|
|
|
|
|
|
|
|
|
# DataLoader 类
|
|
|
|
|
|
loader.load(specs, date_range)
|
|
|
|
|
|
loader._read_h5(source) # 内部方法,行为不变
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 新增的接口 🆕
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
# Storage 新增方法
|
|
|
|
|
|
storage.load_polars(name, start_date, end_date, ts_code) # 直接返回 Polars
|
|
|
|
|
|
|
|
|
|
|
|
# ThreadSafeStorage(Sync 内部使用)
|
|
|
|
|
|
thread_safe_storage.queue_save(name, data)
|
|
|
|
|
|
thread_safe_storage.flush()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 废弃的接口 ❌
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
# 不再支持 HDF5 特定的方法
|
|
|
|
|
|
# 无(所有 HDF5 特定逻辑都在 Storage 内部)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4.4 依赖变更
|
|
|
|
|
|
|
|
|
|
|
|
#### `pyproject.toml` 修改
|
|
|
|
|
|
|
|
|
|
|
|
```toml
|
|
|
|
|
|
[project]
|
|
|
|
|
|
dependencies = [
|
|
|
|
|
|
# ... 现有依赖 ...
|
|
|
|
|
|
"duckdb>=0.10.0", # 新增
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
[project.optional-dependencies]
|
|
|
|
|
|
dev = [
|
|
|
|
|
|
# ... 现有 dev 依赖 ...
|
|
|
|
|
|
"pytest-duckdb", # 可选:DuckDB 测试工具
|
|
|
|
|
|
]
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 安装命令
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 安装 DuckDB
|
|
|
|
|
|
uv pip install duckdb
|
|
|
|
|
|
|
|
|
|
|
|
# 或使用 requirements 安装所有依赖
|
|
|
|
|
|
uv pip install -e ".[dev]"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 5. 风险与回滚策略
|
|
|
|
|
|
|
|
|
|
|
|
### 5.1 风险识别
|
|
|
|
|
|
|
|
|
|
|
|
| 风险 | 概率 | 影响 | 缓解措施 |
|
|
|
|
|
|
|------|------|------|---------|
|
|
|
|
|
|
| **并发写入冲突** | 中 | 高 | 使用 ThreadSafeStorage 队列管理 |
|
|
|
|
|
|
| **数据类型不匹配** | 低 | 中 | 严格的 Schema 定义和转换逻辑 |
|
|
|
|
|
|
| **性能不如预期** | 低 | 高 | 性能基准测试,预留回滚方案 |
|
|
|
|
|
|
| **依赖兼容性问题** | 低 | 中 | 使用虚拟环境隔离测试 |
|
|
|
|
|
|
| **数据丢失** | 低 | 极高 | 迁移前完整备份 HDF5 文件 |
|
|
|
|
|
|
|
|
|
|
|
|
### 5.2 回滚触发条件
|
|
|
|
|
|
|
|
|
|
|
|
以下情况触发回滚:
|
|
|
|
|
|
|
|
|
|
|
|
1. **数据完整性验证失败**
|
|
|
|
|
|
- 记录数不一致
|
|
|
|
|
|
- 抽样数据不匹配
|
|
|
|
|
|
|
|
|
|
|
|
2. **性能下降超过 20%**
|
|
|
|
|
|
- 全表扫描比 HDF5 慢
|
|
|
|
|
|
- 内存占用不降反升
|
|
|
|
|
|
|
|
|
|
|
|
3. **核心测试失败**
|
|
|
|
|
|
- `test_sync.py` 失败
|
|
|
|
|
|
- `test_data_loader.py` 失败
|
|
|
|
|
|
|
|
|
|
|
|
4. **生产环境异常**
|
|
|
|
|
|
- 数据同步失败
|
|
|
|
|
|
- 查询超时
|
|
|
|
|
|
|
|
|
|
|
|
### 5.3 回滚步骤
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
# rollback.sh - 回滚脚本
|
|
|
|
|
|
|
|
|
|
|
|
echo "[Rollback] Starting rollback to HDF5..."
|
|
|
|
|
|
|
|
|
|
|
|
# 1. 停止所有运行中的服务
|
|
|
|
|
|
pkill -f "python.*sync"
|
|
|
|
|
|
|
|
|
|
|
|
# 2. 恢复 HDF5 文件
|
|
|
|
|
|
echo "[Rollback] Restoring HDF5 files..."
|
|
|
|
|
|
cp data/backup/*.h5 data/ 2>/dev/null || echo "No backup found, keeping existing"
|
|
|
|
|
|
|
|
|
|
|
|
# 3. 从 git 恢复代码
|
|
|
|
|
|
echo "[Rollback] Restoring code from git..."
|
|
|
|
|
|
git checkout HEAD~1 -- src/data/storage.py
|
|
|
|
|
|
git checkout HEAD~1 -- src/data/sync.py
|
|
|
|
|
|
git checkout HEAD~1 -- src/factors/data_loader.py
|
|
|
|
|
|
git checkout HEAD~1 -- pyproject.toml
|
|
|
|
|
|
|
|
|
|
|
|
# 4. 重新安装依赖(如果需要)
|
|
|
|
|
|
echo "[Rollback] Reinstalling dependencies..."
|
|
|
|
|
|
uv pip install -e .
|
|
|
|
|
|
|
|
|
|
|
|
# 5. 验证
|
|
|
|
|
|
echo "[Rollback] Running tests..."
|
|
|
|
|
|
uv run pytest tests/test_sync.py -v
|
|
|
|
|
|
|
|
|
|
|
|
echo "[Rollback] Rollback completed!"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 5.4 数据备份策略
|
|
|
|
|
|
|
|
|
|
|
|
**迁移前备份**:
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 创建备份目录
|
|
|
|
|
|
mkdir -p data/backup_$(date +%Y%m%d_%H%M%S)
|
|
|
|
|
|
|
|
|
|
|
|
# 备份所有 HDF5 文件
|
|
|
|
|
|
cp data/*.h5 data/backup_$(date +%Y%m%d_%H%M%S)/
|
|
|
|
|
|
|
|
|
|
|
|
# 备份完成后的 DuckDB 文件(迁移后)
|
|
|
|
|
|
cp data/prostock.db data/backup_$(date +%Y%m%d_%H%M%S)/ 2>/dev/null || true
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**定期备份**(迁移后):
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# DuckDB 文件备份(每天)
|
|
|
|
|
|
0 2 * * * cp /path/to/prostock.db /path/to/backup/prostock_$(date +\%Y\%m\%d).db
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 6. 附录
|
|
|
|
|
|
|
|
|
|
|
|
### 附录 A:性能基准测试方案
|
|
|
|
|
|
|
|
|
|
|
|
**测试脚本**: `scripts/benchmark_storage.py`
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
"""存储性能基准测试:HDF5 vs DuckDB"""
|
|
|
|
|
|
|
|
|
|
|
|
import time
|
|
|
|
|
|
import pandas as pd
|
|
|
|
|
|
from src.data.storage import Storage
|
|
|
|
|
|
|
|
|
|
|
|
def benchmark_load(storage, name, iterations=5):
|
|
|
|
|
|
"""测试加载性能"""
|
|
|
|
|
|
times = []
|
|
|
|
|
|
|
|
|
|
|
|
for _ in range(iterations):
|
|
|
|
|
|
start = time.time()
|
|
|
|
|
|
# 单股票查询
|
|
|
|
|
|
df = storage.load(name, ts_code="000001.SZ")
|
|
|
|
|
|
elapsed = time.time() - start
|
|
|
|
|
|
times.append(elapsed)
|
|
|
|
|
|
|
|
|
|
|
|
return {
|
|
|
|
|
|
"mean": sum(times) / len(times),
|
|
|
|
|
|
"min": min(times),
|
|
|
|
|
|
"max": max(times),
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
def main():
|
|
|
|
|
|
storage = Storage()
|
|
|
|
|
|
|
|
|
|
|
|
print("=== Storage Performance Benchmark ===\n")
|
|
|
|
|
|
|
|
|
|
|
|
# 单股票查询
|
|
|
|
|
|
print("Single stock query (000001.SZ):")
|
|
|
|
|
|
result = benchmark_load(storage, "daily")
|
|
|
|
|
|
print(f" Mean: {result['mean']:.3f}s")
|
|
|
|
|
|
print(f" Min: {result['min']:.3f}s")
|
|
|
|
|
|
print(f" Max: {result['max']:.3f}s")
|
|
|
|
|
|
|
|
|
|
|
|
# 日期范围查询
|
|
|
|
|
|
print("\nDate range query (20240101-20240131):")
|
|
|
|
|
|
start = time.time()
|
|
|
|
|
|
df = storage.load("daily", start_date="20240101", end_date="20240131")
|
|
|
|
|
|
elapsed = time.time() - start
|
|
|
|
|
|
print(f" Time: {elapsed:.3f}s")
|
|
|
|
|
|
print(f" Rows: {len(df)}")
|
|
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
|
|
main()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**预期结果**:
|
|
|
|
|
|
|
|
|
|
|
|
| 测试项 | HDF5 | DuckDB | 提升 |
|
|
|
|
|
|
|--------|------|--------|------|
|
|
|
|
|
|
| 单股票查询 | 5-10s | 0.1-0.5s | **10-100x** |
|
|
|
|
|
|
| 日期范围查询 | 5-10s | 0.2-1s | **5-50x** |
|
|
|
|
|
|
| 全表扫描 | 5-10s | 3-5s | 1.5-2x |
|
|
|
|
|
|
| 内存占用 | 1GB+ | 100-500MB | **50-90%** |
|
|
|
|
|
|
|
|
|
|
|
|
### 附录 B:DuckDB 运维指南
|
|
|
|
|
|
|
|
|
|
|
|
#### 数据库文件位置
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
data/
|
|
|
|
|
|
├── prostock.db # DuckDB 主数据库文件
|
|
|
|
|
|
├── prostock.db.wal # WAL 日志文件(写入时存在)
|
|
|
|
|
|
└── backup/ # 备份目录
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 常用维护命令
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
import duckdb
|
|
|
|
|
|
|
|
|
|
|
|
# 查看数据库信息
|
|
|
|
|
|
conn = duckdb.connect("data/prostock.db")
|
|
|
|
|
|
|
|
|
|
|
|
# 查看所有表
|
|
|
|
|
|
tables = conn.execute("""
|
|
|
|
|
|
SELECT table_name,
|
|
|
|
|
|
estimated_size
|
|
|
|
|
|
FROM information_schema.tables
|
|
|
|
|
|
WHERE table_schema = 'main'
|
|
|
|
|
|
""").fetchall()
|
|
|
|
|
|
|
|
|
|
|
|
# 查看表结构
|
|
|
|
|
|
schema = conn.execute("DESCRIBE daily").fetchall()
|
|
|
|
|
|
|
|
|
|
|
|
# 分析表统计(优化查询计划)
|
|
|
|
|
|
conn.execute("ANALYZE daily")
|
|
|
|
|
|
|
|
|
|
|
|
# 压缩数据库(VACUUM)
|
|
|
|
|
|
conn.execute("VACUUM")
|
|
|
|
|
|
|
|
|
|
|
|
conn.close()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 性能优化建议
|
|
|
|
|
|
|
|
|
|
|
|
1. **创建适当的索引**:
|
|
|
|
|
|
```sql
|
2026-02-23 00:07:21 +08:00
|
|
|
|
CREATE INDEX idx_daily_date_code ON daily(trade_date, ts_code);
|
2026-02-22 14:41:32 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
2. **使用分区(大数据量时)**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- 按年分区(如果数据量达到亿级)
|
|
|
|
|
|
CREATE TABLE daily_partitioned AS
|
|
|
|
|
|
SELECT *, YEAR(trade_date) as year
|
|
|
|
|
|
FROM daily;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
3. **批量插入优化**:
|
|
|
|
|
|
```python
|
|
|
|
|
|
# 使用事务批量插入
|
|
|
|
|
|
conn.execute("BEGIN TRANSACTION")
|
|
|
|
|
|
# ... 多个插入操作 ...
|
|
|
|
|
|
conn.execute("COMMIT")
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 附录 C:常见问题 FAQ
|
|
|
|
|
|
|
|
|
|
|
|
**Q: DuckDB 是否支持多线程并发写入?**
|
|
|
|
|
|
|
|
|
|
|
|
A: DuckDB 支持并发读取,但写入时需要锁。我们使用 `ThreadSafeStorage` 队列机制,将并发写入转换为批量单线程写入,避免锁冲突。
|
|
|
|
|
|
|
|
|
|
|
|
**Q: 数据迁移后 HDF5 文件可以删除吗?**
|
|
|
|
|
|
|
|
|
|
|
|
A: 验证通过后可以删除。建议保留备份至少 1 周。
|
|
|
|
|
|
|
|
|
|
|
|
**Q: DuckDB 文件损坏怎么办?**
|
|
|
|
|
|
|
|
|
|
|
|
A: DuckDB 具有事务日志(WAL),正常情况下不会损坏。如果发生:
|
|
|
|
|
|
1. 从备份恢复 `.db` 文件
|
|
|
|
|
|
2. 删除 `.db.wal` 文件(如果存在)
|
|
|
|
|
|
3. 重新连接
|
|
|
|
|
|
|
|
|
|
|
|
**Q: 如何查看 DuckDB 查询执行计划?**
|
|
|
|
|
|
|
|
|
|
|
|
A: 使用 `EXPLAIN` 命令:
|
|
|
|
|
|
```python
|
|
|
|
|
|
conn.execute("EXPLAIN SELECT * FROM daily WHERE ts_code = '000001.SZ'").fetchall()
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Q: 是否支持从 DuckDB 直接导出 HDF5?**
|
|
|
|
|
|
|
|
|
|
|
|
A: 支持,可以使用 Pandas 中转:
|
|
|
|
|
|
```python
|
|
|
|
|
|
df = conn.execute("SELECT * FROM daily").fetchdf()
|
|
|
|
|
|
df.to_hdf("backup.h5", key="daily")
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 文档历史
|
|
|
|
|
|
|
|
|
|
|
|
| 版本 | 日期 | 作者 | 变更说明 |
|
|
|
|
|
|
|------|------|------|---------|
|
|
|
|
|
|
| v1.0 | 2026-02-22 | Sisyphus | 初始版本,完整迁移方案 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 审批记录
|
|
|
|
|
|
|
|
|
|
|
|
| 角色 | 姓名 | 日期 | 意见 |
|
|
|
|
|
|
|------|------|------|------|
|
|
|
|
|
|
| 技术负责人 | ______ | ______ | ______ |
|
|
|
|
|
|
| 项目负责人 | ______ | ______ | ______ |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**下一步行动**:
|
|
|
|
|
|
1. [ ] 技术负责人审批方案
|
|
|
|
|
|
2. [ ] 确定实施日期
|
|
|
|
|
|
3. [ ] 分配开发资源
|
|
|
|
|
|
4. [ ] 执行 Phase 1 开发
|