feat: HDF5迁移至DuckDB存储
- 新增DuckDB Storage与ThreadSafeStorage实现 - 新增db_manager模块支持增量同步策略 - DataLoader与Sync模块适配DuckDB - 补充迁移相关文档与测试 - 修复README文档链接
This commit is contained in:
@@ -120,9 +120,8 @@ CREATE TABLE daily (
|
||||
PRIMARY KEY (ts_code, trade_date) -- 复合主键,自动去重
|
||||
);
|
||||
|
||||
-- 创建索引(DuckDB 会自动为主键创建索引)
|
||||
CREATE INDEX idx_daily_date ON daily(trade_date);
|
||||
CREATE INDEX idx_daily_code ON daily(ts_code);
|
||||
-- 创建复合索引(覆盖常用查询场景:按日期范围+股票代码过滤)
|
||||
CREATE INDEX idx_daily_date_code ON daily(trade_date, ts_code);
|
||||
|
||||
-- 股票基础信息表(替代 stock_basic.h5)
|
||||
CREATE TABLE stock_basic (
|
||||
@@ -229,12 +228,9 @@ class Storage:
|
||||
)
|
||||
""")
|
||||
|
||||
# Create indexes for query optimization
|
||||
# Create composite index for query optimization (trade_date, ts_code)
|
||||
self._connection.execute("""
|
||||
CREATE INDEX IF NOT EXISTS idx_daily_date ON daily(trade_date)
|
||||
""")
|
||||
self._connection.execute("""
|
||||
CREATE INDEX IF NOT EXISTS idx_daily_code ON daily(ts_code)
|
||||
CREATE INDEX IF NOT EXISTS idx_daily_date_code ON daily(trade_date, ts_code)
|
||||
""")
|
||||
|
||||
def save(self, name: str, data: pd.DataFrame, mode: str = "append") -> dict:
|
||||
@@ -515,111 +511,34 @@ class DataSync:
|
||||
self.storage.flush()
|
||||
```
|
||||
|
||||
### 2.4 数据迁移脚本
|
||||
### 2.4 数据同步方案
|
||||
|
||||
**创建 `scripts/migrate_h5_to_duckdb.py`**
|
||||
**无需迁移脚本,直接使用 sync 模块同步数据**
|
||||
|
||||
```python
|
||||
"""数据迁移脚本:将 HDF5 文件迁移到 DuckDB。
|
||||
由于 DuckDB 存储层完全兼容现有 API,无需创建专门的数据迁移脚本。采用以下策略:
|
||||
|
||||
使用方法:
|
||||
uv run python scripts/migrate_h5_to_duckdb.py
|
||||
1. **新环境/首次部署**:直接运行 `sync_all()` 从 Tushare 获取全部数据
|
||||
2. **现有 HDF5 数据迁移**:保留 HDF5 文件作为备份,DuckDB 从最新日期开始增量同步
|
||||
|
||||
功能:
|
||||
1. 读取所有 .h5 文件
|
||||
2. 转换数据类型(日期格式)
|
||||
3. 写入 DuckDB
|
||||
4. 验证数据完整性
|
||||
"""
|
||||
**同步命令**:
|
||||
|
||||
import pandas as pd
|
||||
import duckdb
|
||||
from pathlib import Path
|
||||
from tqdm import tqdm
|
||||
```bash
|
||||
# 全量同步(首次部署或需要完整数据时)
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"
|
||||
|
||||
def migrate_table(h5_path: Path, db_path: Path, table_name: str):
|
||||
"""迁移单个 H5 表到 DuckDB"""
|
||||
print(f"[Migrate] Migrating {table_name} from {h5_path}")
|
||||
|
||||
# Read HDF5
|
||||
df = pd.read_hdf(h5_path, key=f"/{table_name}")
|
||||
|
||||
# Convert date columns
|
||||
if 'trade_date' in df.columns:
|
||||
df['trade_date'] = pd.to_datetime(df['trade_date'], format='%Y%m%d')
|
||||
if 'list_date' in df.columns:
|
||||
df['list_date'] = pd.to_datetime(df['list_date'], format='%Y%m%d')
|
||||
|
||||
# Connect to DuckDB
|
||||
conn = duckdb.connect(str(db_path))
|
||||
|
||||
# Register and insert
|
||||
conn.register("migration_data", df)
|
||||
|
||||
# Create table and insert
|
||||
columns = ", ".join([f"{col} {infer_dtype(df[col])}" for col in df.columns])
|
||||
conn.execute(f"CREATE TABLE IF NOT EXISTS {table_name} ({columns})")
|
||||
|
||||
col_names = ", ".join(df.columns)
|
||||
conn.execute(f"INSERT INTO {table_name} ({col_names}) SELECT {col_names} FROM migration_data")
|
||||
|
||||
conn.close()
|
||||
|
||||
print(f"[Migrate] Migrated {len(df)} rows to {table_name}")
|
||||
# 增量同步(日常使用)
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all()"
|
||||
|
||||
def infer_dtype(series: pd.Series) -> str:
|
||||
"""推断 DuckDB 数据类型"""
|
||||
if pd.api.types.is_datetime64_any_dtype(series):
|
||||
return "DATE"
|
||||
elif pd.api.types.is_integer_dtype(series):
|
||||
return "BIGINT"
|
||||
elif pd.api.types.is_float_dtype(series):
|
||||
return "DOUBLE"
|
||||
else:
|
||||
return "VARCHAR"
|
||||
|
||||
def verify_migration(db_path: Path):
|
||||
"""验证迁移后的数据完整性"""
|
||||
conn = duckdb.connect(str(db_path))
|
||||
|
||||
# Check tables
|
||||
tables = conn.execute("""
|
||||
SELECT table_name FROM information_schema.tables
|
||||
WHERE table_schema = 'main'
|
||||
""").fetchall()
|
||||
|
||||
print("\n[Verify] Tables in DuckDB:")
|
||||
for (table_name,) in tables:
|
||||
count = conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
|
||||
print(f" - {table_name}: {count} rows")
|
||||
|
||||
conn.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
data_dir = Path("data")
|
||||
db_path = data_dir / "prostock.db"
|
||||
|
||||
# Find all H5 files
|
||||
h5_files = list(data_dir.glob("*.h5"))
|
||||
|
||||
if not h5_files:
|
||||
print("[Migrate] No HDF5 files found in data/ directory")
|
||||
exit(0)
|
||||
|
||||
print(f"[Migrate] Found {len(h5_files)} HDF5 files to migrate\n")
|
||||
|
||||
for h5_file in tqdm(h5_files, desc="Migrating"):
|
||||
table_name = h5_file.stem
|
||||
migrate_table(h5_file, db_path, table_name)
|
||||
|
||||
# Verify
|
||||
verify_migration(db_path)
|
||||
|
||||
print("\n[Done] Migration completed successfully!")
|
||||
print(f"[Done] DuckDB file: {db_path}")
|
||||
print("[Done] You can now delete HDF5 files if verification passed")
|
||||
# 指定线程数
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all(max_workers=20)"
|
||||
```
|
||||
|
||||
**优势**:
|
||||
- ✅ 无需维护独立的迁移脚本
|
||||
- ✅ 数据直接从源头同步,确保最新
|
||||
- ✅ 利用现有 sync 逻辑,代码复用
|
||||
- ✅ 支持增量更新,节省时间
|
||||
|
||||
---
|
||||
|
||||
## 3. 迁移计划
|
||||
@@ -637,11 +556,9 @@ if __name__ == "__main__":
|
||||
| 1.3 | 创建 ThreadSafeStorage | `src/data/storage.py` | 30 分钟 | Dev |
|
||||
| 1.4 | 适配 DataLoader | `src/factors/data_loader.py` | 30 分钟 | Dev |
|
||||
| 1.5 | 修改 Sync 并发逻辑 | `src/data/sync.py` | 1 小时 | Dev |
|
||||
| 1.6 | 创建迁移脚本 | `scripts/migrate_h5_to_duckdb.py` | 30 分钟 | Dev |
|
||||
|
||||
**产出物**:
|
||||
- ✅ 可运行的 DuckDB Storage 实现
|
||||
- ✅ 迁移脚本
|
||||
- ✅ 单元测试通过
|
||||
|
||||
#### Phase 2: 测试与验证 (Day 1-2)
|
||||
@@ -652,7 +569,7 @@ if __name__ == "__main__":
|
||||
|------|------|------|---------|
|
||||
| 2.1 | 运行现有单元测试 | `uv run pytest tests/test_sync.py` | 15 分钟 |
|
||||
| 2.2 | 运行 DataLoader 测试 | `uv run pytest tests/factors/test_data_spec.py` | 15 分钟 |
|
||||
| 2.3 | 数据迁移测试 | `uv run python scripts/migrate_h5_to_duckdb.py` | 10 分钟 |
|
||||
| 2.3 | 数据同步测试 | `uv run python -c "from src.data.sync import sync_all; sync_all()"` | 10 分钟 |
|
||||
| 2.4 | 性能基准测试 | 对比 HDF5 vs DuckDB 查询性能 | 1 小时 |
|
||||
| 2.5 | 并发写入测试 | 验证 ThreadSafeStorage 正确性 | 30 分钟 |
|
||||
|
||||
@@ -682,8 +599,8 @@ if __name__ == "__main__":
|
||||
| 序号 | 任务 | 说明 |
|
||||
|------|------|------|
|
||||
| 4.1 | 备份 HDF5 文件 | `cp data/*.h5 data/backup/` |
|
||||
| 4.2 | 运行数据迁移 | `uv run python scripts/migrate_h5_to_duckdb.py` |
|
||||
| 4.3 | 验证数据完整性 | 对比记录数、抽样检查 |
|
||||
| 4.2 | 运行全量同步 | `uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"` |
|
||||
| 4.3 | 验证数据完整性 | 抽样检查(从 DuckDB 查询并对比关键数据点) |
|
||||
| 4.4 | 删除 HDF5 文件 | `rm data/*.h5`(验证通过后) |
|
||||
| 4.5 | 提交代码 | `git add . && git commit -m "migrate: HDF5 to DuckDB"` |
|
||||
|
||||
@@ -724,7 +641,6 @@ uv run pytest tests/test_sync.py
|
||||
|
||||
| 文件路径 | 说明 |
|
||||
|---------|------|
|
||||
| `scripts/migrate_h5_to_duckdb.py` | 数据迁移脚本 |
|
||||
| `docs/hdf5_to_duckdb_migration.md` | 本文档 |
|
||||
|
||||
#### 测试文件(需要验证)
|
||||
@@ -1072,7 +988,7 @@ conn.close()
|
||||
|
||||
1. **创建适当的索引**:
|
||||
```sql
|
||||
CREATE INDEX idx_daily_code_date ON daily(ts_code, trade_date);
|
||||
CREATE INDEX idx_daily_date_code ON daily(trade_date, ts_code);
|
||||
```
|
||||
|
||||
2. **使用分区(大数据量时)**:
|
||||
|
||||
Reference in New Issue
Block a user