2026-02-23 00:07:21 +08:00
|
|
|
|
# ProStock HDF5 到 DuckDB 迁移测试报告
|
|
|
|
|
|
|
|
|
|
|
|
**报告生成时间**: 2026-02-22
|
2026-02-23 01:37:34 +08:00
|
|
|
|
**完成时间**: 2026-02-22
|
|
|
|
|
|
**状态**: ✅ 已完成
|
2026-02-23 00:07:21 +08:00
|
|
|
|
**迁移文档**: [hdf5_to_duckdb_migration.md](./hdf5_to_duckdb_migration.md)
|
|
|
|
|
|
**测试数据范围**: 2024年1月-3月(3个月)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 1. 迁移实施摘要
|
|
|
|
|
|
|
|
|
|
|
|
### 已完成的核心任务 ✅
|
|
|
|
|
|
|
|
|
|
|
|
| 任务 | 文件 | 状态 |
|
|
|
|
|
|
|------|------|------|
|
|
|
|
|
|
| Storage 类重写 | `src/data/storage.py` | ✅ 完成 |
|
|
|
|
|
|
| ThreadSafeStorage 实现 | `src/data/storage.py` | ✅ 完成 |
|
|
|
|
|
|
| Sync 模块适配 | `src/data/sync.py` | ✅ 完成 |
|
|
|
|
|
|
| DataLoader 适配 | `src/factors/data_loader.py` | ✅ 完成 |
|
|
|
|
|
|
| 测试文件更新 | `tests/` | ✅ 完成 |
|
|
|
|
|
|
|
|
|
|
|
|
### 架构变更
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
HDF5 格式 (.h5 文件) → DuckDB (prostock.db)
|
|
|
|
|
|
├── pandas.read_hdf() → duckdb.execute().fetchdf()
|
|
|
|
|
|
├── 全表加载到内存 → SQL 查询下推,按需加载
|
|
|
|
|
|
├── 文件锁并发 → ThreadSafeStorage 队列写入
|
|
|
|
|
|
└── Polars 通过 Pandas 中转 → DuckDB → PyArrow → Polars (零拷贝)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 2. 测试执行情况
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 测试文件清单
|
|
|
|
|
|
|
|
|
|
|
|
| 测试文件 | 测试类型 | 数据范围 |
|
|
|
|
|
|
|---------|---------|---------|
|
|
|
|
|
|
| `test_daily_storage.py` | DuckDB Storage 集成测试 | 3个月(2024/01-03) |
|
|
|
|
|
|
| `test_data_loader.py` | DataLoader 功能测试 | 3个月(2024/01-03) |
|
|
|
|
|
|
| `test_sync.py` | Sync 模块单元测试 | Mock 数据 |
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 关键测试用例
|
|
|
|
|
|
|
|
|
|
|
|
#### DuckDB Storage 测试 (`test_daily_storage.py`)
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
class TestDailyStorageValidation:
|
|
|
|
|
|
TEST_START_DATE = "20240101"
|
|
|
|
|
|
TEST_END_DATE = "20240331" # 3个月数据
|
|
|
|
|
|
|
|
|
|
|
|
def test_duckdb_connection() # ✅ 连接测试
|
|
|
|
|
|
def test_load_3months_data() # ⚠️ 需要先有数据
|
|
|
|
|
|
def test_polars_export() # ✅ PyArrow 零拷贝导出
|
|
|
|
|
|
def test_all_stocks_saved() # ⚠️ 需要先有数据
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### DataLoader 测试 (`test_data_loader.py`)
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
class TestDataLoaderBasic:
|
|
|
|
|
|
def test_load_single_source() # 从 DuckDB 加载
|
|
|
|
|
|
def test_load_with_date_range() # 3个月日期范围
|
|
|
|
|
|
def test_column_selection() # 列选择
|
|
|
|
|
|
def test_cache_used() # 缓存性能
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. 性能对比预期
|
|
|
|
|
|
|
|
|
|
|
|
| 测试项 | HDF5 (旧) | DuckDB (新) | 预期提升 |
|
|
|
|
|
|
|--------|----------|------------|---------|
|
|
|
|
|
|
| 单股票查询 | 5-10s | 0.1-0.5s | **10-100x** |
|
|
|
|
|
|
| 日期范围查询 | 5-10s | 0.2-1s | **5-50x** |
|
|
|
|
|
|
| 内存占用 | 1GB+ | 100-500MB | **50-90%** |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. 使用前准备
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 数据同步(必须)
|
|
|
|
|
|
|
|
|
|
|
|
当前数据库中没有 2024年1-3月的测试数据,需要先进行数据同步:
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 方式1: 同步特定股票代码的3个月数据(推荐用于测试)
|
|
|
|
|
|
uv run python -c "
|
|
|
|
|
|
from src.data.sync import DataSync
|
|
|
|
|
|
from src.data.api_wrappers import get_daily
|
|
|
|
|
|
import pandas as pd
|
|
|
|
|
|
|
|
|
|
|
|
# 获取测试股票数据
|
|
|
|
|
|
data = get_daily('000001.SZ', start_date='20240101', end_date='20240331')
|
|
|
|
|
|
|
|
|
|
|
|
# 保存到 DuckDB
|
|
|
|
|
|
from src.data.storage import Storage
|
|
|
|
|
|
storage = Storage()
|
|
|
|
|
|
storage.save('daily', data)
|
|
|
|
|
|
print(f'已保存 {len(data)} 行数据')
|
|
|
|
|
|
"
|
|
|
|
|
|
|
|
|
|
|
|
# 方式2: 全量同步所有股票(耗时较长)
|
|
|
|
|
|
uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"
|
|
|
|
|
|
|
|
|
|
|
|
# 方式3: 增量同步(从上次同步日期继续)
|
|
|
|
|
|
uv run python -c "from src.data.sync import sync_all; sync_all()"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 验证安装
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 检查 DuckDB 和 PyArrow 是否安装
|
|
|
|
|
|
uv run python -c "import duckdb; import pyarrow; print('✅ 依赖检查通过')"
|
|
|
|
|
|
|
|
|
|
|
|
# 验证 Storage 类
|
|
|
|
|
|
uv run python -c "from src.data.storage import Storage, ThreadSafeStorage; print('✅ Storage 类导入成功')"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 5. 运行测试
|
|
|
|
|
|
|
|
|
|
|
|
### 5.1 运行所有测试
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 运行 DuckDB 相关测试
|
|
|
|
|
|
uv run pytest tests/test_daily_storage.py tests/factors/test_data_loader.py -v
|
|
|
|
|
|
|
|
|
|
|
|
# 运行 Sync 模块测试
|
|
|
|
|
|
uv run pytest tests/test_sync.py -v
|
|
|
|
|
|
|
|
|
|
|
|
# 运行全部测试
|
|
|
|
|
|
uv run pytest tests/ -v
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 5.2 预期输出
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
tests/test_daily_storage.py::TestDailyStorageValidation::test_duckdb_connection PASSED
|
|
|
|
|
|
tests/test_daily_storage.py::TestDailyStorageValidation::test_polars_export PASSED
|
|
|
|
|
|
tests/factors/test_data_loader.py::TestDataLoaderBasic::test_load_single_source PASSED
|
|
|
|
|
|
tests/factors/test_data_loader.py::TestDataLoaderBasic::test_load_with_date_range PASSED
|
|
|
|
|
|
...
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 6. 常见问题 (FAQ)
|
|
|
|
|
|
|
|
|
|
|
|
### Q: 测试提示 "No data found for period"?
|
|
|
|
|
|
**A**: 需要先执行数据同步,将 2024年1-3月的数据写入 DuckDB。
|
|
|
|
|
|
|
|
|
|
|
|
### Q: ModuleNotFoundError: No module named 'pyarrow'?
|
|
|
|
|
|
**A**: 需要安装 pyarrow:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
uv pip install pyarrow
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Q: 如何查看数据库中的数据?
|
|
|
|
|
|
**A**:
|
|
|
|
|
|
```python
|
|
|
|
|
|
from src.data.storage import Storage
|
|
|
|
|
|
storage = Storage()
|
|
|
|
|
|
|
|
|
|
|
|
# 检查表是否存在
|
|
|
|
|
|
print(storage.exists("daily")) # True/False
|
|
|
|
|
|
|
|
|
|
|
|
# 查询最新日期
|
|
|
|
|
|
print(storage.get_last_date("daily")) # "20240331"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Q: 如何备份 DuckDB 数据库?
|
|
|
|
|
|
**A**:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# 备份
|
|
|
|
|
|
cp data/prostock.db data/prostock_backup.db
|
|
|
|
|
|
|
|
|
|
|
|
# 恢复
|
|
|
|
|
|
cp data/prostock_backup.db data/prostock.db
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 7. 迁移验证清单
|
|
|
|
|
|
|
|
|
|
|
|
- [x] Storage 类实现 DuckDB 存储
|
|
|
|
|
|
- [x] ThreadSafeStorage 实现并发安全
|
|
|
|
|
|
- [x] DataLoader 适配 DuckDB
|
|
|
|
|
|
- [x] Sync 模块使用 ThreadSafeStorage
|
|
|
|
|
|
- [x] 测试文件更新为 3 个月数据范围
|
|
|
|
|
|
- [x] PyArrow 零拷贝导出支持
|
|
|
|
|
|
- [ ] 执行数据同步(需手动运行)
|
|
|
|
|
|
- [ ] 运行全部测试通过(需先有数据)
|
|
|
|
|
|
- [ ] 性能基准测试对比
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 8. 下一步行动
|
|
|
|
|
|
|
|
|
|
|
|
1. **数据同步**: 运行上述 4.1 节的数据同步命令
|
|
|
|
|
|
2. **测试验证**: 运行 `uv run pytest tests/ -v` 确认所有测试通过
|
|
|
|
|
|
3. **性能测试**: 使用 `scripts/benchmark_storage.py` 对比 HDF5 vs DuckDB 性能
|
|
|
|
|
|
4. **生产部署**: 备份 HDF5 文件,删除旧数据,完全切换到 DuckDB
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**报告生成**: ProStock Migration Tool
|
|
|
|
|
|
**状态**: 核心代码完成,等待数据同步后运行测试
|