feat: HDF5迁移至DuckDB存储
- 新增DuckDB Storage与ThreadSafeStorage实现 - 新增db_manager模块支持增量同步策略 - DataLoader与Sync模块适配DuckDB - 补充迁移相关文档与测试 - 修复README文档链接
This commit is contained in:
267
docs/db_sync_guide.md
Normal file
267
docs/db_sync_guide.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# DuckDB 数据同步指南
|
||||
|
||||
ProStock 现已从 HDF5 迁移至 DuckDB 存储。本文档介绍新的同步机制。
|
||||
|
||||
## 新功能概览
|
||||
|
||||
- **自动表创建**: 根据 DataFrame 自动推断表结构
|
||||
- **复合索引**: 自动为 `(trade_date, ts_code)` 创建复合索引
|
||||
- **增量同步**: 智能判断同步策略(按日期或按股票)
|
||||
- **类型映射**: 预定义常见字段的数据类型
|
||||
|
||||
## 核心模块
|
||||
|
||||
### 1. TableManager - 表管理
|
||||
|
||||
```python
|
||||
from src.data.db_manager import TableManager
|
||||
|
||||
# 创建表管理器
|
||||
manager = TableManager()
|
||||
|
||||
# 从 DataFrame 创建表(自动创建复合索引)
|
||||
import pandas as pd
|
||||
data = pd.DataFrame({
|
||||
"ts_code": ["000001.SZ"],
|
||||
"trade_date": ["20240101"],
|
||||
"close": [10.5],
|
||||
})
|
||||
|
||||
manager.create_table_from_dataframe("daily", data)
|
||||
|
||||
# 确保表存在(不存在则自动创建)
|
||||
manager.ensure_table_exists("daily", sample_data=data)
|
||||
```
|
||||
|
||||
### 2. IncrementalSync - 增量同步
|
||||
|
||||
```python
|
||||
from src.data.db_manager import IncrementalSync
|
||||
|
||||
sync = IncrementalSync()
|
||||
|
||||
# 获取同步策略
|
||||
strategy, start, end, stocks = sync.get_sync_strategy(
|
||||
table_name="daily",
|
||||
start_date="20240101",
|
||||
end_date="20240131",
|
||||
stock_codes=None # None = 所有股票
|
||||
)
|
||||
|
||||
# 返回值:
|
||||
# - strategy: "by_date" | "by_stock" | "none"
|
||||
# - start: 同步开始日期
|
||||
# - end: 同步结束日期
|
||||
# - stocks: 需要同步的股票列表(None = 全部)
|
||||
|
||||
# 执行数据同步
|
||||
result = sync.sync_data("daily", data, strategy="by_date")
|
||||
```
|
||||
|
||||
### 3. SyncManager - 高级同步
|
||||
|
||||
```python
|
||||
from src.data.db_manager import SyncManager
|
||||
from src.data.api_wrappers import get_daily
|
||||
|
||||
# 创建同步管理器
|
||||
manager = SyncManager()
|
||||
|
||||
# 一键同步(自动处理表创建、策略选择、数据获取)
|
||||
result = manager.sync(
|
||||
table_name="daily",
|
||||
fetch_func=get_daily, # 数据获取函数
|
||||
start_date="20240101",
|
||||
end_date="20240131",
|
||||
stock_codes=["000001.SZ", "600000.SH"] # 可选:指定股票
|
||||
)
|
||||
|
||||
print(result)
|
||||
# {
|
||||
# "status": "success",
|
||||
# "table": "daily",
|
||||
# "strategy": "by_date",
|
||||
# "rows": 1000,
|
||||
# "date_range": "20240101 to 20240131"
|
||||
# }
|
||||
```
|
||||
|
||||
## 便捷函数
|
||||
|
||||
### 快速同步数据
|
||||
|
||||
```python
|
||||
from src.data.db_manager import sync_table
|
||||
from src.data.api_wrappers import get_daily
|
||||
|
||||
# 同步日线数据
|
||||
result = sync_table(
|
||||
table_name="daily",
|
||||
fetch_func=get_daily,
|
||||
start_date="20240101",
|
||||
end_date="20240131"
|
||||
)
|
||||
```
|
||||
|
||||
### 获取表信息
|
||||
|
||||
```python
|
||||
from src.data.db_manager import get_table_info
|
||||
|
||||
# 查看表统计信息
|
||||
info = get_table_info("daily")
|
||||
print(info)
|
||||
# {
|
||||
# "exists": True,
|
||||
# "row_count": 100000,
|
||||
# "min_date": "20240101",
|
||||
# "max_date": "20240131",
|
||||
# "unique_stocks": 5000
|
||||
# }
|
||||
```
|
||||
|
||||
### 确保表存在
|
||||
|
||||
```python
|
||||
from src.data.db_manager import ensure_table
|
||||
|
||||
# 如果表不存在,使用 sample_data 创建
|
||||
ensure_table("daily", sample_data=df)
|
||||
```
|
||||
|
||||
## 同步策略详解
|
||||
|
||||
### 1. 按日期同步 (by_date)
|
||||
|
||||
**适用场景**: 全市场数据同步、每日增量更新
|
||||
|
||||
**逻辑**:
|
||||
- 表不存在 → 全量同步
|
||||
- 表存在但空 → 全量同步
|
||||
- 表存在且有数据 → 从 `last_date + 1` 开始增量同步
|
||||
|
||||
```python
|
||||
# 示例: 表已有数据到 20240115
|
||||
strategy, start, end, stocks = sync.get_sync_strategy(
|
||||
"daily", "20240101", "20240131"
|
||||
)
|
||||
# 返回: ("by_date", "20240116", "20240131", None)
|
||||
# 只需同步 16-31 号的新数据
|
||||
```
|
||||
|
||||
### 2. 按股票同步 (by_stock)
|
||||
|
||||
**适用场景**: 补充特定股票的历史数据
|
||||
|
||||
**逻辑**:
|
||||
- 检查哪些请求的股票不存在于表中
|
||||
- 仅同步缺失的股票
|
||||
|
||||
```python
|
||||
# 示例: 表中已有 000001.SZ,请求两只股票
|
||||
strategy, start, end, stocks = sync.get_sync_strategy(
|
||||
"daily", "20240101", "20240131",
|
||||
stock_codes=["000001.SZ", "600000.SH"]
|
||||
)
|
||||
# 返回: ("by_stock", "20240101", "20240131", ["600000.SH"])
|
||||
# 只同步缺失的 600000.SH
|
||||
```
|
||||
|
||||
### 3. 无需同步 (none)
|
||||
|
||||
**适用场景**: 数据已是最新
|
||||
|
||||
**触发条件**:
|
||||
- 表存在且日期已覆盖请求范围
|
||||
- 所有请求的股票都已存在
|
||||
|
||||
## 完整示例
|
||||
|
||||
```python
|
||||
from src.data.db_manager import SyncManager, get_table_info
|
||||
from src.data.api_wrappers import get_daily
|
||||
|
||||
# 1. 查看当前表状态
|
||||
info = get_table_info("daily")
|
||||
print(f"当前数据: {info['row_count']} 行, 最新日期: {info['max_date']}")
|
||||
|
||||
# 2. 创建同步管理器
|
||||
manager = SyncManager()
|
||||
|
||||
# 3. 执行同步
|
||||
result = manager.sync(
|
||||
table_name="daily",
|
||||
fetch_func=get_daily,
|
||||
start_date="20240101",
|
||||
end_date="20240222"
|
||||
)
|
||||
|
||||
# 4. 检查结果
|
||||
if result["status"] == "success":
|
||||
print(f"成功同步 {result['rows']} 行数据")
|
||||
print(f"使用策略: {result['strategy']}")
|
||||
elif result["status"] == "skipped":
|
||||
print("数据已是最新,无需同步")
|
||||
else:
|
||||
print(f"同步失败: {result.get('error')}")
|
||||
```
|
||||
|
||||
## 类型映射
|
||||
|
||||
默认字段类型映射:
|
||||
|
||||
```python
|
||||
DEFAULT_TYPE_MAPPING = {
|
||||
"ts_code": "VARCHAR(16)",
|
||||
"trade_date": "DATE",
|
||||
"open": "DOUBLE",
|
||||
"high": "DOUBLE",
|
||||
"low": "DOUBLE",
|
||||
"close": "DOUBLE",
|
||||
"pre_close": "DOUBLE",
|
||||
"change": "DOUBLE",
|
||||
"pct_chg": "DOUBLE",
|
||||
"vol": "DOUBLE",
|
||||
"amount": "DOUBLE",
|
||||
"turnover_rate": "DOUBLE",
|
||||
"volume_ratio": "DOUBLE",
|
||||
"adj_factor": "DOUBLE",
|
||||
"suspend_flag": "INTEGER",
|
||||
}
|
||||
```
|
||||
|
||||
未定义字段会根据 pandas dtype 自动推断:
|
||||
- `int` → `INTEGER`
|
||||
- `float` → `DOUBLE`
|
||||
- `bool` → `BOOLEAN`
|
||||
- `datetime` → `TIMESTAMP`
|
||||
- 其他 → `VARCHAR`
|
||||
|
||||
## 索引策略
|
||||
|
||||
自动创建的索引:
|
||||
|
||||
1. **主键**: `(ts_code, trade_date)` - 确保数据唯一性
|
||||
2. **复合索引**: `(trade_date, ts_code)` - 优化按日期查询性能
|
||||
|
||||
## 与旧代码的兼容性
|
||||
|
||||
原有 `Storage` 和 `ThreadSafeStorage` API 保持不变:
|
||||
|
||||
```python
|
||||
from src.data.storage import Storage, ThreadSafeStorage
|
||||
|
||||
# 旧代码继续可用
|
||||
storage = Storage()
|
||||
storage.save("daily", data)
|
||||
df = storage.load("daily", start_date="20240101")
|
||||
```
|
||||
|
||||
新增的功能通过 `db_manager` 模块提供。
|
||||
|
||||
## 性能建议
|
||||
|
||||
1. **批量写入**: 使用 `SyncManager` 自动处理批量写入
|
||||
2. **避免重复查询**: 使用 `get_table_info()` 检查现有数据
|
||||
3. **合理选择策略**: 全市场更新用 `by_date`,补充数据用 `by_stock`
|
||||
4. **利用索引**: 查询时优先使用 `trade_date` 和 `ts_code` 过滤
|
||||
@@ -120,9 +120,8 @@ CREATE TABLE daily (
|
||||
PRIMARY KEY (ts_code, trade_date) -- 复合主键,自动去重
|
||||
);
|
||||
|
||||
-- 创建索引(DuckDB 会自动为主键创建索引)
|
||||
CREATE INDEX idx_daily_date ON daily(trade_date);
|
||||
CREATE INDEX idx_daily_code ON daily(ts_code);
|
||||
-- 创建复合索引(覆盖常用查询场景:按日期范围+股票代码过滤)
|
||||
CREATE INDEX idx_daily_date_code ON daily(trade_date, ts_code);
|
||||
|
||||
-- 股票基础信息表(替代 stock_basic.h5)
|
||||
CREATE TABLE stock_basic (
|
||||
@@ -229,12 +228,9 @@ class Storage:
|
||||
)
|
||||
""")
|
||||
|
||||
# Create indexes for query optimization
|
||||
# Create composite index for query optimization (trade_date, ts_code)
|
||||
self._connection.execute("""
|
||||
CREATE INDEX IF NOT EXISTS idx_daily_date ON daily(trade_date)
|
||||
""")
|
||||
self._connection.execute("""
|
||||
CREATE INDEX IF NOT EXISTS idx_daily_code ON daily(ts_code)
|
||||
CREATE INDEX IF NOT EXISTS idx_daily_date_code ON daily(trade_date, ts_code)
|
||||
""")
|
||||
|
||||
def save(self, name: str, data: pd.DataFrame, mode: str = "append") -> dict:
|
||||
@@ -515,111 +511,34 @@ class DataSync:
|
||||
self.storage.flush()
|
||||
```
|
||||
|
||||
### 2.4 数据迁移脚本
|
||||
### 2.4 数据同步方案
|
||||
|
||||
**创建 `scripts/migrate_h5_to_duckdb.py`**
|
||||
**无需迁移脚本,直接使用 sync 模块同步数据**
|
||||
|
||||
```python
|
||||
"""数据迁移脚本:将 HDF5 文件迁移到 DuckDB。
|
||||
由于 DuckDB 存储层完全兼容现有 API,无需创建专门的数据迁移脚本。采用以下策略:
|
||||
|
||||
使用方法:
|
||||
uv run python scripts/migrate_h5_to_duckdb.py
|
||||
1. **新环境/首次部署**:直接运行 `sync_all()` 从 Tushare 获取全部数据
|
||||
2. **现有 HDF5 数据迁移**:保留 HDF5 文件作为备份,DuckDB 从最新日期开始增量同步
|
||||
|
||||
功能:
|
||||
1. 读取所有 .h5 文件
|
||||
2. 转换数据类型(日期格式)
|
||||
3. 写入 DuckDB
|
||||
4. 验证数据完整性
|
||||
"""
|
||||
**同步命令**:
|
||||
|
||||
import pandas as pd
|
||||
import duckdb
|
||||
from pathlib import Path
|
||||
from tqdm import tqdm
|
||||
```bash
|
||||
# 全量同步(首次部署或需要完整数据时)
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"
|
||||
|
||||
def migrate_table(h5_path: Path, db_path: Path, table_name: str):
|
||||
"""迁移单个 H5 表到 DuckDB"""
|
||||
print(f"[Migrate] Migrating {table_name} from {h5_path}")
|
||||
|
||||
# Read HDF5
|
||||
df = pd.read_hdf(h5_path, key=f"/{table_name}")
|
||||
|
||||
# Convert date columns
|
||||
if 'trade_date' in df.columns:
|
||||
df['trade_date'] = pd.to_datetime(df['trade_date'], format='%Y%m%d')
|
||||
if 'list_date' in df.columns:
|
||||
df['list_date'] = pd.to_datetime(df['list_date'], format='%Y%m%d')
|
||||
|
||||
# Connect to DuckDB
|
||||
conn = duckdb.connect(str(db_path))
|
||||
|
||||
# Register and insert
|
||||
conn.register("migration_data", df)
|
||||
|
||||
# Create table and insert
|
||||
columns = ", ".join([f"{col} {infer_dtype(df[col])}" for col in df.columns])
|
||||
conn.execute(f"CREATE TABLE IF NOT EXISTS {table_name} ({columns})")
|
||||
|
||||
col_names = ", ".join(df.columns)
|
||||
conn.execute(f"INSERT INTO {table_name} ({col_names}) SELECT {col_names} FROM migration_data")
|
||||
|
||||
conn.close()
|
||||
|
||||
print(f"[Migrate] Migrated {len(df)} rows to {table_name}")
|
||||
# 增量同步(日常使用)
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all()"
|
||||
|
||||
def infer_dtype(series: pd.Series) -> str:
|
||||
"""推断 DuckDB 数据类型"""
|
||||
if pd.api.types.is_datetime64_any_dtype(series):
|
||||
return "DATE"
|
||||
elif pd.api.types.is_integer_dtype(series):
|
||||
return "BIGINT"
|
||||
elif pd.api.types.is_float_dtype(series):
|
||||
return "DOUBLE"
|
||||
else:
|
||||
return "VARCHAR"
|
||||
|
||||
def verify_migration(db_path: Path):
|
||||
"""验证迁移后的数据完整性"""
|
||||
conn = duckdb.connect(str(db_path))
|
||||
|
||||
# Check tables
|
||||
tables = conn.execute("""
|
||||
SELECT table_name FROM information_schema.tables
|
||||
WHERE table_schema = 'main'
|
||||
""").fetchall()
|
||||
|
||||
print("\n[Verify] Tables in DuckDB:")
|
||||
for (table_name,) in tables:
|
||||
count = conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
|
||||
print(f" - {table_name}: {count} rows")
|
||||
|
||||
conn.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
data_dir = Path("data")
|
||||
db_path = data_dir / "prostock.db"
|
||||
|
||||
# Find all H5 files
|
||||
h5_files = list(data_dir.glob("*.h5"))
|
||||
|
||||
if not h5_files:
|
||||
print("[Migrate] No HDF5 files found in data/ directory")
|
||||
exit(0)
|
||||
|
||||
print(f"[Migrate] Found {len(h5_files)} HDF5 files to migrate\n")
|
||||
|
||||
for h5_file in tqdm(h5_files, desc="Migrating"):
|
||||
table_name = h5_file.stem
|
||||
migrate_table(h5_file, db_path, table_name)
|
||||
|
||||
# Verify
|
||||
verify_migration(db_path)
|
||||
|
||||
print("\n[Done] Migration completed successfully!")
|
||||
print(f"[Done] DuckDB file: {db_path}")
|
||||
print("[Done] You can now delete HDF5 files if verification passed")
|
||||
# 指定线程数
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all(max_workers=20)"
|
||||
```
|
||||
|
||||
**优势**:
|
||||
- ✅ 无需维护独立的迁移脚本
|
||||
- ✅ 数据直接从源头同步,确保最新
|
||||
- ✅ 利用现有 sync 逻辑,代码复用
|
||||
- ✅ 支持增量更新,节省时间
|
||||
|
||||
---
|
||||
|
||||
## 3. 迁移计划
|
||||
@@ -637,11 +556,9 @@ if __name__ == "__main__":
|
||||
| 1.3 | 创建 ThreadSafeStorage | `src/data/storage.py` | 30 分钟 | Dev |
|
||||
| 1.4 | 适配 DataLoader | `src/factors/data_loader.py` | 30 分钟 | Dev |
|
||||
| 1.5 | 修改 Sync 并发逻辑 | `src/data/sync.py` | 1 小时 | Dev |
|
||||
| 1.6 | 创建迁移脚本 | `scripts/migrate_h5_to_duckdb.py` | 30 分钟 | Dev |
|
||||
|
||||
**产出物**:
|
||||
- ✅ 可运行的 DuckDB Storage 实现
|
||||
- ✅ 迁移脚本
|
||||
- ✅ 单元测试通过
|
||||
|
||||
#### Phase 2: 测试与验证 (Day 1-2)
|
||||
@@ -652,7 +569,7 @@ if __name__ == "__main__":
|
||||
|------|------|------|---------|
|
||||
| 2.1 | 运行现有单元测试 | `uv run pytest tests/test_sync.py` | 15 分钟 |
|
||||
| 2.2 | 运行 DataLoader 测试 | `uv run pytest tests/factors/test_data_spec.py` | 15 分钟 |
|
||||
| 2.3 | 数据迁移测试 | `uv run python scripts/migrate_h5_to_duckdb.py` | 10 分钟 |
|
||||
| 2.3 | 数据同步测试 | `uv run python -c "from src.data.sync import sync_all; sync_all()"` | 10 分钟 |
|
||||
| 2.4 | 性能基准测试 | 对比 HDF5 vs DuckDB 查询性能 | 1 小时 |
|
||||
| 2.5 | 并发写入测试 | 验证 ThreadSafeStorage 正确性 | 30 分钟 |
|
||||
|
||||
@@ -682,8 +599,8 @@ if __name__ == "__main__":
|
||||
| 序号 | 任务 | 说明 |
|
||||
|------|------|------|
|
||||
| 4.1 | 备份 HDF5 文件 | `cp data/*.h5 data/backup/` |
|
||||
| 4.2 | 运行数据迁移 | `uv run python scripts/migrate_h5_to_duckdb.py` |
|
||||
| 4.3 | 验证数据完整性 | 对比记录数、抽样检查 |
|
||||
| 4.2 | 运行全量同步 | `uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"` |
|
||||
| 4.3 | 验证数据完整性 | 抽样检查(从 DuckDB 查询并对比关键数据点) |
|
||||
| 4.4 | 删除 HDF5 文件 | `rm data/*.h5`(验证通过后) |
|
||||
| 4.5 | 提交代码 | `git add . && git commit -m "migrate: HDF5 to DuckDB"` |
|
||||
|
||||
@@ -724,7 +641,6 @@ uv run pytest tests/test_sync.py
|
||||
|
||||
| 文件路径 | 说明 |
|
||||
|---------|------|
|
||||
| `scripts/migrate_h5_to_duckdb.py` | 数据迁移脚本 |
|
||||
| `docs/hdf5_to_duckdb_migration.md` | 本文档 |
|
||||
|
||||
#### 测试文件(需要验证)
|
||||
@@ -1072,7 +988,7 @@ conn.close()
|
||||
|
||||
1. **创建适当的索引**:
|
||||
```sql
|
||||
CREATE INDEX idx_daily_code_date ON daily(ts_code, trade_date);
|
||||
CREATE INDEX idx_daily_date_code ON daily(trade_date, ts_code);
|
||||
```
|
||||
|
||||
2. **使用分区(大数据量时)**:
|
||||
|
||||
209
docs/test_report_duckdb_migration.md
Normal file
209
docs/test_report_duckdb_migration.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# ProStock HDF5 到 DuckDB 迁移测试报告
|
||||
|
||||
**报告生成时间**: 2026-02-22
|
||||
**迁移文档**: [hdf5_to_duckdb_migration.md](./hdf5_to_duckdb_migration.md)
|
||||
**测试数据范围**: 2024年1月-3月(3个月)
|
||||
|
||||
---
|
||||
|
||||
## 1. 迁移实施摘要
|
||||
|
||||
### 已完成的核心任务 ✅
|
||||
|
||||
| 任务 | 文件 | 状态 |
|
||||
|------|------|------|
|
||||
| Storage 类重写 | `src/data/storage.py` | ✅ 完成 |
|
||||
| ThreadSafeStorage 实现 | `src/data/storage.py` | ✅ 完成 |
|
||||
| Sync 模块适配 | `src/data/sync.py` | ✅ 完成 |
|
||||
| DataLoader 适配 | `src/factors/data_loader.py` | ✅ 完成 |
|
||||
| 测试文件更新 | `tests/` | ✅ 完成 |
|
||||
|
||||
### 架构变更
|
||||
|
||||
```
|
||||
HDF5 格式 (.h5 文件) → DuckDB (prostock.db)
|
||||
├── pandas.read_hdf() → duckdb.execute().fetchdf()
|
||||
├── 全表加载到内存 → SQL 查询下推,按需加载
|
||||
├── 文件锁并发 → ThreadSafeStorage 队列写入
|
||||
└── Polars 通过 Pandas 中转 → DuckDB → PyArrow → Polars (零拷贝)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 测试执行情况
|
||||
|
||||
### 2.1 测试文件清单
|
||||
|
||||
| 测试文件 | 测试类型 | 数据范围 |
|
||||
|---------|---------|---------|
|
||||
| `test_daily_storage.py` | DuckDB Storage 集成测试 | 3个月(2024/01-03) |
|
||||
| `test_data_loader.py` | DataLoader 功能测试 | 3个月(2024/01-03) |
|
||||
| `test_sync.py` | Sync 模块单元测试 | Mock 数据 |
|
||||
|
||||
### 2.2 关键测试用例
|
||||
|
||||
#### DuckDB Storage 测试 (`test_daily_storage.py`)
|
||||
|
||||
```python
|
||||
class TestDailyStorageValidation:
|
||||
TEST_START_DATE = "20240101"
|
||||
TEST_END_DATE = "20240331" # 3个月数据
|
||||
|
||||
def test_duckdb_connection() # ✅ 连接测试
|
||||
def test_load_3months_data() # ⚠️ 需要先有数据
|
||||
def test_polars_export() # ✅ PyArrow 零拷贝导出
|
||||
def test_all_stocks_saved() # ⚠️ 需要先有数据
|
||||
```
|
||||
|
||||
#### DataLoader 测试 (`test_data_loader.py`)
|
||||
|
||||
```python
|
||||
class TestDataLoaderBasic:
|
||||
def test_load_single_source() # 从 DuckDB 加载
|
||||
def test_load_with_date_range() # 3个月日期范围
|
||||
def test_column_selection() # 列选择
|
||||
def test_cache_used() # 缓存性能
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 性能对比预期
|
||||
|
||||
| 测试项 | HDF5 (旧) | DuckDB (新) | 预期提升 |
|
||||
|--------|----------|------------|---------|
|
||||
| 单股票查询 | 5-10s | 0.1-0.5s | **10-100x** |
|
||||
| 日期范围查询 | 5-10s | 0.2-1s | **5-50x** |
|
||||
| 内存占用 | 1GB+ | 100-500MB | **50-90%** |
|
||||
|
||||
---
|
||||
|
||||
## 4. 使用前准备
|
||||
|
||||
### 4.1 数据同步(必须)
|
||||
|
||||
当前数据库中没有 2024年1-3月的测试数据,需要先进行数据同步:
|
||||
|
||||
```bash
|
||||
# 方式1: 同步特定股票代码的3个月数据(推荐用于测试)
|
||||
uv run python -c "
|
||||
from src.data.sync import DataSync
|
||||
from src.data.api_wrappers import get_daily
|
||||
import pandas as pd
|
||||
|
||||
# 获取测试股票数据
|
||||
data = get_daily('000001.SZ', start_date='20240101', end_date='20240331')
|
||||
|
||||
# 保存到 DuckDB
|
||||
from src.data.storage import Storage
|
||||
storage = Storage()
|
||||
storage.save('daily', data)
|
||||
print(f'已保存 {len(data)} 行数据')
|
||||
"
|
||||
|
||||
# 方式2: 全量同步所有股票(耗时较长)
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all(force_full=True)"
|
||||
|
||||
# 方式3: 增量同步(从上次同步日期继续)
|
||||
uv run python -c "from src.data.sync import sync_all; sync_all()"
|
||||
```
|
||||
|
||||
### 4.2 验证安装
|
||||
|
||||
```bash
|
||||
# 检查 DuckDB 和 PyArrow 是否安装
|
||||
uv run python -c "import duckdb; import pyarrow; print('✅ 依赖检查通过')"
|
||||
|
||||
# 验证 Storage 类
|
||||
uv run python -c "from src.data.storage import Storage, ThreadSafeStorage; print('✅ Storage 类导入成功')"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 运行测试
|
||||
|
||||
### 5.1 运行所有测试
|
||||
|
||||
```bash
|
||||
# 运行 DuckDB 相关测试
|
||||
uv run pytest tests/test_daily_storage.py tests/factors/test_data_loader.py -v
|
||||
|
||||
# 运行 Sync 模块测试
|
||||
uv run pytest tests/test_sync.py -v
|
||||
|
||||
# 运行全部测试
|
||||
uv run pytest tests/ -v
|
||||
```
|
||||
|
||||
### 5.2 预期输出
|
||||
|
||||
```
|
||||
tests/test_daily_storage.py::TestDailyStorageValidation::test_duckdb_connection PASSED
|
||||
tests/test_daily_storage.py::TestDailyStorageValidation::test_polars_export PASSED
|
||||
tests/factors/test_data_loader.py::TestDataLoaderBasic::test_load_single_source PASSED
|
||||
tests/factors/test_data_loader.py::TestDataLoaderBasic::test_load_with_date_range PASSED
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. 常见问题 (FAQ)
|
||||
|
||||
### Q: 测试提示 "No data found for period"?
|
||||
**A**: 需要先执行数据同步,将 2024年1-3月的数据写入 DuckDB。
|
||||
|
||||
### Q: ModuleNotFoundError: No module named 'pyarrow'?
|
||||
**A**: 需要安装 pyarrow:
|
||||
```bash
|
||||
uv pip install pyarrow
|
||||
```
|
||||
|
||||
### Q: 如何查看数据库中的数据?
|
||||
**A**:
|
||||
```python
|
||||
from src.data.storage import Storage
|
||||
storage = Storage()
|
||||
|
||||
# 检查表是否存在
|
||||
print(storage.exists("daily")) # True/False
|
||||
|
||||
# 查询最新日期
|
||||
print(storage.get_last_date("daily")) # "20240331"
|
||||
```
|
||||
|
||||
### Q: 如何备份 DuckDB 数据库?
|
||||
**A**:
|
||||
```bash
|
||||
# 备份
|
||||
cp data/prostock.db data/prostock_backup.db
|
||||
|
||||
# 恢复
|
||||
cp data/prostock_backup.db data/prostock.db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 迁移验证清单
|
||||
|
||||
- [x] Storage 类实现 DuckDB 存储
|
||||
- [x] ThreadSafeStorage 实现并发安全
|
||||
- [x] DataLoader 适配 DuckDB
|
||||
- [x] Sync 模块使用 ThreadSafeStorage
|
||||
- [x] 测试文件更新为 3 个月数据范围
|
||||
- [x] PyArrow 零拷贝导出支持
|
||||
- [ ] 执行数据同步(需手动运行)
|
||||
- [ ] 运行全部测试通过(需先有数据)
|
||||
- [ ] 性能基准测试对比
|
||||
|
||||
---
|
||||
|
||||
## 8. 下一步行动
|
||||
|
||||
1. **数据同步**: 运行上述 4.1 节的数据同步命令
|
||||
2. **测试验证**: 运行 `uv run pytest tests/ -v` 确认所有测试通过
|
||||
3. **性能测试**: 使用 `scripts/benchmark_storage.py` 对比 HDF5 vs DuckDB 性能
|
||||
4. **生产部署**: 备份 HDF5 文件,删除旧数据,完全切换到 DuckDB
|
||||
|
||||
---
|
||||
|
||||
**报告生成**: ProStock Migration Tool
|
||||
**状态**: 核心代码完成,等待数据同步后运行测试
|
||||
Reference in New Issue
Block a user