- 新增DuckDB Storage与ThreadSafeStorage实现 - 新增db_manager模块支持增量同步策略 - DataLoader与Sync模块适配DuckDB - 补充迁移相关文档与测试 - 修复README文档链接
268 lines
6.1 KiB
Markdown
268 lines
6.1 KiB
Markdown
# DuckDB 数据同步指南
|
||
|
||
ProStock 现已从 HDF5 迁移至 DuckDB 存储。本文档介绍新的同步机制。
|
||
|
||
## 新功能概览
|
||
|
||
- **自动表创建**: 根据 DataFrame 自动推断表结构
|
||
- **复合索引**: 自动为 `(trade_date, ts_code)` 创建复合索引
|
||
- **增量同步**: 智能判断同步策略(按日期或按股票)
|
||
- **类型映射**: 预定义常见字段的数据类型
|
||
|
||
## 核心模块
|
||
|
||
### 1. TableManager - 表管理
|
||
|
||
```python
|
||
from src.data.db_manager import TableManager
|
||
|
||
# 创建表管理器
|
||
manager = TableManager()
|
||
|
||
# 从 DataFrame 创建表(自动创建复合索引)
|
||
import pandas as pd
|
||
data = pd.DataFrame({
|
||
"ts_code": ["000001.SZ"],
|
||
"trade_date": ["20240101"],
|
||
"close": [10.5],
|
||
})
|
||
|
||
manager.create_table_from_dataframe("daily", data)
|
||
|
||
# 确保表存在(不存在则自动创建)
|
||
manager.ensure_table_exists("daily", sample_data=data)
|
||
```
|
||
|
||
### 2. IncrementalSync - 增量同步
|
||
|
||
```python
|
||
from src.data.db_manager import IncrementalSync
|
||
|
||
sync = IncrementalSync()
|
||
|
||
# 获取同步策略
|
||
strategy, start, end, stocks = sync.get_sync_strategy(
|
||
table_name="daily",
|
||
start_date="20240101",
|
||
end_date="20240131",
|
||
stock_codes=None # None = 所有股票
|
||
)
|
||
|
||
# 返回值:
|
||
# - strategy: "by_date" | "by_stock" | "none"
|
||
# - start: 同步开始日期
|
||
# - end: 同步结束日期
|
||
# - stocks: 需要同步的股票列表(None = 全部)
|
||
|
||
# 执行数据同步
|
||
result = sync.sync_data("daily", data, strategy="by_date")
|
||
```
|
||
|
||
### 3. SyncManager - 高级同步
|
||
|
||
```python
|
||
from src.data.db_manager import SyncManager
|
||
from src.data.api_wrappers import get_daily
|
||
|
||
# 创建同步管理器
|
||
manager = SyncManager()
|
||
|
||
# 一键同步(自动处理表创建、策略选择、数据获取)
|
||
result = manager.sync(
|
||
table_name="daily",
|
||
fetch_func=get_daily, # 数据获取函数
|
||
start_date="20240101",
|
||
end_date="20240131",
|
||
stock_codes=["000001.SZ", "600000.SH"] # 可选:指定股票
|
||
)
|
||
|
||
print(result)
|
||
# {
|
||
# "status": "success",
|
||
# "table": "daily",
|
||
# "strategy": "by_date",
|
||
# "rows": 1000,
|
||
# "date_range": "20240101 to 20240131"
|
||
# }
|
||
```
|
||
|
||
## 便捷函数
|
||
|
||
### 快速同步数据
|
||
|
||
```python
|
||
from src.data.db_manager import sync_table
|
||
from src.data.api_wrappers import get_daily
|
||
|
||
# 同步日线数据
|
||
result = sync_table(
|
||
table_name="daily",
|
||
fetch_func=get_daily,
|
||
start_date="20240101",
|
||
end_date="20240131"
|
||
)
|
||
```
|
||
|
||
### 获取表信息
|
||
|
||
```python
|
||
from src.data.db_manager import get_table_info
|
||
|
||
# 查看表统计信息
|
||
info = get_table_info("daily")
|
||
print(info)
|
||
# {
|
||
# "exists": True,
|
||
# "row_count": 100000,
|
||
# "min_date": "20240101",
|
||
# "max_date": "20240131",
|
||
# "unique_stocks": 5000
|
||
# }
|
||
```
|
||
|
||
### 确保表存在
|
||
|
||
```python
|
||
from src.data.db_manager import ensure_table
|
||
|
||
# 如果表不存在,使用 sample_data 创建
|
||
ensure_table("daily", sample_data=df)
|
||
```
|
||
|
||
## 同步策略详解
|
||
|
||
### 1. 按日期同步 (by_date)
|
||
|
||
**适用场景**: 全市场数据同步、每日增量更新
|
||
|
||
**逻辑**:
|
||
- 表不存在 → 全量同步
|
||
- 表存在但空 → 全量同步
|
||
- 表存在且有数据 → 从 `last_date + 1` 开始增量同步
|
||
|
||
```python
|
||
# 示例: 表已有数据到 20240115
|
||
strategy, start, end, stocks = sync.get_sync_strategy(
|
||
"daily", "20240101", "20240131"
|
||
)
|
||
# 返回: ("by_date", "20240116", "20240131", None)
|
||
# 只需同步 16-31 号的新数据
|
||
```
|
||
|
||
### 2. 按股票同步 (by_stock)
|
||
|
||
**适用场景**: 补充特定股票的历史数据
|
||
|
||
**逻辑**:
|
||
- 检查哪些请求的股票不存在于表中
|
||
- 仅同步缺失的股票
|
||
|
||
```python
|
||
# 示例: 表中已有 000001.SZ,请求两只股票
|
||
strategy, start, end, stocks = sync.get_sync_strategy(
|
||
"daily", "20240101", "20240131",
|
||
stock_codes=["000001.SZ", "600000.SH"]
|
||
)
|
||
# 返回: ("by_stock", "20240101", "20240131", ["600000.SH"])
|
||
# 只同步缺失的 600000.SH
|
||
```
|
||
|
||
### 3. 无需同步 (none)
|
||
|
||
**适用场景**: 数据已是最新
|
||
|
||
**触发条件**:
|
||
- 表存在且日期已覆盖请求范围
|
||
- 所有请求的股票都已存在
|
||
|
||
## 完整示例
|
||
|
||
```python
|
||
from src.data.db_manager import SyncManager, get_table_info
|
||
from src.data.api_wrappers import get_daily
|
||
|
||
# 1. 查看当前表状态
|
||
info = get_table_info("daily")
|
||
print(f"当前数据: {info['row_count']} 行, 最新日期: {info['max_date']}")
|
||
|
||
# 2. 创建同步管理器
|
||
manager = SyncManager()
|
||
|
||
# 3. 执行同步
|
||
result = manager.sync(
|
||
table_name="daily",
|
||
fetch_func=get_daily,
|
||
start_date="20240101",
|
||
end_date="20240222"
|
||
)
|
||
|
||
# 4. 检查结果
|
||
if result["status"] == "success":
|
||
print(f"成功同步 {result['rows']} 行数据")
|
||
print(f"使用策略: {result['strategy']}")
|
||
elif result["status"] == "skipped":
|
||
print("数据已是最新,无需同步")
|
||
else:
|
||
print(f"同步失败: {result.get('error')}")
|
||
```
|
||
|
||
## 类型映射
|
||
|
||
默认字段类型映射:
|
||
|
||
```python
|
||
DEFAULT_TYPE_MAPPING = {
|
||
"ts_code": "VARCHAR(16)",
|
||
"trade_date": "DATE",
|
||
"open": "DOUBLE",
|
||
"high": "DOUBLE",
|
||
"low": "DOUBLE",
|
||
"close": "DOUBLE",
|
||
"pre_close": "DOUBLE",
|
||
"change": "DOUBLE",
|
||
"pct_chg": "DOUBLE",
|
||
"vol": "DOUBLE",
|
||
"amount": "DOUBLE",
|
||
"turnover_rate": "DOUBLE",
|
||
"volume_ratio": "DOUBLE",
|
||
"adj_factor": "DOUBLE",
|
||
"suspend_flag": "INTEGER",
|
||
}
|
||
```
|
||
|
||
未定义字段会根据 pandas dtype 自动推断:
|
||
- `int` → `INTEGER`
|
||
- `float` → `DOUBLE`
|
||
- `bool` → `BOOLEAN`
|
||
- `datetime` → `TIMESTAMP`
|
||
- 其他 → `VARCHAR`
|
||
|
||
## 索引策略
|
||
|
||
自动创建的索引:
|
||
|
||
1. **主键**: `(ts_code, trade_date)` - 确保数据唯一性
|
||
2. **复合索引**: `(trade_date, ts_code)` - 优化按日期查询性能
|
||
|
||
## 与旧代码的兼容性
|
||
|
||
原有 `Storage` 和 `ThreadSafeStorage` API 保持不变:
|
||
|
||
```python
|
||
from src.data.storage import Storage, ThreadSafeStorage
|
||
|
||
# 旧代码继续可用
|
||
storage = Storage()
|
||
storage.save("daily", data)
|
||
df = storage.load("daily", start_date="20240101")
|
||
```
|
||
|
||
新增的功能通过 `db_manager` 模块提供。
|
||
|
||
## 性能建议
|
||
|
||
1. **批量写入**: 使用 `SyncManager` 自动处理批量写入
|
||
2. **避免重复查询**: 使用 `get_table_info()` 检查现有数据
|
||
3. **合理选择策略**: 全市场更新用 `by_date`,补充数据用 `by_stock`
|
||
4. **利用索引**: 查询时优先使用 `trade_date` 和 `ts_code` 过滤
|