# ProStock 数据接口封装规范 ## 1. 概述 本文档定义了在 `src/data/` 目录下新增 Tushare API 接口封装的标准规范。所有非特殊接口(因子和基础数据)必须遵循此规范,以确保: - 代码风格统一 - 自动 sync 支持 - 增量更新逻辑一致 - 减少存储写入压力 ## 2. 接口分类 ### 2.1 特殊接口(不参与统一 sync) 以下接口有独立的同步逻辑,不参与本文档定义的自动 sync 机制: | 接口类型 | 示例 | 说明 | |---------|------|------| | 交易日历 | `trade_cal` | 全局数据,按日期范围获取 | | 股票基础信息 | `stock_basic` | 一次性全量获取,CSV 存储 | | 辅助数据 | 行业分类、概念分类 | 低频更新,独立管理 | ### 2.2 标准接口(必须遵循本规范) 所有**按股票**或**按日期**获取的因子数据、行情数据、财务数据等,必须遵循本规范。 ## 3. 文件结构 ### 3.1 文件命名 ``` {data_type}.py ``` 示例: - `daily.py` - 日线行情 - `moneyflow.py` - 资金流向 - `limit_list.py` - 涨跌停数据 - `stk_holdernumber.py` - 股东人数 ### 3.2 文件位置 ``` src/data/ ├── __init__.py # 导出公共接口 ├── client.py # TushareClient(已有) ├── config.py # 配置管理(已有) ├── storage.py # 存储管理(已有) ├── rate_limiter.py # 速率限制(已有) ├── trade_cal.py # 交易日历(特殊接口) ├── stock_basic.py # 股票基础(特殊接口) ├── daily.py # 日线行情(参考示例) └── {new_data_type}.py # 新增接口文件 ``` ## 4. 接口设计规范 ### 4.1 数据获取函数 #### 4.1.1 按股票获取的接口 适用于:日线行情、分钟线、资金流向等 ```python def get_{data_type}( ts_code: str, start_date: Optional[str] = None, end_date: Optional[str] = None, # 其他可选参数... ) -> pd.DataFrame: """获取 {数据描述}。 Args: ts_code: 股票代码(如 '000001.SZ') start_date: 开始日期(YYYYMMDD格式) end_date: 结束日期(YYYYMMDD格式) # 其他参数说明... Returns: pd.DataFrame 包含以下字段: - ts_code: 股票代码 - trade_date: 交易日期 # 其他字段... Example: >>> data = get_{data_type}('000001.SZ', start_date='20240101', end_date='20240131') """ client = TushareClient() params = {"ts_code": ts_code} if start_date: params["start_date"] = start_date if end_date: params["end_date"] = end_date # 其他参数... data = client.query("{api_name}", **params) return data ``` #### 4.1.2 按日期获取的接口 适用于:每日涨跌停、每日龙虎榜、每日筹码分布等 ```python def get_{data_type}( trade_date: Optional[str] = None, start_date: Optional[str] = None, end_date: Optional[str] = None, ts_code: Optional[str] = None, # 其他可选参数... ) -> pd.DataFrame: """获取 {数据描述}。 **优先按日期获取**(推荐): - 使用 trade_date 获取单日全市场数据 - 或使用 start_date + end_date 获取区间数据 Args: trade_date: 交易日期(YYYYMMDD格式),获取单日全市场数据 start_date: 开始日期(YYYYMMDD格式) end_date: 结束日期(YYYYMMDD格式) ts_code: 股票代码(可选,用于过滤特定股票) # 其他参数说明... Returns: pd.DataFrame 包含以下字段: - ts_code: 股票代码 - trade_date: 交易日期 # 其他字段... Example: >>> # 获取单日全市场数据(推荐) >>> data = get_{data_type}(trade_date='20240115') >>> # 获取区间数据 >>> data = get_{data_type}(start_date='20240101', end_date='20240131') """ client = TushareClient() params = {} if trade_date: params["trade_date"] = trade_date if start_date: params["start_date"] = start_date if end_date: params["end_date"] = end_date if ts_code: params["ts_code"] = ts_code # 其他参数... data = client.query("{api_name}", **params) return data ``` ### 4.2 关键设计原则 #### 4.2.1 优先按日期获取 **强烈建议**优先实现按日期获取的接口: 1. **效率更高**:一次请求获取全市场数据 2. **API 调用更少**:N 天 = N 次调用,而非 N 天 × M 只股票 3. **更适合增量更新**:按天检查本地数据,只获取缺失日期 #### 4.2.2 日期字段统一 - 统一使用 `trade_date` 作为日期字段名 - 日期格式:`YYYYMMDD` 字符串 - 如果 API 返回其他字段名(如 `date`、`end_date`),在返回前重命名为 `trade_date` #### 4.2.3 股票代码字段 - 统一使用 `ts_code` 作为股票代码字段名 - 格式:`{code}.{exchange}`,如 `000001.SZ`、`600000.SH` ## 5. Sync 集成规范 ### 5.1 在 sync.py 中注册新数据类型 在 `DataSync` 类中添加新数据类型的同步支持: ```python class DataSync: """Data synchronization manager with full/incremental sync support.""" DEFAULT_MAX_WORKERS = 10 # 数据类型配置 DATASET_CONFIG = { "daily": { "api_name": "pro_bar", "fetch_by": "stock", # 按股票获取 "date_field": "trade_date", "key_fields": ["ts_code", "trade_date"], }, "moneyflow": { "api_name": "moneyflow", "fetch_by": "stock", # 按股票获取 "date_field": "trade_date", "key_fields": ["ts_code", "trade_date"], }, "limit_list": { "api_name": "limit_list", "fetch_by": "date", # 按日期获取(优先) "date_field": "trade_date", "key_fields": ["ts_code", "trade_date"], }, # 新增数据类型... "{new_data_type}": { "api_name": "{tushare_api_name}", "fetch_by": "date", # "date" 或 "stock" "date_field": "trade_date", "key_fields": ["ts_code", "trade_date"], # 用于去重的主键 }, } ``` ### 5.2 实现同步方法 #### 5.2.1 按日期获取的同步方法(推荐) ```python def sync_by_date( self, dataset_name: str, start_date: str, end_date: str, ) -> pd.DataFrame: """Sync data by date (fetch all stocks for each date). This is the RECOMMENDED approach for date-based data like: - limit_list (涨跌停) - top_list (龙虎榜) - cyq_perf (筹码分布) Args: dataset_name: Name of the dataset in DATASET_CONFIG start_date: Start date (YYYYMMDD) end_date: End date (YYYYMMDD) Returns: Combined DataFrame with all data """ from src.data.trade_cal import get_trading_days config = self.DATASET_CONFIG[dataset_name] api_name = config["api_name"] date_field = config["date_field"] # Get trading days in the range trading_days = get_trading_days(start_date, end_date) if not trading_days: print(f"[DataSync] No trading days in range {start_date} to {end_date}") return pd.DataFrame() print(f"[DataSync] Fetching {dataset_name} for {len(trading_days)} trading days") all_data = [] error_occurred = False for trade_date in tqdm(trading_days, desc=f"Syncing {dataset_name}"): if not self._stop_flag.is_set(): break try: data = self.client.query( api_name, trade_date=trade_date, ) if not data.empty: all_data.append(data) except Exception as e: self._stop_flag.clear() error_occurred = True print(f"[ERROR] Failed to fetch {dataset_name} for {trade_date}: {e}") raise if error_occurred or not all_data: return pd.DataFrame() # Combine all data combined = pd.concat(all_data, ignore_index=True) # Ensure date field is consistent if date_field not in combined.columns and "trade_date" in combined.columns: combined = combined.rename(columns={"trade_date": date_field}) return combined ``` #### 5.2.2 按股票获取的同步方法 ```python def sync_by_stock( self, dataset_name: str, ts_code: str, start_date: str, end_date: str, ) -> pd.DataFrame: """Sync data by stock (fetch all dates for each stock). Use this for stock-based data like: - daily (日线行情) - moneyflow (资金流向) - stk_holdernumber (股东人数) Args: dataset_name: Name of the dataset in DATASET_CONFIG ts_code: Stock code start_date: Start date (YYYYMMDD) end_date: End date (YYYYMMDD) Returns: DataFrame with data for the stock """ config = self.DATASET_CONFIG[dataset_name] api_name = config["api_name"] if not self._stop_flag.is_set(): return pd.DataFrame() try: data = self.client.query( api_name, ts_code=ts_code, start_date=start_date, end_date=end_date, ) return data except Exception as e: self._stop_flag.clear() print(f"[ERROR] Exception syncing {dataset_name} for {ts_code}: {e}") raise ``` ### 5.3 增量更新逻辑 #### 5.3.1 通用增量更新检查 ```python def check_incremental_sync( self, dataset_name: str, force_full: bool = False, ) -> tuple[bool, Optional[str], Optional[str], Optional[str]]: """Check if incremental sync is needed for a dataset. Args: dataset_name: Name of the dataset force_full: If True, force full sync Returns: Tuple of (sync_needed, start_date, end_date, local_last_date) """ config = self.DATASET_CONFIG[dataset_name] date_field = config["date_field"] # If force_full, always sync from default start if force_full: print(f"[DataSync] Force full sync for {dataset_name}") return (True, DEFAULT_START_DATE, get_today_date(), None) # Check local data local_data = self.storage.load(dataset_name) if local_data.empty or date_field not in local_data.columns: print(f"[DataSync] No local {dataset_name} data, full sync needed") return (True, DEFAULT_START_DATE, get_today_date(), None) # Get local last date local_last_date = str(local_data[date_field].max()) print(f"[DataSync] Local {dataset_name} last date: {local_last_date}") # Get calendar last trading day today = get_today_date() _, cal_last = self.get_trade_calendar_bounds(DEFAULT_START_DATE, today) if cal_last is None: print(f"[DataSync] Failed to get trade calendar, proceeding with sync") return (True, DEFAULT_START_DATE, today, local_last_date) print(f"[DataSync] Calendar last trading day: {cal_last}") # Compare dates if int(local_last_date) >= int(cal_last): print(f"[DataSync] {dataset_name} is up-to-date, skipping sync") return (False, None, None, None) # Need incremental sync sync_start = get_next_date(local_last_date) print(f"[DataSync] Incremental sync for {dataset_name} from {sync_start} to {cal_last}") return (True, sync_start, cal_last, local_last_date) ``` #### 5.3.2 完整的同步入口 ```python def sync_dataset( self, dataset_name: str, force_full: bool = False, max_workers: Optional[int] = None, ) -> pd.DataFrame: """Sync a dataset with automatic incremental update. This is the main entry point for syncing any dataset. Args: dataset_name: Name of the dataset in DATASET_CONFIG force_full: If True, force full reload max_workers: Number of worker threads (for stock-based sync) Returns: DataFrame with synced data """ print("\n" + "=" * 60) print(f"[DataSync] Starting {dataset_name} sync...") print("=" * 60) # Ensure trade calendar is up-to-date sync_trade_cal_cache() # Check if sync is needed sync_needed, start_date, end_date, local_last = self.check_incremental_sync( dataset_name, force_full ) if not sync_needed: print(f"[DataSync] {dataset_name} is up-to-date, skipping") return pd.DataFrame() config = self.DATASET_CONFIG[dataset_name] fetch_by = config["fetch_by"] # Fetch data based on strategy if fetch_by == "date": # Fetch by date (all stocks per day) data = self.sync_by_date(dataset_name, start_date, end_date) else: # Fetch by stock (all dates per stock) data = self._sync_all_stocks(dataset_name, start_date, end_date, max_workers) if data.empty: print(f"[DataSync] No new data for {dataset_name}") return pd.DataFrame() # Save to storage (single write) self.storage.save(dataset_name, data, mode="append") print(f"[DataSync] Synced {len(data)} rows for {dataset_name}") return data def _sync_all_stocks( self, dataset_name: str, start_date: str, end_date: str, max_workers: Optional[int] = None, ) -> pd.DataFrame: """Sync data for all stocks (stock-based fetch).""" stock_codes = self.get_all_stock_codes() if not stock_codes: return pd.DataFrame() print(f"[DataSync] Syncing {dataset_name} for {len(stock_codes)} stocks") self._stop_flag.set() results = [] workers = max_workers or self.max_workers with ThreadPoolExecutor(max_workers=workers) as executor: future_to_code = { executor.submit( self.sync_by_stock, dataset_name, ts_code, start_date, end_date ): ts_code for ts_code in stock_codes } with tqdm(total=len(stock_codes), desc=f"Syncing {dataset_name}") as pbar: for future in as_completed(future_to_code): try: data = future.result() if not data.empty: results.append(data) except Exception as e: executor.shutdown(wait=False, cancel_futures=True) raise pbar.update(1) if not results: return pd.DataFrame() return pd.concat(results, ignore_index=True) ``` ## 6. 存储规范 ### 6.1 Storage 类使用 所有数据通过 `Storage` 类进行 HDF5 存储: ```python from src.data.storage import Storage storage = Storage() # 保存数据(自动增量合并) storage.save("dataset_name", data, mode="append") # 加载数据 all_data = storage.load("dataset_name") filtered_data = storage.load("dataset_name", start_date="20240101", end_date="20240131") # 获取最新日期 last_date = storage.get_last_date("dataset_name") # 检查是否存在 exists = storage.exists("dataset_name") ``` ### 6.2 增量写入策略 **关键原则**:所有数据在请求完成后**一次性写入**,而非逐条写入: ```python # ❌ 错误:逐条写入(性能差) for date in dates: data = fetch(date) storage.save("dataset", data, mode="append") # 多次写入 # ✅ 正确:批量写入(性能好) all_data = [] for date in dates: data = fetch(date) all_data.append(data) combined = pd.concat(all_data, ignore_index=True) storage.save("dataset", combined, mode="append") # 单次写入 ``` ### 6.3 去重策略 `Storage.save()` 方法会自动去重,基于配置中的 `key_fields`: ```python # storage.py 中的实现 combined = pd.concat([existing, data], ignore_index=True) combined = combined.drop_duplicates( subset=["ts_code", "trade_date"], # 使用 key_fields keep="last" # 保留最新数据 ) ``` ## 7. 完整示例:新增涨跌停数据接口 ### 7.1 创建 limit_list.py ```python """Limit up/down list interface. Fetch stocks that hit limit up or limit down for a specific trade date. This is a date-based interface (recommended approach). """ import pandas as pd from typing import Optional from src.data.client import TushareClient def get_limit_list( trade_date: Optional[str] = None, ts_code: Optional[str] = None, start_date: Optional[str] = None, end_date: Optional[str] = None, ) -> pd.DataFrame: """获取涨跌停数据。 **优先按日期获取**(推荐): - 使用 trade_date 获取单日全市场涨跌停数据 - 或使用 start_date + end_date 获取区间数据 Args: trade_date: 交易日期(YYYYMMDD格式),获取单日全市场数据 ts_code: 股票代码(可选,用于过滤) start_date: 开始日期(YYYYMMDD格式) end_date: 结束日期(YYYYMMDD格式) Returns: pd.DataFrame 包含以下字段: - ts_code: 股票代码 - trade_date: 交易日期 - name: 股票名称 - close: 收盘价 - pct_chg: 涨跌幅 - amp: 振幅 - fc_ratio: 封单金额/日成交额 - fl_ratio: 封单手数/流通股本 - fd_amount: 封单金额 - first_time: 首次涨停时间 - last_time: 最后封板时间 - open_times: 打开次数 - strth: 涨停强度 - limit: 涨停类型(U涨停D跌停) Example: >>> # 获取单日全市场涨跌停数据(推荐) >>> data = get_limit_list(trade_date='20240115') >>> # 获取区间数据 >>> data = get_limit_list(start_date='20240101', end_date='20240131') """ client = TushareClient() params = {} if trade_date: params["trade_date"] = trade_date if ts_code: params["ts_code"] = ts_code if start_date: params["start_date"] = start_date if end_date: params["end_date"] = end_date data = client.query("limit_list", **params) return data ``` ### 7.2 在 sync.py 中注册 ```python class DataSync: """Data synchronization manager with full/incremental sync support.""" DATASET_CONFIG = { # ... 其他配置 ... "limit_list": { "api_name": "limit_list", "fetch_by": "date", # 按日期获取 "date_field": "trade_date", "key_fields": ["ts_code", "trade_date"], }, } # ... 其他方法 ... def sync_limit_list( self, force_full: bool = False, ) -> pd.DataFrame: """Sync limit list data.""" return self.sync_dataset("limit_list", force_full) # 便捷函数 def sync_limit_list(force_full: bool = False) -> pd.DataFrame: """Sync limit up/down data.""" sync_manager = DataSync() return sync_manager.sync_limit_list(force_full) ``` ### 7.3 更新 __init__.py ```python from src.data.limit_list import get_limit_list __all__ = [ # ... 其他导出 ... "get_limit_list", ] ``` ## 8. 测试规范 ### 8.1 测试文件结构 ``` tests/ ├── test_sync.py # sync 模块测试 ├── test_daily.py # daily 模块测试 └── test_{new_module}.py # 新增模块测试 ``` ### 8.2 测试模板 ```python """Tests for {module_name} module.""" import pytest from unittest.mock import patch, MagicMock import pandas as pd from src.data.{module_name} import get_{data_type} class Test{DataType}: """Test cases for {data_type} data fetching.""" @patch("src.data.{module_name}.TushareClient") def test_get_{data_type}_by_date(self, mock_client_class): """Test fetching data by date.""" # Setup mock mock_client = MagicMock() mock_client_class.return_value = mock_client mock_client.query.return_value = pd.DataFrame({ "ts_code": ["000001.SZ"], "trade_date": ["20240115"], # ... 其他字段 ... }) # Call function result = get_{data_type}(trade_date="20240115") # Verify assert not result.empty mock_client.query.assert_called_once_with( "{api_name}", trade_date="20240115", ) @patch("src.data.{module_name}.TushareClient") def test_get_{data_type}_by_stock(self, mock_client_class): """Test fetching data by stock code.""" # Setup mock mock_client = MagicMock() mock_client_class.return_value = mock_client mock_client.query.return_value = pd.DataFrame({ "ts_code": ["000001.SZ"], "trade_date": ["20240115"], # ... 其他字段 ... }) # Call function result = get_{data_type}( ts_code="000001.SZ", start_date="20240101", end_date="20240131", ) # Verify assert not result.empty mock_client.query.assert_called_once() ``` ## 9. 检查清单 在提交新接口前,请确认以下事项: ### 9.1 文件结构 - [ ] 文件位于 `src/data/{data_type}.py` - [ ] 已更新 `src/data/__init__.py` 导出公共接口 - [ ] 已创建 `tests/test_{data_type}.py` 测试文件 ### 9.2 接口实现 - [ ] 数据获取函数使用 `TushareClient` - [ ] 函数包含完整的 Google 风格文档字符串 - [ ] 日期参数使用 `YYYYMMDD` 格式 - [ ] 返回的 DataFrame 包含 `ts_code` 和 `trade_date` 字段 - [ ] 优先实现按日期获取的接口(如果 API 支持) ### 9.3 Sync 集成 - [ ] 已在 `DataSync.DATASET_CONFIG` 中注册 - [ ] 正确设置 `fetch_by`("date" 或 "stock") - [ ] 正确设置 `date_field` 和 `key_fields` - [ ] 已实现对应的 sync 方法或复用通用方法 - [ ] 增量更新逻辑正确(检查本地最新日期) ### 9.4 存储优化 - [ ] 所有数据一次性写入(非逐条) - [ ] 使用 `storage.save(mode="append")` 进行增量保存 - [ ] 去重字段配置正确 ### 9.5 测试 - [ ] 已编写单元测试 - [ ] 已 mock TushareClient - [ ] 测试覆盖正常和异常情况 ## 10. 常见问题 ### Q1: API 返回的日期字段名不是 trade_date 怎么办? 在返回前重命名: ```python data = client.query("api_name", **params) if "end_date" in data.columns: data = data.rename(columns={"end_date": "trade_date"}) return data ``` ### Q2: 如何处理分页(limit/offset)? Tushare Pro API 通常不需要手动分页,但如果需要: ```python all_data = [] offset = 0 limit = 5000 while True: data = client.query( "api_name", trade_date=trade_date, limit=limit, offset=offset, ) if data.empty or len(data) < limit: all_data.append(data) break all_data.append(data) offset += limit return pd.concat(all_data, ignore_index=True) ``` ### Q3: 如何处理需要额外参数的接口? 在函数签名中添加参数,并传递给 client.query: ```python def get_data( ts_code: str, start_date: Optional[str] = None, end_date: Optional[str] = None, fields: Optional[list] = None, # 额外参数 ) -> pd.DataFrame: params = {"ts_code": ts_code} if start_date: params["start_date"] = start_date if end_date: params["end_date"] = end_date if fields: params["fields"] = ",".join(fields) return client.query("api_name", **params) ``` ### Q4: 如何处理没有 trade_date 字段的数据? 如果数据确实不包含日期字段(如静态数据),可以: 1. 将其归类为"特殊接口",独立管理 2. 或者添加一个 `sync_date` 字段记录同步时间 ### Q5: 如何处理按日期获取但 API 不支持的情况? 如果 API 只支持按股票获取: 1. 在 `DATASET_CONFIG` 中设置 `fetch_by: "stock"` 2. 使用 `_sync_all_stocks` 方法进行同步 3. 在文档中说明这是按股票获取的接口 --- **最后更新**: 2026-02-01