新环境
This commit is contained in:
BIN
main/factor/__pycache__/__init__.cpython-313.pyc
Normal file
BIN
main/factor/__pycache__/__init__.cpython-313.pyc
Normal file
Binary file not shown.
BIN
main/factor/__pycache__/factor.cpython-313.pyc
Normal file
BIN
main/factor/__pycache__/factor.cpython-313.pyc
Normal file
Binary file not shown.
@@ -1,63 +1,63 @@
|
||||
序号 因子名称 (Factor Name / Column Name) 因子类别 (Factor Category) 简要说明
|
||||
1 pe_ttm 价值类因子 (Value) 市盈率 TTM
|
||||
2 return_5, return_20 动量类因子 (Momentum) 过去5日/20日收益率
|
||||
3 act_factor1 to act_factor4 动量类 / 技术类因子 (Momentum / Technical) 基于不同周期EMA斜率计算的动量/趋势因子
|
||||
4 std_return_5, std_return_90, std_return_90_2 波动率类因子 (Volatility) 不同窗口期或延迟窗口期的滚动收益率标准差
|
||||
5 upside_vol, downside_vol 波动率类因子 (Volatility) N日滚动上/下行波动率
|
||||
6 vol_ratio 波动率类因子 (Volatility) 上行波动率 / 下行波动率
|
||||
7 std_return_5 / std_return_90 波动率类因子 (Volatility) 短期波动率 / 长期波动率 比率
|
||||
8 std_return_90 - std_return_90_2 波动率类因子 (Volatility) 长期波动率与其10日前值的差值(波动变化)
|
||||
9 volatility (来自指数计算) 波动率类 / 市场因子 (Volatility / Market) 指数(或个股)的20日滚动收益率标准差
|
||||
10 log(circ_mv) (或 log_circ_mv) 市值类因子 (Size) 流通市值的对数值
|
||||
11 cs_rank_size 市值类因子 (Size) 对数流通市值的截面排序
|
||||
12 vol 流动性类因子 (Liquidity) 成交量 (通常需要与其他指标结合或处理)
|
||||
13 turnover_rate 流动性类因子 (Liquidity) 换手率
|
||||
14 volume_ratio 流动性类因子 (Liquidity) 量比
|
||||
15 turnover_deviation 流动性类因子 (Liquidity) 换手率与其3日滚动均值的标准差倍数偏离
|
||||
16 cat_turnover_spike 流动性类 / 分类因子 (Liquidity / Categorical) 换手率是否显著高于近期均值
|
||||
17 volume_change_rate 流动性类因子 (Liquidity) 短期滚动成交量均值 / 长期滚动成交量均值 - 1
|
||||
18 cat_volume_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日成交量是否大于过去5日最大成交量
|
||||
19 avg_volume_ratio 流动性类因子 (Liquidity) 3日滚动量比均值
|
||||
20 cat_volume_ratio_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日量比是否大于过去5日最大量比
|
||||
21 vol_spike (Rolling Mean Vol) 流动性类因子 (Liquidity) 20日滚动成交量均值
|
||||
22 vol_std_5 流动性类 / 波动率因子 (Liquidity / Volatility) 成交量日变化率的5日滚动标准差
|
||||
23 volume_growth 流动性类因子 (Liquidity) 20日成交量变化率
|
||||
24 turnover_std 流动性类 / 波动率因子 (Liquidity / Volatility) 换手率的20日滚动标准差
|
||||
25 flow_lg_elg_intensity 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)净买入量 / 总成交量
|
||||
26 flow_divergence_diff, flow_divergence_ratio 资金流 / 情绪类因子 (Money Flow / Sentiment) 散户与主力资金流的差异或比率
|
||||
27 lg_elg_buy_prop 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)买入量 / 总买入量
|
||||
28 flow_struct_buy_change 资金流 / 流动性类因子 (Money Flow / Liquidity) 主力买入占比的日变化
|
||||
29 flow_lg_elg_accel 资金流 / 动量类因子 (Money Flow / Momentum) 主力资金流加速度
|
||||
30 active_buy_volume_large/big/small 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模主动买入量 / 净流入量
|
||||
31 buy_lg/elg_vol_minus_sell_lg/elg_vol 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模净买入量 / 总净流入量
|
||||
32 cs_rank_net_lg_flow_val, cs_rank_elg_buy_ratio, cs_rank_lg_sm_flow_diverge, cs_rank_elg_buy_sell_sm_ratio 资金流 / 复合因子 (截面排序) 各种资金流指标的截面排序
|
||||
33 cs_rank_ind_adj_lg_flow 资金流 / 复合因子 (行业调整+截面排序) 行业调整后的大单净流入截面排序
|
||||
34 chip_concentration_range, chip_skewness, cost_support_15pct_change, weight_roc5, cost_stability, ctrl_strength, low_cost_dev, asymmetry, cost_conc_std_N, profit_pressure, underwater_resistance, cs_rank_rel_profit_margin, cs_rank_cost_breadth, cs_rank_dist_to_upper_cost 定位类因子 (Positioning) / 技术类 基于持仓成本分布 (cost_*, weight_avg) 计算的各种指标及其截面排序
|
||||
35 winner_rate, cs_rank_winner_rate 定位类因子 (Positioning) / 技术类 获利盘比例及其截面排序
|
||||
36 floating_chip_proxy, price_cost_divergence, high_cost_break_days, liquidity_risk, lock_factor, cost_atr_adj, smallcap_concentration, cat_golden_resonance 定位类因子 (Positioning) / 复合因子 结合持仓成本与其他信息(价格、成交、波动率、市值)的复合指标
|
||||
37 cat_winner_price_zone 定位类 / 分类因子 (Positioning / Categorical) 基于成本和获利盘划分的区域类别
|
||||
38 flow_chip_consistency, profit_taking_vs_absorb, vol_amp_loss, vol_drop_profit_cnt, cost_break_confirm_cnt, vol_wgt_hist_pos, cs_rank_vol_x_profit_margin, cs_rank_cost_dist_vol_ratio 定位类因子 (Positioning) / 复合因子 进一步结合定位、资金流、量价的复杂交互因子
|
||||
39 return_skew, return_kurtosis 技术类 / 统计特征 (Technical / Stats) 滚动收益率的偏度与峰度
|
||||
40 rsi_3 技术类 / 动量类因子 (Technical / Momentum) 3日相对强弱指数
|
||||
41 obv, maobv_6, obv-maobv_6 技术类 / 量价因子 (Technical / Volume) 能量潮及其均线、差离
|
||||
42 atr_14, atr_6 技术类 / 波动率类因子 (Technical / Volatility) 平均真实波幅
|
||||
43 log_close 技术类 / 量价因子 (Technical / Price) 收盘价对数
|
||||
44 up, down 技术类 / 量价因子 (Technical / Price Action) 标准化上影线、下影线长度
|
||||
45 alpha_22_improved, alpha_003, alpha_007, alpha_013 技术类 / Alpha因子 (Technical / Alpha) WorldQuant Alpha 因子实现
|
||||
46 atr_norm_channel_pos 技术类 / 量价因子 (Technical / Price Action) ATR 标准化的价格通道位置
|
||||
47 turnover_diff_skew 技术类 / 流动性类 (Technical / Liquidity) 换手率变化率的偏度
|
||||
48 pullback_strong_N_M 技术类 / 动量类因子 (Technical / Momentum) 近期强势股的回调幅度
|
||||
49 vol_adj_roc 技术类 / 复合因子 (动量+波动率) 波动率调整后的 N 日变化率
|
||||
50 ar, br, arbr 情绪类 / 技术类因子 (Sentiment / Technical) ARBR 人气意愿指标
|
||||
51 up_ratio_20d (来自指数计算) 情绪类 / 市场因子 (Sentiment / Market) 指数(或个股)过去20天上涨天数比例
|
||||
52 cat_up_limit, cat_down_limit, up_limit_count_10d, down_limit_count_10d, consecutive_up_limit 事件驱动 / 市场状态因子 (Event / Market State) 涨跌停相关状态和计数
|
||||
53 momentum_factor, resonance_factor 复合因子 (量价) (Composite - P/V) 基于量、价、换手率等的简单复合
|
||||
54 cat_af2, cat_af3, cat_af4 复合因子 / 分类因子 (Composite / Cat.) act_factor 之间的比较
|
||||
55 act_factor5, act_factor6 复合因子 (技术类) (Composite - Technical) act_factor 1-4 的组合
|
||||
56 mv_volatility, mv_growth, mv_turnover_ratio, mv_adjusted_volume, mv_weighted_turnover, nonlinear_mv_volume, mv_volume_ratio, mv_momentum 复合因子 (市值+流动性/量价) 考虑了市值影响的量价、流动性或动量指标
|
||||
57 cap_neutral_cost_metric (占位符) 复合因子 / Alpha因子 (占位符) 市值行业中性化的成本指标(需实现)
|
||||
58 hurst_exponent_flow (占位符) 资金流 / 统计因子 (占位符) 资金流的 Hurst 指数(需实现)
|
||||
59 intraday_lg_flow_corr_N (占位符) 复合因子 (价格行为+资金流) (占位符) 日内趋势与大单流相关性(需实现)
|
||||
60 industry_* (来自 industry_df) 行业因子 (Industry) 对应行业的各种指标(如行业收益率、行业动量等)
|
||||
61 *_deviation (来自 create_deviation_within_dates) 复合因子 (相对行业) 个股因子相对于行业均值的偏离
|
||||
序号 因子名称 (Factor Name / Column Name) 因子类别 (Factor Category) 简要说明
|
||||
1 pe_ttm 价值类因子 (Value) 市盈率 TTM
|
||||
2 return_5, return_20 动量类因子 (Momentum) 过去5日/20日收益率
|
||||
3 act_factor1 to act_factor4 动量类 / 技术类因子 (Momentum / Technical) 基于不同周期EMA斜率计算的动量/趋势因子
|
||||
4 std_return_5, std_return_90, std_return_90_2 波动率类因子 (Volatility) 不同窗口期或延迟窗口期的滚动收益率标准差
|
||||
5 upside_vol, downside_vol 波动率类因子 (Volatility) N日滚动上/下行波动率
|
||||
6 vol_ratio 波动率类因子 (Volatility) 上行波动率 / 下行波动率
|
||||
7 std_return_5 / std_return_90 波动率类因子 (Volatility) 短期波动率 / 长期波动率 比率
|
||||
8 std_return_90 - std_return_90_2 波动率类因子 (Volatility) 长期波动率与其10日前值的差值(波动变化)
|
||||
9 volatility (来自指数计算) 波动率类 / 市场因子 (Volatility / Market) 指数(或个股)的20日滚动收益率标准差
|
||||
10 log(circ_mv) (或 log_circ_mv) 市值类因子 (Size) 流通市值的对数值
|
||||
11 cs_rank_size 市值类因子 (Size) 对数流通市值的截面排序
|
||||
12 vol 流动性类因子 (Liquidity) 成交量 (通常需要与其他指标结合或处理)
|
||||
13 turnover_rate 流动性类因子 (Liquidity) 换手率
|
||||
14 volume_ratio 流动性类因子 (Liquidity) 量比
|
||||
15 turnover_deviation 流动性类因子 (Liquidity) 换手率与其3日滚动均值的标准差倍数偏离
|
||||
16 cat_turnover_spike 流动性类 / 分类因子 (Liquidity / Categorical) 换手率是否显著高于近期均值
|
||||
17 volume_change_rate 流动性类因子 (Liquidity) 短期滚动成交量均值 / 长期滚动成交量均值 - 1
|
||||
18 cat_volume_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日成交量是否大于过去5日最大成交量
|
||||
19 avg_volume_ratio 流动性类因子 (Liquidity) 3日滚动量比均值
|
||||
20 cat_volume_ratio_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日量比是否大于过去5日最大量比
|
||||
21 vol_spike (Rolling Mean Vol) 流动性类因子 (Liquidity) 20日滚动成交量均值
|
||||
22 vol_std_5 流动性类 / 波动率因子 (Liquidity / Volatility) 成交量日变化率的5日滚动标准差
|
||||
23 volume_growth 流动性类因子 (Liquidity) 20日成交量变化率
|
||||
24 turnover_std 流动性类 / 波动率因子 (Liquidity / Volatility) 换手率的20日滚动标准差
|
||||
25 flow_lg_elg_intensity 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)净买入量 / 总成交量
|
||||
26 flow_divergence_diff, flow_divergence_ratio 资金流 / 情绪类因子 (Money Flow / Sentiment) 散户与主力资金流的差异或比率
|
||||
27 lg_elg_buy_prop 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)买入量 / 总买入量
|
||||
28 flow_struct_buy_change 资金流 / 流动性类因子 (Money Flow / Liquidity) 主力买入占比的日变化
|
||||
29 flow_lg_elg_accel 资金流 / 动量类因子 (Money Flow / Momentum) 主力资金流加速度
|
||||
30 active_buy_volume_large/big/small 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模主动买入量 / 净流入量
|
||||
31 buy_lg/elg_vol_minus_sell_lg/elg_vol 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模净买入量 / 总净流入量
|
||||
32 cs_rank_net_lg_flow_val, cs_rank_elg_buy_ratio, cs_rank_lg_sm_flow_diverge, cs_rank_elg_buy_sell_sm_ratio 资金流 / 复合因子 (截面排序) 各种资金流指标的截面排序
|
||||
33 cs_rank_ind_adj_lg_flow 资金流 / 复合因子 (行业调整+截面排序) 行业调整后的大单净流入截面排序
|
||||
34 chip_concentration_range, chip_skewness, cost_support_15pct_change, weight_roc5, cost_stability, ctrl_strength, low_cost_dev, asymmetry, cost_conc_std_N, profit_pressure, underwater_resistance, cs_rank_rel_profit_margin, cs_rank_cost_breadth, cs_rank_dist_to_upper_cost 定位类因子 (Positioning) / 技术类 基于持仓成本分布 (cost_*, weight_avg) 计算的各种指标及其截面排序
|
||||
35 winner_rate, cs_rank_winner_rate 定位类因子 (Positioning) / 技术类 获利盘比例及其截面排序
|
||||
36 floating_chip_proxy, price_cost_divergence, high_cost_break_days, liquidity_risk, lock_factor, cost_atr_adj, smallcap_concentration, cat_golden_resonance 定位类因子 (Positioning) / 复合因子 结合持仓成本与其他信息(价格、成交、波动率、市值)的复合指标
|
||||
37 cat_winner_price_zone 定位类 / 分类因子 (Positioning / Categorical) 基于成本和获利盘划分的区域类别
|
||||
38 flow_chip_consistency, profit_taking_vs_absorb, vol_amp_loss, vol_drop_profit_cnt, cost_break_confirm_cnt, vol_wgt_hist_pos, cs_rank_vol_x_profit_margin, cs_rank_cost_dist_vol_ratio 定位类因子 (Positioning) / 复合因子 进一步结合定位、资金流、量价的复杂交互因子
|
||||
39 return_skew, return_kurtosis 技术类 / 统计特征 (Technical / Stats) 滚动收益率的偏度与峰度
|
||||
40 rsi_3 技术类 / 动量类因子 (Technical / Momentum) 3日相对强弱指数
|
||||
41 obv, maobv_6, obv-maobv_6 技术类 / 量价因子 (Technical / Volume) 能量潮及其均线、差离
|
||||
42 atr_14, atr_6 技术类 / 波动率类因子 (Technical / Volatility) 平均真实波幅
|
||||
43 log_close 技术类 / 量价因子 (Technical / Price) 收盘价对数
|
||||
44 up, down 技术类 / 量价因子 (Technical / Price Action) 标准化上影线、下影线长度
|
||||
45 alpha_22_improved, alpha_003, alpha_007, alpha_013 技术类 / Alpha因子 (Technical / Alpha) WorldQuant Alpha 因子实现
|
||||
46 atr_norm_channel_pos 技术类 / 量价因子 (Technical / Price Action) ATR 标准化的价格通道位置
|
||||
47 turnover_diff_skew 技术类 / 流动性类 (Technical / Liquidity) 换手率变化率的偏度
|
||||
48 pullback_strong_N_M 技术类 / 动量类因子 (Technical / Momentum) 近期强势股的回调幅度
|
||||
49 vol_adj_roc 技术类 / 复合因子 (动量+波动率) 波动率调整后的 N 日变化率
|
||||
50 ar, br, arbr 情绪类 / 技术类因子 (Sentiment / Technical) ARBR 人气意愿指标
|
||||
51 up_ratio_20d (来自指数计算) 情绪类 / 市场因子 (Sentiment / Market) 指数(或个股)过去20天上涨天数比例
|
||||
52 cat_up_limit, cat_down_limit, up_limit_count_10d, down_limit_count_10d, consecutive_up_limit 事件驱动 / 市场状态因子 (Event / Market State) 涨跌停相关状态和计数
|
||||
53 momentum_factor, resonance_factor 复合因子 (量价) (Composite - P/V) 基于量、价、换手率等的简单复合
|
||||
54 cat_af2, cat_af3, cat_af4 复合因子 / 分类因子 (Composite / Cat.) act_factor 之间的比较
|
||||
55 act_factor5, act_factor6 复合因子 (技术类) (Composite - Technical) act_factor 1-4 的组合
|
||||
56 mv_volatility, mv_growth, mv_turnover_ratio, mv_adjusted_volume, mv_weighted_turnover, nonlinear_mv_volume, mv_volume_ratio, mv_momentum 复合因子 (市值+流动性/量价) 考虑了市值影响的量价、流动性或动量指标
|
||||
57 cap_neutral_cost_metric (占位符) 复合因子 / Alpha因子 (占位符) 市值行业中性化的成本指标(需实现)
|
||||
58 hurst_exponent_flow (占位符) 资金流 / 统计因子 (占位符) 资金流的 Hurst 指数(需实现)
|
||||
59 intraday_lg_flow_corr_N (占位符) 复合因子 (价格行为+资金流) (占位符) 日内趋势与大单流相关性(需实现)
|
||||
60 industry_* (来自 industry_df) 行业因子 (Industry) 对应行业的各种指标(如行业收益率、行业动量等)
|
||||
61 *_deviation (来自 create_deviation_within_dates) 复合因子 (相对行业) 个股因子相对于行业均值的偏离
|
||||
62 complex_factor_gplearn_1 复合因子 (GP生成) DEAP/GP 找到的因子表达式 1
|
||||
|
File diff suppressed because it is too large
Load Diff
@@ -1,63 +1,63 @@
|
||||
序号 因子名称 (Factor Name / Column Name) 因子类别 (Factor Category) 简要说明
|
||||
1 pe_ttm 价值类因子 (Value) 市盈率 TTM
|
||||
2 return_5, return_20 动量类因子 (Momentum) 过去5日/20日收益率
|
||||
3 act_factor1 to act_factor4 动量类 / 技术类因子 (Momentum / Technical) 基于不同周期EMA斜率计算的动量/趋势因子
|
||||
4 std_return_5, std_return_90, std_return_90_2 波动率类因子 (Volatility) 不同窗口期或延迟窗口期的滚动收益率标准差
|
||||
5 upside_vol, downside_vol 波动率类因子 (Volatility) N日滚动上/下行波动率
|
||||
6 vol_ratio 波动率类因子 (Volatility) 上行波动率 / 下行波动率
|
||||
7 std_return_5 / std_return_90 波动率类因子 (Volatility) 短期波动率 / 长期波动率 比率
|
||||
8 std_return_90 - std_return_90_2 波动率类因子 (Volatility) 长期波动率与其10日前值的差值(波动变化)
|
||||
9 volatility (来自指数计算) 波动率类 / 市场因子 (Volatility / Market) 指数(或个股)的20日滚动收益率标准差
|
||||
10 log(circ_mv) (或 log_circ_mv) 市值类因子 (Size) 流通市值的对数值
|
||||
11 cs_rank_size 市值类因子 (Size) 对数流通市值的截面排序
|
||||
12 vol 流动性类因子 (Liquidity) 成交量 (通常需要与其他指标结合或处理)
|
||||
13 turnover_rate 流动性类因子 (Liquidity) 换手率
|
||||
14 volume_ratio 流动性类因子 (Liquidity) 量比
|
||||
15 turnover_deviation 流动性类因子 (Liquidity) 换手率与其3日滚动均值的标准差倍数偏离
|
||||
16 cat_turnover_spike 流动性类 / 分类因子 (Liquidity / Categorical) 换手率是否显著高于近期均值
|
||||
17 volume_change_rate 流动性类因子 (Liquidity) 短期滚动成交量均值 / 长期滚动成交量均值 - 1
|
||||
18 cat_volume_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日成交量是否大于过去5日最大成交量
|
||||
19 avg_volume_ratio 流动性类因子 (Liquidity) 3日滚动量比均值
|
||||
20 cat_volume_ratio_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日量比是否大于过去5日最大量比
|
||||
21 vol_spike (Rolling Mean Vol) 流动性类因子 (Liquidity) 20日滚动成交量均值
|
||||
22 vol_std_5 流动性类 / 波动率因子 (Liquidity / Volatility) 成交量日变化率的5日滚动标准差
|
||||
23 volume_growth 流动性类因子 (Liquidity) 20日成交量变化率
|
||||
24 turnover_std 流动性类 / 波动率因子 (Liquidity / Volatility) 换手率的20日滚动标准差
|
||||
25 flow_lg_elg_intensity 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)净买入量 / 总成交量
|
||||
26 flow_divergence_diff, flow_divergence_ratio 资金流 / 情绪类因子 (Money Flow / Sentiment) 散户与主力资金流的差异或比率
|
||||
27 lg_elg_buy_prop 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)买入量 / 总买入量
|
||||
28 flow_struct_buy_change 资金流 / 流动性类因子 (Money Flow / Liquidity) 主力买入占比的日变化
|
||||
29 flow_lg_elg_accel 资金流 / 动量类因子 (Money Flow / Momentum) 主力资金流加速度
|
||||
30 active_buy_volume_large/big/small 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模主动买入量 / 净流入量
|
||||
31 buy_lg/elg_vol_minus_sell_lg/elg_vol 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模净买入量 / 总净流入量
|
||||
32 cs_rank_net_lg_flow_val, cs_rank_elg_buy_ratio, cs_rank_lg_sm_flow_diverge, cs_rank_elg_buy_sell_sm_ratio 资金流 / 复合因子 (截面排序) 各种资金流指标的截面排序
|
||||
33 cs_rank_ind_adj_lg_flow 资金流 / 复合因子 (行业调整+截面排序) 行业调整后的大单净流入截面排序
|
||||
34 chip_concentration_range, chip_skewness, cost_support_15pct_change, weight_roc5, cost_stability, ctrl_strength, low_cost_dev, asymmetry, cost_conc_std_N, profit_pressure, underwater_resistance, cs_rank_rel_profit_margin, cs_rank_cost_breadth, cs_rank_dist_to_upper_cost 定位类因子 (Positioning) / 技术类 基于持仓成本分布 (cost_*, weight_avg) 计算的各种指标及其截面排序
|
||||
35 winner_rate, cs_rank_winner_rate 定位类因子 (Positioning) / 技术类 获利盘比例及其截面排序
|
||||
36 floating_chip_proxy, price_cost_divergence, high_cost_break_days, liquidity_risk, lock_factor, cost_atr_adj, smallcap_concentration, cat_golden_resonance 定位类因子 (Positioning) / 复合因子 结合持仓成本与其他信息(价格、成交、波动率、市值)的复合指标
|
||||
37 cat_winner_price_zone 定位类 / 分类因子 (Positioning / Categorical) 基于成本和获利盘划分的区域类别
|
||||
38 flow_chip_consistency, profit_taking_vs_absorb, vol_amp_loss, vol_drop_profit_cnt, cost_break_confirm_cnt, vol_wgt_hist_pos, cs_rank_vol_x_profit_margin, cs_rank_cost_dist_vol_ratio 定位类因子 (Positioning) / 复合因子 进一步结合定位、资金流、量价的复杂交互因子
|
||||
39 return_skew, return_kurtosis 技术类 / 统计特征 (Technical / Stats) 滚动收益率的偏度与峰度
|
||||
40 rsi_3 技术类 / 动量类因子 (Technical / Momentum) 3日相对强弱指数
|
||||
41 obv, maobv_6, obv-maobv_6 技术类 / 量价因子 (Technical / Volume) 能量潮及其均线、差离
|
||||
42 atr_14, atr_6 技术类 / 波动率类因子 (Technical / Volatility) 平均真实波幅
|
||||
43 log_close 技术类 / 量价因子 (Technical / Price) 收盘价对数
|
||||
44 up, down 技术类 / 量价因子 (Technical / Price Action) 标准化上影线、下影线长度
|
||||
45 alpha_22_improved, alpha_003, alpha_007, alpha_013 技术类 / Alpha因子 (Technical / Alpha) WorldQuant Alpha 因子实现
|
||||
46 atr_norm_channel_pos 技术类 / 量价因子 (Technical / Price Action) ATR 标准化的价格通道位置
|
||||
47 turnover_diff_skew 技术类 / 流动性类 (Technical / Liquidity) 换手率变化率的偏度
|
||||
48 pullback_strong_N_M 技术类 / 动量类因子 (Technical / Momentum) 近期强势股的回调幅度
|
||||
49 vol_adj_roc 技术类 / 复合因子 (动量+波动率) 波动率调整后的 N 日变化率
|
||||
50 ar, br, arbr 情绪类 / 技术类因子 (Sentiment / Technical) ARBR 人气意愿指标
|
||||
51 up_ratio_20d (来自指数计算) 情绪类 / 市场因子 (Sentiment / Market) 指数(或个股)过去20天上涨天数比例
|
||||
52 cat_up_limit, cat_down_limit, up_limit_count_10d, down_limit_count_10d, consecutive_up_limit 事件驱动 / 市场状态因子 (Event / Market State) 涨跌停相关状态和计数
|
||||
53 momentum_factor, resonance_factor 复合因子 (量价) (Composite - P/V) 基于量、价、换手率等的简单复合
|
||||
54 cat_af2, cat_af3, cat_af4 复合因子 / 分类因子 (Composite / Cat.) act_factor 之间的比较
|
||||
55 act_factor5, act_factor6 复合因子 (技术类) (Composite - Technical) act_factor 1-4 的组合
|
||||
56 mv_volatility, mv_growth, mv_turnover_ratio, mv_adjusted_volume, mv_weighted_turnover, nonlinear_mv_volume, mv_volume_ratio, mv_momentum 复合因子 (市值+流动性/量价) 考虑了市值影响的量价、流动性或动量指标
|
||||
57 cap_neutral_cost_metric (占位符) 复合因子 / Alpha因子 (占位符) 市值行业中性化的成本指标(需实现)
|
||||
58 hurst_exponent_flow (占位符) 资金流 / 统计因子 (占位符) 资金流的 Hurst 指数(需实现)
|
||||
59 intraday_lg_flow_corr_N (占位符) 复合因子 (价格行为+资金流) (占位符) 日内趋势与大单流相关性(需实现)
|
||||
60 industry_* (来自 industry_df) 行业因子 (Industry) 对应行业的各种指标(如行业收益率、行业动量等)
|
||||
61 *_deviation (来自 create_deviation_within_dates) 复合因子 (相对行业) 个股因子相对于行业均值的偏离
|
||||
序号 因子名称 (Factor Name / Column Name) 因子类别 (Factor Category) 简要说明
|
||||
1 pe_ttm 价值类因子 (Value) 市盈率 TTM
|
||||
2 return_5, return_20 动量类因子 (Momentum) 过去5日/20日收益率
|
||||
3 act_factor1 to act_factor4 动量类 / 技术类因子 (Momentum / Technical) 基于不同周期EMA斜率计算的动量/趋势因子
|
||||
4 std_return_5, std_return_90, std_return_90_2 波动率类因子 (Volatility) 不同窗口期或延迟窗口期的滚动收益率标准差
|
||||
5 upside_vol, downside_vol 波动率类因子 (Volatility) N日滚动上/下行波动率
|
||||
6 vol_ratio 波动率类因子 (Volatility) 上行波动率 / 下行波动率
|
||||
7 std_return_5 / std_return_90 波动率类因子 (Volatility) 短期波动率 / 长期波动率 比率
|
||||
8 std_return_90 - std_return_90_2 波动率类因子 (Volatility) 长期波动率与其10日前值的差值(波动变化)
|
||||
9 volatility (来自指数计算) 波动率类 / 市场因子 (Volatility / Market) 指数(或个股)的20日滚动收益率标准差
|
||||
10 log(circ_mv) (或 log_circ_mv) 市值类因子 (Size) 流通市值的对数值
|
||||
11 cs_rank_size 市值类因子 (Size) 对数流通市值的截面排序
|
||||
12 vol 流动性类因子 (Liquidity) 成交量 (通常需要与其他指标结合或处理)
|
||||
13 turnover_rate 流动性类因子 (Liquidity) 换手率
|
||||
14 volume_ratio 流动性类因子 (Liquidity) 量比
|
||||
15 turnover_deviation 流动性类因子 (Liquidity) 换手率与其3日滚动均值的标准差倍数偏离
|
||||
16 cat_turnover_spike 流动性类 / 分类因子 (Liquidity / Categorical) 换手率是否显著高于近期均值
|
||||
17 volume_change_rate 流动性类因子 (Liquidity) 短期滚动成交量均值 / 长期滚动成交量均值 - 1
|
||||
18 cat_volume_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日成交量是否大于过去5日最大成交量
|
||||
19 avg_volume_ratio 流动性类因子 (Liquidity) 3日滚动量比均值
|
||||
20 cat_volume_ratio_breakout 流动性类 / 分类因子 (Liquidity / Categorical) 当日量比是否大于过去5日最大量比
|
||||
21 vol_spike (Rolling Mean Vol) 流动性类因子 (Liquidity) 20日滚动成交量均值
|
||||
22 vol_std_5 流动性类 / 波动率因子 (Liquidity / Volatility) 成交量日变化率的5日滚动标准差
|
||||
23 volume_growth 流动性类因子 (Liquidity) 20日成交量变化率
|
||||
24 turnover_std 流动性类 / 波动率因子 (Liquidity / Volatility) 换手率的20日滚动标准差
|
||||
25 flow_lg_elg_intensity 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)净买入量 / 总成交量
|
||||
26 flow_divergence_diff, flow_divergence_ratio 资金流 / 情绪类因子 (Money Flow / Sentiment) 散户与主力资金流的差异或比率
|
||||
27 lg_elg_buy_prop 资金流 / 流动性类因子 (Money Flow / Liquidity) (大单+超大单)买入量 / 总买入量
|
||||
28 flow_struct_buy_change 资金流 / 流动性类因子 (Money Flow / Liquidity) 主力买入占比的日变化
|
||||
29 flow_lg_elg_accel 资金流 / 动量类因子 (Money Flow / Momentum) 主力资金流加速度
|
||||
30 active_buy_volume_large/big/small 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模主动买入量 / 净流入量
|
||||
31 buy_lg/elg_vol_minus_sell_lg/elg_vol 资金流 / 流动性类因子 (Money Flow / Liquidity) 不同规模净买入量 / 总净流入量
|
||||
32 cs_rank_net_lg_flow_val, cs_rank_elg_buy_ratio, cs_rank_lg_sm_flow_diverge, cs_rank_elg_buy_sell_sm_ratio 资金流 / 复合因子 (截面排序) 各种资金流指标的截面排序
|
||||
33 cs_rank_ind_adj_lg_flow 资金流 / 复合因子 (行业调整+截面排序) 行业调整后的大单净流入截面排序
|
||||
34 chip_concentration_range, chip_skewness, cost_support_15pct_change, weight_roc5, cost_stability, ctrl_strength, low_cost_dev, asymmetry, cost_conc_std_N, profit_pressure, underwater_resistance, cs_rank_rel_profit_margin, cs_rank_cost_breadth, cs_rank_dist_to_upper_cost 定位类因子 (Positioning) / 技术类 基于持仓成本分布 (cost_*, weight_avg) 计算的各种指标及其截面排序
|
||||
35 winner_rate, cs_rank_winner_rate 定位类因子 (Positioning) / 技术类 获利盘比例及其截面排序
|
||||
36 floating_chip_proxy, price_cost_divergence, high_cost_break_days, liquidity_risk, lock_factor, cost_atr_adj, smallcap_concentration, cat_golden_resonance 定位类因子 (Positioning) / 复合因子 结合持仓成本与其他信息(价格、成交、波动率、市值)的复合指标
|
||||
37 cat_winner_price_zone 定位类 / 分类因子 (Positioning / Categorical) 基于成本和获利盘划分的区域类别
|
||||
38 flow_chip_consistency, profit_taking_vs_absorb, vol_amp_loss, vol_drop_profit_cnt, cost_break_confirm_cnt, vol_wgt_hist_pos, cs_rank_vol_x_profit_margin, cs_rank_cost_dist_vol_ratio 定位类因子 (Positioning) / 复合因子 进一步结合定位、资金流、量价的复杂交互因子
|
||||
39 return_skew, return_kurtosis 技术类 / 统计特征 (Technical / Stats) 滚动收益率的偏度与峰度
|
||||
40 rsi_3 技术类 / 动量类因子 (Technical / Momentum) 3日相对强弱指数
|
||||
41 obv, maobv_6, obv-maobv_6 技术类 / 量价因子 (Technical / Volume) 能量潮及其均线、差离
|
||||
42 atr_14, atr_6 技术类 / 波动率类因子 (Technical / Volatility) 平均真实波幅
|
||||
43 log_close 技术类 / 量价因子 (Technical / Price) 收盘价对数
|
||||
44 up, down 技术类 / 量价因子 (Technical / Price Action) 标准化上影线、下影线长度
|
||||
45 alpha_22_improved, alpha_003, alpha_007, alpha_013 技术类 / Alpha因子 (Technical / Alpha) WorldQuant Alpha 因子实现
|
||||
46 atr_norm_channel_pos 技术类 / 量价因子 (Technical / Price Action) ATR 标准化的价格通道位置
|
||||
47 turnover_diff_skew 技术类 / 流动性类 (Technical / Liquidity) 换手率变化率的偏度
|
||||
48 pullback_strong_N_M 技术类 / 动量类因子 (Technical / Momentum) 近期强势股的回调幅度
|
||||
49 vol_adj_roc 技术类 / 复合因子 (动量+波动率) 波动率调整后的 N 日变化率
|
||||
50 ar, br, arbr 情绪类 / 技术类因子 (Sentiment / Technical) ARBR 人气意愿指标
|
||||
51 up_ratio_20d (来自指数计算) 情绪类 / 市场因子 (Sentiment / Market) 指数(或个股)过去20天上涨天数比例
|
||||
52 cat_up_limit, cat_down_limit, up_limit_count_10d, down_limit_count_10d, consecutive_up_limit 事件驱动 / 市场状态因子 (Event / Market State) 涨跌停相关状态和计数
|
||||
53 momentum_factor, resonance_factor 复合因子 (量价) (Composite - P/V) 基于量、价、换手率等的简单复合
|
||||
54 cat_af2, cat_af3, cat_af4 复合因子 / 分类因子 (Composite / Cat.) act_factor 之间的比较
|
||||
55 act_factor5, act_factor6 复合因子 (技术类) (Composite - Technical) act_factor 1-4 的组合
|
||||
56 mv_volatility, mv_growth, mv_turnover_ratio, mv_adjusted_volume, mv_weighted_turnover, nonlinear_mv_volume, mv_volume_ratio, mv_momentum 复合因子 (市值+流动性/量价) 考虑了市值影响的量价、流动性或动量指标
|
||||
57 cap_neutral_cost_metric (占位符) 复合因子 / Alpha因子 (占位符) 市值行业中性化的成本指标(需实现)
|
||||
58 hurst_exponent_flow (占位符) 资金流 / 统计因子 (占位符) 资金流的 Hurst 指数(需实现)
|
||||
59 intraday_lg_flow_corr_N (占位符) 复合因子 (价格行为+资金流) (占位符) 日内趋势与大单流相关性(需实现)
|
||||
60 industry_* (来自 industry_df) 行业因子 (Industry) 对应行业的各种指标(如行业收益率、行业动量等)
|
||||
61 *_deviation (来自 create_deviation_within_dates) 复合因子 (相对行业) 个股因子相对于行业均值的偏离
|
||||
62 complex_factor_gplearn_1 复合因子 (GP生成) DEAP/GP 找到的因子表达式 1
|
||||
@@ -1,193 +1,193 @@
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from scipy.stats import spearmanr # 用于因子3的原始思路,但实际简化了
|
||||
|
||||
epsilon = 1e-10
|
||||
|
||||
def _safe_divide(numerator, denominator, default_val=0.0):
|
||||
"""安全除法"""
|
||||
with np.errstate(divide='ignore', invalid='ignore'):
|
||||
result = numerator / denominator
|
||||
result[~np.isfinite(result)] = default_val
|
||||
return result
|
||||
|
||||
# --- 修改后的因子计算函数 ---
|
||||
|
||||
def calculate_size_style_strength_factor(df: pd.DataFrame, N: int = 5, factor_name_suffix: str = '') -> pd.DataFrame:
|
||||
"""
|
||||
计算大小盘风格相对强度因子。
|
||||
返回: 以 trade_date 为索引,因子值为列的 DataFrame。
|
||||
"""
|
||||
factor_name = f'size_style_strength_{N}{factor_name_suffix}'
|
||||
print(f"Calculating {factor_name}...")
|
||||
|
||||
required_indices = ['399300.SZ', '000905.SH', '000852.SH']
|
||||
if not all(idx in df['ts_code'].unique() for idx in required_indices):
|
||||
print(f"Error: DataFrame 中缺少部分必需的指数代码 ({required_indices})。返回空因子 Series。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
# 1. 计算各指数N日收益率
|
||||
df_copy = df.copy() # 操作副本,避免修改原始传入df
|
||||
df_copy['_ret_N'] = df_copy.groupby('ts_code')['close'].pct_change(periods=N)
|
||||
|
||||
# 2. Pivot 以方便截面计算
|
||||
pivot_ret_N = df_copy.pivot_table(index='trade_date', columns='ts_code', values='_ret_N')
|
||||
|
||||
# 确保列存在并获取
|
||||
large_ret = pivot_ret_N.get('399300.SZ', pd.Series(np.nan, index=pivot_ret_N.index))
|
||||
mid_ret = pivot_ret_N.get('000905.SH', pd.Series(np.nan, index=pivot_ret_N.index))
|
||||
small_ret = pivot_ret_N.get('000852.SH', pd.Series(np.nan, index=pivot_ret_N.index))
|
||||
|
||||
# 3. 计算因子 (结果是每日一个标量值)
|
||||
large_small_diff = large_ret - small_ret
|
||||
avg_large_small_ret = (large_ret + small_ret) / 2
|
||||
# 计算中盘偏离因子,处理NaN,如果中盘收益为NaN,则偏离因子不起调整作用(乘以1)
|
||||
mid_deviation_raw = mid_ret - avg_large_small_ret
|
||||
mid_deviation_factor = 1 + np.sign(mid_ret.fillna(0)) * np.abs(mid_deviation_raw.fillna(0))
|
||||
|
||||
daily_factor_values = large_small_diff * mid_deviation_factor
|
||||
daily_factor_values.name = factor_name # 给 Series 命名
|
||||
|
||||
print(f"Finished {factor_name}.")
|
||||
return daily_factor_values.to_frame() # 转换为 DataFrame 返回
|
||||
|
||||
def calculate_volatility_structure_factor(df: pd.DataFrame, N: int = 10, factor_name_suffix: str = '') -> pd.DataFrame:
|
||||
"""
|
||||
计算市场波动结构因子。
|
||||
返回: 以 trade_date 为索引,因子值为列的 DataFrame。
|
||||
"""
|
||||
factor_name = f'vol_structure_idx_{N}{factor_name_suffix}'
|
||||
print(f"Calculating {factor_name}...")
|
||||
|
||||
required_indices = ['399300.SZ', '000905.SH', '000852.SH']
|
||||
if not all(idx in df['ts_code'].unique() for idx in required_indices):
|
||||
print(f"Error: DataFrame 中缺少部分必需的指数代码 ({required_indices})。返回空因子 Series。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
if 'pct_chg' not in df.columns:
|
||||
print(f"Error: DataFrame 缺少 'pct_chg' 列。将为 {factor_name} 填充 NaN。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
df_copy = df.copy()
|
||||
# 1. 计算各指数N日波动率
|
||||
df_copy['_vol_N'] = df_copy.groupby('ts_code')['pct_chg'].rolling(N, min_periods=max(1, N//2)).std().reset_index(level=0, drop=True)
|
||||
|
||||
# 2. Pivot
|
||||
pivot_vol_N = df_copy.pivot_table(index='trade_date', columns='ts_code', values='_vol_N')
|
||||
|
||||
large_vol = pivot_vol_N.get('399300.SZ', pd.Series(np.nan, index=pivot_vol_N.index))
|
||||
mid_vol = pivot_vol_N.get('000905.SH', pd.Series(np.nan, index=pivot_vol_N.index))
|
||||
small_vol = pivot_vol_N.get('000852.SH', pd.Series(np.nan, index=pivot_vol_N.index))
|
||||
|
||||
# 3. 计算因子
|
||||
daily_factor_values = _safe_divide((small_vol - mid_vol), large_vol)
|
||||
daily_factor_values.name = factor_name
|
||||
|
||||
print(f"Finished {factor_name}.")
|
||||
return daily_factor_values.to_frame()
|
||||
|
||||
def calculate_market_divergence_factor(df: pd.DataFrame, factor_name_suffix: str = '') -> pd.DataFrame:
|
||||
"""
|
||||
计算市场分化度因子 (基于每日三个指数收益率符号的一致性)。
|
||||
返回: 以 trade_date 为索引,因子值为列的 DataFrame。
|
||||
"""
|
||||
factor_name = f'market_divergence_score{factor_name_suffix}'
|
||||
print(f"Calculating {factor_name}...")
|
||||
|
||||
required_indices = ['399300.SZ', '000905.SH', '000852.SH']
|
||||
if not all(idx in df['ts_code'].unique() for idx in required_indices):
|
||||
print(f"Error: DataFrame 中缺少部分必需的指数代码 ({required_indices})。返回空因子 Series。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
if 'pct_chg' not in df.columns:
|
||||
print(f"Error: DataFrame 缺少 'pct_chg' 列。将为 {factor_name} 填充 NaN。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
pivot_pct_chg = df.pivot_table(index='trade_date', columns='ts_code', values='pct_chg')
|
||||
|
||||
# 确保列存在
|
||||
idx_large_col = '399300.SZ'
|
||||
idx_mid_col = '000905.SH'
|
||||
idx_small_col = '000852.SH'
|
||||
|
||||
# 使用 reindex 确保所有期望的列都存在,缺失的填充NaN
|
||||
pivot_pct_chg = pivot_pct_chg.reindex(columns=[idx_large_col, idx_mid_col, idx_small_col])
|
||||
|
||||
def daily_divergence_score_calc(row):
|
||||
# 当天只有这三个指数的收益率 Series
|
||||
valid_returns = row.dropna() # 获取非 NaN 的收益率
|
||||
if len(valid_returns) < 2: # 如果有效数据少于2个,无法判断分化
|
||||
return np.nan
|
||||
|
||||
signs = np.sign(valid_returns)
|
||||
unique_sign_count = len(signs.unique())
|
||||
|
||||
if unique_sign_count == 1: # 所有符号相同 (或都为0,sign后也是0)
|
||||
return 0.0 # 分化度最低 (高度一致)
|
||||
elif unique_sign_count == 2 and 0 in signs.unique(): # 一个方向,一个0
|
||||
return 0.25 # 较低分化
|
||||
elif unique_sign_count == 2: # 两个方向 (例如两正一负,或两负一正)
|
||||
return 0.75 # 较高分化
|
||||
elif unique_sign_count == 3: # 三个不同方向 (+, -, 0)
|
||||
return 1.0 # 分化度最高
|
||||
return np.nan # 其他未覆盖的情况 (理论上不应发生)
|
||||
|
||||
daily_factor_values = pivot_pct_chg[[idx_large_col, idx_mid_col, idx_small_col]].apply(daily_divergence_score_calc, axis=1)
|
||||
daily_factor_values.name = factor_name
|
||||
|
||||
print(f"Finished {factor_name}.")
|
||||
return daily_factor_values.to_frame()
|
||||
|
||||
# --- 整合所有因子计算到一个主函数 ---
|
||||
def generate_daily_index_relation_factors(df_input: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
计算所有基于大中小盘指数关系的每日截面因子。
|
||||
|
||||
Args:
|
||||
df_input (pd.DataFrame): 长格式的指数行情数据,包含 'ts_code', 'trade_date', 'close', 'pct_chg'。
|
||||
|
||||
Returns:
|
||||
pd.DataFrame: 以 'trade_date' 为索引,各因子为列的 DataFrame。
|
||||
"""
|
||||
# 确保输入 df 不被修改
|
||||
df = df_input.sort_values(['ts_code', 'trade_date']).reset_index(drop=True)
|
||||
|
||||
# 计算各个因子 (每个函数返回一个单列或多列的 DataFrame,索引为 trade_date)
|
||||
factor1_df = calculate_size_style_strength_factor(df, N=5)
|
||||
factor2_df = calculate_volatility_structure_factor(df, N=10)
|
||||
factor3_df = calculate_market_divergence_factor(df)
|
||||
|
||||
# 还可以继续添加其他每日截面因子...
|
||||
|
||||
# 合并所有因子 DataFrame
|
||||
# 使用 functools.reduce 和 pd.merge 来优雅地合并多个 DataFrame
|
||||
from functools import reduce
|
||||
daily_factors_list = [factor1_df, factor2_df, factor3_df]
|
||||
# 过滤掉可能因错误产生的完全为空或只有NaN的DataFrame
|
||||
daily_factors_list = [f_df for f_df in daily_factors_list if not f_df.empty and not f_df.iloc[:,0].isna().all()]
|
||||
|
||||
if not daily_factors_list:
|
||||
print("警告: 未能成功计算任何因子。返回空 DataFrame。")
|
||||
# 返回一个以日期为索引的空DataFrame,或者基于输入df的日期
|
||||
return pd.DataFrame(index=df['trade_date'].unique()).rename_axis('trade_date')
|
||||
|
||||
# 使用 outer join 以保留所有日期,并确保索引是 trade_date
|
||||
final_factors_df = reduce(lambda left, right: pd.merge(left, right, on='trade_date', how='outer'),
|
||||
daily_factors_list)
|
||||
|
||||
final_factors_df = final_factors_df.sort_index() # 按日期排序
|
||||
|
||||
return final_factors_df
|
||||
|
||||
# --- 使用示例 ---
|
||||
# 假设 all_indices_df 是你包含 '399300.SZ', '000905.SH', '000852.SH' 三个指数的长格式行情数据
|
||||
# 确保它有 'ts_code', 'trade_date', 'open', 'high', 'low', 'close', 'vol', 'pct_chg' 列
|
||||
# all_indices_df['trade_date'] = pd.to_datetime(all_indices_df['trade_date'])
|
||||
# all_indices_df = all_indices_df.sort_values(['ts_code', 'trade_date'])
|
||||
|
||||
# daily_market_factors = generate_daily_index_relation_factors(all_indices_df)
|
||||
# print("\n每日市场风格/情绪因子:")
|
||||
# print(daily_market_factors.tail())
|
||||
|
||||
# 后续,你可以将 daily_market_factors 与你的个股数据 pdf 按 'trade_date' 合并
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from scipy.stats import spearmanr # 用于因子3的原始思路,但实际简化了
|
||||
|
||||
epsilon = 1e-10
|
||||
|
||||
def _safe_divide(numerator, denominator, default_val=0.0):
|
||||
"""安全除法"""
|
||||
with np.errstate(divide='ignore', invalid='ignore'):
|
||||
result = numerator / denominator
|
||||
result[~np.isfinite(result)] = default_val
|
||||
return result
|
||||
|
||||
# --- 修改后的因子计算函数 ---
|
||||
|
||||
def calculate_size_style_strength_factor(df: pd.DataFrame, N: int = 5, factor_name_suffix: str = '') -> pd.DataFrame:
|
||||
"""
|
||||
计算大小盘风格相对强度因子。
|
||||
返回: 以 trade_date 为索引,因子值为列的 DataFrame。
|
||||
"""
|
||||
factor_name = f'size_style_strength_{N}{factor_name_suffix}'
|
||||
print(f"Calculating {factor_name}...")
|
||||
|
||||
required_indices = ['399300.SZ', '000905.SH', '000852.SH']
|
||||
if not all(idx in df['ts_code'].unique() for idx in required_indices):
|
||||
print(f"Error: DataFrame 中缺少部分必需的指数代码 ({required_indices})。返回空因子 Series。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
# 1. 计算各指数N日收益率
|
||||
df_copy = df.copy() # 操作副本,避免修改原始传入df
|
||||
df_copy['_ret_N'] = df_copy.groupby('ts_code')['close'].pct_change(periods=N)
|
||||
|
||||
# 2. Pivot 以方便截面计算
|
||||
pivot_ret_N = df_copy.pivot_table(index='trade_date', columns='ts_code', values='_ret_N')
|
||||
|
||||
# 确保列存在并获取
|
||||
large_ret = pivot_ret_N.get('399300.SZ', pd.Series(np.nan, index=pivot_ret_N.index))
|
||||
mid_ret = pivot_ret_N.get('000905.SH', pd.Series(np.nan, index=pivot_ret_N.index))
|
||||
small_ret = pivot_ret_N.get('000852.SH', pd.Series(np.nan, index=pivot_ret_N.index))
|
||||
|
||||
# 3. 计算因子 (结果是每日一个标量值)
|
||||
large_small_diff = large_ret - small_ret
|
||||
avg_large_small_ret = (large_ret + small_ret) / 2
|
||||
# 计算中盘偏离因子,处理NaN,如果中盘收益为NaN,则偏离因子不起调整作用(乘以1)
|
||||
mid_deviation_raw = mid_ret - avg_large_small_ret
|
||||
mid_deviation_factor = 1 + np.sign(mid_ret.fillna(0)) * np.abs(mid_deviation_raw.fillna(0))
|
||||
|
||||
daily_factor_values = large_small_diff * mid_deviation_factor
|
||||
daily_factor_values.name = factor_name # 给 Series 命名
|
||||
|
||||
print(f"Finished {factor_name}.")
|
||||
return daily_factor_values.to_frame() # 转换为 DataFrame 返回
|
||||
|
||||
def calculate_volatility_structure_factor(df: pd.DataFrame, N: int = 10, factor_name_suffix: str = '') -> pd.DataFrame:
|
||||
"""
|
||||
计算市场波动结构因子。
|
||||
返回: 以 trade_date 为索引,因子值为列的 DataFrame。
|
||||
"""
|
||||
factor_name = f'vol_structure_idx_{N}{factor_name_suffix}'
|
||||
print(f"Calculating {factor_name}...")
|
||||
|
||||
required_indices = ['399300.SZ', '000905.SH', '000852.SH']
|
||||
if not all(idx in df['ts_code'].unique() for idx in required_indices):
|
||||
print(f"Error: DataFrame 中缺少部分必需的指数代码 ({required_indices})。返回空因子 Series。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
if 'pct_chg' not in df.columns:
|
||||
print(f"Error: DataFrame 缺少 'pct_chg' 列。将为 {factor_name} 填充 NaN。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
df_copy = df.copy()
|
||||
# 1. 计算各指数N日波动率
|
||||
df_copy['_vol_N'] = df_copy.groupby('ts_code')['pct_chg'].rolling(N, min_periods=max(1, N//2)).std().reset_index(level=0, drop=True)
|
||||
|
||||
# 2. Pivot
|
||||
pivot_vol_N = df_copy.pivot_table(index='trade_date', columns='ts_code', values='_vol_N')
|
||||
|
||||
large_vol = pivot_vol_N.get('399300.SZ', pd.Series(np.nan, index=pivot_vol_N.index))
|
||||
mid_vol = pivot_vol_N.get('000905.SH', pd.Series(np.nan, index=pivot_vol_N.index))
|
||||
small_vol = pivot_vol_N.get('000852.SH', pd.Series(np.nan, index=pivot_vol_N.index))
|
||||
|
||||
# 3. 计算因子
|
||||
daily_factor_values = _safe_divide((small_vol - mid_vol), large_vol)
|
||||
daily_factor_values.name = factor_name
|
||||
|
||||
print(f"Finished {factor_name}.")
|
||||
return daily_factor_values.to_frame()
|
||||
|
||||
def calculate_market_divergence_factor(df: pd.DataFrame, factor_name_suffix: str = '') -> pd.DataFrame:
|
||||
"""
|
||||
计算市场分化度因子 (基于每日三个指数收益率符号的一致性)。
|
||||
返回: 以 trade_date 为索引,因子值为列的 DataFrame。
|
||||
"""
|
||||
factor_name = f'market_divergence_score{factor_name_suffix}'
|
||||
print(f"Calculating {factor_name}...")
|
||||
|
||||
required_indices = ['399300.SZ', '000905.SH', '000852.SH']
|
||||
if not all(idx in df['ts_code'].unique() for idx in required_indices):
|
||||
print(f"Error: DataFrame 中缺少部分必需的指数代码 ({required_indices})。返回空因子 Series。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
if 'pct_chg' not in df.columns:
|
||||
print(f"Error: DataFrame 缺少 'pct_chg' 列。将为 {factor_name} 填充 NaN。")
|
||||
return pd.DataFrame(index=df['trade_date'].unique(), columns=[factor_name]).rename_axis('trade_date')
|
||||
|
||||
pivot_pct_chg = df.pivot_table(index='trade_date', columns='ts_code', values='pct_chg')
|
||||
|
||||
# 确保列存在
|
||||
idx_large_col = '399300.SZ'
|
||||
idx_mid_col = '000905.SH'
|
||||
idx_small_col = '000852.SH'
|
||||
|
||||
# 使用 reindex 确保所有期望的列都存在,缺失的填充NaN
|
||||
pivot_pct_chg = pivot_pct_chg.reindex(columns=[idx_large_col, idx_mid_col, idx_small_col])
|
||||
|
||||
def daily_divergence_score_calc(row):
|
||||
# 当天只有这三个指数的收益率 Series
|
||||
valid_returns = row.dropna() # 获取非 NaN 的收益率
|
||||
if len(valid_returns) < 2: # 如果有效数据少于2个,无法判断分化
|
||||
return np.nan
|
||||
|
||||
signs = np.sign(valid_returns)
|
||||
unique_sign_count = len(signs.unique())
|
||||
|
||||
if unique_sign_count == 1: # 所有符号相同 (或都为0,sign后也是0)
|
||||
return 0.0 # 分化度最低 (高度一致)
|
||||
elif unique_sign_count == 2 and 0 in signs.unique(): # 一个方向,一个0
|
||||
return 0.25 # 较低分化
|
||||
elif unique_sign_count == 2: # 两个方向 (例如两正一负,或两负一正)
|
||||
return 0.75 # 较高分化
|
||||
elif unique_sign_count == 3: # 三个不同方向 (+, -, 0)
|
||||
return 1.0 # 分化度最高
|
||||
return np.nan # 其他未覆盖的情况 (理论上不应发生)
|
||||
|
||||
daily_factor_values = pivot_pct_chg[[idx_large_col, idx_mid_col, idx_small_col]].apply(daily_divergence_score_calc, axis=1)
|
||||
daily_factor_values.name = factor_name
|
||||
|
||||
print(f"Finished {factor_name}.")
|
||||
return daily_factor_values.to_frame()
|
||||
|
||||
# --- 整合所有因子计算到一个主函数 ---
|
||||
def generate_daily_index_relation_factors(df_input: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
计算所有基于大中小盘指数关系的每日截面因子。
|
||||
|
||||
Args:
|
||||
df_input (pd.DataFrame): 长格式的指数行情数据,包含 'ts_code', 'trade_date', 'close', 'pct_chg'。
|
||||
|
||||
Returns:
|
||||
pd.DataFrame: 以 'trade_date' 为索引,各因子为列的 DataFrame。
|
||||
"""
|
||||
# 确保输入 df 不被修改
|
||||
df = df_input.sort_values(['ts_code', 'trade_date']).reset_index(drop=True)
|
||||
|
||||
# 计算各个因子 (每个函数返回一个单列或多列的 DataFrame,索引为 trade_date)
|
||||
factor1_df = calculate_size_style_strength_factor(df, N=5)
|
||||
factor2_df = calculate_volatility_structure_factor(df, N=10)
|
||||
factor3_df = calculate_market_divergence_factor(df)
|
||||
|
||||
# 还可以继续添加其他每日截面因子...
|
||||
|
||||
# 合并所有因子 DataFrame
|
||||
# 使用 functools.reduce 和 pd.merge 来优雅地合并多个 DataFrame
|
||||
from functools import reduce
|
||||
daily_factors_list = [factor1_df, factor2_df, factor3_df]
|
||||
# 过滤掉可能因错误产生的完全为空或只有NaN的DataFrame
|
||||
daily_factors_list = [f_df for f_df in daily_factors_list if not f_df.empty and not f_df.iloc[:,0].isna().all()]
|
||||
|
||||
if not daily_factors_list:
|
||||
print("警告: 未能成功计算任何因子。返回空 DataFrame。")
|
||||
# 返回一个以日期为索引的空DataFrame,或者基于输入df的日期
|
||||
return pd.DataFrame(index=df['trade_date'].unique()).rename_axis('trade_date')
|
||||
|
||||
# 使用 outer join 以保留所有日期,并确保索引是 trade_date
|
||||
final_factors_df = reduce(lambda left, right: pd.merge(left, right, on='trade_date', how='outer'),
|
||||
daily_factors_list)
|
||||
|
||||
final_factors_df = final_factors_df.sort_index() # 按日期排序
|
||||
|
||||
return final_factors_df
|
||||
|
||||
# --- 使用示例 ---
|
||||
# 假设 all_indices_df 是你包含 '399300.SZ', '000905.SH', '000852.SH' 三个指数的长格式行情数据
|
||||
# 确保它有 'ts_code', 'trade_date', 'open', 'high', 'low', 'close', 'vol', 'pct_chg' 列
|
||||
# all_indices_df['trade_date'] = pd.to_datetime(all_indices_df['trade_date'])
|
||||
# all_indices_df = all_indices_df.sort_values(['ts_code', 'trade_date'])
|
||||
|
||||
# daily_market_factors = generate_daily_index_relation_factors(all_indices_df)
|
||||
# print("\n每日市场风格/情绪因子:")
|
||||
# print(daily_market_factors.tail())
|
||||
|
||||
# 后续,你可以将 daily_market_factors 与你的个股数据 pdf 按 'trade_date' 合并
|
||||
# pdf_with_market_factors = pd.merge(pdf, daily_market_factors, on='trade_date', how='left')
|
||||
@@ -1,7 +1,7 @@
|
||||
|
||||
|
||||
from main.utils.utils import read_and_merge_h5_data, merge_with_industry_data
|
||||
|
||||
|
||||
import sys
|
||||
|
||||
|
||||
from main.utils.utils import read_and_merge_h5_data, merge_with_industry_data
|
||||
|
||||
|
||||
import sys
|
||||
print(sys.path)
|
||||
@@ -1,222 +1,222 @@
|
||||
from tqdm import tqdm
|
||||
|
||||
from main.factor.factor import get_rolling_factor, get_simple_factor
|
||||
from main.utils.utils import read_and_merge_h5_data
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def create_factor_table_clickhouse(clickhouse_host: str, clickhouse_port: int,
|
||||
clickhouse_user: str, clickhouse_password: str,
|
||||
clickhouse_database: str, table_name: str = 'factor_data'):
|
||||
"""
|
||||
在 ClickHouse 中创建 factor_data 表,考虑读取速度。
|
||||
"""
|
||||
try:
|
||||
print('create factor table')
|
||||
client = Client(host=clickhouse_host, port=clickhouse_port, user=clickhouse_user,
|
||||
password=clickhouse_password, database=clickhouse_database)
|
||||
|
||||
create_table_query = f"""
|
||||
CREATE TABLE IF NOT EXISTS {table_name}
|
||||
(
|
||||
date Date,
|
||||
asset_id String,
|
||||
factor_name String,
|
||||
factor_value Float64
|
||||
)
|
||||
ENGINE = MergeTree()
|
||||
PARTITION BY toYYYYMM(date)
|
||||
ORDER BY (date, asset_id, factor_name)
|
||||
"""
|
||||
|
||||
client.execute(create_table_query)
|
||||
print(f"成功在 ClickHouse 数据库 '{clickhouse_database}' 中创建表 '{table_name}'!")
|
||||
|
||||
except Exception as e:
|
||||
print(f"创建 ClickHouse 表发生错误: {e}")
|
||||
finally:
|
||||
if 'client' in locals() and client.connection:
|
||||
client.disconnect()
|
||||
|
||||
|
||||
def write_features_to_clickhouse(df: pd.DataFrame, feature_columns: list,
|
||||
clickhouse_host: str, clickhouse_port: int,
|
||||
clickhouse_user: str, clickhouse_password: str,
|
||||
clickhouse_database: str, table_name: str = 'stock_factor',
|
||||
batch_size: int = 5000): # 设置批次大小
|
||||
"""
|
||||
将 DataFrame 中指定的特征列分批写入 ClickHouse 的宽表,动态添加列。
|
||||
"""
|
||||
try:
|
||||
client = Client(host=clickhouse_host, port=clickhouse_port, user=clickhouse_user,
|
||||
password=clickhouse_password, database=clickhouse_database)
|
||||
|
||||
if 'ts_code' not in df.columns or 'trade_date' not in df.columns:
|
||||
raise ValueError("DataFrame 必须包含 'ts_code' 和 'trade_date' 列。")
|
||||
|
||||
existing_columns = set()
|
||||
columns_query = f"DESCRIBE TABLE {table_name}"
|
||||
columns_result = client.execute(columns_query)
|
||||
for col in columns_result:
|
||||
existing_columns.add(col[0])
|
||||
|
||||
for factor_name in feature_columns:
|
||||
if factor_name not in existing_columns:
|
||||
if factor_name not in df.columns:
|
||||
print(f"警告: 特征 '{factor_name}' 不存在于 DataFrame 中,将跳过添加列。")
|
||||
continue
|
||||
|
||||
factor_series = df[factor_name]
|
||||
factor_dtype = factor_series.dtype
|
||||
|
||||
clickhouse_dtype = None
|
||||
if pd.api.types.is_float_dtype(factor_dtype):
|
||||
clickhouse_dtype = 'Float64'
|
||||
elif pd.api.types.is_integer_dtype(factor_dtype):
|
||||
clickhouse_dtype = 'Int64'
|
||||
elif factor_dtype == 'object':
|
||||
print(f"警告: 特征 '{factor_name}' 的数据类型为 object,将跳过添加列。")
|
||||
continue
|
||||
else:
|
||||
clickhouse_dtype = 'Float64'
|
||||
|
||||
if clickhouse_dtype:
|
||||
add_column_query = f"ALTER TABLE {table_name} ADD COLUMN IF NOT EXISTS {factor_name} {clickhouse_dtype}"
|
||||
client.execute(add_column_query)
|
||||
print(f"在表 '{table_name}' 中添加了新列: {factor_name} ({clickhouse_dtype})")
|
||||
existing_columns.add(factor_name)
|
||||
|
||||
insert_columns_order = ['date', 'asset_id'] + [col for col in feature_columns if
|
||||
col in existing_columns and col in df.columns]
|
||||
|
||||
# 分批处理 DataFrame
|
||||
num_rows = len(df)
|
||||
for i in tqdm(range(0, num_rows, batch_size), desc="写入批次"):
|
||||
batch_df = df[i:i + batch_size]
|
||||
data_to_insert_batch = []
|
||||
for row in batch_df.itertuples(index=False):
|
||||
insert_row = [getattr(row, 'trade_date'), getattr(row, 'ts_code')]
|
||||
for factor in feature_columns:
|
||||
if factor in existing_columns and factor in df.columns:
|
||||
try:
|
||||
insert_row.append(getattr(row, factor))
|
||||
except AttributeError:
|
||||
insert_row.append(None)
|
||||
data_to_insert_batch.append(tuple(insert_row))
|
||||
write_batch_to_clickhouse(client, table_name, data_to_insert_batch, insert_columns_order)
|
||||
|
||||
except Exception as e:
|
||||
print(f"写入 ClickHouse 发生错误: {e}")
|
||||
finally:
|
||||
if 'client' in locals() and client.connection:
|
||||
client.disconnect()
|
||||
|
||||
|
||||
def write_batch_to_clickhouse(client, table_name, data_to_insert, columns_order):
|
||||
"""将一个批次的数据写入 ClickHouse"""
|
||||
if data_to_insert:
|
||||
insert_query_final = f"INSERT INTO {table_name} ({', '.join(columns_order)}) VALUES"
|
||||
try:
|
||||
client.execute(insert_query_final, data_to_insert)
|
||||
print(f"成功写入 {len(data_to_insert)} 条数据到 ClickHouse 表 '{table_name}'!")
|
||||
except Exception as e:
|
||||
print(f"写入 ClickHouse 批次数据发生错误: {e}")
|
||||
|
||||
|
||||
# -------------------- 使用示例 --------------------
|
||||
if __name__ == "__main__":
|
||||
# 示例 DataFrame
|
||||
|
||||
print('daily data')
|
||||
df = read_and_merge_h5_data('../../data/daily_data.h5', key='daily_data',
|
||||
columns=['ts_code', 'trade_date', 'open', 'close', 'high', 'low', 'vol', 'pct_chg'],
|
||||
df=None)
|
||||
|
||||
print('daily basic')
|
||||
df = read_and_merge_h5_data('../../data/daily_basic.h5', key='daily_basic',
|
||||
columns=['ts_code', 'trade_date', 'turnover_rate', 'pe_ttm', 'circ_mv', 'volume_ratio',
|
||||
'is_st'], df=df, join='inner')
|
||||
df = df[df['trade_date'] >= '2021-01-01']
|
||||
|
||||
print('stk limit')
|
||||
df = read_and_merge_h5_data('../../data/stk_limit.h5', key='stk_limit',
|
||||
columns=['ts_code', 'trade_date', 'pre_close', 'up_limit', 'down_limit'],
|
||||
df=df)
|
||||
print('money flow')
|
||||
df = read_and_merge_h5_data('../../data/money_flow.h5', key='money_flow',
|
||||
columns=['ts_code', 'trade_date', 'buy_sm_vol', 'sell_sm_vol', 'buy_lg_vol',
|
||||
'sell_lg_vol',
|
||||
'buy_elg_vol', 'sell_elg_vol', 'net_mf_vol'],
|
||||
df=df)
|
||||
print('cyq perf')
|
||||
df = read_and_merge_h5_data('../../data/cyq_perf.h5', key='cyq_perf',
|
||||
columns=['ts_code', 'trade_date', 'his_low', 'his_high', 'cost_5pct', 'cost_15pct',
|
||||
'cost_50pct',
|
||||
'cost_85pct', 'cost_95pct', 'weight_avg', 'winner_rate'],
|
||||
df=df)
|
||||
print(df.info())
|
||||
|
||||
origin_columns = df.columns.tolist()
|
||||
origin_columns = [col for col in origin_columns if 'cyq' not in col]
|
||||
print(origin_columns)
|
||||
|
||||
|
||||
def filter_data(df):
|
||||
# df = df.groupby('trade_date').apply(lambda x: x.nlargest(1000, 'act_factor1'))
|
||||
df = df[~df['is_st']]
|
||||
df = df[~df['ts_code'].str.endswith('BJ')]
|
||||
df = df[~df['ts_code'].str.startswith('30')]
|
||||
df = df[~df['ts_code'].str.startswith('68')]
|
||||
df = df[~df['ts_code'].str.startswith('8')]
|
||||
df = df[df['trade_date'] >= '20180101']
|
||||
if 'in_date' in df.columns:
|
||||
df = df.drop(columns=['in_date'])
|
||||
df = df.reset_index(drop=True)
|
||||
return df
|
||||
|
||||
|
||||
df = filter_data(df)
|
||||
df, _ = get_rolling_factor(df)
|
||||
df, _ = get_simple_factor(df)
|
||||
# df['test'] = 1
|
||||
# df['test2'] = 2
|
||||
# df = df.merge(industry_df, on=['l2_code', 'trade_date'], how='left')
|
||||
df = df.rename(columns={'l2_code': 'cat_l2_code'})
|
||||
# df = df.merge(index_data, on='trade_date', how='left')
|
||||
|
||||
print(df.info())
|
||||
|
||||
feature_columns = [col for col in df.columns if col in df.columns]
|
||||
feature_columns = [col for col in feature_columns if col not in ['trade_date',
|
||||
'ts_code',
|
||||
'label']]
|
||||
feature_columns = [col for col in feature_columns if 'future' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'label' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'score' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'gen' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'is_st' not in col]
|
||||
# feature_columns = [col for col in feature_columns if 'pe_ttm' not in col]
|
||||
# feature_columns = [col for col in feature_columns if 'volatility' not in col]
|
||||
# feature_columns = [col for col in feature_columns if 'circ_mv' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'cat_l2_code' not in col]
|
||||
feature_columns = [col for col in feature_columns if col not in origin_columns]
|
||||
feature_columns = [col for col in feature_columns if not col.startswith('_')]
|
||||
|
||||
print(feature_columns)
|
||||
|
||||
# 替换为您的 ClickHouse 连接信息
|
||||
clickhouse_host = '127.0.0.1'
|
||||
clickhouse_port = 9000
|
||||
clickhouse_user = 'default'
|
||||
clickhouse_password = 'clickhouse520102'
|
||||
clickhouse_database = 'stock_data'
|
||||
|
||||
# create_factor_table_clickhouse(clickhouse_host, clickhouse_port,
|
||||
# clickhouse_user, clickhouse_password,
|
||||
# clickhouse_database)
|
||||
|
||||
write_features_to_clickhouse(
|
||||
df[[col for col in df.columns if col in ['ts_code', 'trade_date'] or col in feature_columns]], feature_columns,
|
||||
clickhouse_host, clickhouse_port,
|
||||
clickhouse_user, clickhouse_password,
|
||||
clickhouse_database)
|
||||
from tqdm import tqdm
|
||||
|
||||
from main.factor.factor import get_rolling_factor, get_simple_factor
|
||||
from main.utils.utils import read_and_merge_h5_data
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def create_factor_table_clickhouse(clickhouse_host: str, clickhouse_port: int,
|
||||
clickhouse_user: str, clickhouse_password: str,
|
||||
clickhouse_database: str, table_name: str = 'factor_data'):
|
||||
"""
|
||||
在 ClickHouse 中创建 factor_data 表,考虑读取速度。
|
||||
"""
|
||||
try:
|
||||
print('create factor table')
|
||||
client = Client(host=clickhouse_host, port=clickhouse_port, user=clickhouse_user,
|
||||
password=clickhouse_password, database=clickhouse_database)
|
||||
|
||||
create_table_query = f"""
|
||||
CREATE TABLE IF NOT EXISTS {table_name}
|
||||
(
|
||||
date Date,
|
||||
asset_id String,
|
||||
factor_name String,
|
||||
factor_value Float64
|
||||
)
|
||||
ENGINE = MergeTree()
|
||||
PARTITION BY toYYYYMM(date)
|
||||
ORDER BY (date, asset_id, factor_name)
|
||||
"""
|
||||
|
||||
client.execute(create_table_query)
|
||||
print(f"成功在 ClickHouse 数据库 '{clickhouse_database}' 中创建表 '{table_name}'!")
|
||||
|
||||
except Exception as e:
|
||||
print(f"创建 ClickHouse 表发生错误: {e}")
|
||||
finally:
|
||||
if 'client' in locals() and client.connection:
|
||||
client.disconnect()
|
||||
|
||||
|
||||
def write_features_to_clickhouse(df: pd.DataFrame, feature_columns: list,
|
||||
clickhouse_host: str, clickhouse_port: int,
|
||||
clickhouse_user: str, clickhouse_password: str,
|
||||
clickhouse_database: str, table_name: str = 'stock_factor',
|
||||
batch_size: int = 5000): # 设置批次大小
|
||||
"""
|
||||
将 DataFrame 中指定的特征列分批写入 ClickHouse 的宽表,动态添加列。
|
||||
"""
|
||||
try:
|
||||
client = Client(host=clickhouse_host, port=clickhouse_port, user=clickhouse_user,
|
||||
password=clickhouse_password, database=clickhouse_database)
|
||||
|
||||
if 'ts_code' not in df.columns or 'trade_date' not in df.columns:
|
||||
raise ValueError("DataFrame 必须包含 'ts_code' 和 'trade_date' 列。")
|
||||
|
||||
existing_columns = set()
|
||||
columns_query = f"DESCRIBE TABLE {table_name}"
|
||||
columns_result = client.execute(columns_query)
|
||||
for col in columns_result:
|
||||
existing_columns.add(col[0])
|
||||
|
||||
for factor_name in feature_columns:
|
||||
if factor_name not in existing_columns:
|
||||
if factor_name not in df.columns:
|
||||
print(f"警告: 特征 '{factor_name}' 不存在于 DataFrame 中,将跳过添加列。")
|
||||
continue
|
||||
|
||||
factor_series = df[factor_name]
|
||||
factor_dtype = factor_series.dtype
|
||||
|
||||
clickhouse_dtype = None
|
||||
if pd.api.types.is_float_dtype(factor_dtype):
|
||||
clickhouse_dtype = 'Float64'
|
||||
elif pd.api.types.is_integer_dtype(factor_dtype):
|
||||
clickhouse_dtype = 'Int64'
|
||||
elif factor_dtype == 'object':
|
||||
print(f"警告: 特征 '{factor_name}' 的数据类型为 object,将跳过添加列。")
|
||||
continue
|
||||
else:
|
||||
clickhouse_dtype = 'Float64'
|
||||
|
||||
if clickhouse_dtype:
|
||||
add_column_query = f"ALTER TABLE {table_name} ADD COLUMN IF NOT EXISTS {factor_name} {clickhouse_dtype}"
|
||||
client.execute(add_column_query)
|
||||
print(f"在表 '{table_name}' 中添加了新列: {factor_name} ({clickhouse_dtype})")
|
||||
existing_columns.add(factor_name)
|
||||
|
||||
insert_columns_order = ['date', 'asset_id'] + [col for col in feature_columns if
|
||||
col in existing_columns and col in df.columns]
|
||||
|
||||
# 分批处理 DataFrame
|
||||
num_rows = len(df)
|
||||
for i in tqdm(range(0, num_rows, batch_size), desc="写入批次"):
|
||||
batch_df = df[i:i + batch_size]
|
||||
data_to_insert_batch = []
|
||||
for row in batch_df.itertuples(index=False):
|
||||
insert_row = [getattr(row, 'trade_date'), getattr(row, 'ts_code')]
|
||||
for factor in feature_columns:
|
||||
if factor in existing_columns and factor in df.columns:
|
||||
try:
|
||||
insert_row.append(getattr(row, factor))
|
||||
except AttributeError:
|
||||
insert_row.append(None)
|
||||
data_to_insert_batch.append(tuple(insert_row))
|
||||
write_batch_to_clickhouse(client, table_name, data_to_insert_batch, insert_columns_order)
|
||||
|
||||
except Exception as e:
|
||||
print(f"写入 ClickHouse 发生错误: {e}")
|
||||
finally:
|
||||
if 'client' in locals() and client.connection:
|
||||
client.disconnect()
|
||||
|
||||
|
||||
def write_batch_to_clickhouse(client, table_name, data_to_insert, columns_order):
|
||||
"""将一个批次的数据写入 ClickHouse"""
|
||||
if data_to_insert:
|
||||
insert_query_final = f"INSERT INTO {table_name} ({', '.join(columns_order)}) VALUES"
|
||||
try:
|
||||
client.execute(insert_query_final, data_to_insert)
|
||||
print(f"成功写入 {len(data_to_insert)} 条数据到 ClickHouse 表 '{table_name}'!")
|
||||
except Exception as e:
|
||||
print(f"写入 ClickHouse 批次数据发生错误: {e}")
|
||||
|
||||
|
||||
# -------------------- 使用示例 --------------------
|
||||
if __name__ == "__main__":
|
||||
# 示例 DataFrame
|
||||
|
||||
print('daily data')
|
||||
df = read_and_merge_h5_data('../../data/daily_data.h5', key='daily_data',
|
||||
columns=['ts_code', 'trade_date', 'open', 'close', 'high', 'low', 'vol', 'pct_chg'],
|
||||
df=None)
|
||||
|
||||
print('daily basic')
|
||||
df = read_and_merge_h5_data('../../data/daily_basic.h5', key='daily_basic',
|
||||
columns=['ts_code', 'trade_date', 'turnover_rate', 'pe_ttm', 'circ_mv', 'volume_ratio',
|
||||
'is_st'], df=df, join='inner')
|
||||
df = df[df['trade_date'] >= '2021-01-01']
|
||||
|
||||
print('stk limit')
|
||||
df = read_and_merge_h5_data('../../data/stk_limit.h5', key='stk_limit',
|
||||
columns=['ts_code', 'trade_date', 'pre_close', 'up_limit', 'down_limit'],
|
||||
df=df)
|
||||
print('money flow')
|
||||
df = read_and_merge_h5_data('../../data/money_flow.h5', key='money_flow',
|
||||
columns=['ts_code', 'trade_date', 'buy_sm_vol', 'sell_sm_vol', 'buy_lg_vol',
|
||||
'sell_lg_vol',
|
||||
'buy_elg_vol', 'sell_elg_vol', 'net_mf_vol'],
|
||||
df=df)
|
||||
print('cyq perf')
|
||||
df = read_and_merge_h5_data('../../data/cyq_perf.h5', key='cyq_perf',
|
||||
columns=['ts_code', 'trade_date', 'his_low', 'his_high', 'cost_5pct', 'cost_15pct',
|
||||
'cost_50pct',
|
||||
'cost_85pct', 'cost_95pct', 'weight_avg', 'winner_rate'],
|
||||
df=df)
|
||||
print(df.info())
|
||||
|
||||
origin_columns = df.columns.tolist()
|
||||
origin_columns = [col for col in origin_columns if 'cyq' not in col]
|
||||
print(origin_columns)
|
||||
|
||||
|
||||
def filter_data(df):
|
||||
# df = df.groupby('trade_date').apply(lambda x: x.nlargest(1000, 'act_factor1'))
|
||||
df = df[~df['is_st']]
|
||||
df = df[~df['ts_code'].str.endswith('BJ')]
|
||||
df = df[~df['ts_code'].str.startswith('30')]
|
||||
df = df[~df['ts_code'].str.startswith('68')]
|
||||
df = df[~df['ts_code'].str.startswith('8')]
|
||||
df = df[df['trade_date'] >= '20180101']
|
||||
if 'in_date' in df.columns:
|
||||
df = df.drop(columns=['in_date'])
|
||||
df = df.reset_index(drop=True)
|
||||
return df
|
||||
|
||||
|
||||
df = filter_data(df)
|
||||
df, _ = get_rolling_factor(df)
|
||||
df, _ = get_simple_factor(df)
|
||||
# df['test'] = 1
|
||||
# df['test2'] = 2
|
||||
# df = df.merge(industry_df, on=['l2_code', 'trade_date'], how='left')
|
||||
df = df.rename(columns={'l2_code': 'cat_l2_code'})
|
||||
# df = df.merge(index_data, on='trade_date', how='left')
|
||||
|
||||
print(df.info())
|
||||
|
||||
feature_columns = [col for col in df.columns if col in df.columns]
|
||||
feature_columns = [col for col in feature_columns if col not in ['trade_date',
|
||||
'ts_code',
|
||||
'label']]
|
||||
feature_columns = [col for col in feature_columns if 'future' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'label' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'score' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'gen' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'is_st' not in col]
|
||||
# feature_columns = [col for col in feature_columns if 'pe_ttm' not in col]
|
||||
# feature_columns = [col for col in feature_columns if 'volatility' not in col]
|
||||
# feature_columns = [col for col in feature_columns if 'circ_mv' not in col]
|
||||
feature_columns = [col for col in feature_columns if 'cat_l2_code' not in col]
|
||||
feature_columns = [col for col in feature_columns if col not in origin_columns]
|
||||
feature_columns = [col for col in feature_columns if not col.startswith('_')]
|
||||
|
||||
print(feature_columns)
|
||||
|
||||
# 替换为您的 ClickHouse 连接信息
|
||||
clickhouse_host = '127.0.0.1'
|
||||
clickhouse_port = 9000
|
||||
clickhouse_user = 'default'
|
||||
clickhouse_password = 'clickhouse520102'
|
||||
clickhouse_database = 'stock_data'
|
||||
|
||||
# create_factor_table_clickhouse(clickhouse_host, clickhouse_port,
|
||||
# clickhouse_user, clickhouse_password,
|
||||
# clickhouse_database)
|
||||
|
||||
write_features_to_clickhouse(
|
||||
df[[col for col in df.columns if col in ['ts_code', 'trade_date'] or col in feature_columns]], feature_columns,
|
||||
clickhouse_host, clickhouse_port,
|
||||
clickhouse_user, clickhouse_password,
|
||||
clickhouse_database)
|
||||
|
||||
Reference in New Issue
Block a user