Skip to contents

Overview

CausalInvestData provides simulated datasets for causal inference in institutional investment management. The package includes four core datasets designed to reflect real-world structures, enabling users to prototype, teach, and evaluate methods such as propensity score matching, causal forests, and impact analysis.


Dataset: fund_performance

data("fund_performance", package = "CausalInvestData")
head(fund_performance)
##   fund_id market_return        alpha      beta treatment       return
## 1       1   0.003952435 -0.009915974 0.9217857         0 -0.007775751
## 2       2   0.036982251 -0.010799101 1.1331275         1  0.032828934
## 3       3   0.215870831  0.009640395 1.0374590         1  0.224115880
## 4       4   0.067050839  0.007356497 1.1228787         1  0.080673608
## 5       5   0.072928774 -0.040986855 0.9176203         0  0.051918971
## 6       6   0.231506499  0.030811469 0.8564341         0  0.228707376

Propensity Score Matching Example

## Warning: package 'MatchIt' was built under R version 4.3.3
m.out <- matchit(treatment ~ market_return + alpha + beta, data = fund_performance)
summary(m.out)
## 
## Call:
## matchit(formula = treatment ~ market_return + alpha + beta, data = fund_performance)
## 
## Summary of Balance for All Data:
##               Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance             0.4942        0.4919          0.0961     0.9914    0.0250
## market_return        0.0623        0.0610          0.0134     0.9360    0.0079
## alpha                0.0102        0.0115         -0.0687     0.9442    0.0194
## beta                 0.9919        0.9993         -0.0654     0.9965    0.0186
##               eCDF Max
## distance        0.0725
## market_return   0.0232
## alpha           0.0512
## beta            0.0413
## 
## Summary of Balance for Matched Data:
##               Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance             0.4942        0.4920          0.0887     1.0413    0.0230
## market_return        0.0623        0.0602          0.0208     0.9314    0.0088
## alpha                0.0102        0.0113         -0.0568     0.9830    0.0167
## beta                 0.9919        0.9992         -0.0650     1.0060    0.0185
##               eCDF Max Std. Pair Dist.
## distance        0.0730          0.0927
## market_return   0.0243          1.1393
## alpha           0.0467          0.8110
## beta            0.0426          0.8197
## 
## Sample Sizes:
##           Control Treated
## All           507     493
## Matched       493     493
## Unmatched      14       0
## Discarded       0       0

Dataset: portfolio_allocations

data("portfolio_allocations", package = "CausalInvestData")
head(portfolio_allocations)
##   portfolio_id risk_level equity_allocation treatment     return
## 1            1        Low         0.4080645         1 0.06659800
## 2            2       High         0.7138592         1 0.09565076
## 3            3     Medium         0.7284506         1 0.07306955
## 4            4       High         0.2891309         1 0.09795321
## 5            5        Low         0.8376339         0 0.10054662
## 6            6     Medium         0.4115289         1 0.08804457
##   bond_allocation
## 1       0.5919355
## 2       0.2861408
## 3       0.2715494
## 4       0.7108691
## 5       0.1623661
## 6       0.5884711

Dataset: client_behavior

data("client_behavior", package = "CausalInvestData")
head(client_behavior)
##   client_id age   income satisfaction_score treatment churned
## 1         1  54 65785.65           4.898391         0       0
## 2         2  66 56907.43           2.481101         1       1
## 3         3  32 57223.12           2.072351         0       0
## 4         4  48 49584.93           6.903271         0       0
## 5         5  75 46669.48           3.650145         0       0
## 6         6  33 52759.41           1.038131         1       0

Dataset: macro_shocks

data("macro_shocks", package = "CausalInvestData")
head(macro_shocks)
##         date interest_rate gdp_growth market_index
## 1 2020-01-01    0.04858548 0.02098241   0.03418012
## 2 2020-02-01    0.03740164 0.02505696   0.05356738
## 3 2020-03-01    0.04924685 0.02460699   0.04155380
## 4 2020-04-01    0.05109889 0.01260686   0.01707771
## 5 2020-05-01    0.02392792 0.02387595   0.02444632
## 6 2020-06-01    0.06718417 0.01306215  -0.04220713

Summary

This package is ideal for:

  • Financial data scientists building causal ML pipelines
  • Academics teaching causal inference methods
  • Practitioners evaluating financial interventions

To cite the package, run:

citation("CausalInvestData")
## To cite the CausalInvestData package in publications, use:
## 
##   Conilias Zvobwo E (2025). _CausalInvestData: Simulated Datasets for
##   Causal Inference in Investment Management_. R package version 0.1.0,
##   <https://github.com/edzai/CausalInvestData>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {CausalInvestData: Simulated Datasets for Causal Inference in Investment Management},
##     author = {Edzai {Conilias Zvobwo}},
##     year = {2025},
##     note = {R package version 0.1.0},
##     url = {https://github.com/edzai/CausalInvestData},
##   }