研究目的
To illustrate that rtFMS is easy to implement and results in an optimized prediction model, providing benefits such as illustrating the relative importance of modeling options, allowing indirect comparisons among models, enabling comparison of more preprocessing and model algorithm options, and removing user bias.
研究成果
rtFMS provides a systematic and unbiased approach to optimizing prediction model performance, eliminating bias from empirical selection. It selected different preprocessing options for NEFA and BHBA models, highlighting its importance. The method is user-friendly, has no hyperparameters, and allows for indirect comparisons and empirical decisions when appropriate. Future work should evaluate rtFMS with additional data types and automate the process for broader application in epidemiology and other fields.
研究不足
The study is limited to a specific milk FTIR dataset; generalizability to other data types or sizes is not established. Computational time is high (approximately one week per outcome), which may limit accessibility. The method requires expertise in R and machine learning. Performance measures are based on cross-validation, and external validation was not performed. The models may not be sufficient as stand-alone tests and should be combined with other indicators.
1:Experimental Design and Method Selection:
The study used regression tree full model selection (rtFMS) to systematically select predictive modeling methods. It involved developing models for every combination of options in categories like input subsets, preprocessing methods, and algorithms, using iterated cross-validation and regression trees for selection.
2:Sample Selection and Data Sources:
The dataset included cow information, milk FTIR data, fatty acid predictions, FOSS predictions for milk BHBA and acetone, blood measurements, and milk components from 26 farms, 346 cows, and 115 sampling days.
3:List of Experimental Equipment and Materials:
FTIR spectrometers (brands not specified), R software with packages (DMwR, MLmetric, party, partykit, glmnet, randomForest, gbm, earth, klaR, epiR, caret), and computational resources.
4:Experimental Procedures and Operational Workflow:
Steps included data preparation, outcome selection (NEFA ≥ 0.7 mmol/L or BHBA ≥ 1.2 mmol/L), standard methods (e.g., removing water wavenumbers), 10 repeated 10-fold cross-validation with groupKFold for farm separation, SMOTE for class imbalance, running models for all option combinations, performance measurement (balanced accuracy), regression tree analysis, and final model selection.
5:7 mmol/L or BHBA ≥ 2 mmol/L), standard methods (e.g., removing water wavenumbers), 10 repeated 10-fold cross-validation with groupKFold for farm separation, SMOTE for class imbalance, running models for all option combinations, performance measurement (balanced accuracy), regression tree analysis, and final model selection.
Data Analysis Methods:
5. Data Analysis Methods: Balanced accuracy was used as the performance measure. Regression trees (ctree function) were employed to analyze model performances, with statistical significance assessed using p-values and Bonferroni correction. Final models were evaluated with confidence intervals and variable importance rankings.
独家科研数据包,助您复现前沿成果,加速创新突破
获取完整内容