Automated Machine Learning Driven Stacked Ensemble Modeling for Forest Aboveground Biomass Prediction Using Multitemporal Sentinel-2 Data

Modeling and large-scale mapping of forest aboveground biomass (AGB) is a complicated, challenging, and expensive task. There are considerable variations in forest characteristics that create functional disparity for different models and needs comprehensive evaluation. Moreover, the human-bias involved in the process of modeling and evaluation affects the generalization of models at larger scales. In this article, we present an automated machine learning framework for modeling, evaluation, and stacking of multiple base models for AGB prediction. We incorporate a hyperparameter optimization procedure for automatic extraction of targeted features from multitemporal Sentinel-2 data that minimizes human-bias in the proposed modeling pipeline. We integrate the two independent frameworks for automatic feature extraction and automatic model ensembling and evaluation. The results suggest that the extracted target-oriented features have an excessive contribution of red-edge and short-wave infrared spectrum. The feature importance scale indicates a dominant role of summer-based features as compared to other seasons. The automated ensembling and evaluation framework produced a stacked ensemble of base models that outperformed individual base models in accurately predicting forest AGB. The stacked ensemble model delivered the best scores of R2cv = 0.71 and RMSE = 74.44 Mgha−1. The other base models delivered R2cv and RMSE ranging between 0.38–0.66 and 81.27–109.44 Mg ha−1, respectively. The model evaluation metrics indicated that the stacked ensemble model was more resistant to outliers and achieved a better generalization. Thus, the proposed study demonstrated an effective automated modeling pipeline for predicting AGB by minimizing human-bias and deployable over large and diverse forest areas.


I. INTRODUCTION
T HE selection of suitable machine learning (ML) algorithms to solve the forest aboveground biomass (AGB) problem Manuscript  requires domain expertise for improving and regulating model performances [1], [2], [3], [4], [5]. There are different ML-based models developed in the literature for the prediction of forest AGB. The most fundamental modeling approach to AGB prediction is generalized linear regression that allows to relate model parameters to the response variable via a link function [73], [77], [78]. Furthermore, kernel-based learners that include methods such as support vector machine [79], [80], [81] and Gaussian process regression [82], [83] have proven to be effective for the development of AGB estimation models. Additionally, the tree-based models such as Random forest (bagging), XGBoost, and CatBoost (boosting) were also found to be very efficient for the prediction of forest AGB [58], [84]. The latest developments in AGB estimation include artificial neural network based learners that use backpropagation algorithm for training the network to reliably predict AGB. These approaches include the use of some basic multilayer perceptron models [85], sparse autoencoders [86], and deep neural networks with generative learning strategy [87], [88]. However, these various ML models are trained and optimized with different strategies that make it difficult to determine whether a given technique is genuinely better or simply better tuned. Automated machine learning (AutoML) approaches can support a wide range of ML algorithms and automate algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment to develop an optimized ML pipeline. The task of creating an optimized ML pipeline for AGB prediction is complex and time-consuming due to the diversity in forest type and species. The traditional approaches proposed for the development of ML pipelines use a trial and error mechanism for stacking and selection of models for AGB prediction [6], [7], [8]. In practice, the human element cannot be completely eliminated from the process due to the need of performing model checks and model explanation w.r.t target parameter, i.e., AGB. However, the human intervention can be reduced and replaced with efficient search and hyperparameter importance learning algorithms that minimize the number of user-regulated parameters and achieve faster and efficient modeling with reduced human-bias [9], [10], [11], [12]. In this regard, the concept of AutoML can be very instrumental in gaining better machine learning (ML) performance with the available computational budget and reduced human assistance to model forest AGB. Au-toML automates knowledge intensive tasks such as configuring learning tools for feature engineering, neural architecture search (NAS) and algorithm selection using an optimization-evaluation mechanism [13]. In a general architecture, the AutoML controller consists of an evaluator and an optimizer. The evaluator measures the performance of learning tools and provides feedback to the optimizer in order to update configurations for better performance. The optimizer generates configurations based on a search space that is determined by a process of targeted learning [14]. For example, if the learning process is "feature engineering," the learning tools to be configured are the "classifiers" and the search space would consist of "feature sets, feature enhancing methods (dimension reduction, feature generation, feature encoding) and related hyperparameters." There are a wide-ranging applications of AutoML such as medical image recognition [15], [16], object detection [17], [18], [19], super resolution [20], [21], [22], language modeling [23], text classification [24], semantic segmentation [25], etc. AutoML has delivered a quality performance for various tasks but scarcely experimented with developing an ML pipeline for AGB prediction. An ML pipeline is a directed graph of learning elements that can be automated by using a search of estimators/predictors, search of learning algorithms, and search of ensemble models [26], [27], [28]. In literature, there are various algorithms developed for automation of these learning elements of an ML pipeline [29]. The evolutionary algorithms such as particle swarm model selection (PSMS) use particle swarm optimization to automate the full model selection problem [30]. Such evolutionary algorithms inspired ensemble PSMS [31] which is a precursor to the development of the latest AutoML systems [28]. Recent AutoML systems are capable of building regression/classification pipelines, full model selection, multiobjective optimization and best architecture/hyperparameter search for deep learning models [32]. Auto-WEKA [33] system uses the sequential model-based algorithm configuration method [34] which is a robust stochastic optimization framework under noisy function evaluations for the lowest cross-validation misclassification. AutoSklearn [35], which is a system similar to sequential model-based optimization (SMBO), has a distinctive search process initialization that exploits a meta-learner delivered the best performance in many AutoML challenges. The tree-based pipeline optimization tool [37] is an open-source genetic AutoML system that optimizes a series of ML models for high-accuracy and compact pipelines for supervised classification. The recent surge in the application of deep neural networks led to the development of AutoML systems like Auto-Net [38] and NAS [39], [40] which are built on combination of Bayesian optimization and Hyperband to automatically tune deep neural networks without human intervention.
The concept of meta-learning is integral to AutoML systems and enables them to learn from the meta-data of the learning elements [41]. AutoML systems perform faster and more efficiently on a new task with meta-learning techniques that replace human-engineered ML pipelines with data-driven pipelines [42]. Meta-learning techniques have been successfully implemented in literature to automate various learning elements (feature engineering, architecture search, hyperparameter optimization) of the AutoML systems [26], [43], [44], [45], [46], [47]. In this context, there are many studies conducted using meta-learning techniques to execute various ML tasks in remote sensing applications. A trivial problem in deploying advanced ML algorithms in the RS domain is few-shot (low training samples) learning for which meta-learning techniques have provided some significant solutions [48], [49], [50], [51]. Meta-learning techniques can efficiently handle multiscale RS data and provide efficient solutions to inversion modeling problems [52], [53]. These techniques have been used to solve more specific RS problems such as the study in [54] specifically focused on deep neural network model for semantic segmentation of circular objects in satellite images. Apart from such specific problems, meta-learning techniques are capable of dealing with combination of such problems to solve larger problems in RS. For example, a recent study [55] aimed at generating captions for RS images using meta-learning-a task that requires dealing with a combination of problems based on visual and textual features to generate RS image captions. Thus, meta-learning approaches have a significant contribution in solving complex RS problems and hold a great potential for improving AGB modeling in diverse scenarios such as mixed tree species, forest types, and varying density of forests.
The possibility is unlikely for a single modeling algorithm to perform effectively on most types of AGB modeling scenarios. Ensemble models are a combination of base models built on the hypothesis that the combination of multiple (weak) models can produce a more reliable and accurate model as compared to the individual base models [56]. Ensemble models are developed based on techniques such as bagging, boosting, and stacking. The bagging and boosting models both involve homogeneous weak learners that are combined by a deterministic strategy. The difference is bagging algorithms follow a parallel learning strategy (e.g., Random Forest) and boosting algorithms follow a sequential learning strategy (e.g., AdaBoost, XGBoost). Differently, stacking involves heterogeneous weak learners that are combined using a meta-model with a parallel learning strategy. Ensemble models have been used for AGB modeling and prediction by various studies in literature [8], [57], and [58]. However, developing a stacked ensemble with manual trial and error approaches to model AGB is an inefficient and time-consuming task. Meta-learning-driven AutoML systems can be adequately used for developing optimal stacked ensemble models (SEM) for AGB prediction [59]. AutoML systems use a collection of base models with hyperparameter tuning algorithms for producing SEMs [60]. The AutoGluon system [61] uses a multilayer stack ensembling with K-fold bagging that stacks models in multiple layers and trains in a layer-wise manner. The AutoSklearn system [28] employs a Bayesian optimization algorithm for searching through the hyperparameter space and meta-learning for warm-starting of the search procedure. The AutoSklearn 2.0 [35], which is an improvement upon Autosklearn, uses portfolio learning after performing Bayesian optimization. Among the most recent AutoML systems, "H2O" AutoML [62] performs stacked ensembling with random forest, gradient boosting machines, linear models, and deep learning models using a super learning algorithm.
The primary objective of this article is the prediction of AGB using multitemporal multispectral (MT-MS) satellite remote sensing data. Prediction of AGB using satellite remote sensing requires a robust ML pipeline to deal with outliers in the training data and increase the generalization ability of the model. Moreover, satellite remote sensing is generally used for large-scale AGB mapping and may require testing multiple models due to the spatially varying characteristics of the forest. Thus, it is required to have a comprehensive evaluation framework for the multiple models that are considered. Therefore, we propose an AutoML system for the prediction of forest AGB that enables training and evaluation of models within a single framework. In particular, we use a meta-learning-driven AutoML system that automates the selection and stacking of candidate base learners for modeling AGB. Additionally, we propose to incorporate a SMBO procedure to automatically extract features from the MT-MS satellite data to minimize human-bias in the proposed ML pipeline.

A. Study Area Description and Field Data
The study area is the province of Trento (6216 km 2 ) situated in north-eastern Italy in the southern part of the Alps. The area is mountainous and has a 60% of forest cover mostly owned by public institutions that are subject to broad goals of forest management (i.e., forest protection, species biodiversity, carbon storage etc.). The area has mountains with high elevations and landlocked valleys suitable for species such as Norway spruce (Picea abies (L.) Karst.) and Swiss pine (Pinus cembra L.). The area has mountains with lower elevations of an average 1000 m above sea level that are suitable for species such as Silver fir (Abies alba Mill.) and European beech (Fagus sylvatica L.). The areas that are below 1000 m are mainly characterized by broadleaves species such as Ostrya carpinifolia, Carpinus betulus, Fraxinus ornus, Quercus pubescens, Quercus petrae, etc. The geographical location of the study area and the distribution of the field plots in the study area are shown in Fig. 1.
The field data consists of 315 circular plots with a fixed radius of 15 m (see Fig. 1). These data include 98 broadleaf plots, 152 coniferous plots, and 65 mixed plots. The total 315 plots were divided into three categories (broadleaf, coniferous, and mixed plot) such that more than 80% of the plot AGB should be derived from one particular tree type (broadleaf or coniferous), otherwise the plot was considered as a mixed plot. All trees having a diameter at breast height (DBH) above 7 cm inside a plot were geolocated and their species, DBH and heights were measured. The AGB of each tree was computed using the allometric equations stated in [63], and the plot level AGB was computed as the sum of the AGB of all measured trees inside the plot. The field-estimated plot level AGB values ranged from 1.07 to 711.41 Mg ha −1 . The plot coordinates were recorded using a survey-grade GPS unit.

B. Remote Sensing Data
The study was performed using MT-MS images acquired by ESA's Sentinel-2A, 2B satellites. The acquisition dates of the Sentinel-2 images are stated in Table I. Sentinel-2 images are characterized by 13 spectral bands at 60, 20, and 10 m spatial resolution. We used ten spectral bands for our experiments, i.e., four spectral bands (B, G, R, and NIR) at 10 m spatial resolution and six spectral bands (3-Red Edge, Narrow NIR, and 2-SWIR) at 20 m spatial resolution. The central wavelengths of these ten considered spectral bands ranged from 490 to 2190 nm.

III. DESCRIPTION OF ALGORITHMS
The developed mechanism for achieving the proposed objectives was based on coupling two algorithms: 1) Tree-structured Parzen Estimator (TPE) [64] which is a hyperparameter optimization algorithm based on an iterative search process on a dynamic search space and 2) the Super Learner (SL) [65] which is a V-fold cross-validation based meta-learning algorithm that selects weights for combining candidate models. The following subsections provide a description of the two algorithms.

A. TPE for Automatic Feature Extraction
The TPE algorithm is based on a SMBO approach that overcomes the limitations of computationally expensive and inefficient random search and grid search algorithms. The inputs required by the TPE algorithm are parameters (x) and loss (y) based on the prior search history to deduce hyperparameters for the next trial. The input pair (parameters and loss) is split into two densities ( (x) and g(x)) based on the loss of the historical data by a γ quantile (2) of the search result. The TPE defines this split p(x|y) using the two densities according to the following equations: and where (x) should be maximized as it represents the density formed from the observations x (i) that correspond to the loss (y) smaller than the target performance (y * ), whereas g(x) should be minimized as it represents the density formed by using the remaining observations. Therefore, the output of the TPE algorithm can be simply characterized as the ratio between g(x) and (x). The idea of the TPE algorithm is based on maximizing the expected improvement (EI y * (x)) computed by (3) based on convergence of loss (y) and target performance (y * ) in subsequent trials where p(x) can be written as . (4) Therefore, we can deduce Finally combining (4) and (5), we obtain The expression in (6) is a product of two terms of which the second term is independent of "x" and hence the expression is directly proportional to the first term and implies that to maximize improvement points "x" with high probability under (x) and low probability under g(x) are desirable. Also, this expression shows that the ratio g(x) (x) should be minimized in order to increase the expected improvement.

B. SL Algorithm for Automated Stacked Ensemble Modeling
The SL algorithm can be used for ensemble modeling by considering a set of diverse modeling algorithms as base learners and selecting an optimal ensemble through V-fold cross-validation. We first define a library "L" of "K" modeling algorithms trained to solve a regression problem of estimating the expectation of target A given the observed variable A . The optimal value of the parameter of interest ψ 0 (A) that is required to be estimated and the related loss function L(O, ψ) that indicates the difference between the observed and the predicted value (squared error loss) can be written as Thus, the optimal values of the parameter of interest for the "K" individual algorithms included in the library "L" can be written as ψ k (A) where k = 1, 2, ….., K. The number of algorithms "K" to be considered in the library is dependent on the sample size of the data. Note that all the considered algorithms estimate the same parameter ψ k (A) but may use different subsets of A , different basis function, estimation procedures, and range of tuning parameters. The identification of the best predicting algorithm from the library over the data distribution (P o ) is determined by minimizing the expected risk difference d n (ψ k , ψ 0 ) , i.e., The use of the same data to estimateψ k (A) and d n (ψ k , ψ 0 ) generates bias in the estimation of the true risk to determine the best algorithm. Thus a V-fold cross-validation selector is used for unbiased estimation of the risk. Given the empirical distributions for the training and validation sets for each V-fold, the cross-validation selector can be defined as where P 0 n,C n and P 1 n,C n are the distributions of training set (C n (i) = 0) and validation set (C n (i) = 1), respectively.

IV. PROPOSED APPROACH
The flowchart of the proposed approach given in Fig. 2 shows the process of integrated implementation of TPE and SL algorithms to create an ML pipeline for AGB prediction. In the following subsections, the pre-processing of Sentinel-2 data, implementation of TPE and SL algorithms, and model explanation parameters are described in detail.
The extraction of targeted spectral features from Sentinel-2 data for forest AGB prediction requires a comprehensive understanding of the dynamic changes of optical properties with respect to AGB. Vegetation indices are standardized yet retain bias attributed to their pre-defined equations. The TPE algorithm provides flexibility to the spectral features and fits spectral bands to empirical equations based on dynamics with AGB. The TPE algorithm supports spectral bands in parameter search space and composes the spectral features in computationally efficient manner. This makes TPE algorithm an optimal choice for producing target-oriented spectral features from empirical equations. This process was followed by the deployment of the SL algorithm that uses K-fold crossvalidation to estimate the performance of multiple ML models (base learners) and the same model with different hyperparameter settings. This creates an optimal weighted average of base learners (i.e., an ensemble model) that improves prediction accuracy, avoids overfitting, and minimizes parametric assumptions. Thus, the SL algorithm becomes a suitable choice to achieve the objective of stacked ensemble modeling for prediction of forest AGB.

A. Data Preprocessing
The Sentinel-2 images were acquired in Level-1C (top of the atmosphere reflectance) format and converted to Level-2A (bottom of the atmosphere reflectance) format including atmospheric and terrain correction using the Sen2cor processor [66]. The spectral bands at 20 m spatial resolution were resampled at 10 m using the nearest neighbor sampling method for spatial consistency and performing computations. The preprocessed data were used for extracting plot reflectance and preparing season-wise analysis-ready data frames using "rgeos" and "rgdal" packages of R software. The analysis-ready data frames of each season consisted of plot reflectance values for each spectral band for all sample plots.

B. Implementation of Algorithms
The two core frameworks of the proposed approach are based on TPE and SL algorithms implemented sequentially to produce a robust ML pipeline for AGB prediction. First, the TPE algorithm generates automatically optimized features from the preprocessed analysis-ready data frames. Later, these optimized features and the response variable are used for training the SL algorithm that automates the process of training a large number of base models and performs stacked ensembling to produce a leaderboard of models. The TPE algorithm was implemented using "Optuna" framework that was developed in [67] and the SL algorithm was implemented using "H2O-3" framework that was developed in [62]. The sequential implementation of the two frameworks is explained in detail in the following paragraphs.
The implementation of the TPE algorithm to extract features from the input spectral bands requires a model, i.e., an empirical equation for which it can generate a set of parameters and select a combination of optimal spectral bands to accurately predict the Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  Table II. The library of 33 empirical equations (indexed as I n ) framed from an exhaustive database of 500+ spectral indices [68] was available from [69]. The TPE randomly selects an empirical equation and generates parameters depending on the number of spectral bands (N SB ) and the coefficients (CF = α, β, γ, ρ, and σ) in the equation. A generalized linear regression model is fitted to select the optimal spectral bands and coefficient values to maximize the coefficient of determination (objective function). The coefficient of determination enables quantifying the proportion of the variance in extracted optimized feature w.r.t the target AGB values. The coefficient of determination should be maximized to determine the best fit in terms of spectral bands and coefficient values for the empirical equations. The TPE initialization provides a pair of parameters and loss as stated in Section III-A that are split into two densities ( (x) and g(x)) as per (1) and (2). The ratio of the two densities should be minimized (thus maximizing the objective function (R 2 ) in our case) through an iterative process and the best empirical equation is determined with the related optimum set of parameters (spectral bands, coefficient value). However, the random initialization creates a selection bias in the TPE algorithm that favors the empirical equations with less number of parameters, i.e., empirical equations involving less spectral bands. This selection bias problem was resolved using a meta-learning strategy by performing optimization process in groups. In total, the empirical equations were divided into five groups based on the number of spectral bands (N SB = 2, 3, 4, 5, 6) and 1000 iterations were performed for each group. The best empirical equation in each group was identified and after performing three repetitions of this process, the overall best empirical equation was identified. The selected empirical equation, optimal spectral bands, and coefficient values were used to compute the features and directed to the subsequent framework.
The SL framework receives the computed TPE-optimized automatic features and the target response variable as the input training frame. As previously mentioned, we used the H2O-3 platform to deploy the SL algorithm which consists of a library of base models stated in Table III. Additional details regarding the specifications of the base models and model parameters can be accessed from H2O.ai documentation (https://docs.h2o.ai/). The SL algorithm exploits the base models and derives an optimal SEM that minimizes the expected risk difference as per (10) and (11). In order to achieve the defined objective of this study, we used all the available base models in the library (see Table III). All base models were trained until their convergence with five-fold cross-validation on a 128 GB NVIDIA GeForce GTX 3090 GPU and Linux-based operating system. However, the maximum number of models and maximum runtime for models can be specified before the training process depending on the available computational budget and time restrictions.

C. Model Explanations
The process of sequential deployment of TPE and SL frameworks followed in the proposed approach results in TPEoptimized features, model rankings, model predictions, and prediction assessment metrics. Model explanations such as model rankings, evaluation statistics, regression scatterplots, and feature importance chart are provided as results. The performance of the base models and the SEM were evaluated on the basis of different evaluation metrics and ranked according to the model agreement score (coefficient of determination). All models were cross-validated using five-fold cross-validation method. The metrics used to evaluate these models are coefficient of determination (R 2 cv ), root mean squared error (RMSE), root mean squared log error (RMSLE), mean absolute error (MAE), mean absolute percentage error (MAPE), and relative absolute error (RAE). The R 2 cv shows the goodness-of-fit of the regression model, the RMSE shows the standard deviation of the residuals, RMSLE shows the log-transformed standard deviation of residuals, MAE shows the mean of absolute values of residuals, MAPE shows the accuracy of prediction as a percentage, and RAE shows accuracy of measurements relative to the range of the observed variable.

A. TPE Optimized Features and Feature Importance
The features for the SL framework were extracted based on the hyperparameter optimization results of the TPE algorithm. The empirical equations with respective spectral bands and coefficient values for each season selected post-TPE optimization procedure are given in Table IV. The selected empirical equations have been referenced using index numbers (I n ) mentioned in Table II. The spectral bands selected for the respective equations have been referenced w.r.t their central wavelengths (in nm). The corresponding Sentinel-2 spectral band for each central wavelength can be identified from Copernicus website (https://sentinels.copernicus.eu/). The empirical equations that had multiple selected combinations of spectral bands were referenced as "a," "b," and "c." In total, 24 extracted features were selected for training the base models for AGB prediction following the TPE-based optimization procedure. Particularly, there were six features from autumn, seven from spring, six from summer, and five from winter seasons. Interestingly, all extracted features consisted of spectral bands from the vegetation red-edge spectrum (VRE) and short-wave infrared (SWIR) spectrum of the Sentinel-2 data. The TPE algorithm that was conditional on the target variable for the optimization process suggested an important role of the VRE and SWIR spectral bands for modeling forest AGB. Precisely, the SWIR band at λ c = 1610 were selected for 21 out of the 24 extracted features and the contribution of VRE was distributed at λ c = 705, 740, and 865. The season-wise analysis indicated that VRE at λ c = 740 and SWIR at λ c = 1610 was selected for all autumn features. Also, SWIR at λ c = 1610 was selected for five features and λ c = 2190 for the remaining two features. For the summer season, VRE at λ c = 705 and 865 were selected for five out of the total six features and SWIR at λ c = 1610 for all features. Thus, the clear observable pattern in the specific spectral contributions of the extracted features accounts for their target-oriented properties.
The stacked bar chart in Fig. 3 shows the computed feature importance for all the considered base models. The features have been referenced as "index_season" and the scaled feature importance values of the base models have been color coded for analyzing the percentage contributions. The highest average feature importance was observed for "12a_sum" (79%) that was extracted with λ c = 705, 490, and 665. The feature achieved specific feature importance of 74% for DNN, 24% for DRF, and 100% for GBM, XGBoost, and GLM. This indicated that feature "12_sum" contributed significantly for most of the base models and has a dominant role in modeling AGB. A total of five features for DNN (26_spr, 12a_sum, 8b_sum, 8a_sum, 12b_sum), two features for DRF (28c_aut, 33_sum), five features for GBM  IV  TPE OPTIMIZED FEATURES FOR ALL SEASONS (12a_sum, 32_aut, 23_aut, 28b_aut, 7_sum), three features for XGBoost (12a_sum, 7_sum, 12b_sum), and five features for GLM (12a_sum, 7_sum, 33_sum, 23_aut, 17_aut) were characterized by feature importance score greater than 50%. The feature "12a_sum" achieved the highest feature importance for three (GBM, XGBoost, and GLM) out of the five base models and the features '26_spr' and '28b_aut' achieved highest feature importance for DNN, and DRF, respectively. All these features were associated with either summer, autumn, or spring season indicating comparatively low importance of winter features in predicting AGB. Overall, the summer features were the most dominant with a total of 12 out of 20 features with a feature importance greater than 50%.

B. Model Leaderboard and Predictive Analysis
The H2O-3-based AutoML framework used to implement the SL algorithm trains all the base models (see Table III) and also produces an optimal SEM for predicting AGB. The total computational time for executing the SL algorithm to produce the modeling results was 1215 s.
The model ranking results and computed assessment metrics shown in Table V indicated that SEM achieved the best overall performance for predicting forest AGB. The SEM model achieved the highest agreement of R 2 cv = 0.71 and the best prediction precision RMSE = 74.44 Mg ha −1 . The RMSLE = 0.56 was equal for the top three ranked models despite a significant difference in the RMSE indicating the existence of outliers in the data. This explained the slightly higher MAE of the SEM despite achieving better model agreement as compared to DNN and DRF. The same explanation is applicable to the DNN and DRF which have a significant difference in the MAE and MAPE despite an identical RMSLE score. This suggests that SEM is more robust to outliers and the meta-learning process enabled the model to achieve better model fitting. The DNN model achieved the second-best performance on the leaderboard and the overall results suggested DNN to be more prone to outliers as compared to SEM.   The RAE score for the SEM model was 0.43 and successively increased to 0.75 for other models on the leaderboard. This indicated the lowest saturation from SEM and successive increment in saturation for other models to predict AGB. The low RAE score for SEM and DNN models shows that these models can predict greater AGB values with better accuracy as compared to the other models. This is also evident from the scatterplots of the regression models shown in Fig. 4. The scatterplots clearly indicate that SEM, DNN, and DRF models efficiently predicted large values of AGB as compared to GBM, XGBoost, and GLM models. Thus, the proposed automated ML pipeline yielded SEM that outperformed single base models and efficiently modeled AGB. The results also pointed out the crucial role of the meta-learning process in modeling AGB for eliminating uncertainties associated with the data and producing robust models.
The forest plot-type-based analysis of model performance was carried out by independently computing RMSE for broadleaves, coniferous, and mixed plots. The results (shown in Table VI) indicate that the SEM model (which showed the best overall performance) was sensitive to the type of forest plot. Overall, the least RMSE errors were observed for the mixed-type plots and the highest RMSE errors were recorded for broadleaf plots. This type of behavior was consistent with the results from previous studies that obtained more accurate AGB estimations results for conifers as compared to the broadleaves using spectral data

VI. DISCUSSION
In this article, we have proposed an automated ML pipeline for developing an SEM for the prediction of forest AGB using multitemporal Sentinel-2 data. The key elements of the ML pipeline were the TPE and SL algorithms that automated the process of feature extraction from the data and training a library of base models leading to the development of a stacked ensemble for modeling AGB. The automated ML pipeline was proposed for dealing with various issues identified in the literature related to the human-bias in AGB modeling and systematic evaluation of models. In this section, we extensively analyze our results with respect to the contemporary literature and precisely identify the advancements delivered by the study for AGB modeling using satellite remote sensing data.
The choice of features is very crucial in any modeling process and the performance of the models is highly dependent on their quality. The use of MS data for AGB modeling has led to the development and testing of various features. Practically, there are numerous combinations of spectral bands that could be used for extracting a feature from MS data for modeling AGB. Studies in literature use a few standard vegetation indices as features for modeling and mapping forest AGB [70], [71], [72], [73].
A comparative analysis of these studies indicates an unstable response of vegetation indices depending upon the spatial resolution, sensor specifications, and available spectral bands of the data. To state simply, a vegetation index identified as effective for a particular study area or data with a certain radiometric specification may not be as effective for other study areas or radiometric specifications. Thus, the human-intelligence-based selection of these vegetation indices can be highly inefficient and ambiguous for AGB modeling. In order to overcome this issue, our study proposed an automated mechanism for extracting such features. The TPE algorithm-based optimization procedure extracts features that are highly target-oriented and involves less human intervention. It is capable of overcoming the shortcomings identified in the literature with regard to extraction and selection of effective features from satellite MS data for AGB modeling. The reduction of human-bias from the process enables the development of more robust and reliable features. Moreover, it performs necessary changes in composing the features based on the specifications of the data such as spatial, spectral, and radiometric resolution. Thus, our study demonstrated the success of the proposed automated approach that led to consistent and accurate modeling results.
The second component of the modeling process after developing robust features is the choice of a modeling algorithm. The type of modeling algorithm substantially affects the precision and accuracy of the predictions. As per literature, there are several studies that use different ML models for predicting forest AGB [7], [57], [58] and provide a comparative assessment of the models in order to identify the best modeling algorithm for AGB prediction. Apart from the range of ML models that can be used, each model has associated hyperparameters that require tuning. A faulty hyperparameter tuning can adversely affect the performance of an ML model and restrict its generalization capability. Moreover, studies that use the same ML models but with a different combination of hyperparameters or architecture sometimes lead to contrasting performance on an identical task. For example, the study in [58] identified XGBoost to deliver the best performance as compared to the other competing models. However, the XGBoost model has multiple associated hyperparameters that were defined manually for finding the best combination of hyperparameters using a grid search. This introduces a human-bias in the process and reduces the chances of reproducing the results for other data or scenarios. To deal with this problem, our study proposed the use of an AutoML approach that effectively automates the iterative tasks associated with the development of a model. Precisely, it automates the process of hyperparameter selection and a range of ML models are trained in the same pipeline for effective comparison and reproducible results.
Studies have identified that a combination of models (stacked ensemble) can produce more efficient results as compared to a single model [74], [75]. There are only a few studies that focused on using SEMs for remote-sensing-based forest applications [6], [76]. These studies used a manual or semi-automatic approach for identifying an optimal combination of models for AGB prediction. An optimal SEM requires a library of diverse models and a systematic algorithm to evaluate the combination of models with optimized hyperparameters and generate a meaningful explanation of models. The solution for this complex problem was provided by the proposed use of SL algorithm implemented using H2O-3 framework that produced a stacked ensemble of base models. The developed SEM model in this study delivered the best performance as compared to the individual base models. Moreover, it limited the required number of user-defined parameters reducing human-bias in the ensemble selection. Thus, a reliable and automated pipeline for robust stacked ensembling and model training was established in this study.

VII. CONCLUSION AND FUTURE WORK
This article proposed an end-to-end ML pipeline for modeling forest AGB using MT-MS satellite remote sensing data. The results demonstrated that reducing human-bias from the modeling process and deploying a comprehensive model evaluation strategy under a single framework can provide better model explanations. The derived model explanations can be instrumental to frame effective schemes for accurately mapping AGB on large areas with diverse forest characteristics. Moreover, instead of using predefined features for modeling, an automated optimization procedure can produce more effective features by weighing more on the spectrum of the data that holds greater importance in explaining the target AGB. The future developments of this work can possibly aim at improving the robustness of the proposed pipeline by the addition of optimization elements and by replacing or improving the deployed meta-learning strategies. The advances in the latest AutoML systems can be possibly incorporated to derive additional model explanations for framing better modeling schemes for large-scale AGB mapping. These advanced AutoML systems could be possibly deployed at a global scale to develop models that can handle a greater diversity of forest types, species, and geographical conditions.