H.138 Dataset ============= Description ----------- * Target Soil Properties: SOC, pH, Clay * Groups of Features: MIR * Sample size: 138 * Number of Features: 2,489 * Coordinates: With coordinates (EPSG: 32649) * Location: Hubei, China * Sampling Design: Two sampling designs: (1) adapted latin hypercube sampling taking into account legacy samples, correlation and accessibility and (2) uncertainty guided sampling based on uncertainty predictions from a random forest model (Stumpf et al. 2017) * Study Area Size: 420 ha * Geological Setting: Sedimentary rocks, mainly dolomite with silt and limestone formed in the middle and lower Jurassic * Previous Data Publication: Full dataset published in Wadoux et al. (2024) * Contact Information: * Alexandre M.J.-C- Wadoux (Alexandre.Wadoux@inrae.fr), French National Institute for Agriculture, Food, and Environment (INRAE) * License: CC BY-SA 4.0 * Publication/Modification Date (d/m/y): 28.02.25, version 1.0 * Changelog: * Version 1.0 (28.02.25): Initial release Details ------- Dataset ^^^^^^^ The dataset contains the following target soil properties and features: Target Soil Properties: """""""""""""""""""""" SOC - Soil Organic Carbon * Code: ``SOC_target`` * Unit: % * Protocol: Determined by the difference of total carbon and inorganic carbon, where total carbon was obtained through elemental analysis by measuring the CO₂ release during dry combustion (DIN ISO 10694) without acid pretreatment and inorganic carbon as 0.12 x the calcium carbonate content, determined by the gas-volumetric Scheibler Method (ISO 10693) * Sampling Date: June 2013, May, 2014 and November 2014 * Sampling Depth: 0 - 20 cm pH * Code: ``pH_target`` * Unit: Unitless * Protocol: Measured in water suspension with a glass electrode with unspecified liquid:soil ratio * Sampling Date: June 2013, May, 2014 and November 2014 * Sampling Depth: 0 - 20 cm Clay * Code: ``Clay_target`` * Unit: % * Protocol: Measured through fractioning the soil into the sand fractions by sieving, and the silt and clay fractions by x-ray sedimentation * Sampling Date: June 2013, May, 2014 and November 2014 * Sampling Depth: 0 - 20 cm Groups of Features: """"""""""""""""" MIR – Mid Infrared Spectroscopy * Number of Features: 2,489 * Code(s): ``wn_5397.9``, ``wn_5396``, ``wn_5394`` ... ``wn_599.8`` * Unit: % (Reflectance) * Sensing: VERTEX 70v FT-IR Spectrometer (Bruker Optik, Ettlingen, Germany), on dried and sieved samples (<2 mm) in the laboratory, spectral range was 7,500 - 370 cm^-1 at 0.4 cm^-1 intervals * Processing: Discarding irrelevant spectral data of the spectrum (7,500 - 5,397.9 cm^-1) and noisy edges of the spectrum (599.8 - 370 cm^-1) * Sampling Date: June 2013, May, 2014 and November 2014 * Spectral Information (After Data Processing): * Data Representation: Wavenumber (in cm^-1) * Spectral Resolution: ~2 cm^-1 * Spectral Range: 5,397.9 – 599.8 cm^-1 Examples -------- .. code-block:: python from LimeSoDa import load_dataset, split_dataset from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_squared_error import numpy as np # Load and explore the dataset data = load_dataset("H.138") dataset = data["Dataset"] folds = data["Folds"] coords = data["Coordinates"] # Split into train/test using fold 1 X_train, X_test, y_train, y_test = split_dataset( data=data, fold=1, targets=["pH_target", "SOC_target", "Clay_target"] ) # Fit model and get predictions model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) # Calculate performance metrics r2 = r2_score(y_test, predictions) rmse = np.sqrt(mean_squared_error(y_test, predictions)) print(f"R-squared: {r2:.7f}") print(f"RMSE: {rmse:.7f}") References ---------- Wadoux, A. M. J.-C., Stumpf, F., & Scholten, T.. (2024). A catchment-scale dataset of soil properties and their mid-infrared spectra. Zenodo repository. https://doi.org/10.5281/zenodo.14557348 Stumpf, F., Schmidt, K., Goebes, P., Behrens, T., Schönbrodt-Stitt, S., Wadoux, A., Xiang, W. & Scholten, T. (2017). Uncertainty-guided sampling to improve digital soil maps. Catena, 153, 30-38.