H.138 Dataset

Description

  • Target Soil Properties: SOC, pH, Clay

  • Groups of Features: MIR

  • Sample size: 138

  • Number of Features: 2,489

  • Coordinates: With coordinates (EPSG: 32649)

  • Location: Hubei, China

  • Sampling Design: Two sampling designs: (1) adapted latin hypercube sampling taking into account legacy samples, correlation and accessibility and (2) uncertainty guided sampling based on uncertainty predictions from a random forest model (Stumpf et al. 2017)

  • Study Area Size: 420 ha

  • Geological Setting: Sedimentary rocks, mainly dolomite with silt and limestone formed in the middle and lower Jurassic

  • Previous Data Publication: Full dataset published in Wadoux et al. (2024)

  • Contact Information:
    • Alexandre M.J.-C- Wadoux (Alexandre.Wadoux@inrae.fr), French National Institute for Agriculture, Food, and Environment (INRAE)

  • License: CC BY-SA 4.0

  • Publication/Modification Date (d/m/y): 28.02.25, version 1.0

  • Changelog:
    • Version 1.0 (28.02.25): Initial release

Details

Dataset

The dataset contains the following target soil properties and features:

Target Soil Properties:

SOC - Soil Organic Carbon
  • Code: SOC_target

  • Unit: %

  • Protocol: Determined by the difference of total carbon and inorganic carbon, where total carbon was obtained through elemental analysis by measuring the CO₂ release during dry combustion (DIN ISO 10694) without acid pretreatment and inorganic carbon as 0.12 x the calcium carbonate content, determined by the gas-volumetric Scheibler Method (ISO 10693)

  • Sampling Date: June 2013, May, 2014 and November 2014

  • Sampling Depth: 0 - 20 cm

pH
  • Code: pH_target

  • Unit: Unitless

  • Protocol: Measured in water suspension with a glass electrode with unspecified liquid:soil ratio

  • Sampling Date: June 2013, May, 2014 and November 2014

  • Sampling Depth: 0 - 20 cm

Clay
  • Code: Clay_target

  • Unit: %

  • Protocol: Measured through fractioning the soil into the sand fractions by sieving, and the silt and clay fractions by x-ray sedimentation

  • Sampling Date: June 2013, May, 2014 and November 2014

  • Sampling Depth: 0 - 20 cm

Groups of Features:

MIR – Mid Infrared Spectroscopy
  • Number of Features: 2,489

  • Code(s): wn_5397.9, wn_5396, wn_5394wn_599.8

  • Unit: % (Reflectance)

  • Sensing: VERTEX 70v FT-IR Spectrometer (Bruker Optik, Ettlingen, Germany), on dried and sieved samples (<2 mm) in the laboratory, spectral range was 7,500 - 370 cm^-1 at 0.4 cm^-1 intervals

  • Processing: Discarding irrelevant spectral data of the spectrum (7,500 - 5,397.9 cm^-1) and noisy edges of the spectrum (599.8 - 370 cm^-1)

  • Sampling Date: June 2013, May, 2014 and November 2014

  • Spectral Information (After Data Processing):
    • Data Representation: Wavenumber (in cm^-1)

    • Spectral Resolution: ~2 cm^-1

    • Spectral Range: 5,397.9 – 599.8 cm^-1

Examples

from LimeSoDa import load_dataset, split_dataset
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Load and explore the dataset
data = load_dataset("H.138")
dataset = data["Dataset"]
folds = data["Folds"]
coords = data["Coordinates"]

# Split into train/test using fold 1
X_train, X_test, y_train, y_test = split_dataset(
    data=data,
    fold=1,
    targets=["pH_target", "SOC_target", "Clay_target"]
)

# Fit model and get predictions
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Calculate performance metrics
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"R-squared: {r2:.7f}")
print(f"RMSE: {rmse:.7f}")

References

Wadoux, A. M. J.-C., Stumpf, F., & Scholten, T.. (2024). A catchment-scale dataset of soil properties and their mid-infrared spectra. Zenodo repository. https://doi.org/10.5281/zenodo.14557348

Stumpf, F., Schmidt, K., Goebes, P., Behrens, T., Schönbrodt-Stitt, S., Wadoux, A., Xiang, W. & Scholten, T. (2017). Uncertainty-guided sampling to improve digital soil maps. Catena, 153, 30-38.