BB.30_2 Dataset
==============

Description
-----------
* Target Soil Properties: SOC, pH, Clay
* Groups of Features: DEM, ERa, Gamma, RSS, VI
* Sample size: 30
* Number of Features: 13
* Coordinates: With coordinates (EPSG: 25833)
* Location: Brandenburg, Germany
* Sampling Design: Regular grid sampling
* Study Area Size: 1.4 ha
* Geological Setting: Pleistocene young morainic landscape of the Weichselian glaciation with predominantly glacial sand
* Previous Data Publication: None
* Contact Information:
    * Pablo Rosso (Pablo.Rosso@zalf.de), Leibniz Centre for Agricultural Landscape Research (ZALF)
    * Sebastian Vogel (SVogel@atb-potsdam.de), Leibniz Institute for Agricultural Engineering and Bioeconomy (ATB)
* License: CC BY-SA 4.0
* Publication/Modification Date (d/m/y): 28.02.25, version 1.0
* Changelog:
    * Version 1.0 (28.02.25): Initial release

Details
-------

Dataset
^^^^^^^
The dataset contains the following target soil properties and features:

Target Soil Properties:
""""""""""""""""""""""

SOC - Soil Organic Carbon
    * Code: ``SOC_target``
    * Unit: %
    * Protocol: Measured CO₂ release during dry combustion after removing inorganic carbon with an acid (DIN ISO 10694)
    * Sampling Date: September 2022
    * Sampling Depth: 0 - 30 cm

pH
    * Code: ``pH_target``
    * Unit: Unitless
    * Protocol: Measured in CaCl₂ suspension with a glass electrode with a 5:1 liquid:soil volumetric ratio (DIN ISO 10390)
    * Sampling Date: September 2022
    * Sampling Depth: 0 - 30 cm

Clay
    * Code: ``Clay_target``
    * Unit: %
    * Protocol: Sieve-Pipette method, measured through fractioning the soil into the sand fractions by sieving, and the silt and clay fractions by sedimentation in water, German adaptation (DIN ISO 11277)
    * Sampling Date: September 2022
    * Sampling Depth: 0 - 30 cm

Groups of Features:
"""""""""""""""""

DEM – Digital Elevation Model and Terrain Parameters
    * Number of Features: 2
    * Code(s): ``Altitude``, ``Slope``
    * Unit: ``Altitude`` in m, ``Slope`` in °
    * Sensing: Digital elevation model raster (5 m) based on LiDAR and photogrammetry from "GeoBasis-DE/LGB"
    * Processing: Calculating ``Slope`` with ``terrain`` function of the raster R-package, extracting DEM values from raster at soil sampling locations
    * Sampling Date: LiDAR March 2011, images for photogrammetry May 2022

ERa – Apparent Electrical Resistivity
    * Number of Features: 1
    * Code(s): ``ERa``
    * Unit: Ω m
    * Sensing: Array of multiple rolling electrodes (Geophilus company, Caputh, Germany) on RapidMapper platform with exploration depth of 0 - 50 cm, in-situ
    * Processing: Ordinary Kriging to align sensing- with soil sampling locations
    * Sampling Date: September 2022

Gamma
    * Number of Features: 5
    * Code(s): ``G_Total_Counts``, ``G_K``, ``G_U``, ``G_Th``, ``G_Cs``
    * Unit: Unitless
    * Sensing: Passive gamma sensor (MS-2000-CsI-MTS, Medusa Radiometrics BV, Groningen, Netherlands) on RapidMapper platform, in-situ
    * Processing: Ordinary Kriging to align sensing- with soil sampling locations
    * Sampling Date: September 2022

RSS – Remote Sensing Derived Spectral Data
    * Number of Features: 1
    * Code(s): ``B04``
    * Unit: Unitless
    * Sensing: Sentinel-2 bare soil image (Level-2A) from "Copernicus Open Access Hub"
    * Processing: Extracting RSS values from raster at soil sampling locations, selecting single band due to low sample size
    * Sampling Date: September 2022

VI - Vegetation Indices
    * Number of Features: 1
    * Code(s): ``NDVI``
    * Unit: Unitless
    * Sensing: Sentinel-2 image during vegetative period (Level-2A) from "Copernicus Open Access Hub"
    * Processing: Calculating ``NDVI`` as (B08 - B04) / (B08 + B04), extracting VI values from raster at soil sampling locations
    * Sampling Date: April 2023

Examples
--------

.. code-block:: python

    from LimeSoDa import load_dataset, split_dataset
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score, mean_squared_error
    import numpy as np

    # Load and explore the dataset
    data = load_dataset("BB.30_2")
    dataset = data["Dataset"]
    folds = data["Folds"]
    coords = data["Coordinates"]

    # Split into train/test using fold 1
    X_train, X_test, y_train, y_test = split_dataset(
        data=data,
        fold=1,
        targets=["pH_target", "SOC_target", "Clay_target"]
    )

    # Fit model and get predictions
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    # Calculate performance metrics
    r2 = r2_score(y_test, predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    print(f"R-squared: {r2:.7f}")
    print(f"RMSE: {rmse:.7f}")

References
----------