Dataset Overview¶
Here we will go through the provided data and explore the structure and metadata.
For examples on how to use the data in machine and deep learning applications see the next section.
import polars as pl
import geopandas as gpd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import tifffile
import plotly.io as pio
import disfor
pio.renderers.default = "sphinx_gallery"
DISFOR provides four datasets:
samples.parquet: Providing location and metadata of sampled pointslabels.parquet: Providing labels for each sampled time-seriespixel_data.parquet: Providing Sentinel-2 band data for each acquistion in the time-series- Sentinel-2 Chips: Image chip time-series for each sample
These datasets are not available in this repository due to size, but available at huggingface for download.
This package provides a data getter service (disfor.get()), which helps to download all of the appropriate data.
We will now go into the different datasets with more detail.
Samples dataset¶
This is a geoparquet file which can be read with geopandas or QGIS.
samples = gpd.read_parquet(disfor.get("samples.parquet"))
samples
| sample_id | original_sample_id | interpreter | dataset | source | source_description | s2_tile | cluster_id | cluster_description | comment | confidence | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | vij | 1 | EFFIS | Evoland Project, EFFIS Source of Wildfire Poly... | 30SUF | 0.0 | Damage polygons | border, thinning, then clear cut | high | POINT (-4.12212 36.74179) |
| 1 | 1 | 1 | vij | 1 | EFFIS | Evoland Project, EFFIS Source of Wildfire Poly... | 30SUF | 1.0 | Damage polygons | unclear progression, edge | medium | POINT (-4.12161 36.74231) |
| 2 | 2 | 2 | vij | 1 | EFFIS | Evoland Project, EFFIS Source of Wildfire Poly... | 30SUF | 2.0 | Damage polygons | plantation | high | POINT (-4.1192 36.74203) |
| 3 | 3 | 3 | vij | 1 | EFFIS | Evoland Project, EFFIS Source of Wildfire Poly... | 30SUF | 3.0 | Damage polygons | plantation | high | POINT (-4.12845 36.75831) |
| 4 | 4 | 4 | vij | 1 | EFFIS | Evoland Project, EFFIS Source of Wildfire Poly... | 30SUF | 5.0 | Damage polygons | clear cut | high | POINT (-4.12816 36.75908) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3823 | 3818 | 16066 | vij | 3 | FORWIND + Copernicus Emergency Service | https://mapping.emergency.copernicus.eu/activa... | <NA> | SI20200205 | Id of the Event, given as ISO2 + Date of storm | <NA> | high | POINT (14.39175 46.2299) |
| 3824 | 3819 | 16067 | vij | 3 | FORWIND + Copernicus Emergency Service | https://mapping.emergency.copernicus.eu/activa... | <NA> | SI20200205 | Id of the Event, given as ISO2 + Date of storm | unclear salvage | high | POINT (14.4004 46.2291) |
| 3825 | 3820 | 16070 | vij | 3 | FORWIND + Copernicus Emergency Service | https://mapping.emergency.copernicus.eu/activa... | <NA> | SI20200205 | Id of the Event, given as ISO2 + Date of storm | <NA> | high | POINT (14.39046 46.24575) |
| 3826 | 3821 | 16071 | vij | 3 | FORWIND + Copernicus Emergency Service | https://mapping.emergency.copernicus.eu/activa... | <NA> | SI20200205 | Id of the Event, given as ISO2 + Date of storm | unclear salvage | high | POINT (14.39247 46.2441) |
| 3827 | 3822 | 16076 | vij | 3 | FORWIND + Copernicus Emergency Service | https://mapping.emergency.copernicus.eu/activa... | <NA> | SI20200205 | Id of the Event, given as ISO2 + Date of storm | <NA> | high | POINT (14.4381 46.25011) |
3822 rows × 12 columns
The dataset provides the following columns:
| Column name | Description |
|---|---|
| sample_id | Unique sample ID for each sample point |
| original_sample_id | Sample ID of the point in the original publication of the dataset |
| interpreter | Shorthand code for the interpreter who labelled this sample |
| dataset | Number of the original sampling campaign in which this point was labelled |
| source | The ancillary data source used to interpret the agent |
| source_description | A long text description of the used source. Link to the original data if available |
| s2_tile | If available, which Sentinel 2 Tile the sample intersects |
| cluster_id | Unique ID to group samples which are spatio-temporally autocorrelated |
| cluster_description | What type of cluster it is |
| comment | Free text comment about the interpretation of the sampled point |
| confidence | Confidence of sampling: high where both timing and agent are confident, medium were only the timing is confident |
| geometry | Coordinates of the sampled point. In CRS EPSG:4326 |
For training of models this samples dataset can be used to subset the data. Some comments on how this dataset can be used to subset the data:
Column
confidence: For example if the agent is supposed to be modelled, the confidence can be set to only include samples with high confidence. However if a disturbance detection algorithm is supposed to be calibrated, then both high and medium confidence can be included.Column
clusters: If some data is supposed to be held out during training for validation purposes, the columnclustersshould be used to avoid high spatial autocorrelation between the train and test set. Thecluster_descriptioncolumn specifies a unit for each value in the columndatasetof how samples are clustered together. For example for dataset 2 the samples are clustered by Sentinel 2 tiles. For dataset 1 the samples are clustered by disturbance patches (i.e. there might be multiple samples within the same disturbance patch) and for dataset 3 the samples are clustered by windthrow event.Column
comment: The comments provide more context on the sample. They are free text but some of the more common comments areborder, where the sample in on the border between two areas with different dynamics and thus exhibit high variability in the time-series, andlow TCD, where usually mediterranean forests with low tree cover density are flagged.
Following is an interactive overview of where data is available.
samples.explore()
Labels dataset¶
The labels table provides labels for each data sample. These labels designate the start times of different events and segments within the sample time-series.
labels = pl.read_parquet(disfor.get("labels.parquet"))
labels
| original_sample_id | dataset | label | original_label | start | end | sample_id | start_next_label |
|---|---|---|---|---|---|---|---|
| i64 | u8 | u16 | str | datetime[ms, UTC] | datetime[ms, UTC] | u16 | datetime[ms, UTC] |
| 0 | 1 | 110 | "0" | 2016-11-10 00:00:00 UTC | 2022-03-09 00:00:00 UTC | 0 | 2022-04-08 00:00:00 UTC |
| 0 | 1 | 212 | "6" | 2022-04-08 00:00:00 UTC | 2022-04-08 23:59:59 UTC | 0 | 2024-04-02 00:00:00 UTC |
| 0 | 1 | 211 | "5" | 2024-04-02 00:00:00 UTC | 2024-04-02 23:59:59 UTC | 0 | null |
| 1 | 1 | 110 | "0" | 2016-11-10 00:00:00 UTC | 2023-05-03 00:00:00 UTC | 1 | 2023-06-22 00:00:00 UTC |
| 1 | 1 | 211 | "5" | 2023-06-22 00:00:00 UTC | 2023-06-22 23:59:59 UTC | 1 | null |
| … | … | … | … | … | … | … | … |
| 16071 | 3 | 243 | "7" | 2020-03-11 00:00:00 UTC | 2020-03-11 23:59:59 UTC | 3821 | 2020-03-16 00:00:00 UTC |
| 16071 | 3 | 120 | "3" | 2020-03-16 00:00:00 UTC | 2024-12-30 00:00:00 UTC | 3821 | null |
| 16076 | 3 | 110 | "0" | 2017-08-04 00:00:00 UTC | 2020-01-21 00:00:00 UTC | 3822 | 2020-03-11 00:00:00 UTC |
| 16076 | 3 | 243 | "7" | 2020-03-11 00:00:00 UTC | 2020-03-11 23:59:59 UTC | 3822 | 2020-06-29 00:00:00 UTC |
| 16076 | 3 | 123 | "4" | 2020-06-29 00:00:00 UTC | 2024-12-30 00:00:00 UTC | 3822 | null |
The following columns are available:
| Column name | Description |
|---|---|
| sample_id | Taken from sample table |
| original_sample_id | Taken from sample table |
| dataset | Taken from sample table |
| label | Interpreted class of the segment (see next table) |
| original_label | The label which was originally assigned and remapped to label |
| start | Start date of the segment |
| end | End date of the segment |
| start_next_label | Start date of the next label. Some labels are encoded as events (Clear Cuts for example) and are not immediately followed by another label, this column allows a full segmentation of the time-series. Null if it is the last label of the sample |
The labelling campaign makes a distinction between events and segments. Temporal segments are periods of a distinct condition of the forest. This can include stable undisturbed forest, re-vegetation or bark beetle decline. Events on the other hand are singular changes of short duration. This includes most other disturbances like clear cutting, wildfire and windthrow events. In practice this distinction means that a single label is set for events, while two labels are set for segments, one for the start of the segment and one for the end.
In this table, labels which are events have a start date in the morning of the day and an end date at midnight of the same date. To enable a full segmentation of the timeline, the start_next_label column is computed. This column provides the date when the next label starts, making a full segmentation of the time-series possible.
The labels follow a hierarchical classification scheme:
| Level 1 | Level 2 | Level 3 |
|---|---|---|
| 100 - Healthy Vegetation | 110 - Undisturbed Forest | |
| 120 - Revegetation | 121 - With Trees (after clear cut) | |
| 122 - Canopy closing (after thinning/defoliation) | ||
| 123 - Without Trees (shrubs and grasses, no reforestation visible) | ||
| 200 - Disturbed | 210 - Planned | 211 - Clear Cut |
| 212 - Thinning | ||
| 213 - Forestry Mulching (Non Forest Vegetation Removal) | ||
| 220 - Salvage | 221 - After Biotic Disturbances | |
| 222 - After Abiotic Disturbances | ||
| 230 - Biotic | 231 - Bark Beetle | |
| 232 - Gypsy Moth (temporal segment of visible disturbance) | ||
| 240 - Abiotic | 241 - Drought | |
| 242 - Wildfire | ||
| 243 - Wind | ||
| 244 - Avalanche | ||
| 245 - Flood |
This allows for flexible labelling and classification.
Pixel data¶
The pixel_data.parquet table provides Sentinel-2 data for the sample time-series to use in classification tasks. It also provides pre-computed columns for various chip sizes which measure the amount of "clear" pixels.
pixel_data = pl.read_parquet(disfor.get("pixel_data.parquet"))
pixel_data
| B02 | B03 | B04 | B05 | B06 | B07 | B08 | B8A | B11 | B12 | SCL | sample_id | timestamps | percent_clear_4x4 | percent_clear_8x8 | percent_clear_16x16 | percent_clear_32x32 | clear | label |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| u16 | u16 | u16 | u16 | u16 | u16 | u16 | u16 | u16 | u16 | u8 | u16 | date | u8 | u8 | u8 | u8 | bool | u16 |
| 103 | 530 | 343 | 1149 | 3651 | 4455 | 4544 | 4675 | 2111 | 1053 | 4 | 0 | 2015-07-29 | 100 | 100 | 100 | 100 | true | null |
| 3320 | 3340 | 3188 | 3580 | 3825 | 3887 | 4240 | 3912 | 3430 | 2714 | 9 | 0 | 2015-08-08 | 0 | 0 | 0 | 0 | false | null |
| 0 | 323 | 72 | 1131 | 3821 | 4624 | 4749 | 4913 | 1934 | 894 | 4 | 0 | 2015-08-18 | 100 | 100 | 100 | 100 | true | null |
| 344 | 756 | 608 | 1416 | 3396 | 4183 | 4084 | 4417 | 2013 | 1018 | 4 | 0 | 2015-08-28 | 100 | 100 | 100 | 100 | true | null |
| 346 | 794 | 539 | 1232 | 3381 | 4128 | 4724 | 4315 | 1715 | 940 | 4 | 0 | 2015-09-17 | 100 | 100 | 92 | 84 | true | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 467 | 558 | 628 | 1013 | 1289 | 1394 | 1402 | 1638 | 1782 | 1274 | 5 | 999 | 2024-12-19 | 100 | 100 | 100 | 100 | true | 110 |
| 290 | 431 | 608 | 1044 | 1339 | 1576 | 1454 | 1770 | 2160 | 1524 | 5 | 999 | 2024-12-21 | 100 | 100 | 100 | 100 | true | 110 |
| 216 | 364 | 542 | 1032 | 1311 | 1476 | 1306 | 1648 | 2108 | 1589 | 5 | 999 | 2024-12-24 | 100 | 100 | 100 | 100 | true | 110 |
| 352 | 432 | 596 | 1033 | 1278 | 1549 | 1438 | 1722 | 2290 | 1711 | 5 | 999 | 2024-12-26 | 100 | 100 | 100 | 100 | true | 110 |
| 284 | 415 | 585 | 1044 | 1272 | 1432 | 1312 | 1643 | 2035 | 1628 | 5 | 999 | 2024-12-29 | 100 | 100 | 100 | 100 | true | 110 |
The following columns are available:
| Column name | Datatype | Description |
|---|---|---|
| sample_id | UINT16 | Taken from sample table |
| timestamp | DATE | UTC date of the S2 acquisition |
| label | UINT16 | Interpreted class of the segment, see previous table |
| clear | BOOL | True if the pixel is clear (SCL value any of 2,4,5,6) |
| percent_clear_4x4 [8x8, 16x16, 32x32] | UINT8 | The percentage of clear pixels (SCL in 2,4,5,6) within a 4x4, 8x8, 16x16 or 32x32 pixel image chip |
| B02, B03, B04, B05, B06, B07, B08, B8A, B11, B12 | UINT16 | DN value for the spectral band |
| SCL | UINT8 | Sentinel 2 Scene Classification Value |
Following is an example of a labelled time-series, together with Sentinel-2 image chips.
sample_id = 3733
sample_data = (
pixel_data.filter(
pl.col.percent_clear_8x8 > 0.9,
sample_id=sample_id,
)
.select(
"timestamps",
"label",
NDMI=(pl.col.B08.cast(pl.Int32) - pl.col.B11) / (pl.col.B08 + pl.col.B11),
)
.sort("timestamps")
)
def get_rgb_array(sample_id, timestamp, data_path):
arr = (
tifffile.imread(f"{data_path}/{sample_id}/{timestamp}.tif")[
:,
:,
[2, 1, 0],
]
/ 10000
)
gain = 5
rgb = np.clip(arr * gain, 0, 1)
return rgb
images = {}
for row in sample_data.iter_rows(named=True):
images[row["timestamps"]] = get_rgb_array(
sample_id,
row["timestamps"].strftime("%Y-%m-%d"),
# If the S2 sample data, hasn't been downloaded before,
# this will take a while
disfor.get("tiffs"),
)
# Ensure timestamps are datetime.date or datetime64
samples_data = sample_data.sort("timestamps").to_pandas()
timestamps = [
timestamp.to_pydatetime().date()
for timestamp in samples_data["timestamps"].to_list()
]
labels_arr = samples_data["label"].fillna(110).to_numpy(dtype=int)
ndmi = samples_data["NDMI"].to_numpy()
# --- Segment boundaries ---
segment_changes = np.concatenate(
[np.array([0]), np.where(labels_arr[:-1] != labels_arr[1:])[0] + 1]
)
n_chips = len(segment_changes)
# --- Build subplot specs ---
specs = [[{"type": "xy"} for _ in range(n_chips)]] + [
[{"colspan": n_chips}, *[None] * (n_chips - 1)]
]
fig = make_subplots(
rows=2,
cols=n_chips,
specs=specs,
vertical_spacing=0,
horizontal_spacing=0,
row_heights=[0.3, 0.7],
subplot_titles=[
f"label: {labels_arr[i]}<br>date: {timestamps[i].strftime('%Y-%m-%d')}"
for i in segment_changes
],
)
# --- Row 1: image chips ---
for j, i in enumerate(segment_changes):
try:
next_idx = segment_changes[j + 1] - 1
except IndexError:
next_idx = i + 5
ts = timestamps[min(i + 5, next_idx)]
if ts not in images:
continue # skip if missing
img = np.clip(images[ts], 0, 1)
fig.add_trace(go.Image(z=(img * 255).astype(np.uint8)), row=1, col=j + 1)
# --- Row 2: NDMI timeline with label-colored markers ---
unique_labels = samples_data["label"].dropna().unique().astype(int)
palette = px.colors.qualitative.T10
color_map = {lbl: palette[i % len(palette)] for i, lbl in enumerate(unique_labels)}
for lbl in unique_labels:
mask = samples_data["label"] == lbl
fig.add_trace(
go.Scatter(
x=samples_data.loc[mask, "timestamps"],
y=samples_data.loc[mask, "NDMI"],
mode="lines+markers",
name=f"Label {lbl}",
marker=dict(color=color_map[lbl], size=6, symbol="circle"),
line=dict(color=color_map[lbl]),
),
row=2,
col=1,
)
# --- Layout tweaks ---
fig.update_layout(
height=500,
width=650,
margin=dict(l=0, r=0, t=40, b=0),
title_font_size=12,
showlegend=True,
legend=dict(
orientation="h", # horizontal
yanchor="top",
xanchor="center",
x=0.5,
title_text="Labels",
),
)
# Remove axis labels only from image subplots (row 1)
for j in range(n_chips):
fig.update_xaxes(
showticklabels=False,
title_text="",
showgrid=False,
zeroline=False,
row=1,
col=j + 1,
)
fig.update_yaxes(
showticklabels=False,
title_text="",
showgrid=False,
zeroline=False,
row=1,
col=j + 1,
)
# Keep axis labels for the timeline plot (row 2)
fig.update_yaxes(title_text="NDMI", row=2, col=1)
# Make subplot titles smaller
fig.update_annotations(font_size=10)
# --- Save and show ---
fig
date: {timestamps[i].strftime('%Y-%m-%d')}" for i in segment_changes ], ) # --- Row 1: image chips --- for j, i in enumerate(segment_changes): try: next_idx = segment_changes[j + 1] - 1 except IndexError: next_idx = i + 5 ts = timestamps[min(i + 5, next_idx)] if ts not in images: continue # skip if missing img = np.clip(images[ts], 0, 1) fig.add_trace(go.Image(z=(img * 255).astype(np.uint8)), row=1, col=j + 1) # --- Row 2: NDMI timeline with label-colored markers --- unique_labels = samples_data["label"].dropna().unique().astype(int) palette = px.colors.qualitative.T10 color_map = {lbl: palette[i % len(palette)] for i, lbl in enumerate(unique_labels)} for lbl in unique_labels: mask = samples_data["label"] == lbl fig.add_trace( go.Scatter( x=samples_data.loc[mask, "timestamps"], y=samples_data.loc[mask, "NDMI"], mode="lines+markers", name=f"Label {lbl}", marker=dict(color=color_map[lbl], size=6, symbol="circle"), line=dict(color=color_map[lbl]), ), row=2, col=1, ) # --- Layout tweaks --- fig.update_layout( height=500, width=650, margin=dict(l=0, r=0, t=40, b=0), title_font_size=12, showlegend=True, legend=dict( orientation="h", # horizontal yanchor="top", xanchor="center", x=0.5, title_text="Labels", ), ) # Remove axis labels only from image subplots (row 1) for j in range(n_chips): fig.update_xaxes( showticklabels=False, title_text="", showgrid=False, zeroline=False, row=1, col=j + 1, ) fig.update_yaxes( showticklabels=False, title_text="", showgrid=False, zeroline=False, row=1, col=j + 1, ) # Keep axis labels for the timeline plot (row 2) fig.update_yaxes(title_text="NDMI", row=2, col=1) # Make subplot titles smaller fig.update_annotations(font_size=10) # --- Save and show --- fig
The Kernel crashed while executing code in the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
The time-series for this sample shows a time-series with a complex disturbance pattern. The sample is first thinned (212) in the beginning of 2023, then affected by windthrow (243), with subsequent salvage logging (222) and regreening (123).