Dataset Overview¶

Here we will go through the provided data and explore the structure and metadata.

For examples on how to use the data in machine and deep learning applications see the next section.

In [3]:

Copied!





import polars as pl
import geopandas as gpd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import tifffile
import plotly.io as pio

import disfor

pio.renderers.default = "sphinx_gallery"
import polars as pl
import geopandas as gpd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import tifffile
import plotly.io as pio

import disfor

pio.renderers.default = "sphinx_gallery"

DISFOR provides four datasets:

samples.parquet: Providing location and metadata of sampled points
labels.parquet: Providing labels for each sampled time-series
pixel_data.parquet: Providing Sentinel-2 band data for each acquistion in the time-series
Sentinel-2 Chips: Image chip time-series for each sample

These datasets are not available in this repository due to size, but available at huggingface for download. This package provides a data getter service (disfor.get()), which helps to download all of the appropriate data.

We will now go into the different datasets with more detail.

Samples dataset¶

This is a geoparquet file which can be read with geopandas or QGIS.

In [4]:

Copied!

samples = gpd.read_parquet(disfor.get("samples.parquet"))
samples
samples = gpd.read_parquet(disfor.get("samples.parquet"))
samples

Out[4]:

	sample_id	original_sample_id	interpreter	dataset	source	source_description	s2_tile	cluster_id	cluster_description	comment	confidence	geometry
0	0	0	vij	1	EFFIS	Evoland Project, EFFIS Source of Wildfire Poly...	30SUF	0.0	Damage polygons	border, thinning, then clear cut	high	POINT (-4.12212 36.74179)
1	1	1	vij	1	EFFIS	Evoland Project, EFFIS Source of Wildfire Poly...	30SUF	1.0	Damage polygons	unclear progression, edge	medium	POINT (-4.12161 36.74231)
2	2	2	vij	1	EFFIS	Evoland Project, EFFIS Source of Wildfire Poly...	30SUF	2.0	Damage polygons	plantation	high	POINT (-4.1192 36.74203)
3	3	3	vij	1	EFFIS	Evoland Project, EFFIS Source of Wildfire Poly...	30SUF	3.0	Damage polygons	plantation	high	POINT (-4.12845 36.75831)
4	4	4	vij	1	EFFIS	Evoland Project, EFFIS Source of Wildfire Poly...	30SUF	5.0	Damage polygons	clear cut	high	POINT (-4.12816 36.75908)
...	...	...	...	...	...	...	...	...	...	...	...	...
3823	3818	16066	vij	3	FORWIND + Copernicus Emergency Service	https://mapping.emergency.copernicus.eu/activa...	<NA>	SI20200205	Id of the Event, given as ISO2 + Date of storm	<NA>	high	POINT (14.39175 46.2299)
3824	3819	16067	vij	3	FORWIND + Copernicus Emergency Service	https://mapping.emergency.copernicus.eu/activa...	<NA>	SI20200205	Id of the Event, given as ISO2 + Date of storm	unclear salvage	high	POINT (14.4004 46.2291)
3825	3820	16070	vij	3	FORWIND + Copernicus Emergency Service	https://mapping.emergency.copernicus.eu/activa...	<NA>	SI20200205	Id of the Event, given as ISO2 + Date of storm	<NA>	high	POINT (14.39046 46.24575)
3826	3821	16071	vij	3	FORWIND + Copernicus Emergency Service	https://mapping.emergency.copernicus.eu/activa...	<NA>	SI20200205	Id of the Event, given as ISO2 + Date of storm	unclear salvage	high	POINT (14.39247 46.2441)
3827	3822	16076	vij	3	FORWIND + Copernicus Emergency Service	https://mapping.emergency.copernicus.eu/activa...	<NA>	SI20200205	Id of the Event, given as ISO2 + Date of storm	<NA>	high	POINT (14.4381 46.25011)

3822 rows × 12 columns

The dataset provides the following columns:

Column name	Description
sample_id	Unique sample ID for each sample point
original_sample_id	Sample ID of the point in the original publication of the dataset
interpreter	Shorthand code for the interpreter who labelled this sample
dataset	Number of the original sampling campaign in which this point was labelled
source	The ancillary data source used to interpret the agent
source_description	A long text description of the used source. Link to the original data if available
s2_tile	If available, which Sentinel 2 Tile the sample intersects
cluster_id	Unique ID to group samples which are spatio-temporally autocorrelated
cluster_description	What type of cluster it is
comment	Free text comment about the interpretation of the sampled point
confidence	Confidence of sampling: high where both timing and agent are confident, medium were only the timing is confident
geometry	Coordinates of the sampled point. In CRS EPSG:4326

For training of models this samples dataset can be used to subset the data. Some comments on how this dataset can be used to subset the data:

Column confidence: For example if the agent is supposed to be modelled, the confidence can be set to only include samples with high confidence. However if a disturbance detection algorithm is supposed to be calibrated, then both high and medium confidence can be included.
Column clusters: If some data is supposed to be held out during training for validation purposes, the column clusters should be used to avoid high spatial autocorrelation between the train and test set. The cluster_description column specifies a unit for each value in the column dataset of how samples are clustered together. For example for dataset 2 the samples are clustered by Sentinel 2 tiles. For dataset 1 the samples are clustered by disturbance patches (i.e. there might be multiple samples within the same disturbance patch) and for dataset 3 the samples are clustered by windthrow event.
Column comment: The comments provide more context on the sample. They are free text but some of the more common comments are border, where the sample in on the border between two areas with different dynamics and thus exhibit high variability in the time-series, and low TCD, where usually mediterranean forests with low tree cover density are flagged.

Following is an interactive overview of where data is available.

In [5]:

Copied!

samples.explore()
samples.explore()

Out[5]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Labels dataset¶

The labels table provides labels for each data sample. These labels designate the start times of different events and segments within the sample time-series.

In [6]:

Copied!

labels = pl.read_parquet(disfor.get("labels.parquet"))
labels
labels = pl.read_parquet(disfor.get("labels.parquet"))
labels

Out[6]:

shape: (7_984, 8)

original_sample_id	dataset	label	original_label	start	end	sample_id	start_next_label
i64	u8	u16	str	datetime[ms, UTC]	datetime[ms, UTC]	u16	datetime[ms, UTC]
0	1	110	"0"	2016-11-10 00:00:00 UTC	2022-03-09 00:00:00 UTC	0	2022-04-08 00:00:00 UTC
0	1	212	"6"	2022-04-08 00:00:00 UTC	2022-04-08 23:59:59 UTC	0	2024-04-02 00:00:00 UTC
0	1	211	"5"	2024-04-02 00:00:00 UTC	2024-04-02 23:59:59 UTC	0	null
1	1	110	"0"	2016-11-10 00:00:00 UTC	2023-05-03 00:00:00 UTC	1	2023-06-22 00:00:00 UTC
1	1	211	"5"	2023-06-22 00:00:00 UTC	2023-06-22 23:59:59 UTC	1	null
…	…	…	…	…	…	…	…
16071	3	243	"7"	2020-03-11 00:00:00 UTC	2020-03-11 23:59:59 UTC	3821	2020-03-16 00:00:00 UTC
16071	3	120	"3"	2020-03-16 00:00:00 UTC	2024-12-30 00:00:00 UTC	3821	null
16076	3	110	"0"	2017-08-04 00:00:00 UTC	2020-01-21 00:00:00 UTC	3822	2020-03-11 00:00:00 UTC
16076	3	243	"7"	2020-03-11 00:00:00 UTC	2020-03-11 23:59:59 UTC	3822	2020-06-29 00:00:00 UTC
16076	3	123	"4"	2020-06-29 00:00:00 UTC	2024-12-30 00:00:00 UTC	3822	null

The following columns are available:

Column name	Description
sample_id	Taken from sample table
original_sample_id	Taken from sample table
dataset	Taken from sample table
label	Interpreted class of the segment (see next table)
original_label	The label which was originally assigned and remapped to label
start	Start date of the segment
end	End date of the segment
start_next_label	Start date of the next label. Some labels are encoded as events (Clear Cuts for example) and are not immediately followed by another label, this column allows a full segmentation of the time-series. Null if it is the last label of the sample

The labelling campaign makes a distinction between events and segments. Temporal segments are periods of a distinct condition of the forest. This can include stable undisturbed forest, re-vegetation or bark beetle decline. Events on the other hand are singular changes of short duration. This includes most other disturbances like clear cutting, wildfire and windthrow events. In practice this distinction means that a single label is set for events, while two labels are set for segments, one for the start of the segment and one for the end.

In this table, labels which are events have a start date in the morning of the day and an end date at midnight of the same date. To enable a full segmentation of the timeline, the start_next_label column is computed. This column provides the date when the next label starts, making a full segmentation of the time-series possible.

The labels follow a hierarchical classification scheme:

Level 1	Level 2	Level 3
100 - Healthy Vegetation	110 - Undisturbed Forest
	120 - Revegetation	121 - With Trees (after clear cut)
		122 - Canopy closing (after thinning/defoliation)
		123 - Without Trees (shrubs and grasses, no reforestation visible)
200 - Disturbed	210 - Planned	211 - Clear Cut
		212 - Thinning
		213 - Forestry Mulching (Non Forest Vegetation Removal)
	220 - Salvage	221 - After Biotic Disturbances
	220 - Salvage	222 - After Abiotic Disturbances
	230 - Biotic	231 - Bark Beetle
	230 - Biotic	232 - Gypsy Moth (temporal segment of visible disturbance)
	240 - Abiotic	241 - Drought
		242 - Wildfire
		243 - Wind
		244 - Avalanche
		245 - Flood

This allows for flexible labelling and classification.

Pixel data¶

The pixel_data.parquet table provides Sentinel-2 data for the sample time-series to use in classification tasks. It also provides pre-computed columns for various chip sizes which measure the amount of "clear" pixels.

In [7]:

Copied!

pixel_data = pl.read_parquet(disfor.get("pixel_data.parquet"))
pixel_data
pixel_data = pl.read_parquet(disfor.get("pixel_data.parquet"))
pixel_data

Out[7]:

shape: (3_218_190, 19)

B02	B03	B04	B05	B06	B07	B08	B8A	B11	B12	SCL	sample_id	timestamps	percent_clear_4x4	percent_clear_8x8	percent_clear_16x16	percent_clear_32x32	clear	label
u16	u16	u16	u16	u16	u16	u16	u16	u16	u16	u8	u16	date	u8	u8	u8	u8	bool	u16
103	530	343	1149	3651	4455	4544	4675	2111	1053	4	0	2015-07-29	100	100	100	100	true	null
3320	3340	3188	3580	3825	3887	4240	3912	3430	2714	9	0	2015-08-08	0	0	0	0	false	null
0	323	72	1131	3821	4624	4749	4913	1934	894	4	0	2015-08-18	100	100	100	100	true	null
344	756	608	1416	3396	4183	4084	4417	2013	1018	4	0	2015-08-28	100	100	100	100	true	null
346	794	539	1232	3381	4128	4724	4315	1715	940	4	0	2015-09-17	100	100	92	84	true	null
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
467	558	628	1013	1289	1394	1402	1638	1782	1274	5	999	2024-12-19	100	100	100	100	true	110
290	431	608	1044	1339	1576	1454	1770	2160	1524	5	999	2024-12-21	100	100	100	100	true	110
216	364	542	1032	1311	1476	1306	1648	2108	1589	5	999	2024-12-24	100	100	100	100	true	110
352	432	596	1033	1278	1549	1438	1722	2290	1711	5	999	2024-12-26	100	100	100	100	true	110
284	415	585	1044	1272	1432	1312	1643	2035	1628	5	999	2024-12-29	100	100	100	100	true	110

The following columns are available:

Column name	Datatype	Description
sample_id	UINT16	Taken from sample table
timestamp	DATE	UTC date of the S2 acquisition
label	UINT16	Interpreted class of the segment, see previous table
clear	BOOL	True if the pixel is clear (SCL value any of 2,4,5,6)
percent_clear_4x4 [8x8, 16x16, 32x32]	UINT8	The percentage of clear pixels (SCL in 2,4,5,6) within a 4x4, 8x8, 16x16 or 32x32 pixel image chip
B02, B03, B04, B05, B06, B07, B08, B8A, B11, B12	UINT16	DN value for the spectral band
SCL	UINT8	Sentinel 2 Scene Classification Value

Following is an example of a labelled time-series, together with Sentinel-2 image chips.

In [8]:

Copied!





sample_id = 3733

sample_data = (
    pixel_data.filter(
        pl.col.percent_clear_8x8 > 0.9,
        sample_id=sample_id,
    )
    .select(
        "timestamps",
        "label",
        NDMI=(pl.col.B08.cast(pl.Int32) - pl.col.B11) / (pl.col.B08 + pl.col.B11),
    )
    .sort("timestamps")
)
sample_id = 3733

sample_data = (
    pixel_data.filter(
        pl.col.percent_clear_8x8 > 0.9,
        sample_id=sample_id,
    )
    .select(
        "timestamps",
        "label",
        NDMI=(pl.col.B08.cast(pl.Int32) - pl.col.B11) / (pl.col.B08 + pl.col.B11),
    )
    .sort("timestamps")
)

In [9]:

Copied!





def get_rgb_array(sample_id, timestamp, data_path):
    arr = (
        tifffile.imread(f"{data_path}/{sample_id}/{timestamp}.tif")[
            :,
            :,
            [2, 1, 0],
        ]
        / 10000
    )
    gain = 5
    rgb = np.clip(arr * gain, 0, 1)
    return rgb
def get_rgb_array(sample_id, timestamp, data_path):
    arr = (
        tifffile.imread(f"{data_path}/{sample_id}/{timestamp}.tif")[
            :,
            :,
            [2, 1, 0],
        ]
        / 10000
    )
    gain = 5
    rgb = np.clip(arr * gain, 0, 1)
    return rgb

In [10]:

Copied!





images = {}
for row in sample_data.iter_rows(named=True):
    images[row["timestamps"]] = get_rgb_array(
        sample_id,
        row["timestamps"].strftime("%Y-%m-%d"),
        # If the S2 sample data, hasn't been downloaded before,
        # this will take a while
        disfor.get("tiffs"),
    )
images = {}
for row in sample_data.iter_rows(named=True):
    images[row["timestamps"]] = get_rgb_array(
        sample_id,
        row["timestamps"].strftime("%Y-%m-%d"),
        # If the S2 sample data, hasn't been downloaded before,
        # this will take a while
        disfor.get("tiffs"),
    )

In [ ]:

Copied!





# Ensure timestamps are datetime.date or datetime64
samples_data = sample_data.sort("timestamps").to_pandas()
timestamps = [
    timestamp.to_pydatetime().date()
    for timestamp in samples_data["timestamps"].to_list()
]
labels_arr = samples_data["label"].fillna(110).to_numpy(dtype=int)
ndmi = samples_data["NDMI"].to_numpy()

# --- Segment boundaries ---
segment_changes = np.concatenate(
    [np.array([0]), np.where(labels_arr[:-1] != labels_arr[1:])[0] + 1]
)
n_chips = len(segment_changes)

# --- Build subplot specs ---
specs = [[{"type": "xy"} for _ in range(n_chips)]] + [
    [{"colspan": n_chips}, *[None] * (n_chips - 1)]
]

fig = make_subplots(
    rows=2,
    cols=n_chips,
    specs=specs,
    vertical_spacing=0,
    horizontal_spacing=0,
    row_heights=[0.3, 0.7],
    subplot_titles=[
        f"label: {labels_arr[i]}<br>date: {timestamps[i].strftime('%Y-%m-%d')}"
        for i in segment_changes
    ],
)

# --- Row 1: image chips ---
for j, i in enumerate(segment_changes):
    try:
        next_idx = segment_changes[j + 1] - 1
    except IndexError:
        next_idx = i + 5
    ts = timestamps[min(i + 5, next_idx)]
    if ts not in images:
        continue  # skip if missing
    img = np.clip(images[ts], 0, 1)
    fig.add_trace(go.Image(z=(img * 255).astype(np.uint8)), row=1, col=j + 1)

# --- Row 2: NDMI timeline with label-colored markers ---
unique_labels = samples_data["label"].dropna().unique().astype(int)
palette = px.colors.qualitative.T10
color_map = {lbl: palette[i % len(palette)] for i, lbl in enumerate(unique_labels)}

for lbl in unique_labels:
    mask = samples_data["label"] == lbl
    fig.add_trace(
        go.Scatter(
            x=samples_data.loc[mask, "timestamps"],
            y=samples_data.loc[mask, "NDMI"],
            mode="lines+markers",
            name=f"Label {lbl}",
            marker=dict(color=color_map[lbl], size=6, symbol="circle"),
            line=dict(color=color_map[lbl]),
        ),
        row=2,
        col=1,
    )

# --- Layout tweaks ---
fig.update_layout(
    height=500,
    width=650,
    margin=dict(l=0, r=0, t=40, b=0),
    title_font_size=12,
    showlegend=True,
    legend=dict(
        orientation="h",  # horizontal
        yanchor="top",
        xanchor="center",
        x=0.5,
        title_text="Labels",
    ),
)

# Remove axis labels only from image subplots (row 1)
for j in range(n_chips):
    fig.update_xaxes(
        showticklabels=False,
        title_text="",
        showgrid=False,
        zeroline=False,
        row=1,
        col=j + 1,
    )
    fig.update_yaxes(
        showticklabels=False,
        title_text="",
        showgrid=False,
        zeroline=False,
        row=1,
        col=j + 1,
    )

# Keep axis labels for the timeline plot (row 2)
fig.update_yaxes(title_text="NDMI", row=2, col=1)

# Make subplot titles smaller
fig.update_annotations(font_size=10)

# --- Save and show ---
fig
# Ensure timestamps are datetime.date or datetime64
samples_data = sample_data.sort("timestamps").to_pandas()
timestamps = [
    timestamp.to_pydatetime().date()
    for timestamp in samples_data["timestamps"].to_list()
]
labels_arr = samples_data["label"].fillna(110).to_numpy(dtype=int)
ndmi = samples_data["NDMI"].to_numpy()

# --- Segment boundaries ---
segment_changes = np.concatenate(
    [np.array([0]), np.where(labels_arr[:-1] != labels_arr[1:])[0] + 1]
)
n_chips = len(segment_changes)

# --- Build subplot specs ---
specs = [[{"type": "xy"} for _ in range(n_chips)]] + [
    [{"colspan": n_chips}, *[None] * (n_chips - 1)]
]

fig = make_subplots(
    rows=2,
    cols=n_chips,
    specs=specs,
    vertical_spacing=0,
    horizontal_spacing=0,
    row_heights=[0.3, 0.7],
    subplot_titles=[
        f"label: {labels_arr[i]}
date: {timestamps[i].strftime('%Y-%m-%d')}"
        for i in segment_changes
    ],
)

# --- Row 1: image chips ---
for j, i in enumerate(segment_changes):
    try:
        next_idx = segment_changes[j + 1] - 1
    except IndexError:
        next_idx = i + 5
    ts = timestamps[min(i + 5, next_idx)]
    if ts not in images:
        continue  # skip if missing
    img = np.clip(images[ts], 0, 1)
    fig.add_trace(go.Image(z=(img * 255).astype(np.uint8)), row=1, col=j + 1)

# --- Row 2: NDMI timeline with label-colored markers ---
unique_labels = samples_data["label"].dropna().unique().astype(int)
palette = px.colors.qualitative.T10
color_map = {lbl: palette[i % len(palette)] for i, lbl in enumerate(unique_labels)}

for lbl in unique_labels:
    mask = samples_data["label"] == lbl
    fig.add_trace(
        go.Scatter(
            x=samples_data.loc[mask, "timestamps"],
            y=samples_data.loc[mask, "NDMI"],
            mode="lines+markers",
            name=f"Label {lbl}",
            marker=dict(color=color_map[lbl], size=6, symbol="circle"),
            line=dict(color=color_map[lbl]),
        ),
        row=2,
        col=1,
    )

# --- Layout tweaks ---
fig.update_layout(
    height=500,
    width=650,
    margin=dict(l=0, r=0, t=40, b=0),
    title_font_size=12,
    showlegend=True,
    legend=dict(
        orientation="h",  # horizontal
        yanchor="top",
        xanchor="center",
        x=0.5,
        title_text="Labels",
    ),
)

# Remove axis labels only from image subplots (row 1)
for j in range(n_chips):
    fig.update_xaxes(
        showticklabels=False,
        title_text="",
        showgrid=False,
        zeroline=False,
        row=1,
        col=j + 1,
    )
    fig.update_yaxes(
        showticklabels=False,
        title_text="",
        showgrid=False,
        zeroline=False,
        row=1,
        col=j + 1,
    )

# Keep axis labels for the timeline plot (row 2)
fig.update_yaxes(title_text="NDMI", row=2, col=1)

# Make subplot titles smaller
fig.update_annotations(font_size=10)

# --- Save and show ---
fig

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

The time-series for this sample shows a time-series with a complex disturbance pattern. The sample is first thinned (212) in the beginning of 2023, then affected by windthrow (243), with subsequent salvage logging (222) and regreening (123).