Data Loaders¶
There are two dataloaders available to make working with the provided data more straightforward.
- Data loader providing spectral data and labels for a single pixel. Useful for scikit-learn classifiers
- Pytorch dataset and Pytorch Ligthning dataloader providing image chips together with labels
In this notebook we show how these dataloaders can be used.
Pixel dataloader¶
from disfor.datasets import TabularDataset
The class TabularDataset provides arguments to filter the dataset and returns class properties which can be used for training of sklearn classifiers.
data = TabularDataset(
# If None, data gets dynamically downloaded and cached from Huggingface
data_folder=None,
# selecting healthy forest (110), clear cut (211) and bark beetle (231)
target_classes=[110, 211, 231],
# we remap salvage logging (221 and 222) to also be part of the clear cut class
# this happens before filtering the target_classes. This means, that all values in the
# mapping dict need to be in target_classes to be included
class_mapping_overrides={221: 211, 222: 211},
# subset to only include samples with high confidence
confidence=["high"],
# only include acquisitions from "leaf-on" months
months=[5, 6, 7, 8, 9],
# including also dark pixels (2) as valid
valid_scl_values=[2, 4, 5, 6],
# only include acquisitions where the clear cut is recent (maximum of 90 days),
# for all other classes include everything
max_days_since_event={211: 90},
max_samples_per_event=5,
# omit samples which have low tcd in the comment
omit_low_tcd=True,
# omit samples which have border in the comment
omit_border=True,
)
Once initialized, the class instance provides train and test data as numpy arrays.
print(data.y_train, data.X_train, data.y_test, data.X_test, sep="\n")
[2 2 0 ... 0 0 0] [[ 522 632 648 ... 2199 1711 1078] [ 632 760 758 ... 2568 1995 1150] [ 164 333 229 ... 2614 954 422] ... [ 194 203 133 ... 1163 555 259] [ 208 198 148 ... 1518 692 319] [ 119 148 109 ... 1916 905 382]] [0 0 1 ... 0 0 0] [[ 224 396 291 ... 2060 1204 593] [ 185 416 304 ... 2202 1315 682] [ 493 734 936 ... 2708 2751 1605] ... [ 207 430 361 ... 2271 1855 1139] [ 578 790 760 ... 2749 2297 1453] [ 484 722 606 ... 3479 2669 1611]]
It also provides the used label encoder, to go from the 0 to n-1 encoded labels back to the original labels.
data.encoder.inverse_transform(data.y_test)[:10]
array([110, 110, 211, 110, 110, 110, 211, 211, 211, 211])
Now, let's quickly train a Random Forest model and validate the output:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(oob_score=True)
rf.fit(data.X_train, data.y_train)
print(rf.oob_score_)
0.9209039548022598
The out of box accuracy for the Random Forest model is 0.92. However let's use the held out set to get a better idea of the model accuracy. For this we apply the trained model on the held out predictors (X_test) and derive accuracy metrics from this.
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
y_pred = rf.predict(data.X_test)
print(
classification_report(
data.y_test, y_pred, target_names=data.encoder.classes_.astype(str)
)
)
disp = ConfusionMatrixDisplay.from_predictions(
data.y_test,
y_pred,
display_labels=["Undisturbed (110)", "Clear Cut (211)", "Bark Beetle (231)"],
normalize="true",
)
precision recall f1-score support
110 0.93 0.97 0.95 1203
211 0.81 0.70 0.75 161
231 0.62 0.41 0.50 92
accuracy 0.90 1456
macro avg 0.79 0.69 0.73 1456
weighted avg 0.90 0.90 0.90 1456
We can see that the undisturbed class is predicted well, however the other two classes are not predicted particularly well. Especially the precision is not great.
Pytorch dataloader¶
The pytorch dataloader is used for loading image chips. If it is the first time loading the data, it will be downloaded from Huggingface. For this at least 80GB of free disk space is necessary. After the first time loading, around 35GB of space will be used.
The first data loading can take quite some time to download and extract the data.
from disfor.datasets import MonoTemporalClassification
tiff_dataset = MonoTemporalClassification(
# If None, data gets dynamically downloaded and cached from Huggingface
data_folder=None,
# selecting healthy forest (110), clear cut (211) and bark beetle (231)
target_classes=[110, 211, 231],
# reduce the size of the chip, to include less context
chip_size=8,
# subset to only include samples with high confidence
confidence=["high"],
# only include acquisitions from "leaf-on" months
months=[5, 6, 7, 8, 9],
# only include acquisitions where the clear cut is recent (maximum of 90 days),
# for all other classes include everything
max_days_since_event={211: 90},
max_samples_per_event=5,
# omit samples which have low tcd in the comment
omit_low_tcd=True,
# omit samples which have border in the comment
omit_border=True,
)
The dataset returns a dictionary with the image, label and path of the image.
tiff_dataset[0]
{'image': tensor([[[0.0448, 0.0448, 0.0514, 0.0582, 0.0548, 0.0516, 0.0512, 0.0504],
[0.0459, 0.0465, 0.0537, 0.0554, 0.0508, 0.0466, 0.0496, 0.0500],
[0.0471, 0.0470, 0.0542, 0.0562, 0.0506, 0.0474, 0.0509, 0.0524],
[0.0440, 0.0443, 0.0534, 0.0559, 0.0506, 0.0513, 0.0518, 0.0540],
[0.0470, 0.0512, 0.0568, 0.0562, 0.0522, 0.0527, 0.0588, 0.0588],
[0.0500, 0.0532, 0.0608, 0.0582, 0.0536, 0.0524, 0.0558, 0.0614],
[0.0460, 0.0494, 0.0555, 0.0534, 0.0536, 0.0530, 0.0560, 0.0626],
[0.0462, 0.0433, 0.0478, 0.0482, 0.0517, 0.0562, 0.0574, 0.0548]],
[[0.0704, 0.0664, 0.0632, 0.0692, 0.0674, 0.0630, 0.0634, 0.0713],
[0.0661, 0.0732, 0.0694, 0.0666, 0.0662, 0.0598, 0.0611, 0.0692],
[0.0660, 0.0690, 0.0648, 0.0604, 0.0618, 0.0574, 0.0637, 0.0717],
[0.0670, 0.0575, 0.0632, 0.0642, 0.0638, 0.0618, 0.0694, 0.0754],
[0.0667, 0.0618, 0.0672, 0.0668, 0.0632, 0.0637, 0.0702, 0.0826],
[0.0678, 0.0682, 0.0704, 0.0700, 0.0645, 0.0625, 0.0694, 0.0872],
[0.0626, 0.0634, 0.0638, 0.0631, 0.0632, 0.0640, 0.0700, 0.0824],
[0.0578, 0.0589, 0.0626, 0.0636, 0.0648, 0.0687, 0.0706, 0.0730]],
[[0.0396, 0.0418, 0.0606, 0.0728, 0.0710, 0.0650, 0.0624, 0.0535],
[0.0374, 0.0430, 0.0636, 0.0636, 0.0642, 0.0559, 0.0570, 0.0543],
[0.0368, 0.0434, 0.0557, 0.0674, 0.0644, 0.0571, 0.0612, 0.0563],
[0.0386, 0.0472, 0.0602, 0.0686, 0.0623, 0.0664, 0.0645, 0.0622],
[0.0390, 0.0572, 0.0722, 0.0725, 0.0648, 0.0664, 0.0760, 0.0838],
[0.0400, 0.0612, 0.0763, 0.0780, 0.0702, 0.0677, 0.0786, 0.0871],
[0.0424, 0.0522, 0.0673, 0.0674, 0.0672, 0.0722, 0.0830, 0.0902],
[0.0396, 0.0378, 0.0511, 0.0658, 0.0708, 0.0756, 0.0835, 0.0732]],
[[0.1128, 0.1145, 0.1145, 0.1023, 0.1023, 0.0977, 0.0977, 0.0971],
[0.1065, 0.1008, 0.1008, 0.0978, 0.0978, 0.0945, 0.0945, 0.1060],
[0.1065, 0.1008, 0.1008, 0.0978, 0.0978, 0.0945, 0.0945, 0.1060],
[0.1004, 0.1032, 0.1032, 0.0989, 0.0989, 0.1076, 0.1076, 0.1303],
[0.1004, 0.1032, 0.1032, 0.0989, 0.0989, 0.1076, 0.1076, 0.1303],
[0.0943, 0.1041, 0.1041, 0.1029, 0.1029, 0.1117, 0.1117, 0.1455],
[0.0943, 0.1041, 0.1041, 0.1029, 0.1029, 0.1117, 0.1117, 0.1455],
[0.0898, 0.0927, 0.0927, 0.0996, 0.0996, 0.1096, 0.1096, 0.1127]],
[[0.4431, 0.2933, 0.2933, 0.1756, 0.1756, 0.1955, 0.1955, 0.2538],
[0.3975, 0.2596, 0.2596, 0.1761, 0.1761, 0.1819, 0.1819, 0.2490],
[0.3975, 0.2596, 0.2596, 0.1761, 0.1761, 0.1819, 0.1819, 0.2490],
[0.3717, 0.2216, 0.2216, 0.1723, 0.1723, 0.1791, 0.1791, 0.2648],
[0.3717, 0.2216, 0.2216, 0.1723, 0.1723, 0.1791, 0.1791, 0.2648],
[0.3349, 0.2223, 0.2223, 0.1611, 0.1611, 0.1774, 0.1774, 0.2449],
[0.3349, 0.2223, 0.2223, 0.1611, 0.1611, 0.1774, 0.1774, 0.2449],
[0.2948, 0.2508, 0.2508, 0.2002, 0.2002, 0.1708, 0.1708, 0.1706]],
[[0.5606, 0.3849, 0.3849, 0.1957, 0.1957, 0.2262, 0.2262, 0.3015],
[0.5281, 0.3177, 0.3177, 0.1941, 0.1941, 0.2235, 0.2235, 0.2963],
[0.5281, 0.3177, 0.3177, 0.1941, 0.1941, 0.2235, 0.2235, 0.2963],
[0.4868, 0.2619, 0.2619, 0.1993, 0.1993, 0.2107, 0.2107, 0.3002],
[0.4868, 0.2619, 0.2619, 0.1993, 0.1993, 0.2107, 0.2107, 0.3002],
[0.4440, 0.2564, 0.2564, 0.1926, 0.1926, 0.2043, 0.2043, 0.2831],
[0.4440, 0.2564, 0.2564, 0.1926, 0.1926, 0.2043, 0.2043, 0.2831],
[0.3800, 0.3076, 0.3076, 0.2427, 0.2427, 0.1960, 0.1960, 0.2015]],
[[0.6180, 0.4956, 0.2720, 0.2370, 0.2282, 0.2228, 0.2562, 0.3224],
[0.5692, 0.4920, 0.3028, 0.2218, 0.2168, 0.2188, 0.2454, 0.3100],
[0.5020, 0.3996, 0.2796, 0.2160, 0.2104, 0.2124, 0.2420, 0.2854],
[0.4720, 0.3028, 0.2382, 0.2318, 0.2198, 0.2126, 0.2512, 0.3112],
[0.4940, 0.3028, 0.2522, 0.2328, 0.2172, 0.2112, 0.2424, 0.3276],
[0.4952, 0.3264, 0.2524, 0.2278, 0.2180, 0.2166, 0.2234, 0.3120],
[0.4364, 0.3324, 0.2232, 0.2046, 0.2084, 0.2100, 0.2150, 0.2886],
[0.3288, 0.3160, 0.2950, 0.2352, 0.2084, 0.2204, 0.2158, 0.2366]],
[[0.6057, 0.4248, 0.4248, 0.2250, 0.2250, 0.2593, 0.2593, 0.3387],
[0.5530, 0.3576, 0.3576, 0.2184, 0.2184, 0.2475, 0.2475, 0.3180],
[0.5530, 0.3576, 0.3576, 0.2184, 0.2184, 0.2475, 0.2475, 0.3180],
[0.5337, 0.3043, 0.3043, 0.2199, 0.2199, 0.2465, 0.2465, 0.3330],
[0.5337, 0.3043, 0.3043, 0.2199, 0.2199, 0.2465, 0.2465, 0.3330],
[0.4618, 0.2949, 0.2949, 0.2269, 0.2269, 0.2282, 0.2282, 0.3200],
[0.4618, 0.2949, 0.2949, 0.2269, 0.2269, 0.2282, 0.2282, 0.3200],
[0.4113, 0.3368, 0.3368, 0.2654, 0.2654, 0.2243, 0.2243, 0.2340]],
[[0.2310, 0.2115, 0.2115, 0.1764, 0.1764, 0.1614, 0.1614, 0.1598],
[0.2167, 0.1935, 0.1935, 0.1683, 0.1683, 0.1610, 0.1610, 0.1717],
[0.2167, 0.1935, 0.1935, 0.1683, 0.1683, 0.1610, 0.1610, 0.1717],
[0.2068, 0.1843, 0.1843, 0.1711, 0.1711, 0.1838, 0.1838, 0.2053],
[0.2068, 0.1843, 0.1843, 0.1711, 0.1711, 0.1838, 0.1838, 0.2053],
[0.1993, 0.1865, 0.1865, 0.1810, 0.1810, 0.1968, 0.1968, 0.2413],
[0.1993, 0.1865, 0.1865, 0.1810, 0.1810, 0.1968, 0.1968, 0.2413],
[0.1764, 0.1708, 0.1708, 0.1832, 0.1832, 0.1930, 0.1930, 0.2090]],
[[0.1105, 0.1160, 0.1160, 0.1074, 0.1074, 0.0926, 0.0926, 0.0826],
[0.1057, 0.1062, 0.1062, 0.1040, 0.1040, 0.0949, 0.0949, 0.0911],
[0.1057, 0.1062, 0.1062, 0.1040, 0.1040, 0.0949, 0.0949, 0.0911],
[0.0981, 0.1066, 0.1066, 0.1078, 0.1078, 0.1055, 0.1055, 0.1176],
[0.0981, 0.1066, 0.1066, 0.1078, 0.1078, 0.1055, 0.1055, 0.1176],
[0.0976, 0.1088, 0.1088, 0.1129, 0.1129, 0.1217, 0.1217, 0.1507],
[0.0976, 0.1088, 0.1088, 0.1129, 0.1129, 0.1217, 0.1217, 0.1507],
[0.0863, 0.0950, 0.0950, 0.1083, 0.1083, 0.1199, 0.1199, 0.1254]]]),
'label': tensor(2),
'path': 'C:\\Users\\Jonas.Viehweger\\AppData\\Local\\disfor\\disfor\\Cache\\0.1.0\\tiffs\\797\\2020-06-05.tif'}
The image can also be plotted.
tiff_dataset.plot_chip(5001)
Pytorch Lightning¶
For use with Pytorch Lightning, a Lightning datamodule is also available. This datamodule takes care of splitting the dataset into a training and validation set, so that the training progress can be monitored.
The datamodule only takes a few extra parameters, like batch_size, num_workers and persist_workers. All of the reminaing parameters are passed as keyword arguments to TiffDataset.
from disfor.datasets import MonoTemporalClassificationDataModule
tiff_datamodule = MonoTemporalClassificationDataModule(
batch_size=64,
num_workers=6,
# Keyword arguments are passed to TiffDataset
# selecting healthy forest (110), clear cut (211) and bark beetle (231)
target_classes=[110, 211, 231],
# reduce the size of the chip, to include less context
chip_size=8,
# subset to only include samples with high confidence
confidence=["high"],
# only include acquisitions from "leaf-on" months
months=[5, 6, 7, 8, 9],
# only include acquisitions where the clear cut is recent (maximum of 90 days),
# for all other classes include everything
max_days_since_event={211: 90},
max_samples_per_event=5,
# omit samples which have low tcd in the comment
omit_low_tcd=True,
# omit samples which have border in the comment
omit_border=True,
)
To test this datamodule we are defining a very simple neural network to predict classes from the input images.
import torch
import torch.nn as nn
import lightning as L
class SimpleClassifier(L.LightningModule):
def __init__(self, num_classes=2, lr=1e-3):
super().__init__()
self.lr = lr
# Simple feedforward network
# Input: 10 channels * 8 * 8 = 640 features
self.model = nn.Sequential(
nn.Flatten(),
nn.Linear(10 * 8 * 8, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(64, num_classes),
)
self.criterion = nn.CrossEntropyLoss()
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch["image"], batch["label"]
logits = self(x)
loss = self.criterion(logits, y)
self.log("train_loss", loss, prog_bar=True, batch_size=len(batch["label"]))
return loss
def validation_step(self, batch, batch_idx):
x, y = batch["image"], batch["label"]
logits = self(x)
loss = self.criterion(logits, y)
# Calculate accuracy
preds = torch.argmax(logits, dim=1)
acc = (preds == y).float().mean()
self.log("val_loss", loss, prog_bar=True, batch_size=len(batch["label"]))
self.log("val_acc", acc, prog_bar=True, batch_size=len(batch["label"]))
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.lr)
Finally we train the neural net using the data from our datamodule. As an example we are only going for 20 epochs.
model = SimpleClassifier(num_classes=3, lr=1e-3)
# Train with your dataloader
trainer = L.Trainer(max_epochs=20)
trainer.fit(model, datamodule=tiff_datamodule)
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry. GPU available: False, used: False TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs c:\Users\Jonas.Viehweger\Documents\Projects\2025\DISFOR\.venv\Lib\site-packages\lightning\pytorch\trainer\connectors\logger_connector\logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default | Name | Type | Params | Mode ------------------------------------------------------- 0 | model | Sequential | 90.5 K | train 1 | criterion | CrossEntropyLoss | 0 | train ------------------------------------------------------- 90.5 K Trainable params 0 Non-trainable params 90.5 K Total params 0.362 Total estimated model params size (MB) 10 Modules in train mode 0 Modules in eval mode
c:\Users\Jonas.Viehweger\Documents\Projects\2025\DISFOR\.venv\Lib\site-packages\torch\utils\data\dataloader.py:666: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used. warnings.warn(warn_msg)
Epoch 19: 100%|██████████| 213/213 [00:04<00:00, 44.52it/s, v_num=7, train_loss=0.147, val_loss=0.212, val_acc=0.926]
`Trainer.fit` stopped: `max_epochs=20` reached.
Epoch 19: 100%|██████████| 213/213 [00:04<00:00, 44.49it/s, v_num=7, train_loss=0.147, val_loss=0.212, val_acc=0.926]
Now, let's look at the confusion matrix of the trained neural net:
# After training, run validation and collect predictions
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in tiff_datamodule.val_dataloader():
x, y = batch["image"], batch["label"]
logits = model(x)
preds = torch.argmax(logits, dim=1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(y.cpu().numpy())
print(
classification_report(
all_labels,
all_preds,
target_names=["Healthy (110)", "Clear Cut (211)", "Bark Beetle (231)"],
)
)
disp = ConfusionMatrixDisplay.from_predictions(
all_labels,
all_preds,
display_labels=["Healthy (110)", "Clear Cut (211)", "Bark Beetle (231)"],
normalize="true",
)
precision recall f1-score support
Healthy (110) 0.95 0.97 0.96 1203
Clear Cut (211) 0.79 0.75 0.77 161
Bark Beetle (231) 0.85 0.63 0.72 92
accuracy 0.93 1456
macro avg 0.86 0.78 0.82 1456
weighted avg 0.92 0.93 0.92 1456
The Kernel crashed while executing code in the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
After 20 epochs of training this very simple neural net, the model is outperforming the pixel based random forest model. Especially the class bark beetle is predicted more accurately.
These models are just toy examples to show the integration of the provided datasets into training pipelines.