Finite Carbon

Building a Platform for Digital Energy

Jordan Golinkoff, Bahareh Yekkehkhany, Yasaman Shahhosseini, Brian Andres Zambrano Luna, Arman Jahangiri, Isaac Dante Asamoah, Patrik Coulibaly

Last updated on Jul 3, 2024

Imagine you’re working on a machine learning model to identify areas affected by deforestation. The dataset contains satellite images, and your task is to classify image patches into two categories: “Deforested” (Class A) and “Non-Deforested” (Class B). However, deforested patches are significantly rarer than non-deforested ones.

Data Description:

Class A (Deforested; at most 5% of the dataset): Represents the rare instances of deforested patches (e.g., cleared land, logging areas).
Class B (Non-Deforested): Dominates the dataset and includes natural forest cover, agricultural land, and other non-deforested regions.

Classifier Performance Metrics:

Accuracy: While overall accuracy is commonly used, it can be misleading due to class imbalance. In this case, achieving high accuracy might not reflect the model’s true performance.
Precision: Precision measures the proportion of correctly predicted deforested instances (Class A) out of all predicted positive instances. Minimizing false positives is crucial to avoid misclassifying non-deforested areas.
Recall (Sensitivity): Recall calculates the proportion of true deforested instances (Class A) correctly identified out of all actual deforested instances. High recall ensures we don’t miss deforested areas.
Specificity: Specificity represents the proportion of true non-deforested instances (Class B) correctly identified out of all actual non-deforested instances. Avoiding false alarms for natural forest cover is essential.
F1 Score: The F1 Score balances precision and recall, considering both false positives and false negatives. It’s particularly useful for imbalanced data.

Challenge:

How would you design a binary classifier that optimizes both precision (minimizing false positives) and recall (minimizing false negatives) for detecting deforested areas?
Consider techniques like oversampling deforested patches, using weighted loss functions, or leveraging synthetic data generation.
Which evaluation metric(s) would you prioritize when evaluating your model’s performance on deforestation detection?

Remember, in rare-case binary classification, thoughtful model selection, feature engineering, and robust evaluation are critical to achieving reliable results, especially when dealing with imbalanced data.

2024

Finite Carbon

Data Description:

Classifier Performance Metrics:

Challenge:

Jordan Golinkoff

Senior Director, Research and Development at Finite Carbon

Bahareh Yekkehkhany

Applied Remote Sensing Scientist, Finite Carbon

Yasaman Shahhosseini

Graduate Student

Brian Andres Zambrano Luna

Postdoc Fellow

Arman Jahangiri

Graduate Student

Isaac Dante Asamoah

Graduate

Patrik Coulibaly

PhD candidate