Exploring Extinction

Projects
I used IUCN data to explore extinction in Animalia. This project is part of Statistics for Environmental Data Science - EDS 222
Published

December 4, 2022

Introduction and Motivation

History of Extinction

Species extinction is a normal part of Earth’s natural history, as over 99.9% of species to ever have existed are now extinct1. Background extinction rates under normal conditions is expected to between 0.1 - 1 species of vertebrates to go extinct every 100 years per 10,000 species2. However, throughout Earth’s history there have been calamitous events such as asteroid impacts, volcanic eruptions, and sudden atmospheric changes that have rapidly changed the conditions on Earth to be unsuitable for life. The worst of these extinction events, the Permian extinction, is thought to have killed off 90% of all species on Earth.

The current species extinction rate is estimated to be between 1,000 and 10,000 times higher than the normal background extinction rate3, which is enough to consider our current time period the 6th mass extinction event - widely agreed to be caused by various human activities.

Figure 1. The Big Five Mass Extinctions

Extinction Today

You may be familiar with some of the species that have recently gone extinct due to human activity. Among these are the Chinese River dolphin, which lived in the Yangtze river of China and was last seen in 2007 - thought to be driven to extinction due to heavy use of the river for fishing, transportation and hydroelectricity,

Figure 2. The Chinese River Dolphin (Baiji)

The Tasmanian Tiger which lived on Tasmania, a small island south east of Australia, and was hunted to extinction in the 1930s

Figure 3. Tasmanian Tiger (Thylacine)

And the famous Dodo bird, endemic to the island of Mauritius, it lacked adaptations to prevent its own extinction from hunting by sailors and habitat destruction in the late 17th century.

Figure 4. The Dodo went extinct by 1681

Why Study Extinction?

Species provide us not only with important sources of medicine, food and various other products, but also play important roles in each of their respective ecosystems on which much of our societies’ depend. Each species also helps us elucidate the story of life’s history on Earth and contextualizes our relationship with the natural world. More importantly however, species have intrinsic value regardless of what they provide for humans, and each one lost is a tragedy in its own right.

Its important to understand the factors that render species vulnerable to extinction, as well as what the mechanisms of extinction are and how they work. Extinction is notoriously difficult to study mainly due to our lack of data which I will expand upon in the issues section, but we can hopefully use some of these findings to identify vulnerable species, and better protect them and their ecosystems from extinction and collapse.

Methods

For this analysis, I used data from the IUCN Redlist of Threatened Species4 to investigate some of the factors that I suspected may influence extinction. For simplicity, I focused on species within the kingdom Animalia. To help explore and contextualize the data, I created a shiny app that allows users to visualize extinct species on a map with any combination of taxonomic, endemism, habitat, threat, and use type filters. The app can be found here.

Cleaning

After cleaning the data, each row represented one species with a unique assessment_id, and each column contained a variable that I thought might influence extinction. The variables that I focused on were: species endemism (endemic), habitat type (habitat), the type of threat faced (threat), human use (use), and taxonomy (class).

Code
head(predictors) |> 
  datatable()

To clarify some nomenclature before we start modeling:

When I use the term variable, I’m referencing the column names that we’re going to use in our analysis. When I use the term level, I’m referring to the different values that each column name can take on. For example, one of our variables is habitat. The different levels it can take on are below.

Code
levels(predictors$habitat) 
[1] "Caves"          "Desert"         "Forest"         "Grassland"     
[5] "Marine Neritic" "Rocky Areas"    "Savanna"        "Shrubland"     
[9] "Wetlands"      

Modeling

I ran a logistic linear regression on each of these variables individually to get a feel for which levels might be significant. I then used an AIC function that added these variables stepwise into 5 different models, one for each added variable. It then scored each model using AIC, a relative way of evaluating model performance, to see which one did the best. I then used the best model to make some predictions. Comparing the coefficients of the variables from their individual models to their coefficients from the consolidated model, I was able evaluate the robustness of each variable. Since we’re testing a large number of levels in the variables, it is likely we will find significance regardless if there is actually an effect. Cross-checking the output of our individual models with our consolidated model allows us to see if our significant levels are robust.

First, we’re going to look at using logistic regression on each variable individually. Our logistic regression uses the following logit function: \[\operatorname{logit}(p)=\log \left(\frac{p}{1-p}\right)=\beta_0+\beta_1 x + \varepsilon \]

Where \(\beta_0\) is our intercept, and \(\beta_1\) is the coefficient for a two-level factor variable that toggles on (\(x = 1\)) when we are not evaluating our reference category. For a variable with \(i\) levels, we will have \(i - 1\) terms of \(\beta\). We will use this expression for each of our variables.

Unfortunately, the model coefficients don’t tell us much since they’re log-transformed, and we have to re-transform to be able to interpret them. Taking endemism for example, after some rearranging we get an expression that we can use to calculate how much more likely an endemic species is to be extinct than a non-endemic species. We will use this approach for each significant level of our variables. \[\hat p = \frac{e^{\beta_0 + \beta_1 x}}{1+ e^{\beta_0 + \beta_1 x}}\]

We’re going to calculate the right hand side of our equation above for endemic species (\(x = 1\)) since its a categorical variable), and non-endemic species (\(x = 0\)). When we take the difference of the two, we calculate how much MORE likely it is for an endemic species to be extinct than a non-endemic species.

I first investigated endemism - a species endemic to Santa Barbra means it is found nowhere else in the world outside of Santa Barbara. This seemed like a good place to start, since an endemic species is geographically and genetically restricted to one location, which seems likely to render it more prone to extinction than a non-endemic species. Below is a mosaic plot - which uses area to visualize to categorical variables - to see if there is an obvious correlation.

It’s difficult to tell, but it looks that endemic species might be more likely to also be extinct. We’re going to take a look at the logistic regression output for regressing extinct on endemic.

Here is the R output for the model summary:

Table 1. Logistic Regression of Extinction on Endemism
  Endemism
Predictors Odds Ratios Conf. Int (95%) P-value
Intercept 0.0177 0.0166 – 0.0188 <0.001
Endemic 2.0405 1.8307 – 2.2718 <0.001
Observations 70067
R2 Tjur 0.002

Since our p-value is far below any of the conventional significance levels, it looks like endemism on its own is significant in predicting extinction. After the transformation, our results show that an endemic species is 1.75% more likely to be extinct than a non-endemic species.

I then investigated the type of habitat the species lives in. Running a logistic regression model only on habitat shows that the Cave habitat (our reference group, when \(x = 0\)) is significant - with a minimal p-value.

Wetlands and Forest habitats are also significant at a 0.05 significance level, and Marine Neritic habitats are significant at a significance level of 0.01. Let’s turn the coefficients into something more interpretable as we did above.

Profiled confidence intervals may take longer time to compute.
  Use `ci_method="wald"` for faster computation of CIs.
Table 2. Logistic Regression of Extinction on Habitat
  Habitat
Predictors Odds Ratios P-value
Intercept 0.0079 <0.001
Desert 2.4085 0.130
Forest 2.7000 0.049
Grassland 1.4574 0.482
Marine Neritic 0.2040 0.006
Rocky Areas 0.8166 0.727
Savanna 0.3305 0.118
Shrubland 1.8463 0.236
Wetlands 3.2211 0.020
Observations 68395
R2 Tjur 0.003

Summarizing our significant results in comparison to the reference group (Caves/Subterranean Habitats):

Species living in a Cave/Subterranean habitat have a 0.78% chance of also being extinct, while species living in a Marine Neritic habitat have 0.619% LESS of a chance of being extinct than species living in a Cave/Subterranean habitat.

Next was the type of threat that the species faces.

Our results show that threat types of Agriculture and Aquaculture, Pollution and Invasive species/Diseases are significant.

Table 3. Logistic Regression of Extinction on Threat
  Threat
Predictors Odds Ratios P-value
Intercept 0.0217 <0.001
Biological resource use 0.7628 0.016
Energy production and Mining 0.5140 0.005
Human intrusions and disturbance 0.9570 0.879
Invasive species, genes and disease 5.2889 <0.001
Natural system modifications 1.3708 0.012
Pollution 1.5498 <0.001
Residential and Commercial Development 1.1123 0.352
Transportation and service corridors 0.5502 0.081
Observations 37633
R2 Tjur 0.019

Again, summarizing our significant results compared to the reference group (Agriculture/Aquaculture):

Species threatened by Agriculture/Aquaculture have a 2.12% chance of also being extinct. Species threatened by Invasive species/Disease and pollution have a 8.16% and 1.1% more chance of being extinct than species threatened by Agriculture/Aquaculture, respectively.

Use seemed like another appropriate variable to investigate. Perhaps species that provide medicinal or energy uses are extracted at more unsustainable rates than a species that provides an artisinal use.

Table 4. Logistic Regression of Extinction on Use
  Use
Predictors Odds Ratios P-value
Intercept 0.0000 0.995
ex - situ production 1018924.8700 0.997
fibre 1.0000 1.000
Food - animal 1.0000 1.000
Food - human 2451051.3686 0.996
fuels 1.0000 1.000
handicrafts, jewellery, etc 1213090.6232 0.997
Manufacturing chemicals 1.0000 1.000
Medicine 1120077.2179 0.997
Other 2628363.0170 0.996
other chemicals 21026904.1357 0.996
other household goods 1.0000 1.000
display animals, horticulture 365417.8191 0.997
Poisons 1.0000 1.000
Research 3304227.7928 0.996
sport hunting/Specimen collecting 349389.6458 0.997
unknown 1.0000 1.000
wearing apparel, accessories 2115511.6966 0.996
Observations 19133
R2 Tjur 0.008

This shows that the human use for each species is not significant for predicting extinction. A potential problem with this though, is the amount of missing data in this column. Out of our over 70,000 species observations, approximately 51,000 of these do not have associated use cases. This may be because we simply don’t have a human use for many species, or that the uses just aren’t properly documented.

Taxonomy seemed like another interesting variable to investigate. It seems likely that more closely-related species will face similar extinction pressures. Since we’re working within the Animalia Kingdom, we will run a logistic regression using the class of each species. I’ve shortened the output below to only include a few of the classes for visual purposes.

Table 5. Logistic Regression of Extinction on Taxa
  Taxa
Predictors Odds Ratios P-value
class [AMPHIBIA] 2.5886 <0.001
class [ANTHOZOA] 0.2351 0.042
class [ARACHNIDA] 7.1234 <0.001
class [AVES] 1.4023 0.001
class [BIVALVIA] 6.4434 <0.001
class [GASTROPODA] 6.3545 <0.001
class [INSECTA] 1.2961 0.017
class [MAMMALIA] 1.9058 <0.001
class [REPTILIA] 0.7869 0.065
Observations 70067
R2 Tjur 0.017

There are quite a few classes that look to be significant here. Especially significant classes appear to be, Actinopterygii (Ray-finned fishes, our reference group) Amphibians, Arachnids, Aves, Bivalves, and Gastropods. Since we’re testing so many levels, we expect that our model will find significance regardless if there is an actual effect. We will keep an eye on the significant classes as we build our larger model. Below is a dendrogram of the evolutionary relationships between classes and their significance levels.

Stepwise Model

Now, we want to see if any of these significant levels of previous variables are still significant when we start to add our variables together for a more complete model. If they are still significant, we can be comfortable concluding that the level is influencing extinction. We start with predicting extinction off of one variable, endemism, and then incrementally add our other variables of interest. We use a stepwise AIC function - which will take a look at each step, and output scores for each step of the model, indicating which model does the best job at predicting extinction.

K AICc Delta_AICc AICcWt LL
5 56 1615.0 0.0 1 -751.2
4 47 8794.1 7179.1 0 -4350.0
3 18 8985.0 7370.0 0 -4474.5
2 10 11713.7 10098.7 0 -5846.9
1 2 14210.9 12595.9 0 -7103.5

The last step in our model has the lowest AIC score, and appears to be the best. This is slightly worrying, since it is the most complicated model - it uses 56 different parameters (1 for each level of each variable) to predict extinction. This could potentially indicate over-fitting, so we’re going to take a look at our coefficients and p-values of our significant variable levels, to see which levels remain robust. Again, I’ve shortened the model output to include only the relevant levels.

Table 6. Generalized Logistic Regression
  Generalized Mod
Predictors Odds Ratios P-value
endemic [Yes] 4.1961 <0.001
habitat [Desert] 1.4347 1.000
habitat [Forest] 3927257.7750 0.996
habitat [Marine Neritic] 1059842.1871 0.996
habitat [Wetlands] 10420495.4785 0.996
threat [Pollution] 6.6093 <0.001
class [AMPHIBIA] 2.2075 0.019
class [ARACHNIDA] 0.0000 0.996
class [AVES] 8.3032 <0.001
class [GASTROPODA] 0.3001 0.245
class [INSECTA] 0.0000 0.990
use [Medicine] 2209540.2542 0.999
use [Poisons] 0.2001 1.000
use [Research] 4638444.2655 0.999
Observations 11603
R2 Tjur 0.101

The significance level for endemism remains far below any conventional significance threshold, it’s coefficient hardly changed, so we can remain confident that Endemic species are indeed more likely to be extinct. This is a robust indicator.

Similarly, threats of Pollution and Invasive Species/Disease remain robust in our more complete model. The significance levels are far below the usual significance thresholds of 0.05 and 0.01, indicating that these are also a robust indicators.

Finally, the Aves, Amphibia, and Gastropoda classes also look like they’re remaining robust - although Aves to a lesser degree. The significance level decreases quite a bit from when we evaluated taxa on their own, and the coefficient changes noticeably. This indicates that there may be an interaction effect between taxa and one of the other variables. Perhaps species in the Aves class (birds) are more prone to infectious diseases than species in other classes.

So, the variable levels that we are confident are associated with species extinction, at least in this dataset are endemic species threatened by pollution or disease, in the Aves, Amphibia, and Gastropoda classes. This is fairly consistent with our current knowledge of extinction.

Code
dendrogram_2

Predictions

I then used the step4 model models to make probability predictions that a species is extinct. Because the augment function that I used removes any NA values before making predictions, I used step4 model instead of step5. The use variable only contains data for ~ 20,000 of our 70,000 species - so we’d be losing well over half of our data if we used the step5 model. We’re still losing ~30,000 species with our step4 model, since not every species has an associated threat, so we expect our p-value in our T test to look slightly different.

The step4 model contains the variables endemic, habitat , threat and class. I then used augment to output the predicted probability that a species is extinct based on the variables used in the step4 model.

We can evaluate how accurate the model is using a confusion matrix.

Prediction Reference Freq
0 0 36524
1 0 1065
0 1 1
1 1 1

Although the model is 97% accurate, it is pretty bad at predicting if species are extinct or not. Here’s where machine learning comes into play.

Issues

Assumptions

I categorized species that are classified as “extinct in the wild” as extinct, since we are interested in species outside of captivity. I also took a case by case approach to classify some of the critically endangered species with values of the variable Possibly Extinct as TRUE as extinct - since many of these species have not been seen in many years (our Chinese River Dolphin friend actually falls into this category, it is listed as critically endangered despite not having been seen since 2007) and are widely agreed to be at least functionally extinct (where there are so few members of the species surviving in the wild that it is unlikely they will ever come into contact).

Simplifications

To ensure that each observation was a species and that there weren’t multiple observations of the same species, I categorized each of the sub habitats into one general habitat - tropical rainforests in Costa Rica and boreal forests in Siberia are both considered forests. I also collapsed species that live in multiple habitats into one habitat. I took similar approaches with the threat type, generalizing each sub type into one general type, collapsing species facing multiple threats into one threat, as well as the use case. Each of these introduces its own oversimplification issues, and should be explored more thoroughly in future investigations. Here is the link to IUCN’s classification schemes.

Limitations

On top of all this, it is extremely difficult to study extinction. We don’t have crucial data on population dynamics, geographic range, reproductive capacity, genetic diversity, and many other important factors for many species. In fact, there are probably still millions of species of plants and animals that we have yet to identify, let alone gather enough pertinent data to understand its status. The IUCN only describes species that went extinct relatively recently. When we consider the number of extinctions that have happened over geologic time scales, we are looking at an extremely minute sample. The IUCN has assessed only ~7% of its described species. Even for species that we are aware of, it is very difficult to tell if a species is actually extinct. The Amsterdam Widgeon was endemic to Île of Amsterdam in the French Southern Territories before it went extinct likely due to visiting sealers and the rats they introduced sometime between 1600 - 1800. No naturalist even visited the island until 1874, and we only know that it existed through its bones that were found in 19555. To illustrate an extreme example of how bad we are at this, we’ll take a look at the Coelecanths.

This ancient genus first appeared in the fossil record over 400 million years ago. It disappeared from the fossil record 66 million years ago and was presumed to have gone extinct along with the dinosaurs. In 1938, one species of Coelecanth was rediscovered in a fishing net off the South African Coast. Here it is, a living fossil, alive and swimming.

Figure 8. Live Coelecanth

Since then, another Coelecanth species has been discovered, and over 100 individual specimens have been recorded. Coelecanths are classified as critically endangered, the IUCN estimates that fewer than 500 exist in the wild, and are suffering as a result of over-fishing. This is one example of a Lazarus taxon - an evolutionary line that disappeared from the fossil record only to reappear much later.

Next Steps

A logical next step to continue this analysis would be to more carefully investigate each variable individually by categorizing them even more broadly. Since we’re testing so many different parameters here, our model is likely to find significance regardless if the effect is actually there or not.

Additionally, it is highly likely that many of these variables are interacting with each other. Meaning that one variable likely influences the outcome, and is correlated with another variable. For example, species more closely related (have similar taxonomy) are probably more likely to share habitats, which will also influence how likely it is that they are extinct. Species that share habitats are also probably more likely to share similar threats - especially in habitats undergoing destruction. To address this, we would add interacting terms (habitat:class, threat:habitat) and take a look at how the coefficients change from model to model again. If we find that the coefficients for any of our variables and our intercept change dramatically, then it is likely that we have an interaction effect.

Another idea to investigate that I suspect plays an important role in species extinction is its ecological niche and trophic level. A specialist species with a narrow ecological niche would likely be much more sensitive to changes in environmental conditions - and is therefore likely more prone to extinction than a generalist species that can fill a variety of niches. Similarly, I suspect that species at low trophic levels are much less likely to go extinct than species at higher trophic levels, because they have lower energy requirements and depend on less of the food chain. In the case of an ecosystem collapse, species at higher trophic levels will likely be the first to die out.

Finally, this analysis only uses data from the Animal kingdom. An analysis incorporating plants, arachaea, fungi, and eubacteria would certainly give us a more full picture of extinction. However given the data that we have available, we are far far away from being able to perform this analysis properly.

Conclusion

Our picture of extinction is far from complete. In fact, our picture of species is far from complete. scientists estimate that there are around 8.7 million species of plants and animals in existence6, yet we’ve only identified 1.2 million. This doesn’t even include lifes kingdoms of which we know the least - the Fungi, Archaea, Protozoa, and Bacteria. Its accepted among scientists that some species among us today will go extinct far before they are discovered. Species will continue to go extinct due to our mistreatment of the natural world - and we will lose far more than simply the services they provide to humans. As of right now, there is no reversing extinction (although this is likely to change).

Despite this bleak reality, we are in a better position to address and mitigate our predicament than we were even just a few decades ago. Our tools to understand the natural world, our policies to safeguard it, and our desire to protect it are becoming more interconnected every day. With better information, we are able to make better decisions to become better stewards of the planet. If one thing is clear, the more we learn about life, the more it surprises us. Despite our ongoing destruction of much of the natural world, life finds a way.

Footnotes

  1. Barnosky, A., Matzke, N., Tomiya, S. et al. Has the Earth’s sixth mass extinction already arrived?. Nature 471, 51–57 (2011). https://doi.org/10.1038/nature09678↩︎

  2. Ceballos G, Ehrlich PR, Barnosky AD, García A, Pringle RM, Palmer TM. Accelerated modern human-induced species losses: Entering the sixth mass extinction. Sci Adv. 2015 Jun 19;1(5):e1400253. doi: 10.1126/sciadv.1400253. PMID: 26601195; PMCID: PMC4640606.↩︎

  3. Barnosky, A., Matzke, N., Tomiya, S. et al. Has the Earth’s sixth mass extinction already arrived?. Nature 471, 51–57 (2011). https://doi.org/10.1038/nature09678↩︎

  4. https://www.iucnredlist.org/search↩︎

  5. https://en.wikipedia.org/wiki/Amsterdam_wigeon↩︎

  6. Sweetlove, L. Number of species on Earth tagged at 8.7 million. Nature (2011). https://doi.org/10.1038/news.2011.498↩︎

Citation

BibTeX citation:
@online{bartnik2022,
  author = {Bartnik, Andrew},
  title = {Exploring {Extinction}},
  date = {2022-12-04},
  url = {https://andrewbartnik.github.io/Portfolio/exploring-extinction},
  langid = {en}
}
For attribution, please cite this work as:
Bartnik, Andrew. 2022. “Exploring Extinction.” December 4, 2022. https://andrewbartnik.github.io/Portfolio/exploring-extinction.