Re: Taxonomy

Rick Walker (walker@cutter.hpl.hp.com)
Mon, 29 Nov 93 19:35:32 PST

> Then I would classify the 8 species of _Sarracenia_ as...
> 00001=S.alata
> 00011=S.minor
> 01001=S.flava

Barry et al,

This system defines a singular point in a N-dimensional space for each
species type. It seems like it might be better to allow a real value
rather than a boolean value for each dimension. Also, it might be good
to give both a mean and a standard deviation for each characteristic.

I think the boolean characteristics have the problem of seeing the world
as "black and white" and fail when a given species has a lot of
variability.

> These queries would reflect the most parsimonious
> way to separate the remaining possible taxa,

With continous values for each feature, you can rigorously define
Michael's "parsimonious-ness" as the probability that a given specimen
could have come from the statistical distribution of characteristics as
defined by taxonomist.

One advantage of such an approach is that it would require the
taxonomist to define not only the "type" but also the degree of
variability. (I realize that this is difficult when only 1 or 2
examples of the species even exist).

With such a formulation, it should be possible for the computer to look
at a database and ask the question: "given this data, and what I know so
far, what question will eliminate the most possibilities"? By asking
"the best" question at each stage, the program should rapidly converge.

> The main difference is that while normal botanical keys have the most
> important distinctions keyed out first, and then you look at the
> details, this system has no required order. Of course, it means all
> features must be examined. There are strengths and weaknesses to this
> method, I guess.

With the statistical method you are free to leave out any characteristic
that is difficult to measure. You'll just get a slightly less significant
result. Also, the less diagnostic characteristics will automatically be
taken as less significant because their statistical distributions will
overlap.

> Barry, I have heard of computerized taxonomic keys, but I have never
> seen or used one.

If anyone is interested, I have a C program that will read a textual
definition of a key and will ask questions in an efficient manner to
identify a specimen. At the moment though, the only database I have
is for identifying 55 families of Sierra Nevada wildflowers.

Here's a snippet of the definition file:

---------------------- cut here ------------------------
if Petals present and evident
if Sepals and petals of each flower in 4's or 5's or multiples
if Petals fused together
if Ovary completely inferior
if Stamens fused into a tube
if Stem erect, not vinelike, many small flowers in single head
thenhyp plant is in the Compositae family
!
if Petals present and evident
if Sepals and petals of each flower in 4's or 5's or multiples
if Petals fused together
if Two or more petal shapes and sizes in 1 flower
if Ovary superior
if Plant stem green
if Flower tubular at least near base
if Fruit of a single capsule, stems rounded
thenhyp plant is in the Scrophulariaceae family
!
if Petals present and evident
if Sepals and petals of each flower in 4's or 5's or multiples
if Petals free from each other, or almost so
if Stamens NOT more than twice as many as petals
if Style 1, not divided or only slightly
if Petals 4
if Ovary superior, 6 stamens
thenhyp plant is in the Cruciferae family
...
ifnot Petals present and evident
ifnot No Petals, but Sepals evident
then No Petals, and no Sepals
!
ifnot Flowers (actually sepals) dark brownish-red
then Flowers (or sepals) not dark brownish-red
!
ifnot Sepals and petals of each flower in 3's or multiples
then Sepals and petals of each flower in 4's or 5's or multiples
...
---------------------- cut here ------------------------

And here's a snippet of an actual run:

---------------------- cut here ------------------------
Welcome to flowers: an expert system for identifying
California flowers of the High Sierra region from
Mt. Lassen to the Kern Canyon. Based on the book,
Sierra Wildflowers by Theodore F. Niehaus. Published
by UC Press, Berkeley, CA., 94720.

Author: Rick Walker, August 9, 1991.

Please answer the following questions about your specimen.
you may type y,Y,t,T for yes; n,N,f,F for no.
w or W will give an explanation of what rules have lead up
to the current question...

Is it true that: Petals present and evident ? y
Is it true that: Sepals and petals of each flower in 3's or multiples ? n
Is it true that: Petals free from each other, or almost so ? y
Is it true that: Stamens numerous, MORE than twice as many as petals ? y
Is it true that: No Petals, but Sepals evident ? n
Is it true that: Ovary partially inferior ? y
Is it true that: Stipule present below leaf petiole ? y

I infer that: plant is in the Rosaceae family
---------------------- cut here ------------------------

This still suffers from the concept of integer valued attributes
(ie., mostly boolean, but sometimes 3-valued attributes).

Is anyone up to writing (or supplying the knowledge) for a cp
database :-) ?

--
Rick