Visual Diagnostics for Algorithmic Cartridge Case Comparisons

Joseph Zemmels, Heike Hofmann, Susan VanderPlas

Acknowledgements

Thank you to everyone at the Roy J Carver High Resolution Microscopy Facility for collecting cartridge case scans.

Funding statement

This work was partially funded by the Center for Statistics and Applications in Forensic Evidence (CSAFE) through Cooperative Agreement 70NANB20H019 between NIST and Iowa State University, which includes activities carried out at Carnegie Mellon University, Duke University, University of California Irvine, University of Virginia, West Virginia University, University of Pennsylvania, Swarthmore College and University of Nebraska, Lincoln.

Background

Cartridge Case Comparisons

For this project, we’re interested in determining whether two cartridge cases were fired from the same firearm.

A cartridge case is a metal casing containing the primer, powder, and a projectile. In the animation, you can see an example of a cartridge case that is ejected from the barrel after firing

When a gun is fired and the bullet travels down the barrel, the cartridge case stays in the barrel and is sent backwards as a reaction to the bullet moving forward.

It then slams against the back wall of the barrel, also known as the “breech face,” with great force.

Markings on the breech face are impressed into surface of the cartridge case, and this leaves so-called “breech face impressions.”

Forensic examiners use these breech face impressions analogous to a fingerprint to identify the gun from which a cartridge case was fired.

Determine whether two cartridge cases were fired from the same firearm.

Cartridge Case: metal casing containing primer, powder, and a projectile
Breech Face: back wall of gun barrel
Breech Face Impressions: markings left on cartridge case surface by the breech face during the firing process

Current Practice

Suppose that you recover two cartridge cases - one from a crime scene and another is from a suspect’s firearm

The way that forensic examinations are commonly performed today involves placing the cartridge cases under a comparison microscope.

An example of a comparison microscope is shown at the bottom of the slide. The examiner would place the two cartridge cases on stages. You can see that there are two microscopes, one for each stage, that are combined into a single view that the examiner can look through.

The goal of the forensic examination is to assess the “agreement” of the impressions on the two cartridge cases. For our purposes, we are particularly interested in the impressions on the cartridge case primer, which you can see higlighted on the left side of the diagram

The final result is the examiner’s conclusion on whether the cartridge cases originated from the same firearm.

Cartridge case recovered from crime scene vs. fired from suspect’s firearm
Place evidence under a comparison microscope for simultaneous viewing (Thompson 2017)
Assess the “agreement” of impressions on the two cartridge cases (AFTE Criteria for Identification Committee 1992)

Impression Comparison Algorithms

Now why are we talking about comparison algorithms today?

Well, in recent years the scientific validity of many forensic disciplines has been called into question.

For example in 2009, a report from the National Research Council stated that the decision of a toolmark examiner, which is the examiner who would be looking at these comparisons, remains a subjective decision based on unartiluated standards and no statistical foundation for estimation of error rates

NEXT

Seven years later, the President’s Council of Advisors on Science and Technology said something similar and emphasized that firearms analysis should convert from a subjective method to an objective method, which would involve developing image analysis algorithms for comparing the similarity of tool marks on firearm evidence including cartridge cases.

NEXT

Today, we will discuss an image-analysis algorithm called the Automatic Cartridge Evidence Scoring or “ACES” algorithm that compares cartridge case evidence. In particular, we’ll introduce a series of diagnostic tools that are useful in understanding how these algorithms work “under the hood,” so to speak

National Research Council (2009):

“[T]he decision of a toolmark examiner remains a subjective decision based on unarticulated standards and no statistical foundation for estimation of error rates”

President’s Council of Advisors on Science and Technology (2016):

“A second - and more important - direction is (as with latent print analysis) to convert firearms analysis from a subjective method to an objective method. This would involve developing and testing image-analysis algorithms for comparing the similarity of tool marks on bullets [and cartridge cases].”

We discuss the Automatic Cartridge Evidence Scoring (ACES) algorithm to compare 3D topographical images of cartridge cases

Visual diagnostics aid in understanding what the algorithm does “under the hood.”

Cartridge Case Comparison Algorithms

Ames I Study

Before we dive into algorithms, I wanted to first discuss the data that we will be using throughout this presentation.

We use cartridge cases collected as part of a 2014 study, commonly called the “Ames I” study because it included researchers from Iowa State University.

For the study, the researchers fired cartridge cases from 25 Ruger SR9 pistols

NEXT

These cartridge cases were then separated into groups of 4 consisting of 3 “known match” cartridge cases and 1 “unknown source” cartridge case. It’s important to note here that we call cartridge cases a “match” if they were fired from a firearm and “non-match” if they were fired from different firearms.

NEXT

The researchers then sent these cartridge case sets to 218 examiners who were tasked with determining whether the one unknown source cartridge case came from the same pistol as the three known-match cartridge cases

Some more terminology here: correctly classifying as set of matching cartridge cases is called a “true positive” while a “true negative” is correctly classifying a set of non-matching cartridge cases.

NEXT

The results they got back showed that the examiners made very few errors during their examinations. On the slide you will see a table of results from the study. There were a total of 3,270 comparisons from the participants, which we separate here into different outcomes. The three columns represent the participants’ responses - whether they concluded that cartridge cases matched, they did not match, or the evidence was inconclusive.

An “inconclusive” is a type of conclusion sanctioned by the Association of Firearm and Toolmark Examiners for situations in which two cartridge case may share some similarities or differences, but not enough to conclusively say whether they are a match or non-match

The rows of the table represent the ground-truth of the comparisons, so we can compare the examiners conclusions against the true nature of the comparisons.

We see that there were 1,075 true positives – remember these are correctly classified matching comparsisons- out of a total of 1,090 matching comparisons. There were also 1,421 true negatives out of a total of 2,180 non-matching comparisons.

The other numbers in the first two columns represent errors that the examiners made. There were 4 truly matching comparisons that were misclassified as non-matches – we call these false negatives – and there were 22 non-matching comparisons that were misclassified as matches, these are the false positives.

The third column shows the number of inconclusives where we see that there were many more inconclusive decisions for truly non-matching comparisons than for matching comparisons.

NEXT

The table you see here shows the true positive, true negative, and overall inconclusive rates from the results of this study. The true positive rate is the number of true positives divided by the total number of matching comparisons.

Similarly, the true negative rate is the number of true negatives divided by the total number of non-matchign comparisons.

We see that the true negative rate is lower than the true positive rate, although this can accounted for because of the large number of inconclusives made for the non-matching comparisons.

This brings us to the inconclusive rate, which is the number of inconclusives, 748, divided by the total number of comparisons.

Baldwin et al. (2014) collected cartridge cases from 25 Ruger SR9 pistols

Separated cartridge cases into quartets: 3 known-match + 1 unknown source
Match if fired from the same firearm, Non-match if fired from different firearms

218 examiners tasked with determining whether the unknown cartridge case originated from the same pistol as the known-match cartridge cases
- True Positive if a match is correctly classified, True Negative if non-match is correctly classified

	Match Conclusion	Non-match Conclusion	Inconclusive Conclusion	Total
Ground-truth Match	1,075	4	11	1,090
Ground-truth Non-match	22	1,421	735 + 2*	2,180

Inconclusive: conclusion when there is some agreement or disagreement in characteristics, but not enough to make a match or non-match conclusion (AFTE Criteria for Identification Committee 1992)

True Positive (%)	True Negative (%)	Overall Inconclusives (%)
99.6	65.2	22.9

Cartridge Case Data

So we have these cartridge cases from the Ames I study, but the question is: “how do we go from a physical cartridge case to something that we can use on a computer?”

The answer is to take a 3D topographical scan of the cartridge case surface using the Cadre TopMatch scanner. You can see example of one such topographic image on the slide. This scan is taken at the micrometer, or “micron,” level and stored in the x3p file format

This image is actually interactable, so I can show you what the scan looks like from different angles. Again, we are interested in the cartridge case primer, which is the circular region in the middle of the scan. You can see that we pick up areas around the primer that we will eventually remove. You will also note that the firing pin impression is a sort of plateaued region in the middle of the primer that is caused by the deformation of the metal when it’s struck by the firing pin.

Keep in mind that we are specifically interested in circular region around this firing pin impression.

3D topographic images using Cadre\(^{\text{TM}}\) TopMatch scanner from Roy J Carver High Resolution Microscopy Facility
x3p file contains surface measurements at lateral resolution of 1.8 micrometers (“microns”) per pixel

Cartridge Case Comparison Algorithms

Now let’s discuss comparison algorithms

The goal of a cartridge case comparison algorithm is to obtain a measure of similarity between two cartridge cases

There are different types of algorithms out there right now, but they all follow the same, basic structure

The first step of these algorithms is to pre-process the scans to isolate the breech face impressions. For example, in the scan we talked about on the last slide, we want to remove the firing pin impression and region around the primer to isolate the breech face impressions.

NEXT

Next, once we have two pre-processed cartridge cases, we compare them to extract a set of numerical features that distinguish between matches and non-matches I’ll talk in a few slides about some common numerical features we calculate

NEXT

Finally, we combine the numerical features into a single similarity score such as a continuous similarity score between 0 and 1 that the two cartridge cases match.

NEXT

Eventually, we hope that these algorithms will be used in casework to help the examiner make a conclusion However, we first need to understand an algorithm’s limitations before we know how the examiner should interpret the similarity score

One challenge we’ve commonly faced while working with these comparison algorithms is knowing how and when these steps work as we intend them to. In the next few slides, I will discuss challenges we’ve faced at each step.

Obtain an objective measure of similarity between two cartridge cases

Step 1: Independently pre-process scans to isolate breech face impressions

Step 2: Compare two cartridge cases to extract a set of numerical features that distinguish between matches vs. non-matches

Step 3: Combine numerical features into a single similarity score (e.g., similarity score between 0 and 1)

Examiner takes similarity score into account during an examination

Challenging to know how/when these steps work correctly

Step 1: Pre-process

Isolate region in scan that consistently contains breech face impressions

How do we know when a scan is adequately pre-processed?

Step 2: Compare Full Scans

Registration: Determine rotation and translation to align two scans

Cross-correlation function (CCF) measures similarity between scans
- Choose the rotation/translation that maximizes the CCF

The next step is to compare two pre-processed scans.

A common technique we use to compare two scans is called “registration,” which essentially involves finding the rotation and translation at which the two scans align best.

In the example on the slide, we have two matching scans K013sA1 and K013sA2 that fired from the same firearm. To “register” the two scans, we need to slightly rotate and shift one of the scans to align to the other.

NEXT

The way we often choose a registration is by using the Cross-Correlation Function, which measures the similarity between two scans. A large CCF value implies highly similar scans. As such, we choose the rotation and translation that maximizes the cross-correlation function between the two scans.

One important note is that cartridge cases often only have a few regions with distinguishable impressions. This means that even for matching scans, the cross-correlation may not be very large since similarities are “drowned-out” by the dissimilarities.

NEXT SLIDE

Step 2: Compare Cells

Split one scan into a grid of cells that are each registered to the other scan (Song 2013)
For a matching pair, we assume that cells will agree on the same rotation & translation

Why does the algorithm “choose” a particular registration?

To solve this issue, John Song, a researcher at NIST, proposed splitting one of the cartridge cases into a grid of “cells,” each of which are registered in the other scan. So we essentially repeat the process from the last slide, but now for each cell. This allows us to consider specific regions of the scan that might contain distinguishable markings rather than considering the full scans all at once.

The key assumption we make here is that cells will agree on the same rotation and translation if the cartridge case pair is truly matching. In the example on the slide, you see three cells from the scan on the left and where they register in the other scan. One thing I will point out here is that the two cells in the top-right appear to agree on the same registration, as evidenced by these parallel connecting lines. In contrast, the cell in the bottom left does not agree with the registration – you see that the connecting line is not parallel to the other two.

NEXT

So something we wrestled with at this step of the algorithm is understanding why the algorithm “chooses” a particular registration. In a few slides, I will discuss tools we developed to address this question.

Step 3: Score

Measure of similarity for two cartridge cases
- Maximized CCF (0.27 in example below) (Vorburger et al. 2007; Tai and Eddy 2018)
- Congruent Matching Cells (11 CMCs in example below) (Song 2013)

Our approach: similarity score between 0 and 1 using a statistical model

What factors influence the final similarity score?

Once we compare the two scans, the final step is to return some similarity score. Different scores have been proposed over the years.

For example, the maximum cross-correlation value is a reasonable choice since it measures the similarity of the two scans. However, we talked about the CCF may not be very large if we compute the maximum CCF between the full scans. In the example on the slide, the maximum CCF is 0.27, which isn’t very large given that the two cartridge cases we’ve been working with are matching.

Another proposed similarity score proposed by John Song in 2013 is the number of “Congruent Matching Cells” or CMCs. This CMC algorithm essentially tries to identify cells that “agree” on the same registration. In the example on the slide, you see 11 CMCs in blue and 28 non-CMCs in red. Notice that blue cells are organized in a grid-like pattern as we would expect. For example, we expect cell 2, 7 to be to the right of cell 2, 6 and this indeed what we see in the second scan. On the other hand, cell 2, 5 is not at all where we would expect, so it makes sense that we would exclude it from the similarity score.

NEXT

We take a slightly different approach by using a statistical model to compute a similarity score.

As we worked with these similarity scores, it became clear that we did not understand the factors that influenced the final result. We were sort of at the whim of the algorithm. Again, I will discuss some tools that we created to address this challenge.

Visual Diagnostics

Visual Diagnostics for Algorithms

A number of questions arise out of using comparison algorithms
- How do we know when a scan is adequately pre-processed?
- Why does the algorithm “choose” a particular registration?
- What factors influence the final similarity score?

We wanted to create tools to address these questions
- Well-constructed visuals are intuitive and persuasive
- Useful for both researchers and practitioners to understand the algorithm’s behavior

X3P Plot

Emphasizes extreme values in scan that may need to be removed during pre-processing
Allows for comparison of multiple scans on the same color scheme
Map quantiles of surface values to a divergent color scheme

The first diagnostic tool we will discuss is the X3P plot, which is a way of representing the surface values of cartridge case scans using a purple, white, orange color scheme.

In the example on the slide, you can see two matching cartridge cases. Purple observations are associated with surface values below the median value while orange observations are above the median. We represent the median value with the color white.

We use this color scheme to emphasize extreme values in the scan and to compare the surface values across scans. Extreme values are often highly influential in the comparison step, so it is important that we remove extreme values that aren’t associated with breech face impressions so that the breech face impressions can be properly compared.

In the example, we can make out some similar markings on the two scans, such as the orange scratch at the 7 o’clock position firing pin impression or the striped impressions around the 4 o’clock position. There are also noticable differences such as the dent-like marking near the 11 o’clock position of the firing pin impression in the right scan that isn’t shared with the left scan.

The X3P plot is useful by itself to compare markings between two scans, but we have found it particularly useful for determining whether a cartridge case requires additional pre-processing.

NEXT SLIDE

X3P Plot Pre-processing Example

Useful for diagnosing when scans need additional pre-processing

As a concrete example of this, consider a matching pair of cartridge cases I came across while working with the comparison algorithms. I noticed these two scans because they had uncharacteristically low similarity scores for a matching pair.

The first row shows the two scans after some pre-processing. You’ll notice that the scan on the left contains a lot of extreme values that are uncharacteristic of breech face impressions. For example, we see that the firing pin impression wasn’t entirely removed during the first pass of pre-processing. There are also large dent-like markings on both scans, such as the purple markings [here] and [here].

These observations are so extreme that they suppress other markings on the two scans that actually look quite similar. For example, we see faint striped markings on the top of the two scans. The maximized cross correlation value for these scans is 0.14, which low for a matching pair.

Now compare this to the bottom row, which shows the same two scans after removing the extreme values. It’s a lot easier to see similarities between the two scans - for example those striped impressions are a deeper shade of purple of orange. The maximized cross-correlation for these scans is 0.29. This CCF value is still low but is much higher than before.

This goes to show how the X3P plot is useful for identifying when scans need additional pre-processing.

Comparison Plot

Separate aligned scans into similarities and differences
Useful for understanding a registration

Similarities: Element-wise average between two scans after filtering elements that are less than 1 micron apart

Differences: Elements of both scans that are at least 1 micron apart

Now let’s move on to another visual diagnostic tool called the Comparison Plot.

The goal of the comparison plot is to separate two scans into similarities and differences, which is useful for understanding why the algorithm chose a particular registration.

NEXT

To compute the similarities, you can imagine overlaying the two surfaces on top of one another. We then compute the element-wise average and distance between the two scans.

The element-wise average is sort of a hybrid of the two surfaces while the element-wise distance emphasizes where the two surfaces differ the most.

We then consider elements where the distance between the two scans is greater than 1 micron. The black and white image you see on the slide shows white elements where the distance between the two surfaces is larger than 1 and black elements that are less than 1.

We apply a filter to the element-wise average based on whether distance between the surfaces is larger than 1 micron. So anywhere you see a white pixel in the black and white image, we remove that from th element-wise average. You can see the results of this filtering on the bottom. This results in a visualization of the obvious “similarities” - where the two surfaces are close - between the two original scans. I’ll show you a more clear visualization of this plot on the next slide.

NEXT

Conversely, we define “differences” to be elements of the two scans where the distance is greater than 1 micron. We apply a filtering to the two original scans, but this time only keep those elements for which the surfaces are far apart.

We combine the similarities and differences into a single visualization to construct the comparison plot

NEXT SLIDE

Full Scan Comparison Plot

An example of a comparison plot of two full scans is shown here.

We show the original scans in the first column after scan K013sA2 has been aligned to K013sA1

The second and third columns show the filtered element-wise average and differences I discussed on the last slide.

To recap, the element-wise average shows the similarities between two scans. What we’ve found to be most useful when using the comparison plot is to consider visually distinct regions in the second and third columns. For example, in the element-wise average is naturally drawn towards the darkest shades of purple and orange. For example, I notice some dark purple and orange regions at the bottom of the two surfaces. After identifying these noteworthy regions, we can go back to the original scans in the first column to further explore the similarities around this area. The purple and orange observations appear to be part of striped impressions. So the element-wise average gives us an idea of how close these impressions are to one another.

In a similar manner, we can look at the third column to identify differences between the two scans. Again, the eye is naturally drawn to the darkest shades. I see a dark purple region on the bottom of the top-right plot that isn’t shared with the bottom-right plot. If we go back to the first column, we indeed see that these are clear differences in the original scans. It appears that the deep purple might be part of the firing pin impression that wasn’t fully removed from the scan during pre-processing.

We have found a comparison plot for two full scans gives us a clear, high-level idea of the similarities and differences between the two scans.

Cell Comparison Plot

::: {.fragment fade-out fragment-index=1}

:::

However, we have found the comparison plot to be most useful when we use it to compare individual cells. This is because we are able to zoom into regions of the scan that may not have as much detail in the full scan plot.

For this example, we’ll be focusing on the cell in the first row, sixth column of left scan. We see where it aligns right scan.

NEXT

The comparison plot allows us to zoom and get a better idea of the similarities and differences in this region. Again, we have the original cells in the first column. In the middle column, our eye is naturally drawn to the dark purple region near the top of the element-wise average. We see in original scans that this corresponds to similar elliptical dark purple regions. Further, below that we also see a streak of connected orange observations that look similar between the two scans. So again, the element-wise average gives us a quick reference to assess the similarities of the two cells

Considering the differences in the third column, we can make a few observations. For one, we notice that the individual regions here are all relatively small - some only a pixel or two in size.

Further, even the larger regions share some similarities. For example, take the larger region in the bottom-left of the two plots. In the top-right plot, we see a descending trend in the height values from the top-left to the bottom-right – the surface values move from dark to light orange.

We actually the same descending trend in this region in the bottom-right plot, except at a different starting location. Now, the surface values start at a light orange and move to a light purple. This discrepancy might happen because of a difference in the amount of pressure applied to the cartridge case surface by the breech face. These observations might actually be the “same” marking, but one is deeper than the other. So even the “different” regions between these two cells actually share similarities.

In summary, we’ve found that “zooming into” regions of a cartridge case scan using the comparison plot is really useful for understanding why the algorithm chose a particular registration.

Translating Visuals to Statistics

Translate qualitative observations made about the visual diagnostics into complementary numerical statistics

Useful to quantify what our intuition says should be true for (non-)matching scans

For a matching cartridge case pair…
1. There should be (many) more similarities than differences
2. The different regions should be relatively small
3. The surface values of the different regions should follow similar trends
Statistics are useful for justifying/predicting the behavior of the algorithm

The last set of diagnostic tools I’m going to talk about are actually not visuals. Instead, they are statistics that we compute based on the visual diagnostics

Throughout this presentation, I’ve made a lot of qualitative observations about the cartridge case surfaces such as “there are dark purple regions shared between the two scans” or “the differences between the scans are relatively small,” etc.

The purpose of these statistics is to translate the qualitative observations we can make about the visual diagnostics into numerical statistics that complement those observations

Using these statistics, we can quantify what our intuition says should be true about matching and non-matching cartridge case pairs

NEXT

For example, if we have two truly matching cartridge case pairs, then we can assume…

The number of similarities should outweigh the number of differences (by a lot)

The regions we call “different” should be relatively small and

similar to the observation I made on the last slide, the surface values of the differences should still follow similar trends

As I said before, these statistics provide a numerical complement to the visual diagnostics we discussed before and are useful to predict the behavior of the algorithm.

So we will talk about three statistics that we calculate based on these qualitative observations

Similarities vs. Differences Ratio

There should be more similarities than differences

Ratio between number of similar vs. different observations

Compare to a non-match cell comparison:

The first observation we made is that there should be more similarities than differences for two matching cartridge cases.

We measure this by computing the ratio between the number of similarities vs. the number of differences. The example we looked at two slides ago is shown again on the slide. The top row shows the similarities for two aligned cells from a matching comparison. Rather than considering the surface values in this cell, the white and gray image represents that we only consider which elements contain surface values. There are 1,449 observations in the top plot.

We do the same thing with the differences, which results in 260 observations on the bottom. The ratio between these two numbers is 5.6, which means there are 5.6 times as many similarities as there are differences. This number sounds large, but it’s useful to consider a non-match example for comparison.

NEXT

The example shown here shows the results from a non-match cell comparison. Now, it just so happens that the number of similarities in this example is also 1,449, but the number differences is considerably more – 797. In this case, there are 1.8 times as many similarities as there are differences.

As we would expect, there are many more similarities than differences for the matching comparison above compared to the non-matching comparison. Again, this ratio provides a quantitative measure of the similarity between two aligned scans.

Different Region Size

The different regions should be relatively small

Size of the different regions

Compare to a non-match cell comparison:

The next observation we made was that the regions we call “different” should be relatively small.

To measure this, we consider the size of the individual different regions.

On the slide, you see the differences from the matching cell comparison we’ve been working with. Again, we consider which elements contain surface values, as represented by the white and gray image.

This time, however, we use what’s called a labeling algorithm to identify connected regions. The third visual on the slide demonstrates that we have identified individual regions in the cell, which we distinguish by color.

Once we label the individual regions, it’s straightforward to calculate the size of each region. The graph here shows a histogram of the region sizes for this example where on the horizontal axis we have the region size in square microns and on the vertical axis, we have the number of regions of a particular size. For example, you can see that there are 6 regions that all have very small sizes. The one region that is over 300 square microns large is this orange region you see in the third plot.

Finally, we can calculate some summary statistics based on this distribution. For example, the average size of these regions is 61.8 square microns with a standard deviation of 72.7 square microns.

NEXT

We can compare this to a non-match cell comparison and go through the same process of labeling the connected regions. In this non-match example, we see that hte average size of the regions is now 134.1 square microns with a standard deviation of 171.3 square microns.

So these statistics based on the different region sizes provide us with another measure of similarity between the two aligned scans.

Different Region Correlation

The surface values of the different regions should follow similar trends

Correlation between the different regions of the two scans

Compare to a non-match cell comparison:

The last statistic I’ll discuss is based on the observation that the surface values of the different regions should follow similar trends.

We measure this by calculating the correlation between the different regions of the two scan.

The example you see on the slide is from a matching cell comparison. On the left are the differences from cell 1, 6 of scan K013sA1 and on the right are the differences from the aligned cell in scan K013sA2. Just as I noted a few slides back, we can see some similarities in the surface trends between these two visuals. Computing the correlation between these two results in a value of 0.48, which is actually relatively high when we are working with these cartridge case scans.

NEXT

Compare this now to the non-matching example we’ve been working with, we again have the differences from a cell comparison between two non-matching scans. If you study these two plots for a while, you’ll see that there aren’t a lot of similarities between the surface trends between these two, which is of course what we would expect given that this is a non-match comparison. The low correlation of 0.09 between these two quantifies this notion.

In summary, these three statistics provide a numerical complement to the qualitative observations we can make from the visual diagnostics. Together, the statistics and the visual diagnostics give us a more holistic idea of the similarity between two cartridge cases.

Automatic Cartridge Evidence Scoring (ACES) Algorithm

Automatic Cartridge Evidence Scoring

Comparison algorithm that pre-processes, compares, and scores two cartridge case scans

Computes 19 numerical features for each cartridge case pair

Computes similarity score between 0 and 1 for a cartridge case pair using trained statistical model

Visual Diagnostic Features

Use visual diagnostic statistics discussed earlier as numerical features

Features:
- From the full scan comparison:
  - Similarities vs. differences ratio
  - Average and standard deviation of different region sizes
  - Different region correlation
- From cell-based comparison:
  - Average and standard deviation of similarities vs. differences ratios
  - Average and standard deviation of different region sizes
  - Average different region correlation

The first group of features we calculate are based on the visual diagnostic statistics that I discussed a few slides ago.

NEXT

We calculate 7 features at the full scan and cell levels. For example, the similarities vs. differences ratio, the average and standard deviation of the different region sizes, and the correlation between the different regions.

At the bottom of the slide, you can see the distributions of these features based on the comparisons in our training data. The orange distributions represent the feature values for the matching comparisons while the gray distributions represent non-matching comparisons.

You’ll note that the distributions of these features only have some overlap between the matches and non-matches. That is, the values of these features behave differently depending on if we consider a match or non-match comparison. This means that we can use these features in a classification algorithm to distinguish between matches and non-matches.

So that’s the first group of features

NEXT SLIDE

Registration-based Features

For a matching cartridge case pair…
- Correlation should be large at the full scan and cell levels
- Cells should “agree” on a particular registration
Compute summary statistics of full-scan and cell-based registration results

Features:
- Correlation from full scan comparison
- Mean and standard deviation of correlations from cell comparisons
- Standard deviation of cell-based registration values (horizontal/vertical translations & rotation)

The next group of features are registration-based features.

We compute these features based on the assumptions that for a matching comparison, the correlation values shouldl be large at the full scan and cell levels AND that the cells should “agree” on a particular registration.

NEXT

The way we measure this is by computing summary statistics of the full scan and cell based registration results.

For example, we compute the the correlation between the two full scans and compute the average and standard deviation of the the correlations from the cell-based registrations.

We also compute the standard deviation of the cell-based registration values themselves.

Again, I show the distributions of these features at the bottom of the slide, where you can see that these features can distinguish between matching vs. non-matching comparisons.

Density-based Features

For a matching cartridge case pair…
- Cells should “agree” on a particular registration
- The estimated registrations between the two comparison directions should be opposites

Apply Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to the cell-based registration results (Ester et al. 1996; Zhang et al. 2021)

Features:
- DBSCAN cluster indicator
- Average DBSCAN cluster size
- Absolute sum of density-estimated rotations
- Root sum of squares of the cluster-estimated translations

The final group of features we calculate are density-based features.

These features are based on the assumptions that for a matching comparison, the cells should “agree” on a particular registration AND that the estimated registrations between the two comparison directions, meaning registering scan A to scan B and also scan B to scan A, should be opposites of each other.

NEXT

To measure this, we apply an algorithm called the Density-Based Spatial Clustering of Applications with Noise or “DBSCAN” algorithm to the cell-based registration results.

On the slide, you can see two scatterplots that represent the estimated translations for each cell in matching comparison. So each point in the scatterplot corresponds to one of the cells below.

On the left, you can see that there is a cluster of points, colored blue, that correspond to the cells that are also colored blue below. These are cells that agree on a particular registration, in this case it looks like about -100 microns horizontally and -100 microns vertically. You can also see below that these cells align in a nice grid pattern – cell 1, 5 is next to cell 1, 6 which is above cell 2, 6, etc. - which we would expect because this is matching pair of scans.

The red points, on the other hand, correspond to the red cells that don’t seem to agree on a single registration – they are sort of randomly distributed on the scan.

So the left side shows the estimated registrations for the cells in scan K013sA2. There’s nothing stopping us from repeating the cell-based registration, but in the other direction – that is, trying to find where cells align in the other scan, K013sA1. The right-hand side shows these results.

Again, we see a cluster of blue points corresponding to blue cells that seem to agree on a particular registration. The key finding from this visual is that the estimated registration in scan K013sA1 is almost the exact opposite of the registration in scan K013sA2. We can see by noting that the the axes on the right-hand side have been reversed, meaning the axes go from positive to negative from left to right. We do this to demonstrate that the clusters are located at essentially the same spot, just opposite of each other.

So the DBSCAN algorithm allows us to identify these clusters of points. From this, we compute 4 features including an indicator variable for the existence of a cluster, the average cluster size, and the difference in the estimated registrations between the two comparison directions.

The distributions of these features is shown below. The takeaway from this plot is that these features also distinguish between matching and non-matching comparisons.

ACES Statistical Model

Compute 19 features for each pairwise comparison
Use 510 cartridge cases from Baldwin et al. (2014) to fit a logistic regression classifier

Train random logistic regression using 21,945 pairwise comparisons from 210 scans
- Classify pairs as a “match” or “non-match” based on similarity score
- Explore two optimization criteria:
  - Model that maximizes the overall accuracy
  - Model that balances true positive and true negative rates

Test model on 44,850 pairwise comparisons from 300 scans
- Compute true positive and true negative rates for each model
- Consider distributions of similarity scores for truly matching and non-matching pairs

In summary, we compute these 19 numerical features for each pair of cartridge cases.

Our goal is to compute a similarity score based on these features.

To do so, we used 510 cartridge cases from the Ames I study to train and test a logistic regression classifier model

NEXT

The way we did this was by using 21,945 pairwise comparisons to train the logistic regression model. So given a cartridge case pair, this model will classifies that pair as a match or a non-match based on an estimated similarity score. The process of training this model involves selecting model parameters that optimize some criteria. For example, we consider two different optimization criteria: one where we select the model that minimizes the overall classification error rate and another that balances the true positive and true negative rates. I’ll talk on the next slide about why we consider these criteria separately.

NEXT

Once we have the trained model, we use 44,850 pairwise comparisons to test the model on whether it can differentiate between new matching and non-matching comparisons. We are interested not only in the true positive and true negative rates, but we also want to consider the similarity scores for the matching and non-matching comparisons since this actually what an examiner would use during their examination.

Test Classification Results

Source	True Pos. (%)	True Neg. (%)	Overall Inconcl. (%)	Overall Acc. (%)
ACES, Min. Error	92.3	99.9	0.0	99.4
ACES, Balanced TP/TN	95.7	98.1	0.0	97.9
Ames I	99.6	65.2	22.9

Ames I (Baldwin et al. 2014) compared quartets (3 to 1) and considered inconclusives

Class imbalance in test data: 3,081 match vs. 41,769 non-match comparisons

The “Balanced TP/TN” model was selected based on the training data. The test data classifications aren’t guaranteed to also be balanced.

The table on this slide summarizes the true positive, true negative, and accuracy rates for the two trained models based on the test data. In the first row, we can see that the model we selected to maximize the overall accuracy has a accuracy of 99.4%, although there is a noticable difference between the true positive and true negative rates.

In second row, the model we selected to balance the true positive and true negative rates has a smaller overall accuracy, yet with closer true positive and negative rates

The reason that I compare these two models to each other is to highlight the fact that these two criteria lead to considerably different results. Eventually, we in the forensics and legal communities will need to decide which criteria we want to use when selecting statistical models. For example, we could consider whether a false positive error has worse ethical consequences than a false negative - or vice versa.

For comparison, I’ve repeated the results reported in the Baldwin study in the last row. We can see that the participants of the study had a larger true positive rate, although our models tend to have a much larger true negative rate. So they act as a nice complement to one another We don’t include overall accuracy here because there is currently some debate about how to treat inconclusives when calculating accuracy.

NEXT

Now, a few notes I wanted to make here: the results here are somewhat difficult to compare directly to the Ames I results since we compare every pair of cartridge cases whereas Baldwin compared groups of 3 known-match cartridge cases to 1 unknown source cartridge case.

They also considered inconclusives in the study, which may explain the differences in the true negative rates between our models and the study.

NEXT

I also wanted to point out that there is a large class imbalance between the matching and non-matching comparisons in the test data. There are about 3000 matching comparisons and 42,000 non-matching comparisons. This is not at all unique to cartridge case compared to other forensic disciplines, but I wanted to point this out to emphasize that we tend to get better results if we err on the side of making non-matching classifications. You can see this illustrated in the first model that has a very high true negative rate but a smaller true positive rate. This model seems to have a preference for making non-match classifications.

NEXT

Finally, some of you may be looking at these results and think “why doesn’t the second model have a more balanced true positive and true negative rate?” Keep in mind though that we selected this model because it balances these rates on the training data, not the test data. There isn’t a guarantee that the test classifications will also be balanced.

Similarity Score Distributions

Now, it’s all well and good to look at the classification accuracy for the sake of comparing different models. However, in practice the examiner would instead consider the similarity score computed by the model rather a binary classification.

The plot on the slide shows the predicted similarity score for the test data. Each of thes points represents a single pairwise comparison, and we distinguish those comparisons by whether they are truly non-matching matching. We can see on top that the non-match comparisons generally have small similarity score, which is certainly something we hope for.

While many of the matching comparisons have large similarity score, there is a group of comparisons that have very small similarity score.

NEXT

In particular, we discovered that matching comparisons between cartridge cases from the firearm labeled “T” tend to have lower similarity scores, which you can see on the slide. This might contribute to our models’ low true positive rate we noted on the last slide.

We consider classification accuracy as a means of selecting/comparing models.
In practice, the examiner would use the similarity score as part of their examination.

Matching comparisons from Firearm T cartridge cases tend to have lower similarity scores:

Conclusions

Conclusions & Future Work

In conclusion, although automatic comparison algorithms are useful for measuring the simmilarity between two pieces of evidence, they can often by difficult to interpret or explain.

Visual diagnostics help us understand the behavior of these comparison algorithms.

NEXT

Today, we introduced a set of diagnostic tools that are useful for explaining the behavior of the ACES algorithm and for comparing cartridge case evidence.

We also demonstrated that the ACES algorithm shows promise at measuring the similarity between two cartridge cases.

NEXT

In the future, we hope to develop free and open-source software that implement our visual diagnostics and comparison algorithms.

Our hope in making these tools easily available to others is to apply them to a diverse range of firearm and ammunition types. So far, we have a model that is trained on only 10 firearms and tested on 15 firearms. We need to devise additional “stress tests” so that we understand the strengths and limitations of the visual diagnostics and ACES algorithm.

Automatic comparison algorithms are useful for obtaining numerical measures of similarity for two pieces of evidence
Visual diagnostics help explain the inner mechanisms of comparison algorithms

Our visual diagnostic tools aid in understanding each step of a cartridge case comparison algorithm
- Also useful by themselves to visually compare cartridge case evidence
The Automatic Cartridge Evidence Scoring (ACES) algorithm shows promise at measuring the similarity between cartridge cases

Develop free, open source software to implement visual diagnostics & ACES
- We train our model on 10 firearms, all with the same make/model and ammunition
- Need additional “stress tests” (different ammunition/firearms, degradation, etc.)

Thank You!

impressions R package for visual diagnostics
- https://jzemmels.github.io/impressions/
scored R package for ACES algorithm
- https://jzemmels.github.io/scored/
cartridgeInvestigatR interactive web application
- https://csafe.shinyapps.io/cartridgeInvestigatR/

References

AFTE Criteria for Identification Committee. 1992. “Theory of Identification, Range Striae Comparison Reports and Modified Glossary Definitions.” AFTE Journal 24 (3): 336–40.

Baldwin, David P, Stanley J Bajic, Max Morris, and Daniel Zamzow. 2014. “A Study of False-Positive and False-Negative Error Rates in Cartridge Case Comparisons.” Fort Belvoir, VA: Ames Lab IA, Performing; Defense Technical Information Center. https://doi.org/10.21236/ADA611807.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–31. KDD’96. Portland, Oregon: AAAI Press.

National Research Council. 2009. Strengthening Forensic Science in the United States: A Path Forward. Washington, D.C.: The National Academies Press.

President’s Council of Advisors on Science and Technology. 2016. “Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.” Executive Office of The President’s Council of Advisors on Science; Technology, Washington DC.

Song, John. 2013. “Proposed ‘NIST Ballistics Identification System (NBIS)’ Based on 3d Topography Measurements on Correlation Cells.” American Firearm and Tool Mark Examiners Journal 45 (2): 11. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=910868.

Tai, Xiao Hui, and William F. Eddy. 2018. “A Fully Automatic Method for Comparing Cartridge Case Images,” Journal of Forensic Sciences 63 (2): 440–48. http://doi.wiley.com/10.1111/1556-4029.13577.

Thompson, Robert. 2017. Firearm Identification in the Forensic Science Laboratory. National District Attorneys Association. https://doi.org/10.13140/RG.2.2.16250.59846.

Vorburger, T V, J H Yen, B Bachrach, T B Renegar, J J Filliben, L Ma, H G Rhee, et al. 2007. “Surface Topography Analysis for a Feasibility Assessment of a National Ballistics Imaging Database.” NIST IR 7362. Gaithersburg, MD: National Institute of Standards; Technology. https://doi.org/10.6028/NIST.IR.7362.

Zhang, Hao, Jialing Zhu, Rongjing Hong, Hua Wang, Fuzhong Sun, and Anup Malik. 2021. “Convergence-Improved Congruent Matching Cells (CMC) Method for Firing Pin Impression Comparison.” Journal of Forensic Sciences 66 (2): 571–82. https://doi.org/10.1111/1556-4029.14634.

Appendix: Firearm-wise Similarity Scores

Specific firearms in the test set tend to have lower associated similarity score for matching comparisons