Missing Persons Analysis in the U.S.

Missing males in the U.S from 1955 to 2015. (Data source: The Doe Network: International Center For Unidentified & Missing Persons)

There are tons of useful data on websites we browse every day, but each site might have its own structure and form that making it difficult to extract the data you need quickly. In this article, I will show you how I gathered the missing persons' data from a website (International Center For Unidentified & Missing Persons) by using Scrapy, and how I cleaned the data and analyzed the spatial-temporal data with ArcGIS tools.

The webpage of the missing males from The Doe Network: International Center For Unidentified & Missing Persons http://www.doenetwork.org/mp-chrono-us-males.php

Scraping the website

Scrapy is an open-source framework for extracting the data you need from websites. I recommend reading the official tutorial if you are not familiar with it. Scrapy requires Python 3.6+ and I installed it by using Anaconda on the command line:

conda install -c conda-forge scrapy

The webpage I scraped lists the missing males in the U.S. in chronological order from as early as 1881 to 2020. For each person, there is a link to their individual page with descriptions such as name, case classification, missing date, last seen location, DOB, age, etc. My goal was to extract the description data of interest on each individual page of missing persons.

Individual page of the missing person.
  1. Overwrite the original items.py file

In the directory of the project created, overwrite the items.py file with the items you want to extract from the site. The items I was interested in include name, case classification, missing date, last seen location, DOB, age, race, gender, height, and weight from each individual page. The items.py file was overwritten as below so that the scraped data will be stored in the defined item class:

import scrapy
class WantedItem(scrapy.Item):
name = scrapy.Field()
case = scrapy.Field()
since = scrapy.Field()
location = scrapy.Field()
DOB = scrapy.Field()
age = scrapy.Field()
race = scrapy.Field()
gender = scrapy.Field()
height = scrapy.Field()
weight = scrapy.Field()

2. Defining the Spider class

In the project created, there is a folder named spiders, in which you save your scraping code that Scrapy uses to scrape information from a webpage. There are two major components in the scraping code. First, define the URL of the webpage and scrape the URLs for each individual webpage of the person. XPaths was used to select elements from HTML documents. The code is shown as follows:

# define URL of the front page
start_urls = ["http://www.doenetwork.org/mp-chrono-us-males.php"]
# extract individual URLs
def parse(self, response):
for href in response.xpath("//div[contains(@id,'container')]/ul
[contains(@class, 'rig')]/li/a//@href"):
url = href.extract()
yield scrapy.Request(url, callback=self.parse_dir_contents)

The second component is to extract the text from each individual webpage iteratively by using the code below:

infor_list = response.xpath("//p/text()").getall()

While I need the description text I wanted, the texts (‘\n’ or ‘\n\n’) of the new line from HTML would also be extracted from the code above. A new list was created to get rid of rows containing ‘\n’ or ‘\n\n’ by the following code:

new_infor = []
for a in infor_list:
if '\n' not in a:
new_infor.append(a)

3. Running the Spider and output the data

Type in the code as below in the command line to run the Scapy and the data extracted was outputted to a CSV file.

scrapy crawl my_scraper -o results.csv

A screenshot of the data extracted is shown below:

Part of the data extracted from the CSV file.

Cleaning the data

Before analyzing the data scraped from the website, the data set needs to be cleaned. While each item may have different cleaning procedures, the data that was missing or unknown was removed first from the dataset.

Age

The age data extracted were in the form of “xx years old” or “xx months old” (e.g. 27 years old). In order to get the age in a consistent format, first I split the number part from the text part (years old/months old). Then divide the number by 12 if the text part is “months old”. Finally, drop the text part and only keep the number part with the same unit (year).

Weight

Some of the weight data extracted were in a range (e.g. 140–150 pounds). Only the lower limit was kept for the later analysis.

Last seen location

Most of the last seen locations were in the form of “city/county/state ” (e.g. Chardon, Geauga County, Ohio). As I was only interested in the state information, I split the location by comma and only kept the state information.

Height

The height was in the foot-inch format (e.g. 7' 1"). I converted them into the inch unit. For example, for 7' 1", first I split the string according to the apostrophe, then convert 7' to inches by multiply 12, which is 84 inches. The height in inches is the summation of 84 and 1.

Analyzing the data

After data cleaning, there were 1893 missing males ranging from the year 1905 to the year 2020 in the dataset. First, I looked into the missing person distribution throughout the years. The scatter plot of missing persons by the year they were last seen with age information is shown below. You can see from the plot most of the data fell between the 1970s and 2000.

Missing persons throughout years with each dot representing one person.

Second, I compared the data amongst states to get some insight. The missing person counts were shown below by state with California having the most missing cases and Texas and Florida in the second and third place.

U.S. MIssing males counts by state.

The map below also shows the missing person counts in different states.

U.S missing males counts by state.

Then I looked into the missing persons’ ages when they were reported missing amongst different states. A boxplot was drawn to show the age distributions for each state and the missing males’ average age in different states was also illustrated in a map.

Missing males’ age distribution when reported missing.

With the spatial-temporal information in the dataset, ArcGIS was used to visualize the missing person counts in each state throughout the years (1955–2015), shown at the beginning of this article. There were scarce cases in the dataset before 1955 and after 2015. So only the missing cases between 1955 and 2015 are illustrated and they were counted every five years.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store