There are tons of useful data on websites we browse every day, but each site might have its own structure and form that making it difficult to extract the data you need quickly. In this article, I will show you how I gathered the missing persons' data from a website (International Center For Unidentified & Missing Persons) by using Scrapy, and how I cleaned the data and analyzed the spatial-temporal data with ArcGIS tools.
Scraping the website
Scrapy is an open-source framework for extracting the data you need from websites. I recommend reading the official tutorial if you are not familiar with it. Scrapy requires Python 3.6+ and I installed it by using Anaconda on the command line:
conda install -c conda-forge scrapy
The webpage I scraped lists the missing males in the U.S. in chronological order from as early as 1881 to 2020. For each person, there is a link to their individual page with descriptions such as name, case classification, missing date, last seen location, DOB, age, etc. My goal was to extract the description data of interest on each individual page of missing persons.
- Overwrite the original items.py file
In the directory of the project created, overwrite the items.py file with the items you want to extract from the site. The items I was interested in include name, case classification, missing date, last seen location, DOB, age, race, gender, height, and weight from each individual page. The items.py file was overwritten as below so that the scraped data will be stored in the defined item class:
name = scrapy.Field()
case = scrapy.Field()
since = scrapy.Field()
location = scrapy.Field()
DOB = scrapy.Field()
age = scrapy.Field()
race = scrapy.Field()
gender = scrapy.Field()
height = scrapy.Field()
weight = scrapy.Field()
2. Defining the Spider class
In the project created, there is a folder named spiders, in which you save your scraping code that Scrapy uses to scrape information from a webpage. There are two major components in the scraping code. First, define the URL of the webpage and scrape the URLs for each individual webpage of the person. XPaths was used to select elements from HTML documents. The code is shown as follows:
# define URL of the front page
start_urls = ["http://www.doenetwork.org/mp-chrono-us-males.php"]# extract individual URLs
def parse(self, response):
for href in response.xpath("//div[contains(@id,'container')]/ul
url = href.extract()
yield scrapy.Request(url, callback=self.parse_dir_contents)
The second component is to extract the text from each individual webpage iteratively by using the code below:
infor_list = response.xpath("//p/text()").getall()
While I need the description text I wanted, the texts (‘\n’ or ‘\n\n’) of the new line from HTML would also be extracted from the code above. A new list was created to get rid of rows containing ‘\n’ or ‘\n\n’ by the following code:
new_infor = 
for a in infor_list:
if '\n' not in a:
3. Running the Spider and output the data
Type in the code as below in the command line to run the Scapy and the data extracted was outputted to a CSV file.
scrapy crawl my_scraper -o results.csv
A screenshot of the data extracted is shown below:
Cleaning the data
Before analyzing the data scraped from the website, the data set needs to be cleaned. While each item may have different cleaning procedures, the data that was missing or unknown was removed first from the dataset.
The age data extracted were in the form of “xx years old” or “xx months old” (e.g. 27 years old). In order to get the age in a consistent format, first I split the number part from the text part (years old/months old). Then divide the number by 12 if the text part is “months old”. Finally, drop the text part and only keep the number part with the same unit (year).
Some of the weight data extracted were in a range (e.g. 140–150 pounds). Only the lower limit was kept for the later analysis.
Last seen location
Most of the last seen locations were in the form of “city/county/state ” (e.g. Chardon, Geauga County, Ohio). As I was only interested in the state information, I split the location by comma and only kept the state information.
The height was in the foot-inch format (e.g. 7' 1"). I converted them into the inch unit. For example, for 7' 1", first I split the string according to the apostrophe, then convert 7' to inches by multiply 12, which is 84 inches. The height in inches is the summation of 84 and 1.
Analyzing the data
After data cleaning, there were 1893 missing males ranging from the year 1905 to the year 2020 in the dataset. First, I looked into the missing person distribution throughout the years. The scatter plot of missing persons by the year they were last seen with age information is shown below. You can see from the plot most of the data fell between the 1970s and 2000.
Second, I compared the data amongst states to get some insight. The missing person counts were shown below by state with California having the most missing cases and Texas and Florida in the second and third place.
The map below also shows the missing person counts in different states.
Then I looked into the missing persons’ ages when they were reported missing amongst different states. A boxplot was drawn to show the age distributions for each state and the missing males’ average age in different states was also illustrated in a map.
With the spatial-temporal information in the dataset, ArcGIS was used to visualize the missing person counts in each state throughout the years (1955–2015), shown at the beginning of this article. There were scarce cases in the dataset before 1955 and after 2015. So only the missing cases between 1955 and 2015 are illustrated and they were counted every five years.