Scraping Websites to Collect Consumption and Price Data
Back in 2009, Kenya was struggling through a major food crisis. One million people faced imminent starvation, and the problem was deeply exacerbated by extreme inflation, making it incredibly difficult for families to keep food on the table. The government struggled to contain an inflation rate that it pegged at 25 percent. But was that really accurate?
The national lending rate around that time was 20 percent, meaning banks would’ve gone bankrupt in that environment. Clearly, the banks were guessing what the inflation rate was—and that it wasn’t 25 percent.
A couple of facts are important to this situation:
- The cost of food is a crucial component of all inflation rate calculations--often making up as much as half of the price index.
- Inflation is a prime driver of monetary policy and the debt markets.
- And monetary policy executed correctly can suppress inflation and stave off a food crisis. But done poorly, it can trigger catastrophic conditions of poverty and hunger.
In addition to these facts, and with particular emphasis in considering the guesstimate of a national inflation rate, a truth rings loud and clear: Wrong data is worse than no data.
The food price data traditionally used in too many inflation rates takes a long time to harvest and exists only at the national level, ignoring differences in regions and sub-regions. That doesn’t cut it when a crisis is in process.
So we asked: Was better food pricing and consumption data available--in any form, from any source--in real time? Could we figure out how to scrape that data efficiently and organize it in a way that would ensure that banks and governments worldwide had every grain of information possible to manage poverty through monetary policy? Is it possible to use that information to thwart a food crisis in the making?
We took those questions down to the DC Big Data Exploration in March 2013 and a crew of data scientists dubbed Team Ndizi (which means bananas in eastern Africa) took up the challenge.
Rather than start with a foundation of data, Team Ndizi, led by Max Richman, decided to scrape several websites to create new datasets. Using a variety of tools such as Python, ScraperWiki, and the Wayback Machine, they identified a number of sources that would help them surround the many layers of food pricing.
humuch.com: global food prices, which Team Ndizi used to scrape banana prices by continent
mFarm: this mobile price provider held 1,000 days of pricing information about dry maize, a staple food crop for 96 percent of Kenya’s people. Supply was especially low during 2009. mFarm gives farmers crop-price data to enable them to make better harvesting decisions.
Pick n’ Pay: this South African grocery chain provides prices on their website which Team Ndizi managed to free and analyze 11 essential food types to a balanced diet.
In addition to this on-the-ground data, Team Ndizi looked to databases at the World Bank, the Food and Agriculture Organization of the U.N. to round out the data. And the team used three cost-of-living sites to validate the information it was getting: Numbeo, Xpatulator, and Expatistan.
Finally, the team looked at how pricing can affect a food crisis in process. The Indonesian government tried to diversify the country’s diet, which is heavily dependent on rice. The program caused rice prices to spike relative to world pricing.
A World Bank senior economist has expressed an interest in analyzing Team Ndizi’s results and comparing it with the organization’s data.
The “defining chart” on the Indonesia rice crisis can be made more effective and useful in collecting price and consumption data in the World Bank’s work.