7 minute read

In this post, we’ll take a quick look at the Food Loss and Waste Database (FLW DB) from the Food and Agriculture Organization (FAO) of the United Nations. The post will be mostly about data visualization and the limitations of the dataset.

I used Python and Pandas/Seaborn/Matplotlib libraries to visualize the data. You can find the full Python scripts on my GitHub repository.

TL;DR

As described in the Background of the Food Loss and Waste Database page, the dataset has clear limitations. The data collection research for food waste and loss seems to be at a very early stage, so it’s hard to draw a meaningful and insightful conclusion just by looking at the dataset. However, since data visualization is fun, I plotted some satisfying graphs.

Database Overview

Downloading the Data

Please follow the below steps to download the dataset. (Note: The following steps are valid as of March 2022)

  • Go to FAO’s FLW data page. link.
  • Select “Year Range.” I used 1965 - 2021 for this post.
  • Download data by clicking the “Download Data” button on the bottom left of the page.

Here are the first few rows of the dataset.

# read and show the first few rows of the dataframe
df = pd.read_csv('./data/Data.csv')
df[:8]
m49_code country region cpc_code commodity year loss_percentage loss_percentage_original loss_quantity activity food_supply_stage treatment cause_of_loss sample_size method_data_collection reference url notes
0 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2009 5.22 5.22% 68100 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
1 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2008 5.43 5.43% 65240 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
2 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2007 5.61 5.61% 61080 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
3 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2006 5.40 5.4% 55270 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
4 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2005 5.00 5% 51970 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
5 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2004 5.00 5% 47310 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
6 104 Myanmar NaN 0142 Groundnuts, excluding shelled 2003 5.00 5% 43880 NaN Whole supply chain NaN NaN NaN FAO's annual Agriculture Production Questionna... FAO Sources NaN NaN
7 108 Burundi NaN 0111 Wheat 2020 3.50 3.5 NaN Shelling, Threshing Farm NaN NaN NaN Modelled Estimates NaN https://www.aphlis.net/en/page/20/data-tables#... NaN

Columns

Here are the columns that I’ll be primarily using for this post. I believe the column names are self-explanatory, but please look at the Background of the Food Loss and Waste Database page for the full details.

  • year
    • e.g. 2017, 2018, …
  • country
    • There are 149 unique countries.
    • e.g. ‘Austria’, ‘Jordan’, ‘Kenya’, …
  • commodity
    • There are 194 unique commodities.
    • e.g. ‘Apples’, ‘Pears’, ‘Apricots’, ‘Cherries’, ‘Peaches and nectarines’, ‘Plums and sloes’, ‘Kiwi fruit’, ‘Strawberries’, ‘Blueberries’, …
  • food_supply_stage
    • There are 20 unique food supply stages.
    • e.g. ‘Whole supply chain’, ‘Farm’, ‘Harvest’, ‘Storage’, ‘Processing’, ‘Trader’, ‘Retail’, …
  • loss_percentage
    • e.g. 5%, 13.5%, 43.2%…

Data Visualization

I apologize for the small font size of the plots. I’m too lazy to rewrite the code. Please open the images in new windows to see the plots better.

Which Commodity Has The Most Food Waste?

This section visualizes loss percentages for each commodity.

The U.S. (Whole Supply Chain)

Let’s first look at the U.S. data. I’ll be looking at the “Whole supply chain” data from 2005.

country = 'United States of America'
year = 2005
supply_stage = "Whole supply chain"
df_filtered = df[(df['country'] == country) & 
                 (df['food_supply_stage'] == supply_stage) & 
                 (df['year'] > year)]

plot_order = df_filtered.groupby(by='commodity').mean().sort_values(by='loss_percentage', ascending=False).index.tolist()
fig, ax = plt.subplots(figsize=(20, 6))
plot = sns.barplot(x='commodity', y='loss_percentage', data=df_filtered, order=plot_order, ci='sd')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, horizontalalignment='right')
ax.set_title(f"Food Loss ({supply_stage}) by Commodity in {country}, (data from {year}, vertical bars show stddev)".title())
ax.set(xlabel=None)
plt.show()

png

Here are some interesting findings.

  • “Grapefruit juice” has a large loss percentage while “Grapefruits” has a small loss percentage.
  • “Mixed grain” loss is large while individual grains (“Wheat” and “Barley”) losses are very small.
  • There are no data for meat. I don’t see beef, pork, and chicken.

Other Countries (Whole Supply Chain)

Here are some plots for different countries that used the same code as the above. There are more plots in my repo.

Canada

png

  • Canada data is very similar to U.S. data.
  • There is “Meat of pig, fresh or chilled” in Candata data, which is not found in the U.S. data.

    Austria, Germany, Hungary

    The followings data are from three European countries.

png png png

  • The above plots are mostly about grains (e.g., Rye, Wheat, Barley). I guess grain data in these countries were easier to collect than other commodities for FAO.
  • “Plums and sloes” loss is low in Hungary (about 2.5%), while the same number is high for Austria (about 28%). It’s hard to believe that Austria is ten times more efficient in processing plums than Hungary. I looked at the “reference” and “method_data_collection” of the rows, but they seem to be the same. To me, this seems like a limitation of the dataset - even though the data collection method is the same, it’s hard to believe the numbers.
year = 2005
supply_stage = "Whole supply chain"
countries = ["Hungary", "Austria"]
commodity = "Plums and sloes"
df_filtered = df[df['country'].isin(countries) & 
                 (df['food_supply_stage'] == supply_stage) & 
                 (df['year'] > year) & (df['commodity'] == commodity)]
df_filtered[['country', 'commodity', 'loss_percentage', 'food_supply_stage', 'method_data_collection','reference', 'year']]
country commodity loss_percentage food_supply_stage method_data_collection reference year
7720 Hungary Plums and sloes 2.40 Whole supply chain FAO's annual Agriculture Production Questionna... FAO Sources 2014
10267 Austria Plums and sloes 26.12 Whole supply chain FAO's annual Agriculture Production Questionna... FAO Sources 2007
10282 Austria Plums and sloes 35.30 Whole supply chain FAO's annual Agriculture Production Questionna... FAO Sources 2006

Mexico, Peru

png png

  • “Watermelons” loss percentages seem too different for these two countries. Mexico is about 0.4%, and Peru is about 45%.
  • The Mexico data doesn’t have a reference, so the large difference makes sense. One can imagine how difficult collecting food waste data is.
year = 2005
supply_stage = "Whole supply chain"
countries = ["Mexico", "Peru"]
commodity = "Watermelons"
df_filtered = df[df['country'].isin(countries) & (df['food_supply_stage'] == supply_stage) & (df['year'] > year) & (df['commodity'] == commodity)]
df_filtered[['country', 'commodity', 'loss_percentage', 'food_supply_stage', 'method_data_collection','reference', 'year']]
country commodity loss_percentage food_supply_stage method_data_collection reference year
13919 Mexico Watermelons 0.42 Whole supply chain NaN NaN 2011
17149 Peru Watermelons 45.19 Whole supply chain FAO's annual Agriculture Production Questionna... FAO Sources 2007
17178 Peru Watermelons 45.17 Whole supply chain FAO's annual Agriculture Production Questionna... FAO Sources 2006

Food Loss Per Commodity (Whole Supply Chain)

As we have seen from the Watermelon data from Mexico and Peru, there might be a massive difference if the data collection methods are different. From now on, I’ll only use the data where the reference is “FAO Sources” and “method_data_collection” is “FAO’s annual Agriculture Production Questionnaires.” The below box plot shows the food loss percentage distribution per commodity from all the countries.

supply_stage = "Whole supply chain"
year = 2005
reference = "FAO Sources"
data_collection_method = "FAO's annual Agriculture Production Questionnaires"
df_filtered = df[(df['food_supply_stage'] == supply_stage) & 
                 (df['year'] >= year) & 
                 (df['reference'] == reference) & 
                 (df['method_data_collection'] == data_collection_method)]
# Get commodities with data points from many countries.
n_country_threshold = 6  # We only use commidity where data is collected from more than n_country_threshold countries. 
unique_country_commodity_pairs = set(zip(df_filtered['country'], df_filtered['commodity'])) # unique (country, commodity) pairs
commodity_counter = Counter([commodity_ for _, commodity_ in unique_country_commodity_pairs]) 
common_commodities = set({x: count for x, count in commodity_counter.items() if count >= n_country_threshold}.keys())
df_filtered = df_filtered[df_filtered['commodity'].isin(common_commodities)]
commodity_plot_order = df_filtered.groupby(by='commodity').mean().sort_values(by='loss_percentage', ascending=False).index.tolist()
fig, ax = plt.subplots(figsize=(25, 6))
plot = sns.boxplot(x='commodity', y='loss_percentage', data=df_filtered, order=commodity_plot_order)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, horizontalalignment='right')
ax.set_title(f"Food Loss ({supply_stage}) from all countries (from {year}, reference={reference})".title() )
ax.set(xlabel=None)
plt.show()

png

  • By looking at the plot, it seems like fruits (e.g., tomatoes, watermelons) produce more waste than grains (e.g., oats, rye)

Which Country Has the Highest Food Loss?

The below plots show how much food loss percentage for “Potatoes”, “Wheat”, “Rice” and “Hen eggs in shell, fresh” for different countries. We can interpret how efficient/inefficient each country’s supply chain is for various commodities.

supply_stage = "Whole supply chain"
year = 2005
reference = "FAO Sources"
data_collection_method = "FAO's annual Agriculture Production Questionnaires"
df_filtered = df[(df['food_supply_stage'] == supply_stage) & 
                 (df['year'] >= year) & 
                 (df['reference'] == reference) & 
                 (df['method_data_collection'] == data_collection_method)]
unique_country_commodity_pairs = set(zip(df_filtered['country'], df_filtered['commodity'])) # unique (country, commodity) pairs
common_commodities = [commodity for commodity, _ in Counter([commodity for _, commodity in unique_country_commodity_pairs]).most_common(10)]

commodity_to_plot = ['Potatoes',
                     'Wheat', 
                     'Rice',
                     'Hen eggs in shell, fresh']
for commodity in common_commodities:
    if commodity not in commodity_to_plot:
        continue
    df_filtered_by_commodity = df_filtered[df_filtered['commodity'] == commodity]
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_order = df_filtered_by_commodity.groupby(by='country').mean().sort_values(by='loss_percentage', ascending=False).index.tolist()
    sns.barplot(x='country', y='loss_percentage', data=df_filtered_by_commodity, ax=ax, order=plot_order)
    ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, horizontalalignment='right')
    ax.set_title(f"'{commodity}' Loss ({supply_stage}) per Country (from {year}, Vertical bars show stddev)".title() )
    ax.set(xlabel=None)
    ax.set_ylabel(f"{commodity} loss percentage")
    plt.show()

png

png

png

png

Limitation of the Dataset

If you look at the Background of the Food Loss and Waste Database page, FAO describes the limitations of the dataset.

  • Often studies will sum the loss percentages across the supply chain. This tends to inflate the overall percentage losses and negate that the quantities of the commodities decrease through the supply chain with respect to losses and other utilizations.
  • The issue of food loss may not be fully captured within national estimates as they often present the lower bound of the problem. On the other hand, studies done with small sample sizes may not be fully representative, even though they represent the upper bound of the problem.
  • The figures show differences between estimates along different stages of the value chains and regions, making a one-size fits all estimate at the international level less practical when addressing losses. Considering how these estimates change over time by stage and ranges, it will be more effective to look at supply stages which fluctuate in their losses in order to focus on effective policy and investments.
  • Effective comparison of data is presently hindered by a lack of comparability on many levels: different measurement methodologies across space and commodities, and estimates that may not be repeated over time.
  • There may be underlying biases when it comes to the data that has been collected, both at the national level and within the individual studies and the focus will be put on problematic areas
  • In the case studies, the sample sizes may not be large enough to extrapolate causes and solutions to a general policy prescription.

IMO, this means we can’t draw any conclusions from the data. I appreciate and highly think of the initial effort to collect the data. Still, I look forward to more standardized data collection so people can do more insightful analyses.

Comments

Leave a comment

Your email address will not be published. Required fields are marked *

Loading...