Using Python to Analyze the Recipes of Binging with Babish
Defining a consistent structure
To begin to analyze a recipe we need to convert the raw information in it into something more easily understood by a computer.
Beginning with the list of ingredients, we will attempt to isolate quantities and units of measurement ("2 teaspoons"), while ommitting things like preparation methods (ie "finely chopped").
Here's an example of a spice blend from the #BabishPanini episode:
- 4 cloves of garlic
- ¼ tsp dried thyme
- ¼ tsp marjoram
- ¼ tsp aniseed
"ingredient_list": [ [ 4, null, "Garlic", "4 cloves of garlic" ], [ 0.25, "teaspoon", "Dried Thyme", "¼ tsp dried thyme" ], [ 0.25, "teaspoon", "Marjoram", "¼ tsp marjoram" ], [ 0.25, "teaspoon", "Aniseed", "¼ tsp aniseed" ] ]
#BabishPanini Winner Recipe parsed with ❤ by DataWithBabish
So far, all of these ingredients are in a consistent format.
A quantity, unit of measurement, and ingredient name.
Our first outlier, "4 cloves of garlic", highlights just one of the many ambiguities we will have to deal with in recipe information. Should our ingedient name be "cloves of garlic" or is "cloves" a unit of measurement?
In this case I have chosen to treat "cloves" as a unit of measurement, but to make our lives easier (if only for the moment) we will omit units not predefined.
Assuming recipe information will continue to be in this consistent format, we've written a parser to extract these bits of information out of our recipes.
class RecipeParser(object):
Ingredient = namedtuple('Ingredient', 'qty unit name raw')
units_pattern = r'(?:(\s?mg|g|kg|ml|L|oz|ounce|tbsp|Tbsp|tablespoon|tsp|teaspoon|cup|lb|pound|small|medium|large|whole|half)?(?:s|es)?\.?\b)'
full_pattern = r'^(?:([-\.\/\s0-9\u2150-\u215E\u00BC-\u00BE]+)?{UNITS_PATTERN})?(?:.*\sof\s)?\s?(.+?)(?:,|$)'.format(
UNITS_PATTERN=units_pattern)
pattern = re.compile(full_pattern, flags=re.UNICODE)
recipe_parser.py hosted with ❤ by GitHub
And like all good programmers, we don't just assume our code works - we test it! From our larger dataset, I've pulled out a sample of ingredients with a wider variety of formats than our initial dataset suggested.
Testing our work
def test_parse_ingredient(self):
# Input, (Quantity, Unit, Name)
tests = [
'Bread', (None, None, 'Bread'),
'6 stalks celery', (6.0, None, 'Stalks Celery'),
'4 eggs', (4.0, None, 'Eggs'),
'2 ½ pounds of full fat cream cheese, cut', (2.5, 'pound', 'Full Fat Cream Cheese'),
'25 oreos, finely processed', (25.0, None, 'Oreos'),
'1-2 variable ingredients', ('1-2', None, 'Variable Ingredients'),
'2 1/2 things', (2.5, None, 'Things'),
'1/2 things', (0.5, None, 'Things'),
'1 large, long sourdough loaf', (1.0, 'large', 'Long Sourdough Loaf'),
'100ml Water', (100.0, 'ml', 'Water'),
'1L Water', (1.0, 'L', 'Water')
]
test_recipe_parser.py hosted with ❤ by GitHub
Some notable examples
Bread
25 oreos, finely processed
1 large, long sourdough loaf
2 ½ pounds of full fat cream cheese, cut
Some ingredients completely lack specific quantities or units, it's simply implied.
Others include comments about the ingredients, like preparation method or variations on quantity (ie. large, small, whole, half, extra) or to the ingredient itself (ie. full fat)
Our previous assumption about ingredient information being in a consistent format is quickly falling apart. Our parser implementation, built on RegularExpressions, is reliant on that consistent structure and thus very brittle to the realities of our data. We will need a better approach...
Enter Natural Language Processing (NLP)
But first, an asside about parsing fractions in Python.
Handling fractions
Since we are dealing with text scraped from the internet here, we have multiple ways of specifying the same thing. Some fractions are so common they have been incorporated into the unicode character set (ie. ½) and would be more common than their plain text counterparts (ie. 1/2). As a programmer, I would be amiss if I didn't also support 0.5, though I doubt a recipe writer would actually use it.
from fractions import Fraction
def normalize_qty(cls, qty):
if len(qty) == 0:
qty = None
elif len(qty) == 1:
qty = numeric(qty)
else:
try:
if '/' in qty:
# 2 1/2
qty = float(sum(Fraction(s) for s in qty.split()))
elif qty[-1].isdigit():
# normal number, ending in [0-9]
qty = float(qty)
else:
# Assume the last character is a vulgar fraction
qty = float(qty[:-1]) + numeric(qty[-1])
except ValueError:
pass # let it be a string
return qty
recipe_parser.py hosted with ❤ by GitHub
And as always, we test our work!
def test_normalize_qty(self):
tests = [
'1', 1.0,
'1/2', 0.5,
'1 2/3', (1 + 2 / 3),
'1 ⅔', (1 + 2 / 3)
]
test_recipe_parser.py hosted with ❤ by GitHub
Beyond REGEX
Examples of current limitations:
- All potential ingredient units must be known beforehand.
- Any info found after a comma or in parenthesis is simply ignored.
Bring in the AI! Introducing Natural Language Processing (NLP)
Research
- Using NLP to parse ingredients + comments: https://github.com/NYTimes/ingredient-phrase-tagger
- Example use: https://rajmak.wordpress.com/tag/recipe-ingredients-tagging/
- Running in Docker: https://github.com/ArchSirius/docker-ingredient-phrase-tagger
Call to action
- Help contribute to an ingredient/recipe parsing library