Using Python to Analyze the Recipes of Binging with Babish

Our previous assumption about ingredient information being in a consistent format is quickly falling apart. We will need a better approach...

Using Python to Analyze the Recipes of Binging with Babish

Defining a consistent structure

To begin to analyze a recipe we need to convert the raw information in it into something more easily understood by a computer.

Beginning with the list of ingredients, we will attempt to isolate quantities and units of measurement ("2 teaspoons"), while ommitting things like preparation methods (ie "finely chopped").

Here's an example of a spice blend from the #BabishPanini episode:







  • 4 cloves of garlic
  • ¼ tsp dried thyme
  • ¼ tsp marjoram
  • ¼ tsp aniseed
"ingredient_list": [      
  [
    4,
    null,
    "Garlic",
    "4 cloves of garlic"
  ],
  [
    0.25,
    "teaspoon",
    "Dried Thyme",
    "¼ tsp dried thyme"
  ],
  [
    0.25,
    "teaspoon",
    "Marjoram",
    "¼ tsp marjoram"
  ],
  [
    0.25,
    "teaspoon",
    "Aniseed",
    "¼ tsp aniseed"
  ]
]

#BabishPanini Winner Recipe parsed with ❤ by DataWithBabish

So far, all of these ingredients are in a consistent format.

A quantity, unit of measurement, and ingredient name.

Our first outlier, "4 cloves of garlic", highlights just one of the many ambiguities we will have to deal with in recipe information. Should our ingedient name be "cloves of garlic" or is "cloves" a unit of measurement?

In this case I have chosen to treat "cloves" as a unit of measurement, but to make our lives easier (if only for the moment) we will omit units not predefined.

Assuming recipe information will continue to be in this consistent format, we've written a parser to extract these bits of information out of our recipes.

class RecipeParser(object):

    Ingredient = namedtuple('Ingredient', 'qty unit name raw')
    units_pattern = r'(?:(\s?mg|g|kg|ml|L|oz|ounce|tbsp|Tbsp|tablespoon|tsp|teaspoon|cup|lb|pound|small|medium|large|whole|half)?(?:s|es)?\.?\b)'

    full_pattern = r'^(?:([-\.\/\s0-9\u2150-\u215E\u00BC-\u00BE]+)?{UNITS_PATTERN})?(?:.*\sof\s)?\s?(.+?)(?:,|$)'.format(
        UNITS_PATTERN=units_pattern)

    pattern = re.compile(full_pattern, flags=re.UNICODE)

recipe_parser.py hosted with ❤ by GitHub

And like all good programmers, we don't just assume our code works - we test it! From our larger dataset, I've pulled out a sample of ingredients with a wider variety of formats than our initial dataset suggested.

Testing our work

def test_parse_ingredient(self):
    # Input, (Quantity, Unit, Name)
    tests = [
        'Bread', (None, None, 'Bread'),
        '6 stalks celery', (6.0, None, 'Stalks Celery'),
        '4 eggs', (4.0, None, 'Eggs'),
        '2 ½ pounds of full fat cream cheese, cut', (2.5, 'pound', 'Full Fat Cream Cheese'),
        '25 oreos, finely processed', (25.0, None, 'Oreos'),
        '1-2 variable ingredients', ('1-2', None, 'Variable Ingredients'),
        '2 1/2 things', (2.5, None, 'Things'),
        '1/2 things', (0.5, None, 'Things'),
        '1 large, long sourdough loaf', (1.0, 'large', 'Long Sourdough Loaf'),
        '100ml Water', (100.0, 'ml', 'Water'),
        '1L Water', (1.0, 'L', 'Water')
    ]

test_recipe_parser.py hosted with ❤ by GitHub

Some notable examples

Bread
25 oreos, finely processed
1 large, long sourdough loaf
2 ½ pounds of full fat cream cheese, cut

Some ingredients completely lack specific quantities or units, it's simply implied.

Others include comments about the ingredients, like preparation method or variations on quantity (ie. large, small, whole, half, extra) or to the ingredient itself (ie. full fat)

Our previous assumption about ingredient information being in a consistent format is quickly falling apart. Our parser implementation, built on RegularExpressions, is reliant on that consistent structure and thus very brittle to the realities of our data. We will need a better approach...

Enter Natural Language Processing (NLP)

But first, an asside about parsing fractions in Python.

Handling fractions

Since we are dealing with text scraped from the internet here, we have multiple ways of specifying the same thing. Some fractions are so common they have been incorporated into the unicode character set (ie. ½) and would be more common than their plain text counterparts (ie. 1/2). As a programmer, I would be amiss if I didn't also support 0.5, though I doubt a recipe writer would actually use it.

from fractions import Fraction

def normalize_qty(cls, qty):
    if len(qty) == 0:
        qty = None
    elif len(qty) == 1:
        qty = numeric(qty)
    else:
        try:
            if '/' in qty:
                # 2 1/2
                qty = float(sum(Fraction(s) for s in qty.split()))
            elif qty[-1].isdigit():
                # normal number, ending in [0-9]
                qty = float(qty)
            else:
                # Assume the last character is a vulgar fraction
                qty = float(qty[:-1]) + numeric(qty[-1])
        except ValueError:
            pass  # let it be a string
    return qty

recipe_parser.py hosted with ❤ by GitHub

And as always, we test our work!

def test_normalize_qty(self):
    tests = [
        '1', 1.0,
        '1/2', 0.5,
        '1 2/3', (1 + 2 / 3),
        '1 ⅔', (1 + 2 / 3)
    ]

test_recipe_parser.py hosted with ❤ by GitHub

Beyond REGEX

Examples of current limitations:

  • All potential ingredient units must be known beforehand.
  • Any info found after a comma or in parenthesis is simply ignored.

Bring in the AI! Introducing Natural Language Processing (NLP)

Research

Call to action