Beer Talk - Navigating Untappd Data Using Graphs
It’s that time of the year - the festive season. This means a bit of ‘down time’ to play although in reality it’s always used to explore, learn or do ‘worthy’ things for the community such as maintain and extend TRAK (an open source architecture framework).
Untappd Application
One of the many apps on the mobile phone is Untappd. This allows a user to ‘check in’ a beer they’ve tried and rate it. In addition you can identify where it was bought from or the pub in which you’re drinking it. As with most apps nowadays there is a social side and you can add friends, tag their check-ins and generally spy on them. It’s very good at helping maintain a list of what beers you’ve enjoyed and what you’ve had before which is useful when the memory starts to fail. Of course the emergent behaviour is that it encourages the user to explore and try more beers - no doubt something behind the design of the app and it is clear by the feedback from brewers that the data collected from the users is made available (presumably at a price) to the brewers so that they can see what users think of their beers.
I wanted to produce a list of what I’d tried but in the app and on the website there is no mechanism that simply produces a straight list - both produce a more ornate version with photos and details. What I really wanted was one in alphabetic order by the beer name and possibly by the brewer’s name. As a Supporter ( $5 for a month) you can export your data as CSV and JSON. So what sort of (data) model would best fit this?
The data header names are:-
- beer_name
- brewery_name
- beer_type
- beer_abv
- beer_ibu
- comment
- venue_name
- venue_city
- venue_state
- venue_country
- venue_lat
- venue_lng
- rating_score
- created_at
- checkin_url
- beer_url
- brewery_url
- brewery_country
- brewery_state
- flavor_profiles
- purchase_venue
- serving_type
This seems pretty straightforward once you know that IBU = International Bitterness Unit - a measure of how bitter the beer is and ABV is the percentage Alcohol By Volume.
Entities or Concepts
There appear to be some explicit concepts or entities:-
- Beer
- Brewery
- Checkin
- Venue
Although there are names for Beer and Brewery the unique identifier is probably the beer_url and brewery_url since it is possible that beers or breweries might have the same name. These URLs are internal to Untappd and are the address at which the description of the Beer or Brewery are presented.
Venue potentially appears in many guises. It might be the place at which the beer is drunk or it might be the place from which the beer was bought or both. The way in which the app works encourages the user to add the location at checkin which is presumably usually where the beer is tasted. This appears to populate the ‘venue_…’ columns. The purchase_venue is added separately by the user and looking at the data this simply holds the name so there is a potential problem where two purchase venues share the same name. Why they don’t use a venue_url is a mystery since this could be used to link purchase_venue and venue_name et al. It only seems sensible that a purchase venue should be a subset of the total set of venues (with addresses, locations et al). This might need tweaking since an online Purchase Venue might not have or show a geographical address and hence a venue_url is a sensible identifier.
The serving_type (‘draft’, ‘can’, ‘bottle’) might initially seem to be a property of the beer but it’s really the checkin or tasting event as the same beer might be tried in many different styles at different times.
As a first pass gathering the attributes of the entities we have:-
Beer
- name
- type
- abv
- ibu
- url (unique identifier)
- flavour profiles
Brewery
- name
- state
- country
- url (unique identifier)
Checkin
- date / time
- url (unique identifier)
- rating
- comment
- serving type
Venue
- name
- city
- state
- country
- latitude
- longitude
Although Untappd doesn’t state this it is best if the Purchase Venue is simply another Venue. This creates a problem because it only contains a name whereas the (Checkin) Venue is must have the combination of name + latitude + longitude as the unique identifier. We could potentially solve this by creating our own dummy identifier for Venue and some means to check / parse both purchase_venue and venue_name a) to link them and b) to spot any potential conflicts.
Relationships
Entities without relationships are nearly useless. We need relationships to establish patterns, identify boundaries and describe behaviour.
Some instinctive first choices are:-
- Brewery brews Beer
- Checkin at Venue
We need to associate Beer and Checkin and since Checkin isn’t a person we could try
- Checkin concerns Beer
and to tie the place where the beer is served (and usually consumed)
- Beer served at Venue
which is a separate assertion from
- Beer purchased at Venue
which gives us the a first-pass model.
This isn’t complete, however.
- The Untappd app allows a Checkin without requiring either a serving or a purchase Venue
- The rating may well be affected by how well the beer is looked after at the pub or, say, how near it is to the ‘best before’ date if in a bottle or can which affects both the purchase venue and the serving venue. The same beer might be given a different rating at a different venue so the rating is a combination of the [beer + serving venue + purchase venue].
- We can tie these together in a graph database by adding the Checkin identifier (checkin_url) property to the ‘served at’ and ‘purchased at’ relationships.
The flavour profile contains a set of adjectives to describe the taste of the beer. This is something set up by the Untappd database, presumably with input from each brewer, doesn’t vary with the checkin and is therefore a property of the beer but distinct from the user’s comments on drinking it (the ‘comment’ attribute of Checkin). We could leave these as a list property but might we could make them more explicit by having Taste Descriptor as a separate entity and associating it with Beer:-
- Beer described as Taste Descriptor e.g. ‘Headwaters Pale Ale described as smooth’
Since it’s very easy to navigate / query relationships in a graph database and there are only a handful of serving types we might as well having Serving Type as a separate entity.
There are a number of possibilities:-
- simply tie it to a particular Checkin (which is already tied to a Beer and Venue(s))
- tie it to the Beer and tie this to the Checkin using the checkin_url
The last method seems to make for more natural reading as we can then assert:-
- Beer served as Serving Type at Venue e.g. ‘Postman’s Plum Porter served as draft at The Old Post Office’
In using the Untappd App many of these relationships and entities are optional since a checkin only requires the user to identify the beer and hence each entry or row need only contain:-
- Checkin (provided by the Untappd app)
- Beer
- Brewery (provided by the Untappd app)
If we want we can traverse the succession of Checkins by adding ‘Checkin follows Checkin’ .
The final model to import against is now below.
Note that in a directed graph database such as Neo4J <http://www.neo4j.com/> the direction of the relationship does not affect the design of the entity since it is as easy to traverse a path in either direction by specifying the direction in the query, for example using the CYPHER query language:
MATCH breweries_beers_in_cans = (br:Brewery)-[:BREWS]->(b:Beer)-[:`SERVED AS`]->(ty:Serving_Type)
WHERE ty = ‘Can’
RETURN breweries_beers_in_cans`
A CYPHER Graph Query Simply Follows a Path
This is unlike a traditional SQL design where a relationship is inferred by placing a foreign key identifier value in the table with which the entity has a relationship.
The other advantage is that a properly constructed graph can be read as a simple statement or assertion so queries look much more like natural language.
In any notation it is an error not to define the relationship or define a relationship since:-
- If there is no relationship with anything it is not a description of architecture in any meaningful sense and cannot be used / queried (so what’s the point?)
- Without a defined relationship the meaning / semantics are unknown and different readers will interpret in different and therefore inconsistent ways.
- Now to import the Untappd data using this model to structure, cleanse/modify/transform the data to suit the queries to be run ....
External References
- Neo4J Graph Database http://www.untappd.com/
- Untappd http://www.untappd.com/