Scraping Facebook posts with Haskell


Deprecated

Due to changes in the Facebook Graph API, most of this doesn't work anymore.

Introduction

The main back-end component of ~~Oxfeud.it~~(Now deprecated) is a Haskell program that grabs content from Facebook and puts it in a database. It makes use of the Facebook Graph Api to make requests that return JSON objects from which data is extracted and inserted into the DB.

It is used in two different ways:

  1. Once, grab all posts (and associated data) from a page to initially populate the database.
  2. Repeatedly, grab the last n posts (and associated data) to update the database.

The specific data we need to get is:

  • All the posts to a certain page. For each post we store
    • The id, message, and time of posting
    • All comments and reactions
  • For a comment we store
    • The id, message, time of posting, and author
    • All reply comments and reactions
    • All tags
  • For a reaction we store
    • The type (Like, Angry, etc), and author
  • We also store all users who have interacted with the page by
    • Reacting to a post or comment
    • Making a comment
    • Being tagged in a comment

It's important that the program be:

  • Modular, so that we can get fetch content from arbitrary pages and do arbitrary things to it.
  • Reliable. If one part of the program fails, for example parsing one JSON object out of 20, the rest of the program should continue as best as it can.
  • Log what it does, and any failures that occur.
  • Easy to understand and extend.

Why Haskell?

There are a few reasons I decided to write this in Haskell.

The underlying general pattern of "Get some input, parse it into objects, then process the objects" lends itself well to Haskell. Haskell also handles recursive data types very cleanly, as we'll see in the Paging section. Finally, the Haskell development style of "Write a library with functions that manipulate objects in your problem space, and then use those to write a small executable that solves a specific problem" will suit this rapidly changing project nicely.

Types

A good place to start is to think of what types of objects you're going to deal with. We can define the types of Posts, Comments, Reactions and Users in the obvious way.

Facebook Graph API

If you have the ID for a post you can use the Facebook Graph API to get all sorts of information about it. An access token is also required for each query. The standard query we'll be making will be an HTTPS GET request on a URL that looks like https://graph.facebook.com/v2.10/[POST_ID]?fields=reactions%2Ccomments%7Bmessage%2Cfrom%2Cmessage_tags%2Ccreated_time%7D%2Cmessage%2Ccreated_time&access_token=[TOKEN]

This will return a JSON object that looks like the anonymized example below.

How easy! Everything we need is right there. Now how do we write a function that gets this for us?

Haskell has a few http libraries, but I like the simplicity of req. Looking at the documentation, we see that something like this will work:

Here Token and PostID are type synonyms for String.

Now all we have to do is use the Aeson library to parse this response into a Post.

Parsing

With the Aeson library, you can easily parse JSON into a Haskell object if you jsut define an instance of FromJSON for it. I found this guide very valuable.

Let's start by parsing a User, and building up to Post from there.

This is completely standard. Now, notice how a Reaction JSON object also has "from" and "id" fields, corresponding to the reacting user? We can actually parse a Reaction object by reading the reaction type, and getting the user by parsing the object again, this time as a user.

We don't even need to annotate the type of ParseJSON :: Value -> Parser a as Value -> Parser User, Haskell can infer the type by seeing that we bind it to reactionauthor, which it knows is of type User. Neat!

We do something similar for parsing Comments. Not every Comment JSON object has a message_tags field (when the comment doesn't contain any tags the Facebook API doesn't even return a message_tags field), so we need to be careful. Thankfully, Aeson allows us to provide a default value if a key doesn't appear. Finally, we can parse the array of tags directly into a list of Users.

Finally, we can parse a Post like this.

To tidy things up we can change the response type of our req request to a jsonResponse and actually parse the JSON immediately, like so:

Or can we...

Paging

The Facebook Graph API, by default, only returns the most recent 25 things of each type. That is, when making a request on a post with 26 comments and 26 reactions, only the first 25 reactions will appear in the reactions.data field, and same for the comments. This limit can be increased to 100, but that's just kicking the can down the road. Given that many posts have more that 100 reactions and/or comments, how do we get all of them?

The answer is in the paging field. This (sometimes, if there are any extra things) contains a next field that you can put into your query string to get the next n things. This means we need to rethink our postFromFB function, because the response body of an API call is not a Post, but some top level Post information (the message and time), the first 25 Comments and possibly a pointer to some more Comments, and the first 25 Reactions and possibly a pointer to some more Reactions.

We can abstract this pattern of "Some things, and possibly instructions on how to get more things" as a new data type

and class,

First we make an instance of FromJSON (Paging a) whenever we have an instance of FromJSON a:

Then we make a type for initial responses:

and an instance of FromJSON

So the function postFromFB now has signature Token -> PostID -> PostResponse. We just need a way to elaborate a PostResponse into a Post.

We define some helper methods to make making API requests simpler.

Now we need to deal with paging comments and reactions.

So given a Paging a we can use nextPaging to get the next Paging a, if it exists. We just need some function to repeat this until there are no more to get, and collect all the as along the way.

We finish up here by defining a function to make a Post from a PostResponse (and a Token).

Cleaning up

One thing to notice is that nearly every function we've defined takes a Token and just passes it to another function. We can hide this by defining a new monad type FB a = Token -> IO a, so that a value of type FB a represents a computation that, given a Token, makes some calls to the Facebook API and returns an a.

Then we can change most of our functions to return FB a, and in main call them like:

(where getToken :: IO Token, getID :: IO PostID, makePostFromFB :: PostID -> FB Post.)

Database

Now that we can get any post we want, we need some place to put them all. A MYSQL database will do. We'll use the following schema, with all the obvious foreign key constraints.

Table Field Type
comments id_comment bigint
comments id_parent bigint
comments id_author bigint
comments message text
comments time t imestamp
posts id_post bigint
posts message text
posts feudnum int
posts time t imestamp
reactions type v archar
reactions id_author bigint
reactions id_post bigint
refs id _src big int
refs nu m_dest int
tags i d_comment b igint
tags tagged_user bigint
users id_user bigint
users name t inytext

We're going to use the mysql-simple library for putting posts in the database.

The idea here is to construct query templates like myQuery = "INSERT INTO table(v1,v2) VALUES(?,?)", then execute them by calling execute connection myQuery (x1,x2).

Mostly we're just going to be INSERTing rows into a table. Sometimes we'll want to UPDATE, like if the content of a post or comment changes.

So, to insert the data from a comment we could write a function like:

This will try inserting, and if that fails print the exception and try updating, and if that fails too, print the exception and exit gracefully. It works, but it's not ideal.

There's a lot of code reuse inside insertComment, and given that we'll need something similar for inserting Posts, Reactions, etc, we should try and pull out all the boilerplate, because fundamentally, there's not a lot of difference between putting a Post or a Comment into the database, the only thing that really changes is what table it goes in, and what query you use.

So let's start by keeping track of all the tables we'll be accessing, and let's make an appropriate pair of INSERT and UPDATE queries for each of them.

Now we can define a helper method to encapsulate the functionality of "try this, and if that fails try this instead":

And now we can make a generic function to put things into the database.

Finally, we need a function that actually puts a post into the database. We proceed inductively, starting at the component types and building up to Post.

There is something to beware of. As we have foreign key constraints in the database, we need to be careful with the order in which we insert things. For example, since the table tags has a column id_comment with a foreign key constraint to the id_comment column in the comments table, we better not try to insert a comment's tags before we insert the comment itself.

Thankfully we can guarantee this quite easily. If xm, ym :: IO () then in the program

any side effects of xm will take happen before those of ym. Why is that the case? Well, the do notation is just syntactic sugar for

We can think of the IO monad as a special instance of the State monad, defined as ~~~ {.haskell} type IO a = RealWorld -> (a, RealWorld)

return :: a -> IO a return x r = (x, r)

(>>=) :: IO a -> (a -> IO b) -> IO b (xm >>= f) r = let (y, r') = xm r in f y $ r' ~~~

So our program above simplifies to

And we see that because xm has to be evaulated before ym is.

Putting a user into the database is simple.

To insert a reaction, we put in the reacting user first and then the reaction itself.

Inserting a comment is a little more involved. We insert the author, the comment itself, and then any tagged users.

And from here it's straightforward to define how to insert posts.

Getting Post IDs

So, given a post ID number, we can fetch the data from Facebook and put it into a database. One might wonder, however, how to actually get the post IDs in the first place...

Well, if you make a Graph API query on a pages feed value, you get a list of the last 25 posts made. Then, in a slight modification from what was discussed above, you can use the paging features to get the 25 previous posts, and the previous 25 posts, and so on. Then, because a page has only a finite amount of posts we can argue by induction that this process terminates, so QED we can get all the post IDs.

Actually, there's quite a serious issue with the argument above that we haven't taken into account.

No, not the possibility that in the several hundred milliseconds between making one Graph API call and processing the 25 posts we get and the next one, more than 25 new posts have been submitted.

Rather, it's the fact that the Graph API is really buggy. Once you page sufficiently far back (roughly 1000 posts, I found), Graph API will start missing posts, and only returning about 10% of them. This is a known issue, and can be fixed by querying the published_posts field instead of feed. Unfortunately, that requires a page access token which I obviously don't have.

Instead we can exploit the fact that every Oxfeud post starts with #Oxfeud_[n], for some unique natural number n. So, if you wanted to view the 1000th Oxfeud post, you could simply type #Oxfeud_1000 and Facebook's nifty hashtag search feature will display it as the first result.

From here the solution is simple. Just use Graph API's hashtag search function to search for posts by their Feud Number, and get everything you need from there, because of course Graph API will let you search for any post by hashtag, right? Oh...

Our old friend wget will have to do. We can just download the search results page for every n between 1 and the current number. Extracting the actual post IDs is a nice exercise in Regex.

On the other hand, we can actually get the last ~1000 posts using the Graph API. Using the query string

[PAGEID]/?fields=feed{reactions,comments{message,message_tags,from,created_time},message,created_time}&access_token=[TOKEN]

we get a nice response that we can straightforwardly parse into a Paging PostResponse object. If we like, we can call elab on it to fully page it, resulting in about 2000 PostResponses. Or, we can define a function elab' :: Int -> PageID -> Paging a -> FB [a] so that elab' n pages back n times, and take n to be around 10 or so.

Putting it all together

A program to read a list of post IDs from a text file and add the corresponding posts to the database could look like this:

We can also relax the type signature of, and slightly modify the function pagingFromFB to

(where PageID is just a synonym for String) so that it uses the query string above, and hence returns a FB (Paging PostResponse).

Then a program to update the last 1000 posts might look like:

Handling Exceptions

The next step is to handle failure. There are three main ways our program can fail:

  • We can fail to make a valid Req request
  • We can fail to parse a JSON object
  • We can fail to connect to/make a query on the database

Let's start with the FB type. Given that any call to Graph API can potentially fail, it would make sense to redefine it as newtype FB a = FB (Token -> IO (Maybe a)). For readability, we'll prefer to use the standard monad transformer, so define type MFB a = MaybeT FB a.

It's useful to make a helper function for promoting Maybe as to MFB as.

Now we just go around and change every function that returns FB a to return MFB a. This is all pretty standard. One nice thing pops out from the fact that MFB a is an instance of Applicative: we can rewrite elab in a rather more natural way.