Scraping Facebook posts with Haskell

Deprecated

Due to changes in the Facebook Graph API, most of this doesn't work anymore.

Introduction

The main back-end component of ~~Oxfeud.it~~(Now deprecated) is a Haskell program that grabs content from Facebook and puts it in a database. It makes use of the Facebook Graph Api to make requests that return JSON objects from which data is extracted and inserted into the DB.

It is used in two different ways:

Once, grab all posts (and associated data) from a page to initially populate the database.
Repeatedly, grab the last n posts (and associated data) to update the database.

The specific data we need to get is:

All the posts to a certain page. For each post we store
- The id, message, and time of posting
- All comments and reactions
For a comment we store
- The id, message, time of posting, and author
- All reply comments and reactions
- All tags
For a reaction we store
- The type (Like, Angry, etc), and author
We also store all users who have interacted with the page by
- Reacting to a post or comment
- Making a comment
- Being tagged in a comment

It's important that the program be:

Modular, so that we can get fetch content from arbitrary pages and do arbitrary things to it.
Reliable. If one part of the program fails, for example parsing one JSON object out of 20, the rest of the program should continue as best as it can.
Log what it does, and any failures that occur.
Easy to understand and extend.

Why Haskell?

There are a few reasons I decided to write this in Haskell.

The underlying general pattern of "Get some input, parse it into objects, then process the objects" lends itself well to Haskell. Haskell also handles recursive data types very cleanly, as we'll see in the Paging section. Finally, the Haskell development style of "Write a library with functions that manipulate objects in your problem space, and then use those to write a small executable that solves a specific problem" will suit this rapidly changing project nicely.

Types

A good place to start is to think of what types of objects you're going to deal with. We can define the types of Posts, Comments, Reactions and Users in the obvious way.

data Post = Post { postID :: Integer, postmessage :: String, posttime :: UTCTime, postcomments :: [Comment], postreactions :: [Reaction] } deriving Show

data Comment = Comment { commentid :: Integer, parentid :: Integer, commenttime::String, commentmessage :: String, commentauthor :: User, commenttags :: [User] } deriving Show

data User = User { userid :: Integer, username :: String } deriving Show

data Reaction = Reaction { reactiontype :: String, reactionauthor :: User } deriving Show

Facebook Graph API

If you have the ID for a post you can use the Facebook Graph API to get all sorts of information about it. An access token is also required for each query. The standard query we'll be making will be an HTTPS GET request on a URL that looks like https://graph.facebook.com/v2.10/[POST_ID]?fields=reactions%2Ccomments%7Bmessage%2Cfrom%2Cmessage_tags%2Ccreated_time%7D%2Cmessage%2Ccreated_time&access_token=[TOKEN]

This will return a JSON object that looks like the anonymized example below.

{
  "reactions": {
    "data": [
      {
        "id": "0123456789",
        "name": "John Doe"
        "type": "LIKE"
      }
    ],
    "paging": {
      "cursors": {
        "before": "TVRBd01EQXdORGd5TXpRM016ZAzJPakUxTVRZAd016TTBORFE2TWpVME1EazJNVFl4TXc9PQZDZD",
        "after": "TVRBd01EQTJNVEUwTURrd01UazNPakUxTVRZAd01qazVNakk2TnpnNE5qUTRNRE0zT1RFek16RXkZD"
      }
    }
  },
  "comments": {
    "data": [
      {
        "message": "Jane Doe oh. My. God.",
        "from": {
          "name": "John Doe",
          "id": "0123456789"
        },
        "message_tags": [
          {
            "id": "9876543210",
            "length": 16,
            "name": "Jane Doe",
            "offset": 0,
            "type": "user"
          }
        ],
        "created_time": "2018-01-15T13:42:10+0000",
        "id": "561927510810787_562062857463919"
      },
    ],
    "paging": {
      "cursors": {
        "before": "WTI5dGJXVnVkRjlqZAFhKemIzSTZAOVFl5TURZAeU9EVTNORFl6T1RFNU9qRTFNVFl3TWpNM016QT0ZD",
        "after": "WTI5dGJXVnVkRjlqZAFhKemIzSTZAOVFl5TVRVMk56QTBNVEl4TWpBeE9qRTFNVFl3TXpjME5EYz0ZD"
      }
    }
  },
  "message": "#Oxfeud_7254

To the girl outside cellar who stole my burger, that wasn’t ok.",
  "created_time": "2018-01-15T05:40:50+0000",
  "id": "409018456101694_561927510810787"
}

How easy! Everything we need is right there. Now how do we write a function that gets this for us?

Haskell has a few http libraries, but I like the simplicity of req. Looking at the documentation, we see that something like this will work:

{-# LANGUAGE OverloadedStrings#-}

import Network.HTTP.Req
import qualified Data.ByteString.Lazy as B
import qualified Data.Text as T

instance MonadHttp IO where
  handleHttpException = throwIO

options :: Token -> Option scheme
options token =  
  (mconcat [ "fields" =: ("reactions,comments{message,from,message_tags,created_time},message,created_time" :: String), "access_token" =: (token :: String) ] )

postFromFB :: Token -> PostID -> IO B.ByteString
postFromFB token p = do 
    r <- (req 
      GET 
      (https "graph.facebook.com" /: "v2.10" /: (T.pack p)) 
      NoReqBody 
      lbsResponse 
      options token)
  return (responseBody r :: B.ByteString)

Here Token and PostID are type synonyms for String.

Now all we have to do is use the Aeson library to parse this response into a Post.

Parsing

With the Aeson library, you can easily parse JSON into a Haskell object if you jsut define an instance of FromJSON for it. I found this guide very valuable.

Let's start by parsing a User, and building up to Post from there.


{-# LANGUAGE OverloadedStrings, DeriveGeneric, RecordWildCards #-}
import Data.Aeson
import Data.Aeson.Types
import qualified Data.HashMap.Strict as H
import qualified Data.Vector as V

instance FromJSON User where
  parseJSON = withObject "user'" $ \o -> do
    userid' <- o .: "id"
    username <- o.: "name"
    let userid = read userid'
    return User{..}

This is completely standard. Now, notice how a Reaction JSON object also has "from" and "id" fields, corresponding to the reacting user? We can actually parse a Reaction object by reading the reaction type, and getting the user by parsing the object again, this time as a user.

We don't even need to annotate the type of ParseJSON :: Value -> Parser a as Value -> Parser User, Haskell can infer the type by seeing that we bind it to reactionauthor, which it knows is of type User. Neat!

instance FromJSON Reaction where
  parseJSON = withObject "reaction" $ \o -> do
    reactiontype <- o.: "type"
    reactionauthor <- parseJSON (Object o)
    return Reaction{..}

We do something similar for parsing Comments. Not every Comment JSON object has a message_tags field (when the comment doesn't contain any tags the Facebook API doesn't even return a message_tags field), so we need to be careful. Thankfully, Aeson allows us to provide a default value if a key doesn't appear. Finally, we can parse the array of tags directly into a list of Users.

instance FromJSON Comment where
  parseJSON = withObject "comment" $ \o -> do
    cmessage <- (o .: "message")
    commentid' <- (o.: "id")
    from <- o .: "from"
    author <- parseJSON (Object from)
    commenttime <- o.: "created_time"
    ttags <- o .:? "message_tags" .!= V.empty 
    tags <- parseJSON (Array ttags)
    let commentid = read $ tail . dropWhile (/='_')$ cid' 
        parentid = read $ takeWhile (/='_') $ cid'
    return Comment{..}

Finally, we can parse a Post like this.


instance FromJSON PostResponse where
  parseJSON = withObject "response" $ \o -> do
    rs <- o .:? "reactions" .!= (H.empty :: H.HashMap T.Text Value)
    rd <- rs .:> "data" .!=  V.empty 
    postreactions <- parseJSON (Array rd)
    cs <- o .:? "comments" .!= (H.empty :: H.HashMap T.Text Value)
    cd <- cs .:? "data" .!= V.empty
    postcomments <- parseJSON (Array cd)  
    postmessage <- o .: "message"
    time_string <- o.: "created_time"
    postid' <- o.: "id"
    let postid = read $ takeWhile (/='_') postid'
        posttime = read time_string :: UTCTime
    return Post{..}

To tidy things up we can change the response type of our req request to a jsonResponse and actually parse the JSON immediately, like so:

postFromFB :: Token -> PostID -> IO Post
postFromFB token p = do 
    r <- (req 
      GET 
      (https "graph.facebook.com" /: "v2.10" /: (T.pack p)) 
      NoReqBody 
      jsonResponse 
      options token)
  return (responseBody r :: Post)

Or can we...

Paging

The Facebook Graph API, by default, only returns the most recent 25 things of each type. That is, when making a request on a post with 26 comments and 26 reactions, only the first 25 reactions will appear in the reactions.data field, and same for the comments. This limit can be increased to 100, but that's just kicking the can down the road. Given that many posts have more that 100 reactions and/or comments, how do we get all of them?

The answer is in the paging field. This (sometimes, if there are any extra things) contains a next field that you can put into your query string to get the next n things. This means we need to rethink our postFromFB function, because the response body of an API call is not a Post, but some top level Post information (the message and time), the first 25 Comments and possibly a pointer to some more Comments, and the first 25 Reactions and possibly a pointer to some more Reactions.

We can abstract this pattern of "Some things, and possibly instructions on how to get more things" as a new data type

data Paging a = Paging {content :: [a], next :: Maybe String}

and class,

class CanPage a where getNext :: Token -> PostID -> Paging a -> IO (Maybe (Paging a))

First we make an instance of FromJSON (Paging a) whenever we have an instance of FromJSON a:

instance (FromJSON a) => FromJSON (Paging a) where 
  parseJSON = withObject "" $ \o-> do
    adata <- o.: "data"
    content <- parseJSON (Array adata)
    pg <- o .:? "paging" .!= (H.empty :: H.HashMap T.Text Value)
    cr <- pg .:? "cursors" .!= (H.empty :: H.HashMap T.Text Value)
    next <- cr .:? "after"
    return Paging{..}

Then we make a type for initial responses:

--(id, message, time, comments, reactions)
data PostResponse = PostResponse (PostID, String, UTCTime, Paging Comment, Paging Reaction)

and an instance of FromJSON

instance FromJSON PostResponse where
  parseJSON = withObject "response" $ \o -> do
    r <- o .:? "reactions" .!= (H.empty :: H.HashMap T.Text Value)
    rs <- parseJSON (Object r)
    c <- o .:? "comments" .!= (H.empty :: H.HashMap T.Text Value)
    cs <- parseJSON (Object c)  
    message <- o .: "message"
    time_string <- o.: "created_time"
    ids <- o.: "id"
    return $  PostResponse (ids, message, time_string, cs, rs)

So the function postFromFB now has signature Token -> PostID -> PostResponse. We just need a way to elaborate a PostResponse into a Post.

We define some helper methods to make making API requests simpler.

jsonRequest ::  (FromJSON c) => Url scheme -> Option scheme ->  IO ( c)
jsonRequest url opts=  
    (req GET 
    url
    NoReqBody 
    jsonResponse
    opts ) >>= (\r ->   return (responseBody r )  )

options':: String -> Token -> String -> Option scheme
options' what token next = mconcat ["fields" =: ("message,message_tags,from,created_time,type,id,name"::String),"access_token" =: (token :: String),
              "limit" =: ("100" :: String),
              "after" =: (next :: String)] 

makePostURL :: PostID -> String -> Url 'Https
makePostURL p what = (https "graph.facebook.com" /: "v2.10" /: (T.pack p) /: (T.pack what) )  

pagingFromFB :: (FromJSON c)=> Token -> String -> String ->  PostID -> IO (Paging c)
pagingFromFB token what n  parent = do
  let url = makePostURL parent what
      opts = options' what token n
  jsonRequest url opts

Now we need to deal with paging comments and reactions.

nextPaging ::  (FromJSON a) => Token -> String -> PostID -> Paging a ->IO (Maybe (Paging a))
nextPaging token what postid (Paging{..}) = case next of 
    Nothing -> return Nothing
    Just n -> do
      p <- pagingFromFB token what n postid
      return $ Just p

instance CanPage Comment where
  getNext token  = nextPaging token "comments" 
    
instance CanPage Reaction where
  getNext token = nextPaging token "reactions"

So given a Paging a we can use nextPaging to get the next Paging a, if it exists. We just need some function to repeat this until there are no more to get, and collect all the as along the way.

elab :: (CanPage c) =>  Token -> PostID -> Paging c -> IO [c]
elab token postid (p@Paging{..}) = case next of
  Nothing -> return content
  Just n -> do
    p' <- getNext token postid p
    case p' of
      Nothing -> return content
      Just (p'') -> do
        cs <- elab token postid p''
        return $ content++cs

We finish up here by defining a function to make a Post from a PostResponse (and a Token).

makePost :: Token ->  PostResponse -> IO Post
makePost token (PostResponse(ids,m,t,pcs,prs)) = do
  let postmessage = m
      postID = ids 
      postnum = Nothing
      posttime = t 
      postrefs = [] 
  postcomments <- elab token postID pcs
  postreactions <- elab token postID prs
  return Post{..}
     
makePostFromFB :: Token -> PostID -> IO Post   
makePostFromFB token p  = postFromFB token p >>= makePost token

Cleaning up

One thing to notice is that nearly every function we've defined takes a Token and just passes it to another function. We can hide this by defining a new monad type FB a = Token -> IO a, so that a value of type FB a represents a computation that, given a Token, makes some calls to the Facebook API and returns an a.

newtype FB a = FB (Token -> IO a)

instance Monad FB where
  return  x  = FB (\t -> return x)
  FB xm >>= f  = FB (\t -> xm t >>= (\a ->  let FB ym = f a in ym t) )
instance Functor FB where
  fmap f (FB xm) = FB (\t -> xm t >>= (\a -> return $ f a))
instance Applicative FB where
  pure = return
  (FB xmm) <*> (FB ym) = FB (\t -> ym t >>= (\a -> xmm t >>= (\f -> return  $ f a)))
instance MonadIO FB where
  liftIO xm = FB( \t -> xm >>= \x -> return x)

runFB ::  FB a -> Token -> IO a
runFB (FB xm) token = xm token

Then we can change most of our functions to return FB a, and in main call them like:

main :: IO ()
main = do
  token <- getToken
  postid <- getID
  post <- runFB (makePostFromFB postid) token
  print post

(where getToken :: IO Token, getID :: IO PostID, makePostFromFB :: PostID -> FB Post.)

Database

Now that we can get any post we want, we need some place to put them all. A MYSQL database will do. We'll use the following schema, with all the obvious foreign key constraints.

Table	Field	Type
comments	id_comment	bigint
comments	id_parent	bigint
comments	id_author	bigint
comments	message	text
comments	time t	imestamp
posts	id_post	bigint
posts	message	text
posts	feudnum	int
posts	time t	imestamp
reactions	type v	archar
reactions	id_author	bigint
reactions	id_post	bigint
refs id	_src big	int
refs nu	m_dest int
tags i	d_comment b	igint
tags	tagged_user	bigint
users	id_user	bigint
users	name t	inytext

We're going to use the mysql-simple library for putting posts in the database.

The idea here is to construct query templates like myQuery = "INSERT INTO table(v1,v2) VALUES(?,?)", then execute them by calling execute connection myQuery (x1,x2).

Mostly we're just going to be INSERTing rows into a table. Sometimes we'll want to UPDATE, like if the content of a post or comment changes.

So, to insert the data from a comment we could write a function like:

insertComment :: Connection -> Comment -> IO ()
insertComment cx Comment{..} = catch 
  (execute cx "INSERT INTO comments..." ... )
  (\e -> let err = show (e :: SomeException) in (print err) >> catch 
    (execute cx "UPDATE comments..." ...)
    (\e' -> let err' = show (e' :: SomeException) in print err'))

This will try inserting, and if that fails print the exception and try updating, and if that fails too, print the exception and exit gracefully. It works, but it's not ideal.

There's a lot of code reuse inside insertComment, and given that we'll need something similar for inserting Posts, Reactions, etc, we should try and pull out all the boilerplate, because fundamentally, there's not a lot of difference between putting a Post or a Comment into the database, the only thing that really changes is what table it goes in, and what query you use.

So let's start by keeping track of all the tables we'll be accessing, and let's make an appropriate pair of INSERT and UPDATE queries for each of them.

data Table = Po | Co | Us | Re | Reac | Ta deriving Show

queries :: Table -> (Query, Query)
queries Po = ("INSERT INTO posts(id_post, feudnum, message, time) VALUES(?,?,?,?)", "UPDATE posts SET feudnum = ?, message = ?, time = ? WHERE id_post = ?")
queries Co = ("INSERT INTO comments(id_comment, id_author, message, time, id_parent) VALUES(?,?,?,?,?)", "UPDATE comments SET id_author=?, message=?, time=?,id_parent=? WHERE id_comment=?")
queries Us = ("INSERT INTO users(id_user,name) VALUES(?,?)", "UPDATE users SET name=? WHERE id_user=?")
queries Re = ("INSERT INTO refs(id_src,num_dest) VALUES(?,?)", "")
queries Reac = ("INSERT INTO reactions(id_post, type, id_author) VALUES(?,?,?)", "")
queries Ta = ("INSERT INTO tags(id_comment, tagged_user) VALUES(?,?)", "")

Now we can define a helper method to encapsulate the functionality of "try this, and if that fails try this instead":

tryBoth :: IO a -> IO a -> (SomeException -> IO a) -> IO a
tryBoth a1 a2 handler = catch 
  a1 
  (\e -> let e' = show (e :: SomeException) in catch a2 handler)

And now we can make a generic function to put things into the database.

put :: (QueryParams a, QueryParams b) => Table -> a -> b -> Connection -> IO ()
put t a b cx = let (ins, ups) = queries t in tryBoth
  (execute_ cx ins a)
  (execute_ cx ups b)
  (\e -> print $ "error" ++ show t ++ " " ++ show e)

Finally, we need a function that actually puts a post into the database. We proceed inductively, starting at the component types and building up to Post.

There is something to beware of. As we have foreign key constraints in the database, we need to be careful with the order in which we insert things. For example, since the table tags has a column id_comment with a foreign key constraint to the id_comment column in the comments table, we better not try to insert a comment's tags before we insert the comment itself.

Thankfully we can guarantee this quite easily. If xm, ym :: IO () then in the program

main :: IO ()
main = do
  xm
  ym

any side effects of xm will take happen before those of ym. Why is that the case? Well, the do notation is just syntactic sugar for

main = xm >>= const ym

We can think of the IO monad as a special instance of the State monad, defined as ~~~ {.haskell} type IO a = RealWorld -> (a, RealWorld)

return :: a -> IO a return x r = (x, r)

(>>=) :: IO a -> (a -> IO b) -> IO b (xm >>= f) r = let (y, r') = xm r in f y $ r' ~~~

So our program above simplifies to

main r = let (r', ()) = xm r in ym r'

And we see that because xm has to be evaulated before ym is.

Putting a user into the database is simple.

putUser :: User -> Connection -> IO ()
putUser (User{..}) = put Us (userid,username) (username,userid)

To insert a reaction, we put in the reacting user first and then the reaction itself.

putReaction :: String -> Reaction -> Connection -> IO ()
putReaction p (Reaction{..}) cx = do
  put Us (userid reactor, username reactor) (username reactor, userid reactor) cx
  put Reac(p, rtype, userid reactor) () cx

Inserting a comment is a little more involved. We insert the author, the comment itself, and then any tagged users.

putComment ::  Comment -> Connection -> IO ()
putComment (Comment{..}) cx = do
  putUser author cx
  put Co (commentid, userid author, commentmessage, commenttime, parentid) (userid author, commentmessage, commenttime, parentid, commentid) cx
  sequence_ $ map (\u -> (putUser u cx) >> (put Ta (commentid, userid u) () cx)) tags

And from here it's straightforward to define how to insert posts.

putPost :: ConnectInfo -> Post -> IO ()
putPost ci (Post{..}) = do
  cx <- connect ci
  sequence_ $ map (\c -> putComment c cx) postcomments
  sequence_ $ map (\r -> putReaction postid r cx) postreactions
  sequence_ $ map (\r -> put Re (postid, r) () cx) postrefs 
  put Po (postid, feudnum, postmessage, posttime) (feudnum, postmessage, posttime, postid) cx

Getting Post IDs

So, given a post ID number, we can fetch the data from Facebook and put it into a database. One might wonder, however, how to actually get the post IDs in the first place...

Well, if you make a Graph API query on a pages feed value, you get a list of the last 25 posts made. Then, in a slight modification from what was discussed above, you can use the paging features to get the 25 previous posts, and the previous 25 posts, and so on. Then, because a page has only a finite amount of posts we can argue by induction that this process terminates, so QED we can get all the post IDs.

Actually, there's quite a serious issue with the argument above that we haven't taken into account.

No, not the possibility that in the several hundred milliseconds between making one Graph API call and processing the 25 posts we get and the next one, more than 25 new posts have been submitted.

Rather, it's the fact that the Graph API is really buggy. Once you page sufficiently far back (roughly 1000 posts, I found), Graph API will start missing posts, and only returning about 10% of them. This is a known issue, and can be fixed by querying the published_posts field instead of feed. Unfortunately, that requires a page access token which I obviously don't have.

Instead we can exploit the fact that every Oxfeud post starts with #Oxfeud_[n], for some unique natural number n. So, if you wanted to view the 1000th Oxfeud post, you could simply type #Oxfeud_1000 and Facebook's nifty hashtag search feature will display it as the first result.

From here the solution is simple. Just use Graph API's hashtag search function to search for posts by their Feud Number, and get everything you need from there, because of course Graph API will let you search for any post by hashtag, right? Oh...

Our old friend wget will have to do. We can just download the search results page for every n between 1 and the current number. Extracting the actual post IDs is a nice exercise in Regex.

On the other hand, we can actually get the last ~1000 posts using the Graph API. Using the query string

[PAGEID]/?fields=feed{reactions,comments{message,message_tags,from,created_time},message,created_time}&access_token=[TOKEN]

we get a nice response that we can straightforwardly parse into a Paging PostResponse object. If we like, we can call elab on it to fully page it, resulting in about 2000 PostResponses. Or, we can define a function elab' :: Int -> PageID -> Paging a -> FB [a] so that elab' n pages back n times, and take n to be around 10 or so.

Putting it all together

A program to read a list of post IDs from a text file and add the corresponding posts to the database could look like this:

main :: IO ()
main = do
  postids' <- readFile "post_ids.txt"
  token <- readFile "token.txt"
  let ci = defaultConnectInfo
      postids = lines postids'
  posts <- runFB (sequence $ map makePostFromFB postids) token
  sequence_ $ map (putPost ci) posts

We can also relax the type signature of, and slightly modify the function pagingFromFB to

jsonFromFB :: (FromJSON c) => PageID -> FB c

(where PageID is just a synonym for String) so that it uses the query string above, and hence returns a FB (Paging PostResponse).

Then a program to update the last 1000 posts might look like:

main :: IO ()
main = do
  token <- readFile "token.txt"
  let ci = defaultConnectInfo
      pageid = "..."
  posts <- runFB (jsonFromFB pageid >>= elab 10 pageid >>= map makePost) token
  sequence_ $ map (putPost ci) posts

Handling Exceptions

The next step is to handle failure. There are three main ways our program can fail:

We can fail to make a valid Req request
We can fail to parse a JSON object
We can fail to connect to/make a query on the database

Let's start with the FB type. Given that any call to Graph API can potentially fail, it would make sense to redefine it as newtype FB a = FB (Token -> IO (Maybe a)). For readability, we'll prefer to use the standard monad transformer, so define type MFB a = MaybeT FB a.

It's useful to make a helper function for promoting Maybe as to MFB as.

liftMaybe :: Monad m => Maybe a -> MaybeT m a
liftMaybe = MaybeT . return

Now we just go around and change every function that returns FB a to return MFB a. This is all pretty standard. One nice thing pops out from the fact that MFB a is an instance of Applicative: we can rewrite elab in a rather more natural way.

elab :: CanPage c => PostID -> Paging c -> MFB [c]
elab postid (p@Paging{..}) = 
  (liftMaybe next >> getNext postid p >>= elab postid >>= return . (content++)) <|> (return content)