Sunday, November 20, 2016

Reddit API: The gateway to data

Reddit API: The gateway to data


Earlier this week, I worked on a super simple project to use the Reddit API to gather data on specific users on demand. You can find it here if you're especially curious. The project itself is a demonstration of how easy it is to generate user data. Basically, with an username, its trivial to profile what subreddits that user frequents, at what times, and what their posting habits are. This is a little troubling, and over the weekend I gave some serious consideration over the types of data mining you could do with the Reddit API's rather incredible power. 

An important note to consider when reading this. While these are all theoretical implementations, if I can conceive of them, I guarantee you, there are people who have already implemented these, be they individuals who are curious about the flow of data, or companies trying to leverage the power of advertising. What we post on the internet is public domain, and a lot can be extrapolated from patterns of behavior. The mere fact that Reddit uses a popularity system to determine the viewership of posts makes it an especially juicy target for large scale data mining. 

Example 1: Deep User Analysis

The tool I wrote to profile users is deliberately quite tame. It doesn't for example, see where an user is geographically located. Or what their gender is. Or what age group they belong to. But those are all very easy to determine, unless the poster is exceedingly careful about the type of information they choose to share.

It is, for starters, trivially easy to pull out the exact date of every post an user makes. With that, you can generally chart their timezone, or at the very least, times in which they are up. Further, you could trivially check the content of all their posts for key words, such as cities, to determine where they live, or brands, to determine hobbies and interests. While this method would absolutely not be fool proof, with a little machine learning and some data smoothing, I predict you could get at least a 90% guess on where a particular user lives, what their gender is, and what age group they belong to, assuming they had enough posts to go on.

But it doesn't stop there. While its easy to figure out the status of an user, that by itself seems impractical, after all, expending all this effort to profile a single user is, frankly a waste of time. Which brings us to my second point

Example 2: Subreddit Profiling


A super simple application of Deep User Analysis is to profile the population of a subreddit. Let say, for example, that a make up company wishes to see how well their product does in /r/makeupaddiction, a subreddit focused on makeup. It would be trivial for them to monitor for new posts containing their product name, for example, by simply organizing a search targeted at that subreddit sorting by new and containing the specific keyword. It would look something like this. That would be trivial to do, with even a small server running on a rasberry pi, that could be checked manually every day. But, assuming this company can spend a little more money, they could upgrade, beyond just gathering data of direct mentions in posts.

A sample implementation would go like this. Set up a bot to check whenever anything new is posted to r/makeup addiction. Keep track of that post for 48 hours via a database. After 48 hours it'll have dropped from the rankings, measure its popularity. For posts that pass a certain threshold, run a Deep User Analysis on each user, further checking the thread for positive and negative reactions based on some simple NLP (or, if you'd rather not deal with Markov chains, get an intern to go over posts that meet parameters such as mentioning your company or a competing company.). Suddenly you have a vast swathe of data. How your product is perceived, who uses it, what demographics they fall into. All it cost you was one skilled developer and some server time. Maybe an intern too, if your developer is busy with other projects.

If you're particularly nefarious and unprincipled, you can then task the selfsame intern to do some astroturfing. At the very least your company has a better handle on who uses their products, and can definitively measure the effect of targeted advertising. 

Example 3: Influencing Reddit


This goes into the unethical, but completely possible. Reddit API allows for remote management of user accounts. As we just showed, its trivial to implement monitoring of new posts by keyword. What does this mean? It means that its relatively trivial to program a downvote bot. In fact, with maybe 20 bots, and a little creativity, you could dominate the discourse of a particular subreddit, especially one with a small moderation team. I'm sure its something reddit admins have controls in place for, but realistically, it would be very difficult to distinguish between legitimate uses of the API and uses that allow users to be malicious like this. 

The api rules say "Note: votes must be cast by humans. That is, API clients proxying a human's action one-for-one are OK, but bots deciding how to vote on content or amplifying a human's vote are not. See the reddit rules for more details on what constitutes vote cheating.". I can't see this stopping anyone with money on the line, as changing your IP and getting a new API key is as easy as paying 15 dollars a month for a VPN, and having an intern to generate new accounts, if you can't find your way around a capcha. 

The bottom line is given the ease of implementing a vote manipulation bot, a sufficiently unprincipled group could easily reduce the visibility of posts by the competition. 

Conclusion

These are only some of the potential uses of the Reddit API, a simple, logical chain of things that can be done with only basic functionality. This is without diving deep into some of the more convoluted ways you might manipulate Reddit, or gather data on it, if you had enough resources (like, for example a government or major corporation). There are *some* limitations to the Reddit API, like the 6000 calls per minute limit, and the fact that it only returns the last 1000 contributions by author, but its important to realize that despite these limitations, the Reddit API holds incredible power. 

I would never suggest we quit using Reddit altogether, or even that we obfuscate what we post, this is simply a reminder that yes, what you post is public, and that with such powerful tools and big payouts, people will use that data. This is not a problem exclusive to Reddit. Any place where data is published in a publicly accessible manner can, and will, be data mined. 

No comments:

Post a Comment