Musings On Code

Monday, January 23, 2017

Intermediate Algorithms: Directed Graphs

Today's problem is a relatively "simple" directed graph problem. In this problem we just want to find if there exists a route between two given nodes. Now, if you remember, a directed graph is one where the edges have *direction*, that is, nodes will not necessarily point back to the nodes that you traversed to get to them.

An example of a directed graph

Now this problem is all about the edge cases. For example, we need to account for internal loops, as well as for running into dead ends. Now for the purposes of this problem, we're going to assume we're allowed to modify the nodes as we visit them. This is really a matter of convenience, if we could not alter the nodes, we would just have to modify our approach slightly.

At a high level, what we're going to do is build a BFS, utilizing a queue, and keep track of visited nodes by giving them a visited property. We could achieve that effect by keeping a list of visited nodes but it would affect our performance time by requiring us to iterate over it each time we visited a node.

To start, we'll begin at one of the nodes, and traverse the graph until we either find the node we're looking for, or run out of nodes to visit.

Lets look at what the code for this would look like, in javascript:

The algorithm will perform in O(n) time as it must theoretically traverse every node in a worst case scenario, but it has a memory complexity of O(1) as we're not storing anything other than state variables, this changes to O(n^2) and O(n) if we can't modify the nodes as we visit them.

If you wrote an algorithm and would like to test it, here's a super simple graph. 1 and 3 are connected, 3 and 1 are not.

Sunday, January 15, 2017

Intermediate Algorithms: Balanced Binary Beast

Algorithms continue to be a fun little pastime that is vital to people interviewing in the CS industry. We've covered some simple array algorithms, so its time for something a little more... interesting, my favorite data structure, trees. We're going to start with generating a tree. Not just any tree though, a balanced tree.

This exercise should take about 20 minutes.

So here we go: given a sorted array of numbers for example [1,3,5,8, 120], or [1,2,3,4,5,6], make a balanced binary tree. A balanced tree is one where the minimum depth and the maximum depth do not have a difference greater than one, or in other words where no "branches" are deeper than any other branches by one level. A binary tree is one in which each node has 0, 1 or 2 children. Unlike in this picture, our binary tree will be organized in such a way that the left child of each node will be greater than its parent, and the right child will be greater, making it useful for purposes of searching.

For the purposes of this exercise we'll be using javascript, as its more bare bones than ruby and syntactically closer to bare metal languages, you could easily however implement this in ruby as well. Of note is the fact that this algorithm cannot exceed O(n) time as we must iterate over the entire original array. An optimal solution then will have O(n) time.

Lets think through a solution: The first parameter to consider is our input, which is a sorted list. By default we want our root to be the middle value of this sorted list, since this way we can just grab a value to right to be the first child, and a value to the left to be the second. Not just any value, because we want the children of the children to follow the same pattern as the root we want to grab the middle of the list. Essentially we'll split the list in half with and then split the list in half again... does this smell like recursion? It does! So we'll make a node creation queue recursively, and then just go through it and make the nodes. Seems simple enough.

Now here's the code that makes the magic happen:

Not so bad is it? The bottom line, as long as you think through an algorithm, you can figure out a simple solution to the problem!

Sunday, December 11, 2016

Basic Algorithms: More Arrays

In simple terms the question we are asking is "Given an unsorted array of elements, return all elements where the value of the element is greater than half of the values in the array."

So for example, given an array of the elements [7,3,5,8], return 7 and 8. In the case of an odd array, do not return the tied value, so an array such as [11,4,7,9,2], you would return 9 and 11, with 7 being the middle value.

Just thinking about it, the immediate answer that springs to mind is to take each value of the array and iterate over each other value of the array and count how many of them are smaller than the current number, if your count of numbers smaller than the current number is greater than half the length of the array, you can just add that number to the array you're going to return. That solution would look something like this

Except... that's not a very good solution. It'll run in O(n^2) time. Yuck. We don't like n^2 solutions, because they're very time consuming. Imagine if your array had a million numbers? It would take minutes to solve!

A better approach might be to consider this, the biggest half of elements would just be the first half of a sorted array! Thus if we want to make this problem trivial, all we need to do is come up with an efficient way of sorting the array, and then take half of it! I'm personally a big fan of the merge sort, but you would get the same result with a pivot sort or a quick sort. Simply sort the array, and then take the left half of it. The solution might look something like this:

This is a much better solution, with a runtime of O(nlogn). A better sort would give us a shorter time.

Some basic benchmarking, with an array with a ten thousand elements, the naive solution takes: 10 seconds, while the non-naive solution takes: 0.03. Quite a difference! This becomes more obvious in the 100,000 element array, where the naive solution takes: 834 seconds, almost 15 minutes, while the optimal solution takes 0.4 seconds!

The core takeaway of this algorithm is to always consider sorting unordered arrays when given a problem that deals with comparing values.

Sunday, November 27, 2016

Finding Challenges

This blog post is going to be a little less technical than I would like, but it touches on a topic that I think every aspiring developer should think on, hard an often. That question is "what should I build?". Its a difficult question, because as you acquire new skills its easy to find a million applications that would be too difficult to implement, or another half a million which are too easy. Googling for "problems for aspiring developers" or even "practice problems" isn't very helpful. Having gone though this extensively, I have found out a number of approaches. The key is to find which, if any, work for you.

Coding Challenges

This is a typically offered solution. There are many websites out there like leetcode, project euler, hacker rank, interviewcake and codeeval that offer what they usually refer to as coding interview prep questions, and if your main goal is to land a job, you will probably learn a lot of specifically useful things, as well as how to think of specific problems to find clever solutions. There are a lot of problems, and its unlikely you'll transcend what these sites have to offer for a long time. The focus on sticking to the core of your language, with no fancy gems or extraneous tools can really serve to ground you in reality and teach you the finer details of your core skills.

These benefits come with a few drawbacks. Its hard to showcase on a resume, and while some sites, such as hacker rank offer a score you can show, unless its truly impressive its probably not something an interviewer is going to care about excessively. It also doesn't really teach you what building big, full fleshed projects feels like, and leaves you stranded in the land of your specific language, rather than bringing in interesting frameworks, and new tools.

If you like puzzles though, Coding challenges will provide small bite sized doses that provide a tangible feeling of progress, and that's important.

Build for yourself

We all have it, the dream project we all have wanted to build, maybe its something that's been built and implemented often, like Conway's Game of Life, or a simple maze program. Or maybe its more ambitious, like a fully fledged simulation of the behavior of twitter users, or a method to get push notifications whenever someone posts on a forum you like. Whatever it is, take this project and break it down into its component parts. Figure out the details, break down what you need to learn how to do, and what you know what to do. Then build it. Start with the smallest parts, and learn what you don't know. An evening of messing around with a database can teach you more than reading a few thousand words on it, and you won't know how to properly build routes until someone comes along and breaks the website you so cleverly crafted. This will teach you many things, be it deployment, or database usage, as well as training you on the number one thing every programmer needs to know. How to learn. You will break things. Things will not work, but you will learn.

This comes with a few drawbacks as well. A lot of what you'll learn is not immediately applicable. You'll run into paywalls and insurmountable problems. Getting a product you're satisfied with takes time. Your github account may wind up littered with the corpses of half finished programs, perhaps discouraging interviewers who don't understand the walls and limitations you've hit. Your raspeberry pi will catch fire when you screw something up badly enough. Interviewers won't care what you've built if you can't figure out reversing a linked list.

Most people who learn to code do so out of a desire to build. If building is your thing, then *build*, you'll be the better for it.

Build for others

I once read on a stack overflow question, that the best way to learn was to go out there and find problems people needed solving, and to figure out how to solve them. Its taken me some time and a growth in my skills to realize that even a very junior coder has skills very few people have and that problems that seem very difficult can be almost trivial with the right tools. When you're done building all your side projects, or when they've grown burdensome, or if you simply don't have anything you really want to build for *you*, go out and find problems that need solving. Earlier this week, i discovered that a certain forum I like has no good working mobile app, and no push notifications. I set out to try to figure out solutions to these problems. Will I build a killer app? Probably not, but I might use twilio to replicate push notifications for interested users. This has many of the same benefits as building for yourself, with the added benefit of teaching you what interacting with real users and their nebulous requirements is actually like

This is not free of burdens. Some problems haven't been solved because they are very hard to solve, or because the owner of the product doesn't wish to fix the issues. Sometimes you'll run into the real issue of hosting and maintaining a product other people use, and problems of ethics. You may find that what you've built relies on others, and doesn't last, consigning your hard work to obscurity and uselessness

The bottom line, is that with some basic coding skills, you can tackle problems others find impossible. By doing so, you can learn more, and have actual deployed projects to show interviewers.

Conclusion

This is not an exclusive list by any means, just a resource I wish I'd had when I started googling around for answers to this question. Some or all of these methods may not work for you, they're simply what work for me. At the very least, you should be asking "what can I do?", and if you are, you're already on the right track.

Sunday, November 20, 2016

Reddit API: The gateway to data

Earlier this week, I worked on a super simple project to use the Reddit API to gather data on specific users on demand. You can find it here if you're especially curious. The project itself is a demonstration of how easy it is to generate user data. Basically, with an username, its trivial to profile what subreddits that user frequents, at what times, and what their posting habits are. This is a little troubling, and over the weekend I gave some serious consideration over the types of data mining you could do with the Reddit API's rather incredible power.

An important note to consider when reading this. While these are all theoretical implementations, if I can conceive of them, I guarantee you, there are people who have already implemented these, be they individuals who are curious about the flow of data, or companies trying to leverage the power of advertising. What we post on the internet is public domain, and a lot can be extrapolated from patterns of behavior. The mere fact that Reddit uses a popularity system to determine the viewership of posts makes it an especially juicy target for large scale data mining.

Example 1: Deep User Analysis

The tool I wrote to profile users is deliberately quite tame. It doesn't for example, see where an user is geographically located. Or what their gender is. Or what age group they belong to. But those are all very easy to determine, unless the poster is exceedingly careful about the type of information they choose to share.

It is, for starters, trivially easy to pull out the exact date of every post an user makes. With that, you can generally chart their timezone, or at the very least, times in which they are up. Further, you could trivially check the content of all their posts for key words, such as cities, to determine where they live, or brands, to determine hobbies and interests. While this method would absolutely not be fool proof, with a little machine learning and some data smoothing, I predict you could get at least a 90% guess on where a particular user lives, what their gender is, and what age group they belong to, assuming they had enough posts to go on.

But it doesn't stop there. While its easy to figure out the status of an user, that by itself seems impractical, after all, expending all this effort to profile a single user is, frankly a waste of time. Which brings us to my second point

Example 2: Subreddit Profiling

A super simple application of Deep User Analysis is to profile the population of a subreddit. Let say, for example, that a make up company wishes to see how well their product does in /r/makeupaddiction, a subreddit focused on makeup. It would be trivial for them to monitor for new posts containing their product name, for example, by simply organizing a search targeted at that subreddit sorting by new and containing the specific keyword. It would look something like this. That would be trivial to do, with even a small server running on a rasberry pi, that could be checked manually every day. But, assuming this company can spend a little more money, they could upgrade, beyond just gathering data of direct mentions in posts.

A sample implementation would go like this. Set up a bot to check whenever anything new is posted to r/makeup addiction. Keep track of that post for 48 hours via a database. After 48 hours it'll have dropped from the rankings, measure its popularity. For posts that pass a certain threshold, run a Deep User Analysis on each user, further checking the thread for positive and negative reactions based on some simple NLP (or, if you'd rather not deal with Markov chains, get an intern to go over posts that meet parameters such as mentioning your company or a competing company.). Suddenly you have a vast swathe of data. How your product is perceived, who uses it, what demographics they fall into. All it cost you was one skilled developer and some server time. Maybe an intern too, if your developer is busy with other projects.

If you're particularly nefarious and unprincipled, you can then task the selfsame intern to do some astroturfing. At the very least your company has a better handle on who uses their products, and can definitively measure the effect of targeted advertising.

Example 3: Influencing Reddit

This goes into the unethical, but completely possible. Reddit API allows for remote management of user accounts. As we just showed, its trivial to implement monitoring of new posts by keyword. What does this mean? It means that its relatively trivial to program a downvote bot. In fact, with maybe 20 bots, and a little creativity, you could dominate the discourse of a particular subreddit, especially one with a small moderation team. I'm sure its something reddit admins have controls in place for, but realistically, it would be very difficult to distinguish between legitimate uses of the API and uses that allow users to be malicious like this.

The api rules say "Note: votes must be cast by humans. That is, API clients proxying a human's action one-for-one are OK, but bots deciding how to vote on content or amplifying a human's vote are not. See the reddit rules for more details on what constitutes vote cheating.". I can't see this stopping anyone with money on the line, as changing your IP and getting a new API key is as easy as paying 15 dollars a month for a VPN, and having an intern to generate new accounts, if you can't find your way around a capcha.

The bottom line is given the ease of implementing a vote manipulation bot, a sufficiently unprincipled group could easily reduce the visibility of posts by the competition.

Conclusion

These are only some of the potential uses of the Reddit API, a simple, logical chain of things that can be done with only basic functionality. This is without diving deep into some of the more convoluted ways you might manipulate Reddit, or gather data on it, if you had enough resources (like, for example a government or major corporation). There are *some* limitations to the Reddit API, like the 6000 calls per minute limit, and the fact that it only returns the last 1000 contributions by author, but its important to realize that despite these limitations, the Reddit API holds incredible power.

I would never suggest we quit using Reddit altogether, or even that we obfuscate what we post, this is simply a reminder that yes, what you post is public, and that with such powerful tools and big payouts, people will use that data. This is not a problem exclusive to Reddit. Any place where data is published in a publicly accessible manner can, and will, be data mined.

Sunday, November 13, 2016

Basic Algorithms: The Duplicate Number

As part of learning to code, it is important to learn how to think about certain problems, specifically simple algorithm problems.

The Problem

You have an array of numbers from 1 to n-1, one of the numbers is duplicated, meaning the array has a length of n. For example

[1,2,3,3,4,5,6,7], where the duplicates is 3.

The array is not necessarily sorted. Find and return the duplicated number.

Another sample array might be

[4,2,3,2,1,5,6]. with 2 being a duplicate

The Naive Solution

A naive solution, that is, one not optimized for memory or computational savings, would likely do something along the lines of,

Create a comparison array, which is a duplicate of the first array
Delete the first number from the comparison array.
Compare the first number of the main array to each value in the comparison array
If it matches any numbers in the comparison array, it is our duplicate, return it
If it doesn't, delete that number from the first array.
Repeat for each number until you have the duplicate

This is obviously not terribly efficient, with a space complexity of O(n), as it requires a new array to be created (two if you want to back up the main array), and a time complexity of O(n^2) as you iterate over a number of arrays. It is not a good solution, but it would work.

The code would look something like this

The Optimized Solution

This can actually be solved with a time complexity of O(n) and a space complexity of O(1). Given that the array has a known length, n, and that the digits only go up to n-1, we can figure out what we would expect the sum of an array of no duplicates to look like, and then subtract the sum of the original array from the theoretical array. This requires iterating over the first array only once to sum it.

In testing, using an array of 1,000,000, digits the optimal solution took about 0.1 second to run, (on an average of 10 trials) while the naive version took 45 seconds. The 10,000,000 trial took one second in the optimized version, and 414(!) seconds (almost 7 minutes) in the naive version. In the 100,000,000 number trial, the optimized version took roughly 10 seconds. I did not run the naive version as the projected time to run would be 414 seconds squared, which is roughly two days! This should highlight the importance of efficient search algorithms in managing large data sets.

The important takeaway here is that, by thinking of the constraints of the problem, it is possible to arrive at a better solution. With algorithm questions, the elegant solutions most often rely on finding the way the limitations of the data set can be leveraged.

Sunday, November 6, 2016

Ruby 101: Objects and Method Calls

One of the biggest conceptual jumps between being a "for fun" coder that works on small side projects, and being someone who codes frequently and tackles difficult projects, is really understanding what goes on "under the hood" so to speak.

The most important thing to get, and something that takes a while to get a handle on at first, is that in ruby, everything is an object. Everything. Not just the thing you made a class out of, but everything you use. When you define a variable, that variable points to an object. When you create an array, that array is an object.

More importantly, every object belongs to a class. Read that again. Every object belongs to a class. If you go far up the chain, every object actually inherits from the Object class, which in turn inherits from the BasicObject class. So really, when we say "everything in ruby is an object" we mean, "everything in ruby inherits from the Object class".

This is super easy to test. Just go into irb, and make an object. Any object. I picked an Array, but a string or a fixnum or even a class you made yourself will do. Try typing the name of the object and then class. As so

[].class

=> Array

Then, add .superclass on top of the class method call.

[].class.superclass

=> Object

Do it one more time

[].class.superclass.superclass

=> BasicObject

This is as far down as the chain goes, if you do it one more time, you get "nil", which means there's no super super super class for array.

What does this mean? It means that if you use an Array you can use all the methods contained in both Object and BasicObject. Object contains many useful methods, like for example, class, which we just used. Which brings us to the next point of discussion, method calls. A method call is quite simply when you call an operation on an object (anything in ruby!) by using a period and then the command. What's interesting is what happens when you call a method on an object.

First ruby checks the class of the object in question and scans from top to bottom looking for the method in question. If it finds it, it runs the last instance of it found, if not it goes into whatever class we inherited from, all the way up to BasicObject. If it finds nothing, it returns an error!

Were you to run SampleClass.new.sample_method, you'd simply get an output of "this method is called", as opposed to "this method is not called" because ruby will always take the last definition. Similarly, if you made a class called say "SonClass" that inherited from "SampleClass" and tried to run sample_method on it, it would just grab it from SampleClass, unless that is, you defined sample method in SampleClass...

Its these properties that make ruby so modular, since you can open up Array and edit its internal functions, or even add functions that can be used by anyone calling on an Array...