The right way to Scrape a List of Topics from a Subreddit Using Bash
An output file is generated the same directory and contents will be something like this:
Let dive in the
jq command line we referred to as:
The actions begins when ever
curl is referred to as with a personalized header plus the URL of the subreddit to scrape. The outcome is piped to
jq where it parsed and reduced to 3 fields: Subject, URL and Permalink. These kinds of lines happen to be read, one-at-a-time, and kept into a changing using the browse command, all of the inside of a when loop, that may continue right up until there are reduce your lines you just read. The last distinctive line of the inner when block echoes the three domains, delimited with a tab persona, and then piping it throughout the
tr command line so that the double-quotes can be removed out. The outcome is then appended to a record.
And, last but not least, execute the script using a subreddit identity:
Wee gonna use
crimp to retrieve the JSON feed out of Reddit and
jq to parse the JSON info and get the fields we want from your results. Install these two dependencies using
apt-get on Ubuntu and other Debian-based Linux distributions. On other Linux distributions, make use of your distribution’s package administration tool instead.
Each item in the array is an object that also contains two fields called kind and data. The properties we want to grab are in the data object.
jq expects an expression that can be given to the type data and produces the specified output. It should describe the contents with regards to their pecking order and health club to an mixture, as well as how a data needs to be transformed. Let run the complete command once again with the appropriate expression:
Delicious fields inside the output info, but each and every one wee enthusiastic about are Subject, Permalink, and URL. Lit . an inclusive list of types and their domains on Reddit API records page: https://github.com/reddit-archive/reddit/wiki/JSON
Reddit may be a goldmine of interesting articles and news flash, and it all conveniently accessed featuring a JSON API. Now that you have ways to access this kind of data and process the results you can try things like:
Note the period that follows the command. This expression just parses the input and prints it as-is. The output looks effectively formatted and color-coded:
All of this is possible using the data offered and the tools you have on your system. Happy hacking!
Reddit offers JSON feeds for every subreddit. Here is how to produce a Bash script that downloads and parses a list of content from any subreddit you prefer. This is just one thing that you can do with Reddit’s JSON feeds.
There are three expressions in this command separated by two pipe symbols. The results of each manifestation are handed to the next for further evaluation. The first manifestation filters out all sorts of things except the array of Reddit listings. This kind of output is certainly piped in the second reflection and forced in an array. Your third expression serves on each aspect in the mixture and ingredients three homes. More information regarding
jq and expression format can be found in jq’s official manual.
The output should certainly fill up the terminal eye-port and look this type of thing:
We wish to extract Subject, Permalink, and URL, in the output info and preserve it into a tab-delimited data file. We can work with text developing tools just like
grep , although we have an alternative tool by our discretion that recognizes JSON info structures, named
jq . To find our initial attempt, let use it to pretty-print and color-code the output. Wel utilize the same contact as prior to, but this time, tube the output through
jq and instruct it to parse and produce the JSON data.
The output shows Title, URL, and Permalink each on their own brand:
Open your editor and duplicate the items of this snippet into a document called scrape-reddit. sh
Let put the API call and the JSON post-processing together in a script which will generate a file with the discussions we want. Wel add support for fetching posts coming from any subreddit, not just /r/MildlyInteresting.
Let study the structure of the JSON data we get back coming from Reddit. The main result is usually an object which contains two houses: kind and data. The latter holds a property called
children , which include an array of posts to this subreddit.
Each brand contains the three fields wee after, segregated using a case character.
Ahead of we can do this program, we must make certain that it has been of course execute accord. Use the
chmod command to put on these accord to the data file:
Note how the options used before the URL:
-s forces curl to run in silent mode so that we don see any output, except the data from Reddit servers. The next option and the parameter that follows,
-A eddit scraper example/code> , sets a custom individual agent line that helps Reddit identify the service being able to view their info. The Reddit API hosts apply pace limits based upon the user agent string. Setting up a tailor made value can cause Reddit to segment each of our rate limit away from different callers and minimize the chance that we all get a great HTTP 429 Rate Limit Exceeded problem.
Let see the actual data supply looks like. Make use of
curl to fetch the most recent posts from your MildlyInteresting subreddit:
Next, it is going to store the first disagreement as the subreddit brand, and build up a date-stamped filename in which the output will be saved.
This script can first check if the user features supplied a subreddit brand. If not really, it from the with a mistake message and a non-zero return code.