Web Scraping using JSoup – Getting Weekly Top Songs Project

In this article we will do a simple web scraper project using JSoup Library .

What is Web Scraping?

Web Scraping is used for extracting data from websites which are not otherwise available. Most companies expose certain data to be accessible for developers who do some sort of analysis with it. For example Twitter or Instagram API provides access to pictures or posts on a particular topic or coming from a particular location. Whereas Amazon or Ebay do not provide any API where we can monitor the price of certain products. On such cases we can scrape their sites.

Legality of Scraping

Companies do not mind one off scraping but hitting their website every minute is a big no-no for them (Only Google’s crawlers are allowed to scrape their sites of course). So ensure to keep the number of hits to the website to a minimum and scrape responsibly.

Weekly top songs – Project

We will get started with scraping the Saavn website to get a list of the weekly top songs. The data that we are looking for is from this URL https://www.jiosaavn.com/featured/weekly-top-songs

On inspecting the website below are the web elements we are looking for to get the titles of the track-list.

Inspect element of the saavn website

So we are looking for an ordered list with a class name of “track-list” which has a list of elements. In that we are looking for a div element with a classname of song-json. This gives all the data for that song in JSON format. So lets write some code to parse the HTML.

Scraping using Jsoup API

We are using the Jsoup library and we will go through what is going on in each line.

Document doc=Jsoup.connect("https://www.jiosaavn.com/featured/weekly-top-songs").get();

Jsoup is connecting to the website and getting the HTML structure of the page for parsing which is stored in the document object

Elements elem = elements.select("div.song-json");

We are selecting the div element with the class name of song-json. This will return us the data of all the songs.

            elements.forEach((element)->{
                JsonElement json = JsonParser.parseString(element.text());
                System.out.println(json.getAsJsonObject().get("title"));
            });

Finally we get the json text of each element and parse it using a JSONParser and then we can get any data as json properties. In the above case we will print the title of all the songs and that is all.

Conclusion

JSoup can be used for parsing of simple HTML DOM and not for executing Javascript or a substitute for a browser. For more complex operations checkout Selenium Web Driver. Once again Scrape Responsibly!!

GITHUB Link for the project

https://github.com/shahulbasha/WebScraping

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s