What is QMiner?
QMiner was my first python program and my first project in my career as a product developer with Future Focus Infotech(FFI). QMiner is an application that is developed with an intention to create a question bank for Interviewbot, a product that automates technical interview process and performs the initial screening of candidates.
It searches for interview questions through twitter and google collects all the URL and scrapes it to come up with a list of possible technical interview questions for programming languages like Python, Java, C# etc.,
Tasks to be performed by QMiner
- To collect and extract URLs of all sites having interview questions from tweets & google
- To scrape each URL and get all questions and create a database out of it
1. Collect and Extract URLs
In order to collect tweets, I used Tweet Collector to download all tweets with respect to the search query, I used “Interview questions” as search query and regex to extract URL from tweets. For extracting URLs from google, I used
For extracting URLs from google, I used BeautifulSoup, a Python library for pulling data out of HTML and XML files.
2. Scraping sites
Scraping questions from pages was not as simple as I thought because basically when you scrape a site you go through the tag structure of the page and extract information from portions you want using tags and associated classes.
In this case, it was supposed to go through a list of sites about 100s of them, which has different ways of structuring their questions and answers, also that the code will have no idea whether the sites will have questions or not. This complicated the problem since we have to scrape all the content of the site and parse it to infer whether it is a question or not.
2.1 Advertisements ??
One other problem with scraping sites are Advertisements which contains lots of junk data so before even proceeding with the scraping of questions we have to get rid of these Ads. Ads are mostly plugged in as scripts into the site. We have to get rid of all script and style tags in the site and focus only on the body part of the site.
2.2 Identifying questions
The more humanly thought of identifying questions was to scrape sentences that begins with any of following words like “what”, “how”, “do”, “can”, “why”, “explain”, “does”, “which” etc., This worked pretty well scraping most of the questions from sites but the amount of junk it came up was pretty high. For example, it scraped sentences like “what are most important interview java questions”. At this point, the intention was to reduce the junk content as much as possible
2.3 What if answers has questions ??
Not all the sentences that begin with these words are questions. In most sites, the answers are so elaborate that they themselves have the question. So the algorithm has to be fine tuned such that it captures only the question portion of the text.
The best way we could think of to do it is to find the tag structure of the first question of the site and scrape all the other content with same tag structure thereby extracting only the question portion of the text. Again, the problem is that there are more chances that first question you encounter on a site may be something like “Are you looking for interview programming question”. Hence, Scraping a question from the middle of the page would capture the required tags more accurately.
Now that we have tag structure of 1 question we can use the same to scrape other questions from the site. This way we could build a DB of 1000+ questions.
2.4 Categorizing OF questions
To go one step further with validation of questions i.e., whether a question belongs to a particular programming language or not. We came up with a list of keywords extracted from the glossary of documentation of each programming language and cross verified each question across this dictionary of words. This way we could categorize the questions based on programming languages
2.5 Extract answers
That is when we realized it is hard to write answers for all these questions manually. Hence we have to re-tweak the script such that it captures the answers along with the question.
Answers are usually the portion that follows the text, but this is not true for all cases.Answers are not always enclosed under one div. They are split across multiple tags. In some cases they are followed by the link pointing to more elaborated answer like the following example,
<h3 class="h3">1) What is difference between JDK and JRE?</h3> <h3 class="h4">JVM</h3><p>JVM is an acronym for Java Virtual Machine, it is an abstract machine which provides the runtime environment </p> <p>JVMs are available for many hardware and software platforms </p> <h3 class="h4">JRE</h3> <p>JRE stands for Java Runtime Environment. </p> <a href="difference-between-jdk-jre-and-jvm">more details...</a>
To extract the whole of the answer to any particular question I have to scrape the content of all tags between the current question to next question excluding all the anchor links and added to the DB.
Technologies used: Python, SQLite, BeautifulSoup