top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Site crawler of a site (quora) in python.

0 votes
623 views

I am writing a crawler in python, which crawl quora. I can't read the content of quora without login. But google/bing crawls quora. One thing i can do is use browser automation and login in my account and the go links by link and crawl content, but this method is slow. So can any one tell me how should i start in writing this crawler.

posted Aug 3, 2013 by Salil Agrawal

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

+1 vote

You start with reading the page: http://www.quora.com/about/tos

which you agreed to when you created your account with them. At one place it seems pretty clear that unless you make specific arrangements with Quora, you're limited to using their API.

I suspect that they bend over backwards to get Google and the other big names to index their stuff. But that doesn't make it legal for you to do the same.

In particular, the section labeled "Rules" makes constraints on automated crawling. And so do other parts of the TOS. Crawling is permissible, but not scraping. What's that mean? I dunno. Perhaps scraping is what you're describing above as "method is slow."

answer Aug 3, 2013 by Deepak Dasgupta
Similar Questions
0 votes

Is there a way to block .php from being indexed by crawlers, but allow other type files to be indexed? When the crawlers access the php files, they are executed, creating lots of error messages (and taking up cpu cycles).

–1 vote

I'm working on a new project and i want to receive a request from a user and to redirect him to a third party site, but on the page after i redirect my users i want to them to see injected html (on the third party site.)

i'm not really sure how to approach this problem..

+1 vote

I want to find the maximal number of elements contained in a nested dictionary, e.g.

data = {
 'violations':
 {
 'col1': {'err': [elem1, elem2, elem3]},
 'col2': {'err': [elem1, elem2]}
 }
 }

so to find the maximal number of elements in the lists for key 'err' in key 'col1' and 'col2'. Also key 'violations' may contain many keys (e.g. 'col1' , 'col2', 'col3' etc), so what's the best way to do this (using a loop)?

max = 0for col in data.violations:
 if max < len(data.violations.col.err):
 max = len(data.violations.col.err)
...