Monitoring of positions with their hands

Make monitoring of positions of queries in a search engine, the beginning.


Usually we are interested in increasing customers.
And to increase something, you need to evaluate it first.
And so historically that the part of customers to online shopping comes from search engines.
( About the work with contextual advertising, and price-aggregators will write in the following articles, if anyone is interested. )
And for ozenki your status in the search engines, you usually need to collect statistics on the status of queries in the results.

Our tool will consist of 2 parts:
the
    the
  • script for parsing search results, using Curl and lxml
  • the
  • web interface to manage and display in Django


Learn from yandex.ru our position on the request.


I Want to just clarify, this article will describe the basics and do the easiest option, which will continue to improve.

first make a function that returns html to the url.

Download page will be using pycurl.
the
import pycurl 
c = pycurl.Curl()

Set url which will load
the
url = 'ya.ru'
c.setopt(pycurl.URL, URL)

To return the body of the page curl uses a callback function which passes a string with html.
Use the StringIO string buffer, the input he has a write () function, and to take all the contents out of it we can through the getvalue()
the
from StringIO import StringIO

c.bodyio = StringIO()
c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write)
c.get_body = c.bodyio.getvalue

Just in case make curl is similar to a browser, set the timeout, UserAgent, headers, etc.
the
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, 60)
c.setopt(pycurl.TIMEOUT, 120)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:13.0) Gecko/20100101 Firefox/13.0')
httpheader = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3',
'Accept-Charset:utf-8;q=0.7,*;q=0.5',
'Connection: keep-alive',
]
c.setopt(pycurl.HTTPHEADER, httpheader)

Now load the page
the
c.perform()

That's all, page we can read html pages
the
print c.get_body()

Can read the headlines
the
print c.getinfo(pycurl.HTTP_CODE)

And if got got some non-200 response from the server, we will be able to handle it. Now we just throw the exception, handle the exception, we will in the next articles
the
if c.getinfo(pycurl.HTTP_CODE) != 200:
raise Exception('HTTP code is %s' % c.getinfo(pycurl.HTTP_CODE))


Wrap up everything that happened in the function, the end result
the
import pycurl

try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO

def get_page(url, *args, **kargs):
c = pycurl.Curl()
c.setopt(pycurl.URL, URL)
c.bodyio = StringIO()
c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write)
c.get_body = c.bodyio.getvalue
c.headio = StringIO()
c.setopt(pycurl.HEADERFUNCTION, c.headio.write)
c.get_head = c.headio.getvalue

c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, 60)
c.setopt(pycurl.TIMEOUT, 120)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:13.0) Gecko/20100101 Firefox/13.0')
httpheader = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3',
'Accept-Charset:utf-8;q=0.7,*;q=0.5',
'Connection: keep-alive',
]
c.setopt(pycurl.HTTPHEADER, httpheader)

c.perform()
if c.getinfo(pycurl.HTTP_CODE) != 200:
raise Exception('HTTP code is %s' % c.getinfo(pycurl.HTTP_CODE))

return c.get_body()

Check the function of
the
print get_page('ya.ru')


Choose from page of search results a list of sites with positions

Construct search query,
yandex.ru/yandsearch we need to send 3 GET parameter
'text'query, 'lr'-region of search, 'p'-page issue
the
import urllib
import urlparse

key='brick'
region=213
page=1
params = ['http', 'yandex.ru', '/yandsearch', ", ", "]
params[4] = urllib.urlencode({
'text':key, 
'lr':region,
'p':page-1,
})
url = urlparse.urlunparse(params)

Get url and check in the browser
the
print url 

Get through the previous function page results
the
html = get_page(url)

Now let's parse the dom model with lxml
the
import lxml.html

site_list = []
for h2 in lxml.html.fromstring(html).find_class('b-serp-item__title'):
b = h2.find_class('b-serp-item__number')
if len(b):
num = b[0].text.strip()
url = h2.find_class('b-serp-item__title-link')[0].attrib['href']
site = urlparse.urlparse(url).hostname
site_list.append((num, site, url))

Will write in more detail what's going on
lxml.html.fromstring(html) html strings we do object to the html document
.find_class('b-serp-item__title') — searching for the document all tags that contain the class 'b-serp-item__title', you will get a list of elements of H2 which contain intercouse us information on the positions and pass them cycle
b = h2.find_class('b-serp-item__number') — looking inside I found the tag H2 the b element that contains the position number of the site if found then collect the position of b[0].text.strip() and line c the url of the site
urlparse.urlparse(url).hostname — get a domain name

Check the resulting list
the
print site_list

And collect all the resulting in function
the
def site_list(key, region=213, page=1):
params = ['http', 'yandex.ru', '/yandsearch', ", ", "]
params[4] = urllib.urlencode({
'text':key, 
'lr':region,
'p':page-1,
})
url = urlparse.urlunparse(params)
html = get_page(url)
site_list = []
for h2 in lxml.html.fromstring(html).find_class('b-serp-item__title'):
b = h2.find_class('b-serp-item__number')
if len(b):
num = b[0].text.strip()
url = h2.find_class('b-serp-item__title-link')[0].attrib['href']
site = urlparse.urlparse(url).hostname
site_list.append((num, site, url))
return site_list

Check the function of
the
print site_list('brick', 213, 2)


Find our site in the list of sites

We need a helper function that cuts off the 'www.' at the beginning of
the
def cut_www(site):
if site.startswith('www.'):
site = site[4:]
return site


Get a list of sites and compare with our website.
the
site = 'habrahabr.ru'
for pos, s, url in site_list('python', 213, 1):
if cut_www(s) == site:
print pos, url

Hmm, on the website the first issue on 'python' Habra is not present, try to results in a cycle in depth,
but we need to put a limit, max_position — up to which position we will check
at the same time and wrap in a function, and in case it is not there will return None, None
the
def site_position(site, key, region=213, max_position=10):
for page in range(1,int(math.ceil(max_position/10.0))+1):
site = cut_www(site)
for pos, s, url in site_list(key, region, page):
if cut_www(s) == site:
return pos, url
return None, None

Check
the
print site_position('habrahabr.ru', 'python', 213, 100)


Here we actually got our position.

Write please, whether interesting this topic is and should be-whether to continue?

About what to write next article?
to make the interface to this function, with plates and charts and hang the script on cron.
to make the handling of the captcha for this function, and manually using a special api
— make the script monitoring that thread with threading
— describe the work with direktom for example, generate and fill adds via the api or control prices
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Parse URL in Zend Framework 2

Custom database queries in MODx Revolution