As I site on powershell was being parsed

Preface


First of all, I'm not a programmer. I'm admin yet. Of course I would like to be called an architect, but in the foreseeable space of suitable vacancies, with adequate requirements, and most importantly, the salaries for these requirements no. Which is a pity.
In fact, within this article I want to tell you about the useful buns of the new version of Powershell. In particular, the ability to quickly and confidently to parse web pages and to do this simultaneously.

Task


So, the task that stood before me was quite simple. There is a website, if you go through the initial form where you have to select start date and end date we get here on this page:

image

The number of such pages can be great within the same date period. But not more than 999. That is, if for example, you need to select the data for 5 years, they are all 999 pages will not fit. This page is a directory, I was only interested in the data on which it leads the link in the column of Permit NO:

image

In General, since I'm not a programmer, my knowledge is not enough to use the C# or something. In General, my favorite tool – powershell helped here.

Solution


I decided to go in two stages. First to upload and parse a directory with links and then go through the directory to choose the documents to which it refers. Primitive task for the programmer. It took me around 16 hours. However given the fact that I did it using a new team for me not only to solve the problem, but also to learn a new team for me and chips powershell 3, which at the moment just came out.
I was lucky that the site has accepted the parameters directly in the URL, like so:

the
http://[skip]/[skip]?allcount=$allcount&allstartdate_month=$allstartdate_month [skip] 

because working with HTML forms, I don't know how. So I decided to just request the desired pages by changing the request parameters. For this I used the Invoke-WebRequest. It allows in the simplest possible way to send the request and get the result without using .Directly NET classes or COM objects in IE. The result is a parsed HTML document, which can be dismantled further.
In addition, a feature of this page was that it returned not only with the HTML code of the table, but and the parsed contents of the table

image

Parsing of the first half

In this part, I just chose the directory. The main problem at this stage, to sort out all the pages that are returned by the system and to determine the last. For this, I decided to check whether the Next button on the page, or it is not.
In addition, the output of this part I wanted to flat csv file that contains the actual directory. And in the end transfer the file to the next stage. Born for this code below. It just selects all root labels for a date range, parses the contents of the page regular expressions, by using the above feature and returns an object that contains all of this information.

the
function Get-AppList {
[CmdletBinding()]
param(
[datetime] $startDate = '01.01.2012',
[datetime] $endDate = '01.01.2012',
[string] $allpermittype = "SG",
[string] $allcount = "0000",
[string] $requestid= "1"
)
begin{
[string] $allstartdate_month = "{0:d2}" -f $startDate.Month
[string] $allstartdate_day= "{0:d2}" -f $startDate.Day
[string] $allstartdate_year= $startDate.Year

[string] $allenddate_month = "{0:d2}" -f $endDate.Month
[string] $allenddate_day = "{0:d2}" -f $endDate.Day
[string] $allenddate_year = $endDate.Year

$fields = @{Regex="\[0:PtAppFirstName\]\{(?<PtAppFirstName>.+)\}";Column="PtAppFirstName"},
@{Regex="\[1:PtAppLastName\]\{(?<PtAppLastName>.+)\}";Column="PtAppLastName"},
@{Regex="\[2:PtAppMI\]\{(?<PtAppMI>.+)\}";Column="PtAppMI"},
@{Regex="\[3:PtJobNum\]\{(?<PtJobNum>.+)\}";Column="PtJobNum"},
@{Regex="\[4:PtJobDocNum\]\{(?<PtJobDocNum>.+)\}";Column="PtJobDocNum"},
@{Regex="\[5:PtJobType\]\{(?<PtJobType>.+)\}";Column="PtJobType"},
@{Regex="\[6:PtPermitType\]\{(?<PtPermitType>.+)\}";Column="PtPermitType"},
@{Regex="\[7:PtPermitSubtype\]\{(?<PtPermitSubtype>.+)\}";Column="PtPermitSubtype"},
@{Regex="\[8:PtPermitSeqNum\]\{(?<PtPermitSeqNum>.+)\}";Column="PtPermitSeqNum"},

@{Regex="\[10:PtFilingDate\]\{(?<PtFilingDate>.+)\}";Column="PtFilingDate"},
@{Regex="\[11:PtExpirationDate\]\{(?<PtExpirationDate>.+)\}";Column="PtExpirationDate"},
@{Regex="\[12:PtBin\]\{(?<PtBin>.+)\}";Column="PtBin"},
@{Regex="\[13:JHouseNumber\]\{(?<JHouseNumber>.+)\}";Column="JHouseNumber"},
@{Regex="\[14:JStreetName\]\{(?<JStreetName>.+)\}";Column="JStreetName"},
@{Regex="\[15:PermitIsn\]\{(?<PermitIsn>.+)\}";Column="PermitIsn"}

$uri = "http://[skip]/bisweb/[skip]?allcount=$allcount&allstartdate_month=$allstartdate_month&allstartdate_day=$allstartdate_day&allstartdate_year=$allstartdate_year&allenddate_month=$allenddate_month&allenddate_day=$allenddate_day&allenddate_year=$allenddate_year&allpermittype=$allpermittype&go13=+GO+&requestid=0&navflag=T&requestid=$requestid"
}
process{
do {
# select the next page. persistent session
$a = Invoke-WebRequest -Uri $uri -SessionVariable sv

$s = $a.ParsedHtml.childNodes| % data
$s2 = ($s[3] -split "\[\d+\]")

$obj = @{}

$s2 | % {
$item = $_
if ($item) {
$fields | % {
$res = $item -match $_.regex
if ($res) {
$obj[$_.Column] = $matches[$_.Column]
}
else {
$obj[$_.Column]= $null
}
}
if (($obj.PtPermitType -ne $null) -and ($obj.PtPermitType -ne " ")) {
new-object psobject -Property $obj
}
}
}

# check whether the latter is page.specific to this site
$form = $a.Forms | where id -EQ "frmnext"

if ($form) {

$allstartdate_month=$form.Fields["allstartdate_month"]
$allstartdate_day=$form.Fields["allstartdate_day"]
$allstartdate_year=$form.Fields["allstartdate_year"]

$allenddate_month = $form.Fields["allenddate_month"]
$allenddate_day = $form.Fields["allenddate_day"]
$allenddate_year = $form.Fields["allenddate_year"]
$allpermittype = $form.Fields["allpermittype"]
$allcount = $form.Fields["allcount"]

$requestid = $form.Fields["requestid"]

$uri = "http://[skip]/skip?allcount=$allcount&allstartdate_month=$allstartdate_month&allstartdate_day=$allstartdate_day&allstartdate_year=$allstartdate_year&allenddate_month=$allenddate_month&allenddate_day=$allenddate_day&allenddate_year=$allenddate_year&allpermittype=$allpermittype&go13=+GO+&requestid=0&navflag=T&requestid=$requestid"

}
} while ($form)
}
}

Parsing the second half

In the second part there is another problem. The number of pages, which we needed to request was a little more. Commercials in just 30. Because too much of the results of the first stage and selection of pages for one took a lot of time. So I decided to take advantage of another feature of powershell v3 powershell workflow. Well, rather the operator of the foreach –parallel. In fact, workflow is designed for a very different, but in this case went so. I must say, it is not a means for task parallelism to increase performance, so do not expect this from him. So, in this case, the idea was to take this opportunity to run the queries for each line of the directory "parallel". In fact, this command launches a separate process, and their number is limited. I was wondering is it possible to change the maximum number. This mechanism allows you to simplify your code to get "concurrency". In quotes not because they are not parallel. They are parallel, just not start in the lungs flows and in a heavy process framework .NET Workflow and results are forced to transfer across process boundaries. So it is not too productive, but, "says our favorite chef, cheap convenient and practical", and most importantly for admin only 2 lines of code. Losing a few seconds on a specific task plays no role with respect to the problem in General. In General, the good stuff.
Code came out that way.

the
workflow Get-AppDetails2 ($list) {
$webList = @()
foreach-parallel ($i in $list){
$PermitIsn = $i.PermitIsn
$queryUri = "http://[skip]/bisweb/[skip]?allisn=$PermitIsn&allbin=&requestid=1"
Invoke-WebRequest -Uri $queryUri
}
}

Conclusions


Overall, all this proves that powershell is powerful and useful stuff, suitable for any important and useful cases.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Custom database queries in MODx Revolution

Google Web Mercator: a mixed coordinate system