Skip to content

Commit 8ce9bcb

Browse files
committed
Merge branch 'add-response'
2 parents 07f50c3 + 50a6444 commit 8ce9bcb

File tree

4 files changed

+45
-88
lines changed

4 files changed

+45
-88
lines changed

README.md

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -40,45 +40,48 @@ If you installed from source do:
4040

4141
The urls file, by default `urls.csv` must have all the urls you want to check. You can use a text file with 1 url per line or a csv file with the urls on the first column and without headers.
4242

43-
## Checking the urls
43+
You can use [ecounter](https://github.com/greenpeace/ecounter) to create a urls file from a sitemap.xml file.
4444

45-
To check all urls in `urls.csv` with all the checks use the command:
45+
## Http info about a list of urls
4646

47+
If you want to obtain information about http status codes, mime-types, file sizes and redirect urls of any urls, you can use `-http`.
48+
49+
You must use this check in a separate command like:
4750
```
48-
./check-my-pages -urls=urls.csv -http -analytics -canonical -redirects -linkpattern -cssjspattern -mediapattern
51+
./check-my-pages -urls=urls.csv -http -miliseconds=100
52+
```
53+
because check-my-pages will stop after executing `-http`
54+
55+
This check creates a file named `httpResponses.csv` with 5 fields:
56+
1. initial url
57+
2. http status code
58+
3. mime type
59+
4. file size *(adds -1 if the file size is unknown)*
60+
5. final url
61+
62+
## Checking html urls
63+
64+
To do all the checks in `urls.csv` (html urls) with all the checks use the command:
65+
66+
```
67+
./check-my-pages -urls=urls.csv -analytics -canonical -linkpattern -cssjspattern -mediapattern
4968
```
5069

5170
This repository includes a few testing urls in the file `urls.csv`. Please replace them by your own.
5271

5372
It will create a couple of files, one per check the script is doing:
54-
* `httpResponses.csv` - Stores the **http response** codes for the URL. 200 means everything is OK.
5573
* `analytics.csv` - Reports **google analytics** tracking ID
5674
* `canonicals.csv` - Reports the **canonical url** for every url
57-
* `redirects.csv` - Reports the requested URL and the final URL. This will be useful to test the **redirects** in the main site.
5875
* `linkpattern.csv` - Reports on links that include a regular expression pattern. Useful to track **links** to specific **dead sites**. The default pattern can be set by the `-pattern` option.
5976
* `cssjspattern.csv` - Reports **css and js** urls that include a regular expression pattern. To detect dead css and js urls in large sites. The pattern can also be defined with the option `-pattern` (described bellow)
6077
* `mediapattern.csv` - Reports **media** links. Images, videos, audios, iframes and objects. Also use `-pattern` to define the urls pattern.
6178

6279
## Optional command line configurations
6380

64-
`-miliseconds=100` - Sets a delay of 100 miliseconds between requests.
81+
`-miliseconds=100` - Sets a delay of 100 miliseconds between requests (the default value)
6582

6683
`-pattern='https?://(\w|-)+.greenpeace.org/espana/.+'` - Changes the search link pattern to the regular expression.
6784

68-
## Information about other urls
69-
70-
If you want to obtain information about non-html files, like for example images, it's better to use `-fileinfo`.
71-
72-
You must use this check in a separate command like:
73-
74-
```
75-
./check-my-pages -urls=urls.csv -fileinfo -miliseconds=100
76-
```
77-
78-
because check-my-pages will stop after executing `-fileinfo`
79-
80-
This check creates a file named `fileInfo.csv` with 4 fields: url, http status code, mime type and file size (adds -1 if the file size is unknown).
81-
8285
## Remove the report files
8386

8487
To remove the files created by **check-my-pages**:

check-my-pages.go

Lines changed: 11 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ package main
33
import (
44
"flag"
55
"fmt"
6-
"net/http"
76
"os"
87
"regexp"
98
"time"
@@ -16,10 +15,8 @@ func main() {
1615
isHelp := flag.Bool("help", false, "Help")
1716
urlsFileName := flag.String("urls", "urls.csv", "Name of the csv file with the urs in the first column")
1817
isHTTP := flag.Bool("http", false, "Http response codes")
19-
isRedirects := flag.Bool("redirects", false, "Redirects response codes")
2018
isAnalytics := flag.Bool("analytics", false, "Correct analytics tag in the html")
2119
isCanonical := flag.Bool("canonical", false, "Canonical URLS in the ")
22-
isFileInfo := flag.Bool("fileinfo", false, "Specific calls for files")
2320
isLinkpattern := flag.Bool("linkpattern", false, "Link Pattern")
2421
isCSSJsPattern := flag.Bool("cssjspattern", false, "CSS and JS Pattern")
2522
isMediaPattern := flag.Bool("mediapattern", false, "Image, object and iframe Pattern")
@@ -44,19 +41,22 @@ func main() {
4441

4542
if *isHTTP == true {
4643

47-
httpResponses, httpErr := os.OpenFile("httpResponses.csv", os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0600)
48-
if httpErr != nil {
49-
panic(httpErr)
44+
isHTTPfile, isHTTPErr := os.OpenFile("httpResponses.csv", os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0600)
45+
if isHTTPErr != nil {
46+
panic(isHTTPErr)
5047
}
51-
defer httpResponses.Close()
48+
defer isHTTPfile.Close()
5249

53-
c.OnResponse(func(r *colly.Response) {
54-
lineResponse := fmt.Sprintf("%s,%v\n", r.Request.URL.String(), r.StatusCode)
55-
if _, err := httpResponses.WriteString(lineResponse); err != nil {
50+
var lineHTTP string
51+
for _, v := range allUrls {
52+
lineHTTP = getHTTPinfoAsCsvline(v)
53+
if _, err := isHTTPfile.WriteString(lineHTTP); err != nil {
5654
panic(err)
5755
}
56+
time.Sleep(time.Millisecond * time.Duration(*waitMiliseconds))
57+
}
5858

59-
})
59+
os.Exit(0)
6060
}
6161

6262
if *isAnalytics == true {
@@ -94,26 +94,6 @@ func main() {
9494
})
9595
}
9696

97-
if *isFileInfo == true {
98-
99-
fileInfofile, fileInfofileErr := os.OpenFile("fileInfo.csv", os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0600)
100-
if fileInfofileErr != nil {
101-
panic(fileInfofileErr)
102-
}
103-
defer fileInfofile.Close()
104-
105-
var lineFileInfo string
106-
for _, v := range allUrls {
107-
lineFileInfo = fileInfo(v)
108-
if _, err := fileInfofile.WriteString(lineFileInfo); err != nil {
109-
panic(err)
110-
}
111-
time.Sleep(time.Millisecond * time.Duration(*waitMiliseconds))
112-
}
113-
114-
os.Exit(0)
115-
}
116-
11797
if *isLinkpattern == true {
11898

11999
linkpattern, linkpatternErr := os.OpenFile("linkpattern.csv", os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0600)
@@ -234,39 +214,14 @@ func main() {
234214

235215
}
236216

237-
if *isRedirects == true {
238-
239-
redirects, redirectsErr := os.OpenFile("redirects.csv", os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0600)
240-
if redirectsErr != nil {
241-
panic(redirects)
242-
}
243-
defer redirects.Close()
244-
245-
c.OnRequest(func(r *colly.Request) {
246-
response, error := http.Get(r.URL.String())
247-
if error != nil {
248-
fmt.Printf("=> %v\n", error.Error())
249-
} else {
250-
finalURL := response.Request.URL.String()
251-
lineCanonical := fmt.Sprintf("%s,%s\n", r.URL.String(), finalURL)
252-
if _, err := redirects.WriteString(lineCanonical); err != nil {
253-
panic(err)
254-
}
255-
}
256-
257-
})
258-
}
259-
260217
if *isClear == true {
261218

262219
os.Remove("httpResponses.csv")
263220
os.Remove("analytics.csv")
264221
os.Remove("canonicals.csv")
265-
os.Remove("redirects.csv")
266222
os.Remove("linkpattern.csv")
267223
os.Remove("cssjspattern.csv")
268224
os.Remove("mediapattern.csv")
269-
os.Remove("fileInfo.csv")
270225
os.Exit(0)
271226
}
272227

file.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ import (
55
"net/http"
66
)
77

8-
// Obtains statuscode, content-type and content lenght from a specific url
9-
func fileInfo(url string) string {
8+
// getHTTPinfoAsCsvline Obtains statuscode, content-type, content lenght and final URL from a specific http get request
9+
func getHTTPinfoAsCsvline(url string) string {
1010
response, error := http.Get(url)
1111
if error != nil {
1212
return fmt.Sprintf("%s,%s,,\n", url, error.Error())
@@ -15,5 +15,6 @@ func fileInfo(url string) string {
1515
statusCode := response.StatusCode
1616
contentType := headers["Content-Type"][0]
1717
contentLength := response.ContentLength
18-
return fmt.Sprintf("%s,%d,%s,%d\n", url, statusCode, contentType, contentLength)
18+
finalURL := response.Request.URL.String()
19+
return fmt.Sprintf("%s,%d,%s,%d,%s\n", url, statusCode, contentType, contentLength, finalURL)
1920
}

help.go

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,29 +16,25 @@ check-my-pages is a scrapping script. It checks each url in a list and creates r
1616
1717
EXAMPLES:
1818
19-
./check-my-pages -urls=urls.csv -http -analytics -canonical -redirects -linkpattern -cssjspattern -mediapattern
19+
./check-my-pages -urls=urls.csv -http -miliseconds=100
2020
21-
./check-my-pages -urls=urls.csv -fileinfo -miliseconds=100
21+
./check-my-pages -urls=urls.csv -analytics -canonical -linkpattern -cssjspattern -mediapattern
2222
2323
2424
CHECKS:
2525
26-
-http : Gets the http response code. If it's 200 it should be OK.
26+
-http : Gets the http response code, mime-type, file size and final url. It must be used separately from the other checks.
2727
2828
-analytics : Gets the first Google Analytics account.
2929
3030
-canonical : Gets the canonical URL for the url.
3131
32-
-redirects : Gets info about redirects and final URLs.
33-
3432
-linkpattern : Gets links that match the regular expression pattern.
3533
3634
-cssjspattern : Gets CSS and JS URLs that match the regular expression pattern.
3735
3836
-mediapattern : Gets urls from images, videos, audios, iframes and objects that match the regular expression pattern
3937
40-
-fileinfo : Speciall check more suitable for non-html pages (for example images). It needs to be used alone as the example above, without other checks.
41-
4238
4339
OPTIONS:
4440
@@ -48,6 +44,10 @@ OPTIONS:
4844
4945
-miliseconds=100 : Sets a delay of 100 miliseconds between requests.
5046
47+
OTHER:
48+
49+
-clear : Deletes all the files with the reports
50+
5151
5252
FILES WITH THE REPORTS:
5353
@@ -57,8 +57,6 @@ FILES WITH THE REPORTS:
5757
5858
- canonicals.csv : Reports the canonical url for every url
5959
60-
- redirects.csv : Reports the requested URL and the final URL. This will be useful to test the redirects in the main site.
61-
6260
- linkpattern.csv : Reports on links that include a regular expression pattern. Useful to track links to specific dead sites. The default pattern can be set by the -pattern option.
6361
6462
- cssjspattern.csv : Reports css and js urls that include a regular expression pattern. To detect dead css and js urls in large sites. The pattern can also be defined with the option -pattern (described bellow)

0 commit comments

Comments
 (0)