PERL: Checking Google Reader subscriptions status

RSS subscriptions in my Google Reader account have reached an indecent number. I am subscribed to 398 sites (blogs, forums or whatever).

I will need to unsubscribe for a few of them I am no longer interested in, but I had the feeling that a lot of the feeds in my list weren't active at all.

Given the number of subscriptions, I wasn't too keen on checking each link manually, so I came up with this perl script.

I haven't use any module, so it could be executed with the default Perl environment that comes "out-of-the-box". The only requirements are wget, curl , the your subscriptions exported into a OPLM file and internet connectivity :)

I have tested it under Cygwin 1.7 and Arch Linux, but any other GN/Linux distro should work fine as well.

The execution is pretty straightforward:

$ perl check_google_subscriptions.pl google-reader-subscriptions.xml

The script will look for duplicated items in your file, and will go through all your subscriptions, looking for the http response.

This the source code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
#!/usr/bin/perl
#######################################################
#Author: psgonza
#Date: 19-Feb-2012
#Objetive: Check Google Reader feed links status
#Version:1
#######################################################
use FileHandle;
use strict;
use warnings;
binmode STDOUT, ":utf8";
$| = 1;
die "[ERR] Usage: $0 <file.xml>\n" unless @ARGV;
my $wget = system ("which wget > /dev/null 2>&1");
if ($wget ne 0) {print "[ERR] wget not found\n"; exit 0;}
my $curl = system ("which curl > /dev/null 2>&1");
if ($curl ne 0) {print "[ERR] curl not found\n"; exit 0;}
my $googlexml = $ARGV[0];
my @lines;
my @new_lines;


sub check_link{
        my $urlList = shift;
        my @wrongUrl = ();
        my $counter = 1;
        print "$counter";
        foreach my $url(@$urlList){
        my $result = system ("wget --spider \"$url\" --server-response --timeout=5 --tries=3 > /dev/null 2>&1");
        if ($result != 0) { push (@wrongUrl, $url); }
        if ($counter % 10) {print ".";} else {print "$counter";}
        $counter++;    
        }
        printf " DONE!\n\n";
        return \@wrongUrl;
}
print "==============================\n";
#Open xml file
open FILE, $googlexml or die $!;
#Create array with feeds urls
while (<FILE>)
{
        if ($_ =~ /xmlUrl=\"(.*)\" /) { push (@lines, $1);}
}
#Close xml file
close FILE;
my $total_items = scalar(@lines);
print "$total_items links found in $googlexml\n";
#Create hash with uniqe items
my %tmp = map { $_, 0 } @lines;
my @links = sort(keys %tmp);

#Check for duplicated items
print "\n===========================================\n";
print "Duplicated links:\n";
print "===========================================\n";
my %cnt;
$cnt{$_}++ for @lines;
print "$_\n" for grep $cnt{$_} > 1, keys %cnt;



my $items = scalar(@links);
print "\n==============================================================\n";
print "Checking server response for $items items. It can take a while...\n";
print "===============================================================\n\n";
my $ref_list = check_link(\@links);
my $deadlink = scalar(@$ref_list);
if ( $deadlink ne 0 )
{
        print "$deadlink problematic links found\n\n";
}
else
{
        print "\nAll links working\n";
        print "\n===========================================\n";
        print "DONE\n";
        print "===========================================\n";
        undef($ref_list);
        exit 0;
}
print "\n===========================================\n";
print "Those problematic links could fail due to:\n\n";
print "- RSS not available in the server any more\n";
print "- Spiders are blocked in Robots.txt so the query fails\n\n";
print "Downloading the headers using curl.";
foreach my $val(@$ref_list){
        print ".";
        chomp $val;
        if (my $fineresult = system ("curl --connect-timeout 5 --max-time 5 --dump-header headers.txt \"$val\" > /dev/null 2>&1 && (cat headers.txt | grep \"HTTP/1.1 200 OK\" > /dev/null)") != 0) 
                { push (@new_lines, $val); }
}
print ". DONE! \n\n";
$deadlink = scalar(@new_lines);
if ( $deadlink ne 0 )
        {
        print "\n===========================================\n";
        print "$deadlink links found not available\n\n";
        }
        else {
        print "All links working\n";
        }
foreach my $val(@new_lines){
        print "--> $val <-- NOT AVAILABLE\n";
}

undef($ref_list);
print "\n===========================================\n";
print "DONE\n";
print "===========================================\n";

And this how the output will look like:

$ perl check_google_subscriptions.pl example_google_export
==============================
60 links found in example_google_export
===========================================
Duplicated links:
===========================================
http://www.xtorquemadax.com/feeds/posts/default?alt=rss
==============================================================
Checking server response for 59 items. It can take a while...
===============================================================
1.........10.........20.........30.........40.........50......... DONE!
7 problematic links found

===========================================
Those problematic links could fail due to:
- RSS not available in the server any more
- Spiders are blocked in Robots.txt so the query fails
Downloading the headers using curl......... DONE!

===========================================
7 links found not available
--> http://blog.fredjean.net/articles.atom
--> http://blogs.elcorreo.com/el-navegante/posts.rss
--> http://feeds.feedburner.com/github
--> http://veejoe.net/blog/feed/
--> http://www.irfree.com/feed/rss/
--> http://www.misionurbana.com/site/index.php?q=rss.xml
--> http://www.rsdownload.net/option,com_rss/feed,RSS2.0/no_html,1.html
===========================================
DONE
===========================================

It's not big deal, but does the job!

Enjoy :)

posts