(I wrote this post in my way back to the old continent... According to the map, we were flying over Jarenga river, in the mother Russia)
I don't know you, but I have been using RSS clients since, well, foverer I'd say... First Akregator in KDE, then Google Reader (I still wonder why Google had to shut it down) and finally, as most of the people did, I moved to Feedly, and I am pretty happy with it, no complains at all.
A few days ago I was taking a look at my RSS subscriptions and there were more than 400 websites that I have been adding during all these years, so I deciced to clean it up a little bit using Python.
I came up with the script you can see below... It is fairly simple (and it uses only standard library modules) but does the job for me.
In order to use it, you just need to export your feedly subscriptios (you will get an opml file) and execute the script. You will get a new file without the "dead" links ready to be imported.
Usage:
# python3 feedly_checker.py myfeedlyrss.opml
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | #!/usr/bin/python3
import os
import sys
from xml.dom import minidom
import http.client
from urllib.parse import urlparse
errors = []
def usage():
print("Usage: python3 " + sys.argv[0] + " <feedly_export_file>.opml")
sys.exit(0)
def parse_feed(xmlfile):
feeds = []
for _ in range(len(xmlfile)):
tmp = {}
if xmlfile[_].getAttribute('type') == "rss":
tmp[xmlfile[_].getAttribute('text')] = xmlfile[_].getAttribute('xmlUrl')
feeds.append(tmp)
return feeds
def check_http_response(url):
for name, rss in url.items():
addr = urlparse(rss)
try:
conn = http.client.HTTPConnection(addr[1], 80, timeout=3)
conn.request("HEAD", addr[2])
res = conn.getresponse()
if res.status >= 400:
errors.append(rss)
print("Name: %s, URL:%s, Error Code:%d" % (name, rss, res.status))
except http.client.HTTPException:
errors.append(rss)
finally:
conn.close()
def load_xml(filename):
long_input_file = os.path.realpath(filename)
try:
xmldoc = minidom.parse(long_input_file)
item_res = xmldoc.getElementsByTagName('outline')
except Exception:
print("Error parsing xml file")
sys.exit(1)
return item_res
def clean_xml(filename, errors):
xmldoc = minidom.parse(filename)
doc_root = xmldoc.getElementsByTagName('outline')
for j in doc_root:
if j.getAttribute('type') == "rss" and j.getAttribute('xmlUrl') in errors:
j.parentNode.removeChild(j)
print("Removing... %s" % (j.getAttribute('xmlUrl')))
try:
f = open(filename + "_clean", "w")
f.write(xmldoc.toxml())
print("Creating %s_clean file..." % filename)
except IOError:
print("Error creating %s_clean file" % filename)
else:
f.close()
if len(sys.argv) != 2:
usage()
else:
input_file = sys.argv[1]
outlineItems = load_xml(input_file)
results = parse_feed(outlineItems)
print("%d entries found in %s" % (len(results), input_file))
for site in results:
check_http_response(site)
print("%d wrong entries found" % (len(errors)))
clean_xml(input_file, errors)
sys.exit(0)
|
It's been a quick excersice (POC?), so there is a lot of room for improvement...
- It is not fast, because it is single-thread, and it has to wait for the http connection to timeout.
- Any error code above 400 will make the link to be removed... Some error codes 5xx are "temporary server errors", but they will be deteled as well... It'd need some extra work
\\psgonza