File: cleansite.py

"""
=======================================================================
use python's html and url parser libs to try to isolate and move
unused files in a web site directory;  run me in the directory of
the site's root html file(s) (default=[index.html]); 

this is heuristic: it assumes that referenced files are in this site 
if they exist here;  it also may incorrectly classify some files as 
unused if they are referenced only from files which cause Python's
html parser to fail -- you should inspect the run log and unused file
directory manually after a run, to see if parse failures occurred;
more lenient html parsers exist for Python, but all seem 2.X-only; 
other parse options might avoid failures too: re.findall() pattern 
matches for '(?s)href="/?originalUrl=https%3A%2F%2Flearning-python.com%2F%26quot%3B(.*%3F)%26quot%3B%26%23x27%3B%2520and%2520%26%23x27%3Bsrc%3D...%26%23x27%3B%3F%2520(see%2520Example%252019-9)%3Bsee%2520chapters%252019%2520and%252014%2520for%2520html%2520parsers%2C%2520chapter%252013%2520for%2520url%2520parsing%3Bto%2520do%3A%2520extend%2520me%2520to%2520delete%2520the%2520unused%2520files%2520from%2520remote%2520site%2520via%2520ftp%3Anot%2520done%2520because%2520unused%2520files%2520require%2520verification%2520if%2520parse%2520failures%3Bcaveat%3A%2520assumes%2520site%2520is%2520one%2520dir%2C%2520doesn%26%23x27%3Bt%2520handle%2520subdirs%2520(improve%2520me)%3B%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%26quot%3B%26quot%3B%26quot%3Bimport%2520os%2C%2520sys%2C%2520html.parser%2C%2520urllib.parsedef%2520findUnusedFiles(rootfiles%3D%5B%26%23x27%3Bindex.html%26%23x27%3B%5D%2C%2520dirunused%3D%26%23x27%3BUnused%26%23x27%3B%2C%2520skipfiles%3D%5B%5D)%3A%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520find%2520and%2520move%2520files%2520referenced%2520by%2520rootfiles%2520and%2520by%2520any%2520html%2520they%2520%2520%2520%2520%2520reach%2C%2520ignoring%2520any%2520in%2520skipfiles%2C%2520and%2520moving%2520unused%2520to%2520dirunused%3B%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520usedFiles%2520%3D%2520set(rootfiles)%2520%2520%2520%2520for%2520rootfile%2520in%2520rootfiles%3A%2520%2520%2520%2520%2520%2520%2520%2520parseFileRefs(rootfile%2C%2520usedFiles%2C%2520skipfiles%2C%25200)%2520%2520%2520%2520moveUnusedFiles(usedFiles%2C%2520dirunused)%2520%2520%2520%2520return%2520usedFilesdef%2520moveUnusedFiles(usedFiles%2C%2520dirunused%2C%2520trace%3Dprint)%3A%2520%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520move%2520unused%2520files%2520to%2520a%2520temp%2520directory%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520print(%26%23x27%3B-%26%23x27%3B%2520*%252080)%2520%2520%2520%2520if%2520not%2520os.path.exists(dirunused)%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520tbd%3A%2520clean%2520if%2520present%3F%2520%2520%2520%2520%2520%2520%2520%2520os.mkdir(dirunused)%2520%2520%2520%2520for%2520filename%2520in%2520os.listdir(%26%23x27%3B.%26%23x27%3B)%3A%2520%2520%2520%2520%2520%2520%2520%2520if%2520filename%2520not%2520in%2520usedFiles%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520if%2520not%2520os.path.isfile(filename)%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520print(%26%23x27%3BNot%2520a%2520file%3A%26%23x27%3B%2C%2520filename)%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520else%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520trace(%26%23x27%3BMoving...%26%23x27%3B%2C%2520filename)%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520os.rename(filename%2C%2520os.path.join(dirunused%2C%2520filename))def%2520parseFileRefs(htmlfile%2C%2520usedFiles%2C%2520skipFiles%2C%2520indent%2C%2520trace%3Dprint)%3A%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520find%2520files%2520referenced%2520in%2520root%2C%2520recur%2520for%2520html%2520files%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520trace(%26%23x27%3B%25sParsing%3A%26%23x27%3B%2520%25%2520(%26%23x27%3B.%26%23x27%3B%2520*%2520indent)%2C%2520htmlfile)%2520%2520%2520%2520parser%2520%3D%2520MyParser(usedFiles%2C%2520skipFiles%2C%2520indent)%2520%2520%2520%2520text%2520%2520%2520%3D%2520open(htmlfile).read()%2520%2520%2520%2520try%3A%2520%2520%2520%2520%2520%2520%2520%2520parser.feed(text)%2520%2520%2520%2520except%2520html.parser.HTMLParseError%2520as%2520E%3A%2520%2520%2520%2520%2520%2520%2520%2520print(%26%23x27%3B%3D%3D%26gt%3BFAILED%3A%26%23x27%3B%2C%2520E)%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520file%26%23x27%3Bs%2520refs%2520may%2520be%2520missed!%2520%2520%2520%2520parser.close()class%2520MyParser(html.parser.HTMLParser)%3A%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520use%2520Python%2520stdlib%2520html%2520parser%2520to%2520scan%2520files%3B%2520could%2520nest%2520this%2520in%2520%2520%2520%2520%2520parseFileRefs%2520for%2520enclosing%2520scope%2C%2520but%2520would%2520remake%2520class%2520per%2520call%3B%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520def%2520__init__(self%2C%2520usedFiles%2C%2520skipFiles%2C%2520indent)%3A%2520%2520%2520%2520%2520%2520%2520%2520self.usedFiles%2520%3D%2520usedFiles%2520%2520%2520%2520%2520%2520%2520%2520self.skipFiles%2520%3D%2520skipFiles%2520%2520%2520%2520%2520%2520%2520%2520self.indent%2520%2520%2520%2520%3D%2520indent%2520%2520%2520%2520%2520%2520%2520%2520super().__init__()%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520vs%2520html.parser.HTMLParser.__init__(self)%2520%2520%2520%2520def%2520handle_starttag(self%2C%2520tag%2C%2520attrs)%3A%2520%2520%2520%2520%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520%2520%2520%2520%2520callback%2520on%2520tag%2520open%2520during%2520parse%3A%2520check%2520links%2520and%2520images%2520%2520%2520%2520%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520%2520%2520%2520%2520if%2520tag%2520%3D%3D%2520%26%23x27%3Ba%26%23x27%3B%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520url%2520%3D%2520%5Bvalue%2520for%2520(name%2C%2520value)%2520in%2520attrs%2520if%2520name.lower()%2520%3D%3D%2520%26%23x27%3Bhref%26%23x27%3B%5D%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520if%2520url%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520self.notefile(url%5B0%5D)%2520%2520%2520%2520%2520%2520%2520%2520elif%2520tag%2520%3D%3D%2520%26%23x27%3Bimg%26%23x27%3B%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520url%2520%3D%2520%5Bvalue%2520for%2520(name%2C%2520value)%2520in%2520attrs%2520if%2520name.lower()%2520%3D%3D%2520%26%23x27%3Bsrc%26%23x27%3B%5D%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520if%2520url%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520self.notefile(url%5B0%5D)%2520%2520%2520%2520def%2520notefile(self%2C%2520url)%3A%2520%2520%2520%2520%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520%2520%2520%2520%2520note%2520used%2520file%2520found%2C%2520and%2520recur%2520to%2520a%2520nested%2520parse%2520if%2520html%2520%2520%2520%2520%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520%2520%2520%2520%2520urlparts%2520%3D%2520urllib.parse.urlparse(url)%2520%2520%2520%2520%2520%2520%2520%2520(scheme%2C%2520server%2C%2520filepath%2C%2520parms%2C%2520query%2C%2520frag)%2520%3D%2520urlparts%2520%2520%2520%2520%2520%2520%2520%2520filename%2520%3D%2520os.path.basename(filepath)%2520%2520%2520%2520%2520%2520%2520%2520if%2520(os.path.exists(filename)%2520%2520%2520%2520%2520%2520%2520and%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520is%2520it%2520here%3F%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520filename%2520not%2520in%2520self.skipFiles%2520and%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520ignore%2520it%3F%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520filename%2520not%2520in%2520self.usedFiles)%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520skip%2520repeats%3F%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520self.usedFiles.add(filename)%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520add%2520in-place%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520if%2520filename.endswith((%26%23x27%3B.html%26%23x27%3B%2C%2520%26%23x27%3B.htm%26%23x27%3B))%3A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%23%2520recur%2520for%2520html%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520parseFileRefs(%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520filename%2C%2520self.usedFiles%2C%2520self.skipFiles%2C%2520self.indent%2520%2B%25203)def%2520deleteUnusedRemote(localUnusedDir%2C%2520ftpsite%2C%2520ftpuser%2C%2520ftppswd%2C%2520ftpdir%3D%26%23x27%3B.%26%23x27%3B)%3A%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520to%2520do%3A%2520delete%2520unused%2520files%2520from%2520remote%2520site%2520too%3F%2520see%2520Chapter%252013%2520for%2520ftp%3B%2520%2520%2520%2520not%2520used%2520because%2520unused%2520dir%2520requires%2520manual%2520inspection%2520if%2520parse%2520failures%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520%2520%2520%2520from%2520ftplib%2520import%2520FTP%2520%2520%2520%2520connection%2520%3D%2520FTP(ftpsite)%2520%2520%2520%2520connection.login(ftpuser%2C%2520ftppswd)%2520%2520%2520%2520connection.cwd(ftpdir)%2520%2520%2520%2520%2520for%2520filename%2520in%2520os.listdir(localUnusedDir)%3A%2520%2520%2520%2520%2520%2520%2520%2520connection.delete(filename)if%2520__name__%3D%3D%2520%26%23x27%3B__main__%26%23x27%3B%3A%2520%2520%2520%2520htmlroot%2520%3D%2520sys.argv%5B1%5D%2520if%2520len(sys.argv)%2520%26gt%3B%25201%2520else%2520%26%23x27%3Bindex.html%26%23x27%3B%2520%2520%2520%2520moveto%2520%2520%2520%3D%2520sys.argv%5B2%5D%2520if%2520len(sys.argv)%2520%26gt%3B%25202%2520else%2520%26%23x27%3BPossiblyUnused%26%23x27%3B%2520%2520%2520%2520ignore%2520%2520%2520%3D%2520sys.argv%5B3%5D%2520if%2520len(sys.argv)%2520%26gt%3B%25203%2520else%2520%26%23x27%3Bwhatsnew.html%26%23x27%3B%2520%2520%2520%2520usedFiles%2520%3D%2520findUnusedFiles(%5Bhtmlroot%5D%2C%2520moveto%2C%2520%5Bignore%5D)%2520%2520%2520%2520moveFiles%2520%3D%2520os.listdir(moveto)%2520%2520%2520%2520%2520print(%26%23x27%3B-%26%23x27%3B%2520*%252080)%2520%2520%2520%2520print(%26%23x27%3B**Summary**%5Cn%26%23x27%3B)%2520%2520%2520%2520print(%26%23x27%3B%25d%2520unused%2520files%2520moved%2520to%3A%5Cn%5Ct%25s%5Cn%26%23x27%3B%2520%25%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520(len(moveFiles)%2C%2520os.path.abspath(moveto)))%2520%2520%2520%2520print(%26%23x27%3B%25d%2520used%2520files%2520in%2520this%2520site%3A%2520%26%23x27%3B%2520%25%2520len(usedFiles))%2520%2520%2520%2520for%2520F%2520in%2520sorted(usedFiles)%3A%2520print(%26%23x27%3B%5Ct%26%23x27%3B%2C%2520F)%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%2520%2520%2520%2520if%2520input(%26%23x27%3Bdelete%2520remotely%3F%26%23x27%3B)%2520in%2520%26%23x27%3ByY%26%23x27%3B%3A%2520%2520%2520%2520%2520%2520%2520%2520deleteUnusedRemote(moveto%2C%2520input(%26%23x27%3Bsite%3F%26%23x27%3B)%2C%2520input(%26%23x27%3Buser%3F%26%23x27%3B)%2C%2520input(%26%23x27%3Bpswd%3F%26%23x27%3B))%2520%2520%2520%2520%26quot%3B%26quot%3B%26quot%3B%253C%2FPRE">



[Home page] Books Code Blog Python Author Train Find ©M.Lutz