Home

Script to Find Referenced Files in HTML

Some total rip-off contracted swine dumped a piggish trash filled ColdFusion codeset on some totally naive (or crooked) dumb-ass bosses. Not a process ran without errors, including many 404 errors in the web server logs. They told me to make it work. I decided to run a bunch of lint type searches on the code for each type of error I found to document the bugs. I found thousands of errors, hundreds of unused files, empty files, etc. This is what you can expect from crooks that do not package code, do not test code before delivery and stupid idiots that accept crap and pay before seeing results.

A project that uses common software engineering processes (specifications, testing to specifications, bug tracking, packaging code for release, regression testing and release testing) would not find much with a lint script, obviously this project was run by morons and crooks. Lint type programs usually give out a bunch of false positive errors, but on this bunch of code there were so many real errors I would have been happy to see a false positive. The code was so bad I think it would have been quicker to rewrite it.

Here is an example of a bash shell script to find missing referenced files from tags in HTML, modify for other file reference tags, like "link" tags for css, "script" tags for javascript, "form" tags, "a" tags, ColdFusion tags, etc.

#######################
# find files that do a call to  image file
# Uses (dirname basename) to get paths

# root directory of files
LOCATION=docs

find_img () {

cd ${LOCATION}
pwd

CHECK_FILES=$(find . -name "*.htm*" -exec grep -ci '\/    {next;} 
     /\<[Ii][Mm][Gg] /, /\>/ { 
                              sub(/^.*[Ss][Rr][Cc]=/,"");
                              $0=$1;
                              gsub(/"/,"");
                              print;}' ${FILE} | 
                                 egrep -i 'gif|jpg|jpeg|png' | sort -u  )

     for REF_FILE in ${REFERENCES}
     do
         DIRECTORYROOT=$(dirname ${FILE})
         ls ${DIRECTORYROOT}/${REF_FILE} >/dev/null 2>&1
         [ $? != 0 ] && print "file:${FILE}     missing:${REF_FILE}";
     done
done  
}

find_img

# Here are some other type of file references that can use
# a modified form of this script:
#cfinclude template = "../cfdocs/dochome.htm"
#link rel="prefetch" href="/images/big.jpeg"
#link rel="StyleSheet" href="CSS/default.css" type="text/css" 
#form method="get" action="/some/form_script/form_stuff.php"
#script src="missing_javascript_file.js" type="text/javascript"
#a href="missing_link" ...