Category Archives: Python

Comment Spam

Within a week of switching to WordPress for my blogging software, I started receiving a lot of comment spam. I found this amazing because I have had a blog for a few years now without any problems. I have had the occasional spam comment, but lately I have been receiving 3-7 of them a day. I know this is very little compared to high-volume sites, but seems like a lot for a small site like mine. For the most part, the Akismet spam plugin WordPress ships with does an amazing job. It has let a few slip by, but that is no big deal.

This whole comment spam problem reminded me of a research paper I read a year or so ago. It was called Defending Against an Internet-based Attack on the Physical World. It was about the threat of using api’s such as Google’s SOAP API to automate filling out request forms for catalogues and other material on thousands of sites to some victim. This would cause the victim’s physical mail to become overloaded and very hard to manage. Imagine 100′s or 1,000′s of pieces of mail being delivered to your house every day. The point of this being that I figure spammers are using a technique similar to this to find WordPress blogs, then spam them automatically.

I decided to see how easy it was. First I went to see if I could sign up for Google’s SOAP API, but I found out that they no longer offer this service. Without this service, it is going to be a lot harder to get this done. Ignoring the whole api problem, I decided to find a search string to find comment pages on WordPress blogs. I was amazed at how easy this was. I just went to a blog using the default WordPress theme and looked for keywords that would always be there. After about a second I came up with this search string:

"Leave a Reply" Name Mail Website "proudly powered by WordPress"

Typing this into google found over 1,000,000 pages! Clicking a few of these verified that they were infact WordPress comment pages. Now I needed to write a program to automate parsing these links. Without the search api, I was stuck doing it manually. After about an hour I came up with this python script. This script will submit the search string I generated above to google, parse the first 100 results from the page, then submit a search for the next 100 and so on. While testing this script I noticed google started blocking my search, which is a good thing. I found a way around this by using different User-Agent strings and adding some timeouts. Because of this, the script defaults to saving the first 100 links. I have left out the code to fill out the comment forms becuase I feel that piece of code would do more harm than good.

Anyways, I think there is a huge problem with comment spam that needs to be fixed. The fact that so many pages can be found in a single search is amazing. Google blocking querys when it detects a bot is definitely a step in the right direction. The fact that I was able to get around this so easily is not.

Files:
http://www.mattweber.org/files/wp-link-finder.py

Python script: rename.py

I like to have my music, movie, and picture files named a certain way. When I download files from the internet, they usually don’t follow my naming convention. I found myself manually renaming each file to fit my style. This got old realy fast, so I decided to write a program to do it for me.

This program can convert the filename to all lowercase, replace strings in the filename with whatever you want, and trim any number of characters from the front or back of the filename. Here is the usage output:

usage: rename.py [options] file1 ... fileN

options:
  -h, --help            show this help message and exit
  -v, --verbose         Use verbose output
  -l, --lowercase       Convert the filename to lowercase
  -fNUM, --trim-front=NUM
                        Trims NUM of characters from the front of the filename
  -bNUM, --trim-back=NUM
                        Trims NUM of characters from the back of the filename
  -rOLDVAL NEWVAL, --replace=OLDVAL NEWVAL
                        Replaces OLDVAL with NEWVAL in the filename

Here is a few examples of what this program can do.

]$ ls -l
total 0
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 01-BandName_-_SongName-group.mp3
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 02-BandName_-_SongName2-group.mp3
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 03-BandName_-_SongName3-group.mp3
]$ rename.py -f3 -r "_-_" "-" -r "-group" "" *.mp3
]$ ls -l
total 0
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 BandName-SongName.mp3
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 BandName-SongName2.mp3
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 BandName-SongName3.mp3
]$ rename.py --replace="Band" "" -lv *.mp3
BandName-SongName.mp3 -> name-songname.mp3
BandName-SongName2.mp3 -> name-songname2.mp3
BandName-SongName3.mp3 -> name-songname3.mp3
]$ ls -l
total 0
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 name-songname.mp3
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 name-songname2.mp3
-rw-r--r--   1 matt  matt  0 Mar  4 14:03 name-songname3.mp3

Files:
http://www.mattweber.org/files/rename.py

Northwind Database Client Server Project

Here is a project I wrote last semester for a client/server systems class. Basically the professor assigned us a table from the Northwind Database. Our first assignment was to take that table, export it to a text file, then create a program to parse that text file and create a random access binary file. We were then to create a GUI that will read, write, and modify records from that binary file. I decided to write my project assignments in Python since that is my programming language of choice.

We spent half the semester working on this part of the project instead of learning anything related to client/server systems. Creating the GUI was extremely easy using Qt3 Designer and pyuic, so most of my work went into the reading and writing of the binary file.

Finally about 3/4 of the way though the semester we were assigned to modify our program to read, write, and modify the binary file over the network. The server contained the binary file, and the client connected to the server to get information from the binary file. This part of the program was actually fun to implement.

Our very last assignment was to merge our client and server with another students. I partnered with Eric Gaumer since he was also working in python. The following code is the result of our work. We took my GUI added the features his client needed and then added his server code to mine.

Simply start the server, then when you start the client enter the ip address where the server is located.

Code:

http://www.mattweber.org/files/nwdb.tar.gz

Note: PyQT and PyKDE are required to run this code.

Homework: Generate All Permutations

Here is a homework question from my Problem Solving Strategies class, along with my answer written as a python generator.

Design and code a decrease and conquer algorithm for generating all permutations of N elements in an array. Use the decrease-by-one method given in class where the parameters are: a prefix string, the number of elements whose permutations are to be concatenated with the prefix string, and the set of those elements. Turn in the code and the results for running it with a 4 element set containing A, B, C, and D.

def AllPermutations(elements, numElem=None, prefix=[]):
     if not numElem: numElem = len(elements)
     if numElem == 0:
         yield prefix
     else:
         for index in xrange(numElem):
             newPrefix = prefix[:]
             newPrefix.append(elements[index])
             newElements = elements[:index] + elements[index+1:]
             for perms in AllPermutations(newElements, numElem - 1, newPrefix):
                 yield perms

 if __name__ == ‘__main__‘:
     for perm in AllPermutations(list(’ABCD‘)):
         print perm

Files:
http://www.mattweber.org/files/perms.py

PyBlosxom Plugin: googlestats.py

This plugin keeps statistics on googlebot visits to you blog entry’s. It was inspired by the WP-GoogleStats plugin for wordpress. When enabled this plugin checks if the visitor is the googlebot, and if it is, updates the number of visits, last visit date, and last visit time template variables. You can use these template variables in the body of your posts to create custom messages based on visit statistics, or use the built-in $googlestats template variable to display a default message.

Download:

http://www.mattweber.org/files/googlestats.py

PyBlosxom Plugins Update

Today I updated all my plugins to the newest versions from Contributed Plugins Pack 1.2.2. Everything went pretty well except for Will Guaraldi’s pycategories plugin which displayed an extra / sign when processing the root directory. I modified the plugin and made a patch that fixes this problem:

--- pycategories.py	2005-06-21 11:23:14.000000000 -0700
+++ pycategories_fixed.py	2005-06-30 13:37:08.838820024 -0700
@@ -185,6 +185,8 @@
                   "flavour":      flavour,
                   "count":        num,
                   "indent":       tab }
+            if item == "":
+                d["fullcategory"] = item

             # and we toss it in the thing
             output.append(item_t % d)

One last note is with the newest version of the comments plugin. If you want to view comments you need to append “viewcomments=yes” to the querystring. I assume this is to allow showing comments only when viewing the blog entry and not when viewing a directory with only one entry. If this is the case I believe checking “bl_type” == “file” as a better method. I plan on editing the plugin to work this way because I believe the querystring method is not very good, and not backwards compatible.

Files:
pycategories.diff

PyBlosxom Plugin: robots.py

The robots.py PyBlosxom plugin will insert the Robots META Tag to your blog entry’s. This is my first PyBlosxom plugin so any comments or suggestions are appreciated.

Download:

http://www.mattweber.org/files/robots.py

Directing the Googlebot

While setting up PyBlosxom there were a few things I wanted to be able to do. The most important was being able to direct bots around my site, more specifically, the googlebot. I did some research and found a few sites that explain how the googlebot works and how you can guide it though your site.

I found Scribbling.net’s article, “Help the Googlebot understand your web site” which describes how the googlebot should index a blog. Basically, you want google to index your posts, not your main page. You do this so people can find the actual post about a topic, not your main page that has most likely changed since googlebot last indexed your site. They show that you can use metatags telling bots when and when not to index a page.

To do this using PyBlosxom you can use the comments plugin and “comment-story” flavour file with the meta tag telling googlebot to index this page and the regular “story” flavour file telling it not to index the page. Out of the box, the comments plugin would display comments any time you viewed a page with one post. This is a problem when using the calender and categories plugins because it would show the comments when viewing categories or dates with only one post, even though you were not viewing the actual post. We do not want this because it means that we will be telling google to index directories, not post pages. To fix this I modified the comments plugin so that it will only show comments when viewing an actual post. Here is my modified comments plugin for anyone interested in doing this with their blog.