Author Topic: Ideas on IP Cache routine  (Read 106 times)

William C

  • Guest
Ideas on IP Cache routine
« on: February 03, 2012, 10:43:54 PM »
If IP Caches is enabled (and it should be if any of the external spamlists are set to be checked (you don't know how slow these sites are!)), then  avhfdas_ipcache is the most important table in AVH FDAS.  As such, I'd like to present to you some of my ideas:

  • Drop the added and lastseen indexes, because they are used only ONCE A DAY, but get updated (HUNDRED OF) THOUSANDS OF TIMES per day.  The overhead delay they cause during INSERTs is unjustified. If you drop these indexes, avhfdas_ipcache will also need much less disk space. No change in your plugin code is required.
  • Set the lastseen column to ON UPDATE CURRENT_TIMESTAMP. You will then spare some plugin code and let MySQL do its thing faster.
  • Drop the added column. Does the plugin use it? I can't find it used anywhere. If you don't want to drop it...., then set the added column to DEFAULT CURRENT_TIMESTAMP. You will then spare some plugin code and let MySQL do its thing faster.
  • I have an analogy. Think of the 3rd party spamlists as the hard drive of a computer. Think of the IP Cache table as its RAM. Do you know what we need, Peter? A CPU Cache (L1, L2, L3 cache, etc.), of course. :) Store the recent ham visitor IPs in a serialized array saved through the WordPress Transients API.  Yes, create a temporary, first-in-first-out, whitelist.  The max_size of this array is up to you!  Subsequent pageviews of ham visitors (and I hope they prolong their visit!) won't even require a MySQL query.
  • Finally, a point to ponder. Storing spambot IP addresses is justifiable, but are we sure we should store ALL our visitor IPs in unsecured form (plain text) in our database? What if that information, their information, is stolen? What if the blog has politically or morally sensitive information and the thief incriminates a site visitor because his IP appears in avhfdas_ipcache? Shouldn't we at least anonymize or hash visitor IPs?

Good job on this plugin, Peter. My comment spam plugins and I are very pleased with your work.

Peter

  • Administrator
  • Regular
  • *****
  • Posts: 439
    • http://blog.avirtualhome.com
Re: Ideas on IP Cache routine
« Reply #1 on: February 04, 2012, 08:00:29 AM »
Cool ideas William,

I will definitely look into the ideas, and I might implement them. Thanks for the suggestions.
Peter van der Does
AVH Plugins developer

Peter

  • Administrator
  • Regular
  • *****
  • Posts: 439
    • http://blog.avirtualhome.com
Re: Ideas on IP Cache routine
« Reply #2 on: February 13, 2012, 10:08:53 AM »
The two indexes are used by the IP Cache Log page. You can sort on either column and so using the index would be beneficial, especially when you decided to keep the cache for a long period of time.

By using the plugin code for the MySQL time I take in account the local time set by the administrator of the blog instead of the local time of the server. This would be consistent with the comment time stamp.

The added column is used to determine whether or not to remove the IP from the cache.

As every IP is stored in a database, for the use of the IP cache log, it will be much faster to look up the ip in table then it is going through the array you would get from the transient interface. Using the transient method also means a bigger impact on the memory usage.  The transient API also uses a MySQL call to load the transient option.

If the IP would be hashed or anonymized the IP Cache log would be be much harder to use. Currently you can search Google for IP's you see in your log and manually determine if they are ham or spam by checking with other parties. Or by using the IP you can cross check with your HTPP access log. The IP's in the IP cache database are stored in a ip2long format unlike when a visitor make a comment on your blog. WordPress stores the IP in the database as IPv4 plain text.

Peter van der Does
AVH Plugins developer

William C

  • Guest
Re: Ideas on IP Cache routine
« Reply #3 on: March 20, 2012, 02:43:21 AM »
1. As I said, the added and lastseen indexes are used only ONCE A DAY. The admin might go to the IP Cache Log page and sort on those columns but how often will this happen? Once a week, once a month, once ever?  Meanwhile, those indexes' binary trees are rebuilt or rebuilt (HUNDREDS OF) THOUSANDS OF TIMES per day.  The overhead does not justify the presumed benefit.  These two indexes slow the site down. A lot.

2. Point taken; You are right.

3. Ah I didn't look at avh-fdas.public.php before..... Sorry, why would you delete the IPs that have potentially just revisited an hour ago?  If I visit the blog consistently once a day, why would you delete my IP after daystokeep?  Sorry I don't understand; Deleting by lastseen makes a lot more sense. 

4. I didn't tell you to eliminate the IP cache log. I meant supplement the SQL-based IP cache log with a memory-based array.  I didn't tell you to store the IP cache log in memory.  I meant cache a few IPs using the Transients API.  The memory tradeoff is very small; each array element is just a few bytes long. 10KB of RAM can cache at least 30 IPs. If the blog has less than 30 unique visitors at a time, you save up to 97% of your MySQL calls.  Even if lastseen won't be accurate, the improvement is worth it.
 
4. Using the Transients API does not produce additional MySQL calls.  When WordPress loads, WordPress calls wp_load_alloptions which loads all autoloaded WordPress settings (including AVHFDAS's) and also all transient API data.  Furthermore, a memcached plugin, for example, would make WordPress store transient values in fast memory instead of in the database.

4. Searching through an in-memory array is always faster than a MySQL call.  Always.

5. I know. :) I just gave you a point to ponder that we are storing our visitor IPs in plaintext and they are readable by and vulnerable to any third-party plugin.