From College Publisher to WordPress: My own daily WTF

Share on Facebook1Share on Reddit0Tweet about this on Twitter5Share on Google+0

Up until June 2011, The Signpost (Weber State University’s student news organization), was hosted by College Publisher – A content management system designed for university newspapers like The Signpost. In June 2011, the paper made the jump to WordPress.

The first order of business was to import past stories from College Publisher into our new WordPress hosted site. Easy right? Get a SQL dump from CP, format the columns for the WP database, upload the new SQL file. Bada-boom bada-bing. If only…

After waiting nearly two weeks for a response from College Publisher, they provided us with a temporary FTP site which contained the archived file export:

Oh, Jupiter! Why?!

Oh, Jupiter! Why?!

 

Working with large Excel files can be difficult, working with nearly 4 GB excel files can be damn near impossible — especially when working on a office computer with 4 GB of of memory. After closing all non-essential processes, I was able to open up the file — only to find 60,00 rows — many of which contained classified ads that were stored with stories, incomplete records, and other corrupted gobbledygook.

Sorting by story size, and filtering my some key words, allowed me to filter out 20,000 bogus records — bringing the Excel file down to about 40k. Upon further inspection, nearly all of these remaining stories contained duplicate headlines. A typical duplicate set contained 4 records, one of which being the whole correct story.

40k records, with 4 duplicates per unique story leaves 10k unique stories — more stories than would be feasible for one individual to sort through in a reasonable amount of time.

The eventual solution? I created SQL insert scripts from the Excel file and imported them into a MySql database. Then I whipped up a PHP  application that allows searching by article title, author, and body content. Duplicates are displayed to the user, along with a Reddit-esque voting mechanism for voting correct stories up, and bad ones down. As prior authors search for their old work, they improve the system — “crowd sourcing” the work or filtering out bad data. To date, 7,810 records have been voted up or down — 18%.

Thanks, College Publisher!

Oh, and as far as importing the stories into WordPress? I guess we can tackle that in another few years when the archives have been filtered.

Share on Facebook1Share on Reddit0Tweet about this on Twitter5Share on Google+0
  • Rudolf Olah

    That’s a pretty cool system, how much work was it to do the filtering and such? Is the code open-sourced somewhere? It’d be nice to see a clean example of the logic for this sort of process.

    • http://www.facebook.com/philip.bjorge Philip Bjorge

      He probably used the built in filtering tools in Microsoft Excel 2011 (and probably earlier + later versions also).

      • mdjasper

        Yes, this is correct. I used the filtering tools in Excel to filter by keywords that were similar to the corrupted rows, and sorted by size to find incomplete entries.

        The application I wrote for crowd-source voting is located at http://archives.wsusignpost.com. When duplicate posts are displayed, it orders them by their current vote score.

        Trivial stuff, really — but a big WTF on College Publisher for providing the data is that fashion.

  • Eric TF Bat

    The WordPress API isn’t too bad. Once you have your articles in a database, splatting them into WP isn’t all that painful. I wrote something to auto-generate a bunch of individual product pages for a shopping cart site, and I can basically regenerate the content from an Emacs .org file any time I want to change something. Much easier!

  • xxy

    You are going from one pile of shit to another.

    • mdjasper

      Yeah, but the second pile provides unique urls, search-ability, and public access. I’m open for suggestions though?