hirez: More graf. Same place as the other one. (Default)
[personal profile] hirez
The most useful tool I have found for keeping an offline copy of an LJ + comments is the mildly-obviously named 'LJArchive'. It also manages to provide a fairly rapid full text search of entries + comments, which is really very useful indeed.

As far as I can tell (-> . <- this far) it's written in M$ C#, has been abandoned by the author and hasn't worked for some number of months due to $Random-XML-error which appears to be inside the comment-parsing code.

I was going to chunter on about this being unfixable b/c it would require spending ££ on the relevant part of the M$ toolchain, but it seems that the 'Express' version is free for the download (and presumably in exchange for all sorts of details that M$ can use to sell me things).

So instead of whining about it, I'd better bag yon thingy and see if the code is amenable to tinkering by a Unix Curmudgeon.

Date: 2011-12-14 02:36 pm (UTC)
From: [identity profile] quercus.livejournal.com
If it's a question of changing and fixing it, you'd have to make a copy of the dev tools work there too. Sounds like firing up a Windows box would be simpler.

Otherwise just blag the high-level design and knowledge of how to poke the LJ server (which is the hard part), then clean-room it in Scala, Python or $FAVOURITE_THING_THIS_WEEK. Things that can fall over with XML errors are usually indicative of brain-dead roll-your-own-DOM coding in the first place.

Date: 2011-12-14 02:43 pm (UTC)
From: [identity profile] hirez.livejournal.com
Yep. Indeed, there's at least one Perl noddy-script that just spiders LJ as $punter and re-creates that locally, which is Good Enough. LJArchive's USP is the search.

Date: 2011-12-14 05:02 pm (UTC)
From: [identity profile] quercus.livejournal.com
So do it in Java and hang Lucene in there. You'll get the best search this side of Google, and it's easy to do special magic for searching tags etc.

Date: 2011-12-14 08:28 pm (UTC)
From: [identity profile] hirez.livejournal.com
Lucene appears to eat all the memory in the world and frag its indexes at the first sign of trouble... No, wait. The other one. Solr.
Edited Date: 2011-12-14 08:30 pm (UTC)

Date: 2011-12-15 10:34 am (UTC)
From: [identity profile] quercus.livejournal.com
Lucene and Solr are much the same thing internally. Lucene is nuts and bolts for building one, Solr is one I built earlier, all in a box. I've only ever used Lucene, so I don't know really what the boundary between Solr and Nutch is.

Mostly I've sat Lucene on top of Oracle, so the underlying indexes were well-behaved anyway.

Date: 2011-12-23 03:12 pm (UTC)
reddragdiva: (Default)
From: [personal profile] reddragdiva
Correct. Lucene is the worst text search except all the others. It is fat and slow and horrible and works pretty well. Any serious use requires a separate box with several gig of memory just for search.

May 2025

S M T W T F S
    123
45678910
11121314151617
18192021222324
2526272829 3031

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 22nd, 2026 11:56 am
Powered by Dreamwidth Studios