hirez | Unexpected consequences of a closed-source toolchain

You're viewing

hirez's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

The most useful tool I have found for keeping an offline copy of an LJ + comments is the mildly-obviously named 'LJArchive'. It also manages to provide a fairly rapid full text search of entries + comments, which is really very useful indeed.

As far as I can tell (-> . <- this far) it's written in M$ C#, has been abandoned by the author and hasn't worked for some number of months due to $Random-XML-error which appears to be inside the comment-parsing code.

I was going to chunter on about this being unfixable b/c it would require spending ££ on the relevant part of the M$ toolchain, but it seems that the 'Express' version is free for the download (and presumably in exchange for all sorts of details that M$ can use to sell me things).

So instead of whining about it, I'd better bag yon thingy and see if the code is amenable to tinkering by a Unix Curmudgeon.

Current Mood: RMS farts softly for you
Current Location: BANES
Current Music: Be strong, be wrong

Flat | Top-Level Comments Only

From:

quercus.livejournal.com

If it's a question of changing and fixing it, you'd have to make a copy of the dev tools work there too. Sounds like firing up a Windows box would be simpler.

Otherwise just blag the high-level design and knowledge of how to poke the LJ server (which is the hard part), then clean-room it in Scala, Python or $FAVOURITE_THING_THIS_WEEK. Things that can fall over with XML errors are usually indicative of brain-dead roll-your-own-DOM coding in the first place.

From:

hirez.livejournal.com

Yep. Indeed, there's at least one Perl noddy-script that just spiders LJ as $punter and re-creates that locally, which is Good Enough. LJArchive's USP is the search.

From:

quercus.livejournal.com

So do it in Java and hang Lucene in there. You'll get the best search this side of Google, and it's easy to do special magic for searching tags etc.

From:

hirez.livejournal.com

Lucene appears to eat all the memory in the world and frag its indexes at the first sign of trouble... No, wait. The other one. Solr.

Edited Date: 2011-12-14 08:30 pm (UTC)

From:

quercus.livejournal.com

Lucene and Solr are much the same thing internally. Lucene is nuts and bolts for building one, Solr is one I built earlier, all in a box. I've only ever used Lucene, so I don't know really what the boundary between Solr and Nutch is.

Mostly I've sat Lucene on top of Oracle, so the underlying indexes were well-behaved anyway.

From:

reddragdiva

Correct. Lucene is the worst text search except all the others. It is fat and slow and horrible and works pretty well. Any serious use requires a separate box with several gig of memory just for search.

Flat | Top-Level Comments Only

Profile

Julia Rez

Remarkably.placid.horse

May 2025

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Page Summary

quercus.livejournal.com - (no subject)

Style Credit

Base style: Corinthian by momijizukamori
Theme: Wonderland by krja

Expand Cut Tags

No cut tags

Page generated Mar. 22nd, 2026 11:56 am

Adventure Rocket Ship

(Queen of Eyes)

Unexpected consequences of a closed-source toolchain

Unexpected consequences of a closed-source toolchain

no subject

no subject

no subject

no subject

no subject

no subject

Profile

May 2025

Page Summary

Style Credit

Expand Cut Tags