hirez: More graf. Same place as the other one. (Default)
[personal profile] hirez
The most useful tool I have found for keeping an offline copy of an LJ + comments is the mildly-obviously named 'LJArchive'. It also manages to provide a fairly rapid full text search of entries + comments, which is really very useful indeed.

As far as I can tell (-> . <- this far) it's written in M$ C#, has been abandoned by the author and hasn't worked for some number of months due to $Random-XML-error which appears to be inside the comment-parsing code.

I was going to chunter on about this being unfixable b/c it would require spending ££ on the relevant part of the M$ toolchain, but it seems that the 'Express' version is free for the download (and presumably in exchange for all sorts of details that M$ can use to sell me things).

So instead of whining about it, I'd better bag yon thingy and see if the code is amenable to tinkering by a Unix Curmudgeon.

Date: 2011-12-14 12:52 pm (UTC)
diffrentcolours: (Default)
From: [personal profile] diffrentcolours
Given it's C#, can you not munge it in Mono?

Date: 2011-12-14 01:05 pm (UTC)
From: [identity profile] hirez.livejournal.com
Oh, probably. However I'd rather just fling it & project files at its native build environment in order to minimize the pissing about.

Really I'd rather not bother at all, but no bugger else looks like they're padding up and windmilling a Stuart Surridge ('Stepping up to the plate' is an Americanism with which I shall have no truck), I know root(fuck-all) about C#, merrily detest XML and haven't written production Winders code since, er, 1991.

Date: 2011-12-14 01:10 pm (UTC)
From: [identity profile] jarkman.livejournal.com
It's just Java with a funny hat on, mostly. I don't suppose it will give you too much trouble. Let me know if it does.

Date: 2011-12-14 09:52 pm (UTC)
From: [identity profile] hirez.livejournal.com
[FX: Installs VS2010-Peon Edition]
[FX: Converts code from VS-2005]
[FX: Can't build debug version for reason which I'm sure makes perfect sense]

Hm. Ok. Good. Existing version is throwing the right XML error again. However, I need to see the XML it's attempting to parse.

[FX: Installs Wireshark...]

Date: 2011-12-14 10:19 pm (UTC)
From: [identity profile] jarkman.livejournal.com
Top stuff. I'm sure it will all be fixed by morning.

Date: 2011-12-14 10:19 pm (UTC)
From: [identity profile] hirez.livejournal.com
[FX: ... Capture of failing session]
[FX: Grovelling through results]

Huh? It appears to fail while trying to parse a DTD from w3c.org. WTF?

Date: 2011-12-14 10:27 pm (UTC)
From: [identity profile] jarkman.livejournal.com
I suspect you'll be wanting to get the debug target building next.. :-)

Date: 2011-12-14 11:43 pm (UTC)
From: [identity profile] quercus.livejournal.com
Which DTDs are even at the W3C? Probably the HTML ones, which it's generally a bad idea to depend upon anyway.

Date: 2011-12-14 11:52 pm (UTC)
From: [identity profile] hirez.livejournal.com
XHTML, I think.

It blows up big-style if I point www.w3.org at 127.0.0.1

I wonder if one could hack those bits out?

Date: 2011-12-15 10:24 am (UTC)
From: [identity profile] quercus.livejournal.com
Easiest thing (wronger than a wrong thing) would be to point w3.org at 192.168.1.some_handy_apache_box and stick local copies of the DTD up on it, at the right path.

DTDs though, especially not for HTML, just don't need to be retrieved from the canonical w3 each time. It's not uncommon, but it's still crappy coding to rely on this.

I presume that the w3 site here gets hammered so much they must front-end it with a squid the size of Cthulthu.

Date: 2011-12-15 10:31 am (UTC)
From: [identity profile] hirez.livejournal.com
Ugh. It's a Winders app, so I don't think that's going to work. I would suspect that a less-worse option would be to hoover down the DTDs and pull them from file://

Which might make working out which part of the DTD it's failing to parse somewhat simpler. Might.

Date: 2011-12-15 11:27 am (UTC)
From: [identity profile] quercus.livejournal.com
The DTDs _should_ be embedded into the exe by some convenient means. If these are the HTML DTDs (or &Raggett forbid, the XHTML DTDs), then they aren't changing any time soon.

I don't see how just being Windows would break the ability to frob DTD retrieval by spoofing the public identifier?

Date: 2011-12-15 12:06 pm (UTC)
From: [identity profile] hirez.livejournal.com
Right. I think what has happened is that assumptions made about the content (or encoding?) of the DTDs made in, er, 2005 2004, have turned out to be somewhat less than optimal.

Google seems to show that this is A Thing for C#/.NET
Edited Date: 2011-12-15 12:10 pm (UTC)

Date: 2011-12-14 02:36 pm (UTC)
From: [identity profile] quercus.livejournal.com
If it's a question of changing and fixing it, you'd have to make a copy of the dev tools work there too. Sounds like firing up a Windows box would be simpler.

Otherwise just blag the high-level design and knowledge of how to poke the LJ server (which is the hard part), then clean-room it in Scala, Python or $FAVOURITE_THING_THIS_WEEK. Things that can fall over with XML errors are usually indicative of brain-dead roll-your-own-DOM coding in the first place.

Date: 2011-12-14 02:43 pm (UTC)
From: [identity profile] hirez.livejournal.com
Yep. Indeed, there's at least one Perl noddy-script that just spiders LJ as $punter and re-creates that locally, which is Good Enough. LJArchive's USP is the search.

Date: 2011-12-14 05:02 pm (UTC)
From: [identity profile] quercus.livejournal.com
So do it in Java and hang Lucene in there. You'll get the best search this side of Google, and it's easy to do special magic for searching tags etc.

Date: 2011-12-14 08:28 pm (UTC)
From: [identity profile] hirez.livejournal.com
Lucene appears to eat all the memory in the world and frag its indexes at the first sign of trouble... No, wait. The other one. Solr.
Edited Date: 2011-12-14 08:30 pm (UTC)

Date: 2011-12-15 10:34 am (UTC)
From: [identity profile] quercus.livejournal.com
Lucene and Solr are much the same thing internally. Lucene is nuts and bolts for building one, Solr is one I built earlier, all in a box. I've only ever used Lucene, so I don't know really what the boundary between Solr and Nutch is.

Mostly I've sat Lucene on top of Oracle, so the underlying indexes were well-behaved anyway.

Date: 2011-12-23 03:12 pm (UTC)
reddragdiva: (Default)
From: [personal profile] reddragdiva
Correct. Lucene is the worst text search except all the others. It is fat and slow and horrible and works pretty well. Any serious use requires a separate box with several gig of memory just for search.

Date: 2011-12-14 01:06 pm (UTC)
From: [identity profile] venta.livejournal.com
Hmm. My LJArchive is still working fine. I wonder if now I know it has a bug it will stop working...

Date: 2011-12-14 01:10 pm (UTC)
From: [identity profile] hirez.livejournal.com
Interesting... Do you archive comments?

Date: 2011-12-14 01:23 pm (UTC)
From: [identity profile] venta.livejournal.com
I do archive comments. A quick run-through suggests that (although I don't get anything which looks overtly like an XML parsing error), I do get a pop-up saying that the server doesn't support exporting comments.

It certainly used to: older posts have comments attached, but nothing recent does. I hadn't noticed that before, and I don't remember seeing the pop-up before, but posts downloaded prior to this conversation are also minus their comments.

I feel it's rather cheeky to say can I have a copy if you do fix it :) The nearest I've come thus far to making C#'s acquaintance is a few unpleasant brushes with Managed C++, so I fear I'm quite unlikely to be able to offer particularly useful assistance.

Date: 2011-12-14 01:39 pm (UTC)
From: [identity profile] hirez.livejournal.com
Hm. Looking at mithering elsewhere, it seems that a LJ code-update b0rked something, they knew what it was and they were going to fix it. Perhaps the fix was yon pop-up...

If (massive if) I get the thing working, I'll no doubt jabber about it. Although the last time something similar happened, it turned out my fixing was entirely redundant and gave me a migraine.

Date: 2011-12-14 01:56 pm (UTC)
From: [identity profile] venta.livejournal.com
Perhaps the fix was yon pop-up...

That would explain it :(

I really should remember that utilities like this are often things whose innards one can poke if one wants. It just never occurs to me. I don't know why not.

Date: 2011-12-14 08:39 pm (UTC)
From: [identity profile] hirez.livejournal.com
Gah. On trying it just now, I got a wee box telling me my session had been dropped. I suspect comment-export is one of the things they drop first while under DDoS.

Date: 2011-12-14 07:02 pm (UTC)
miss_squiddy: (Default)
From: [personal profile] miss_squiddy
It's borked for me too so if you can get a working version or fix the code you have, I would very much like to know.

May 2025

S M T W T F S
    123
45678910
11121314151617
18192021222324
2526272829 3031

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 22nd, 2026 10:26 am
Powered by Dreamwidth Studios