Re: RFC: yup enhancements


Subject: Re: RFC: yup enhancements
From: Hollis Blanchard (hollis-lists@austin.rr.com)
Date: Fri Oct 19 2001 - 19:13:23 MDT


on 15/10/01 11:42 PM, Jeremy Katz at katzj@linuxpower.org wrote:

> Frightening the way these things come back to haunt you... I'll bring
> in the $0.03 worth of things which I brought up with Brian many months
> ago of pie in the sky ideas which I never got around to doing as well as
> some more thoughts on your ideas (since they were on the list too :)

Great, thanks for the email! :) Sorry for the delay in replying, my cable
modem has been down all week. :(

> On Monday, October 15 2001, Hollis Blanchard said:
>
>> 2. it can't handle multiple architectures at all, including "source" (as in
>> .src.rpm).
>
> Source RPMs are this weird special case that doesn't get handled
> properly. But, multiple architectures in general works fine assuming
> that all of the architectures in a tree are "compatible" it works just
> fine with what is, I believe, in CVS.

I haven't seen any support for eg 'yup install <package> -a <arch>'. Could
you give me an example of the support you're talking about?

>> 3. yup has a pretty big memory footprint, and is a bit slow to start up.
>
> Some of which can't be helped due to the size of RPM transaction sets.

True, but my biggest complaint is with startup time. It's slow on my G3; it
was almost unusable on my 604, especially when after all that startup it
says something like "no packages found" (must correct a typo and start
over).

I suspect parsing and creating objects from a 3MB text file is a serious
performance problem.

> I have some interesting ideas about breaking transaction sets down into
> minimally sized sets but yup isn't likely to be my experimentation
> target anymore. Once I have some progress on this, though, it should
> hopefully be easy to move it either up into rpm-python itself or or use
> elsewhere.
>
>> Here are some yup internals:
> [snip internals]
>> The reason for the .diff file is clear: bandwidth. Whenever a package in
>> yup.db.list is updated, users would have to download the entire 3+ MB list
>> again. By using the diff, they can download only the package information
>> that has changed.
>
> Yep, this is one nice advantage over, say, apt for example. Especially
> for users who don't have ridiculously high bandwidth connections.

And this is something I think can easily be preserved by downloading an
object at a time rather than the full list.

>> I think this could help us in a couple ways:
>> - pickles would eliminate parsing time (replacing it with much faster load
>> time).
>> - pickles would reduce yup.db.list size significantly because they would not
>> be human-readable. (All the yup.db files should really be transparently
>> gzipped already, but that doesn't matter now...)
>> - pickles could also allow us to avoid keeping both yup.db.list and
>> yup.db.init around (detailed below).
>>
>> Pickles would *only* be used to hold package header information - not the
>> packages themselves. In fact it may be best to pickle each rpm's python
>> header structure itself (I haven't looked at it yet).
>
> Going down this road, you might be better off just dropping the actual
> header of the RPM (easily retrievable via hdr.unload() once you read the
> RPM header during yup-arch or replacement run and then you can reread it
> on the client side with rpm.headerLoad(string)... this is quite likely
> even smaller than pickling)

Yes, this sounds like an excellent idea.

> Unfortunately, neither of the above addresses the issue that just having
> 1000 headers loaded into memory entails a large footprint.

That's true, but I'm more concerned about loading and parsing a 3 MB text
file at the moment. And as Dan pointed out, Red Hat's Anaconda installer
(written in python) doesn't seem to have the same memory problems.

> Some of this
> can probably be addressed by looking at all of the tags in the RPM
> headers and just including the "necessary" ones in your header
> structure. You do gain an advantage in using regular RPM headers in
> that you can handle all version comparisons within RPM instead of
> needing custom logic to do so.

I recently discovered that the rpm python module allows for version
comparison simply by passing an epoch, version, release tuple. There is of
course a seperate call for passing a raw rpm header object. In other words
we could use either method -- a custom-built object or the raw rpm header --
to avoid that comparison.

Stripping the rpm headers somehow seems very likely to break something in
rpm's python bindings, and I think that would also mean we'd have to
construct our own objects again (rather than using the raw rpm header).

>> The (server) directory structure I'm considering is something like this:
>>
>> yup.version: [text file]
>> 0.8
>> yup.arch.ppc/list: [text file]
>> ElectricFence 2.2.2-5
>> ImageMagick 5.2.7-2
>> ...
>> yup.arch.ppc/pkginfo/ [directory of pickled objects]
>> ElectricFence.pickle
>> ImageMagick.pickle
>> ...
>> yup.arch.ppc/pkg/ [directory of rpm's]
>> ElectricFence-2.2.2-5.ppc.rpm
>> ImageMagick-5.2.7-2.ppc.rpm
>> ...
>>
>> I'm not sure any other files outside a yup.arch directory are needed. Please
>> correct me if I've overlooked something.
>
> A base config file will still be needed to continue to support things
> such as server-side exclusion of packages and other site options (eg
> package groups are supported like this).

Ah, I was wondering how groups were supposed to work. :)

When would server-side package exclusion be useful?

> The listing of package
> information seems reasonable, although you need to make sure that you
> include epoch information (canonically represented as E:V-R)

Good to point out; I've always found rpm documentation to be lacking and so
have never really understood the usefulness of an epoch...

>> Before executing any action, yup does the following (for each repository):
>> 1. yup would always download the 'list' file for each arch.
>> 2. yup would compare the version of each package in 'list' with the version
>> of the local pickle.
>
> You might compare against the locally installed version as opposed to
> the pickle available and only download if there's a newer remote package
> than you have locally. Why download things you don't need? This could
> actually also be extended to downloading them in general.

The idea is for yup to download (piecemeal) the full remote repository. If
yup did not have a full repository to consider locally, how could it resolve
dependencies correctly?

Example: If you consider only what you have installed,
- you have gnome 1.0 and do not have libpng
- libpng 1.0 only was available at the distribution release
- gnome 2.0 requires libpng 2.0
Now yup will only know about libpng 1.0, even though libpng 2.0 is available
remotely. Because libpng was not installed locally, yup never bothered to
download the 2.0 headers. Thus 'yup install gnome2.0' won't work.

Please correct me if I'm misunderstanding your idea.

>> 3. If the remote version is more current, yup downloads the updated pickle.
>>
>> In this way, yup keeps the most up-to-date pickles locally at all times,
>> which it can then use to quickly make decisions regarding the availability
>> of updates, dependencies, etc. There also is no single 'master' list - the
>> pickles can be organized by repository. (In theory this organization could
>> be added to the current yup source, but I believe there are too many
>> assumptions that only one package list exists.)
>>
>> There are optimizations that can and should be made, but does this general
>> idea sound ok? Any comments welcome.
>
> Generally sounds reasonable with the caveats above... when thinking
> about this in the past, my thoughts were pickling data, raw RPM headers,
> or db3. A *very* quick look shows that you might be better off using
> the raw RPM header instead of pickling it; further investigation would
> help there though.

That makes sense to me. I confess my thoughts on pickling were only in
reaction to an immediate problem; they probably weren't the best approach
possible.

> Other longer-term things which were on my list:
> * Redo the config file format because it's utterly inane and especially
> if the yup list stuff is going away, it makes no sense to keep :P Just
> using something that can be parsed with ConfigParser would make it a
> billion times better

ConfigParser sounds great. I recall being horrified at the yup config file
parsing...

> * Just use urllib for all file transfers instead of directly invoking
> ftplib and httplib; should be simple and mainly helps in terms of KISS
> and improving reliability

There was a reason I avoided urllib when I redid the file transfer layer,
and I believe it was the callbacks... I can't recall the details right now
though.

Actually, it may also have been because I would *really* like to be able to
cache ftp connections, since often that's the slowest part of an ftp
transfer (authentication, and even a delay leading up to authentication). I
don't believe urllib allowed this.

> * More cleanups of how all of the output is handled separating backend
> from front-end and general good things like that

Sure, that could certainly use improvement.

-Hollis



This archive was generated by hypermail 2a24 : Fri Oct 19 2001 - 18:24:07 MDT