Re: RFC: yup enhancements


Subject: Re: RFC: yup enhancements
From: Jeremy Katz (katzj@linuxpower.org)
Date: Mon Oct 15 2001 - 22:42:55 MDT


Frightening the way these things come back to haunt you... I'll bring
in the $0.03 worth of things which I brought up with Brian many months
ago of pie in the sky ideas which I never got around to doing as well as
some more thoughts on your ideas (since they were on the list too :)

On Monday, October 15 2001, Hollis Blanchard said:
> 1. it only understands one (main) repository. You can't have multiple
> repositories to draw from, and of course you can't have a repository which
> is incomplete.

Yep, this was one of the things on my list and is definitely something
which would be useful. I had some implementation notes at some point,
but who knows where those are now.

> 2. it can't handle multiple architectures at all, including "source" (as in
> .src.rpm).

Source RPMs are this weird special case that doesn't get handled
properly. But, multiple architectures in general works fine assuming
that all of the architectures in a tree are "compatible" it works just
fine with what is, I believe, in CVS. If not, I'll have to diff the
tree being used at NCSU with current code, but I've been wary of that
for a variety of reasons. All of this is handled using standard RPM
archcompat stuff so should work fine on PPC as well, although some
additional information may need to be pushed up into RPM about
compatibile PPC architectures.

> 3. yup has a pretty big memory footprint, and is a bit slow to start up.

Some of which can't be helped due to the size of RPM transaction sets.
I have some interesting ideas about breaking transaction sets down into
minimally sized sets but yup isn't likely to be my experimentation
target anymore. Once I have some progress on this, though, it should
hopefully be easy to move it either up into rpm-python itself or or use
elsewhere.

> Here are some yup internals:
[snip internals]
> The reason for the .diff file is clear: bandwidth. Whenever a package in
> yup.db.list is updated, users would have to download the entire 3+ MB list
> again. By using the diff, they can download only the package information
> that has changed.

Yep, this is one nice advantage over, say, apt for example. Especially
for users who don't have ridiculously high bandwidth connections.
 
> Ok, so that's how it is right now.
>
> Python (the language yup is written in) has an interesting feature: it
> allows you to save an object to a file and load it again later. This is
> called pickling an object.
>
> I think this could help us in a couple ways:
> - pickles would eliminate parsing time (replacing it with much faster load
> time).
> - pickles would reduce yup.db.list size significantly because they would not
> be human-readable. (All the yup.db files should really be transparently
> gzipped already, but that doesn't matter now...)
> - pickles could also allow us to avoid keeping both yup.db.list and
> yup.db.init around (detailed below).
>
> Pickles would *only* be used to hold package header information - not the
> packages themselves. In fact it may be best to pickle each rpm's python
> header structure itself (I haven't looked at it yet).

Going down this road, you might be better off just dropping the actual
header of the RPM (easily retrievable via hdr.unload() once you read the
RPM header during yup-arch or replacement run and then you can reread it
on the client side with rpm.headerLoad(string)... this is quite likely
even smaller than pickling)

Unfortunately, neither of the above addresses the issue that just having
1000 headers loaded into memory entails a large footprint. Some of this
can probably be addressed by looking at all of the tags in the RPM
headers and just including the "necessary" ones in your header
structure. You do gain an advantage in using regular RPM headers in
that you can handle all version comparisons within RPM instead of
needing custom logic to do so.

> The (server) directory structure I'm considering is something like this:
>
> yup.version: [text file]
> 0.8
> yup.arch.ppc/list: [text file]
> ElectricFence 2.2.2-5
> ImageMagick 5.2.7-2
> ...
> yup.arch.ppc/pkginfo/ [directory of pickled objects]
> ElectricFence.pickle
> ImageMagick.pickle
> ...
> yup.arch.ppc/pkg/ [directory of rpm's]
> ElectricFence-2.2.2-5.ppc.rpm
> ImageMagick-5.2.7-2.ppc.rpm
> ...
>
> I'm not sure any other files outside a yup.arch directory are needed. Please
> correct me if I've overlooked something.

A base config file will still be needed to continue to support things
such as server-side exclusion of packages and other site options (eg
package groups are supported like this). The listing of package
information seems reasonable, although you need to make sure that you
include epoch information (canonically represented as E:V-R)

> Before executing any action, yup does the following (for each repository):
> 1. yup would always download the 'list' file for each arch.
> 2. yup would compare the version of each package in 'list' with the version
> of the local pickle.

You might compare against the locally installed version as opposed to
the pickle available and only download if there's a newer remote package
than you have locally. Why download things you don't need? This could
actually also be extended to downloading them in general.

> 3. If the remote version is more current, yup downloads the updated pickle.

> In this way, yup keeps the most up-to-date pickles locally at all times,
> which it can then use to quickly make decisions regarding the availability
> of updates, dependencies, etc. There also is no single 'master' list - the
> pickles can be organized by repository. (In theory this organization could
> be added to the current yup source, but I believe there are too many
> assumptions that only one package list exists.)
>
> There are optimizations that can and should be made, but does this general
> idea sound ok? Any comments welcome.

Generally sounds reasonable with the caveats above... when thinking
about this in the past, my thoughts were pickling data, raw RPM headers,
or db3. A *very* quick look shows that you might be better off using
the raw RPM header instead of pickling it; further investigation would
help there though.

Other longer-term things which were on my list:
* Redo the config file format because it's utterly inane and especially
if the yup list stuff is going away, it makes no sense to keep :P Just
using something that can be parsed with ConfigParser would make it a
billion times better
* Just use urllib for all file transfers instead of directly invoking
ftplib and httplib; should be simple and mainly helps in terms of KISS
and improving reliability
* More cleanups of how all of the output is handled separating backend
from front-end and general good things like that

Jeremy

-- 
Jeremy Katz
katzj@redhat.com        | jlkatz@eos.ncsu.edu
katzj@linuxpower.org    | Developer, NCSU Realm Kit for Red Hat Linux
GPG fingerprint: 367E 8B6B 5E57 2BDB 972A 4D73 C83C B4E8 89FE 392D



This archive was generated by hypermail 2a24 : Mon Oct 15 2001 - 22:01:58 MDT