Recent comments posted to this site:
TL;DR, the patch
diff --git a/Remote/Git.hs b/Remote/Git.hs
index 6b7dc77d98..4faaea082d 100644
--- a/Remote/Git.hs
+++ b/Remote/Git.hs
@@ -482,7 +482,12 @@ inAnnex' repo rmt st@(State connpool duc _ _ _ _) key
keyUrls :: GitConfig -> Git.Repo -> Remote -> Key -> [String]
keyUrls gc repo r key = map tourl locs'
where
- tourl l = Git.repoLocation repo ++ "/" ++ l
+ tourl l = Git.repoLocation repo ++ "/" ++ escapeURIString escchar l
+ -- Escape characters that are not allowed unescaped in a URI
+ -- path component, but don't escape '/' since the location
+ -- is a path with multiple components.
+ escchar '/' = True
+ escchar c = isUnescapedInURIComponent c
-- If the remote is known to not be bare, try the hash locations
-- used for non-bare repos first, as an optimisation.
locs
seems to work well. Built in https://github.com/datalad/git-annex/pull/251 (CI tests still run), tested locally:
❯ /usr/bin/git-annex version
git-annex version: 10.20260115+git119-g43a3f3aaf2-1~ndall+1
build flags: Assistant Webapp Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV Servant
dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
...
❯ /usr/bin/git-annex get --from origin video.mkv
get video.mkv (from origin...) ok
(recording state in git...)
to work. Here is claude's analysis which lead it to the fix:
Bug Analysis: fails_to_get_from_apache2_server_URL_backend_file
Root Cause
The bug is in Remote/Git.hs:485 — the keyUrls function constructs URLs by simple string concatenation without URL-encoding the path components:
tourl l = Git.repoLocation repo ++ "/" ++ l
How the failure occurs
1. Key: URL--yt:https://www.youtube.com/watch,63v,613ZXfZfnRfyM
2. keyFile encoding (Annex/Locations.hs:783-795) converts : → &c and / → %:
URL--yt&chttps&c%%www.youtube.com%watch,63v,613ZXfZfnRfyM
3. keyUrls concatenates this directly into the URL path:
https://datasets.datalad.org/.../.git//annex/objects/zZ/3v/URL--yt&chttps&c%%www.youtube.com%watch,63v,613ZXfZfnRfyM/...
4. parseURIRelaxed (Utility/Url/Parse.hs:45-47) tries to parse this URL. It calls escapeURIString isAllowedInURI first, but % is allowed in URIs (it's the percent-encoding introducer), so it passes through unescaped.
5. parseURI then sees %%w and %wa which are invalid percent-encoding sequences (% must be followed by two hex digits). The parse fails, returning Nothing.
6. download' (Utility/Url.hs:389-391) hits the Nothing branch and returns "invalid url".
Why SHA256E keys work
SHA256E keys (like SHA256E-s107998--4545...) contain only alphanumeric characters, -, and .. The keyFile encoding produces no % or & characters, so the concatenated URL is always valid.
The fix
keyUrls in Remote/Git.hs:485 needs to URL-encode the path components. Other remotes already do this:
- S3 (Remote/S3.hs:1221-1229): uses escapeURIString with a custom predicate keeping / but encoding everything else
- WebDAV (Remote/WebDAV/DavLocation.hs:35): uses escapeURIString isUnescapedInURI
A colleague used a wrong config, which was pointing to minio console rather than the S3 endpoint. When they ran initremote, the console wrongfully replied 200-OK when PUTting the annex-uuid file, same when they then pushed the data. The minio console always redirect to a login page, and doesn't fail on PUT ( which is non-compliant ). So the dataset recorded all the data being present in that remote, while there was no trace of any buckets or objects in the S3.
steps to reproduce:
git init test_s3
cd test_s3/
git-annex init
export AWS_ACCESS_KEY_ID=john AWS_SECRET_ACCESS_KEY=doe
git annex initremote -d test_remote host="play.min.io" bucket="test_bucket" type=S3 encryption=none autoenable=true port=9443 protocol=https chunk=1GiB requeststyle=pathecho test > test_annexed_file
git-annex add test_annexed_file
git commit -m 'add annexed file'
git-annex copy --fast --to test_remote
I am showing it with --fast flag here, as this is what datalad uses by default. Without --fast, it fails with (HeaderException {headerErrorMessage = "ETag missing"}) failed which is better.
So to sum it up, the unfortunate circumstances are:
- the initremote PUT of annex-uuid is not performing check that the annex-uuid file was effectively pushed in a bucket.
- minio console replies with 200-OK for all http requests
- datalad uses
push --fastby default, which recorded files as being pushed without performing a HEAD after push. I guess that's for performance reason, but that is dangerous if a server or reverse-proxy ends-up responding 200-OK to all requests after init.
Thanks for your help!
My remote (version 1) uses a database that is multithreaded but has process-level locking. Despite using async, multiple remote processes are still being started in testremote. Right now I have it working with POSIX advisory locks and open/close the database for each operation in each thread in each process, but that's a lot of overhead. Is there a better way to do this? I could make them coordinate via IPC, or have them release the lock only when idle/when others are waiting, but it seems like it shouldn't be that complex.
I get clear failures when I use testremote. On real workloads (with -j 24) it is more confusing. There are not errors, but at some point the git-annex command hangs. Quite possibly a bug in my code, given testremote is failing.
I guess my question is: Is there a way to force git-annex to only use one special remote process, either by configuration or by having all but the first return "use the other one" (without -j 1 always)? And does the way this is handled differ between actual use and testremote?
Or to put it another way: how do you envision one should design a special remote that supports concurrency and relies on a database with process-level locking?
This is due to the assistant not supporting submodules. Nothing has ever been done to make it support them.
When git check-ignore --stdin is passed a path in a submodule, it exits.
We can see this happen near the top of the log:
fatal: Pathspec 'code/containers/.codespellrc' is in submodule 'code/containers'
git check-ignore EOF: user error
The subseqent "resource vanished (Broken pipe)" are each time git-annex tries to talk to git check-ignore.
Indeed, looking at the source code to check-ignore, if it's passed a path inside a submodule, it errors out, and so won't be listening to stdin for any more paths:
joey@darkstar:~/tmp/t>git check-ignore --stdin
r/x
fatal: Pathspec 'r/x' is in submodule 'r'
- exit 128
And I was able to reproduce this by having a submodule with a file in it, and starting the assistant.
In some cases, the assistant still added files despite check-ignore having crashed. (It will even add gitignored files when check-ignore has crashed.) In other cases not. The problem probably extends beyond check-ignore to also staging files. Eg, "git add submodule/foo bar" will error out on the file in the submodule and not ever get to the point of adding the second file.
Fixing this would need an inexpensive way to query git about whether a file
is in a submodule. Passing the files that
the assistant gathers through git ls-files --modified --others
might be the only way to do that.
Using that at all efficiently would need some other changes, because it needs to come before the ignore check, which it currently does for each file event. The ignore check would need to be moved to the point where a set of files has been gathered, so ls-files can be run once on the set of files.
ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file
It's somewhat more complex than that, but git-annex's locking does take concurrenct into account.
The transfer locking in specific is there to avoid issues like git-annex
get of the same file being run in the same repo in 2 different terminals.
So it intentionally does not allow concurrency. Except for in this
particular case where multiple clients are downloading.
Reproducer worked for me.
This seems specific to using a local git remote, it will not happen over ssh.
In Remote.Git, copyFromRemote calls runTransfer in the remote repository.
That should be alwaysRunTransfer as is usually used when git-annex is
running as a server to send files. Which avoids this problem.
I can't follow the link as it appears to be broken, so couldn't look whether there was a fix for the high ram usage.
My borg repo is only about 100 GB, but it has a large amount of files:
local annex keys: 850128
local annex size: 117.05 gigabytes
annexed files in working tree: 1197089
The first sync action after my first borg backup (only one snapshot) is currently using about 26 GB of my 32 GB of RAM and possibly still climbing.
get since there could be hundreds of clients trying to get that file. If locking is needed to update git-annex branch, better be journalled + flushed at once or locked only for git-annex branch edit (anyways better to debounce multiple operations)