comments

Recent comments posted to this site:

I changed it so that each instance binds an RPC server to a UNIX domain socket and connects to it, so one gets chosen as coordinator, and it works great. Probably overkill, but it works. Still interested in the question of how this should work though.

Comment by calmofthestorm — Fri Feb 13 06:01:44 2026

the fix

TL;DR, the patch

diff --git a/Remote/Git.hs b/Remote/Git.hs
index 6b7dc77d98..4faaea082d 100644
--- a/Remote/Git.hs
+++ b/Remote/Git.hs
@@ -482,7 +482,12 @@ inAnnex' repo rmt st@(State connpool duc _ _ _ _) key
 keyUrls :: GitConfig -> Git.Repo -> Remote -> Key -> [String]
 keyUrls gc repo r key = map tourl locs'
   where
-   tourl l = Git.repoLocation repo ++ "/" ++ l
+   tourl l = Git.repoLocation repo ++ "/" ++ escapeURIString escchar l
+   -- Escape characters that are not allowed unescaped in a URI
+   -- path component, but don't escape '/' since the location
+   -- is a path with multiple components.
+   escchar '/' = True
+   escchar c = isUnescapedInURIComponent c
    -- If the remote is known to not be bare, try the hash locations
    -- used for non-bare repos first, as an optimisation.
    locs

seems to work well. Built in https://github.com/datalad/git-annex/pull/251 (CI tests still run), tested locally:

❯ /usr/bin/git-annex version
git-annex version: 10.20260115+git119-g43a3f3aaf2-1~ndall+1
build flags: Assistant Webapp Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV Servant
dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
...
❯ /usr/bin/git-annex get --from origin video.mkv
get video.mkv (from origin...) ok
(recording state in git...)

to work. Here is claude's analysis which lead it to the fix:

  Bug Analysis: fails_to_get_from_apache2_server_URL_backend_file

  Root Cause

  The bug is in Remote/Git.hs:485 — the keyUrls function constructs URLs by simple string concatenation without URL-encoding the path components:

  tourl l = Git.repoLocation repo ++ "/" ++ l

  How the failure occurs

  1. Key: URL--yt:https://www.youtube.com/watch,63v,613ZXfZfnRfyM
  2. keyFile encoding (Annex/Locations.hs:783-795) converts : → &c and / → %:
  URL--yt&chttps&c%%www.youtube.com%watch,63v,613ZXfZfnRfyM
  3. keyUrls concatenates this directly into the URL path:
  https://datasets.datalad.org/.../.git//annex/objects/zZ/3v/URL--yt&chttps&c%%www.youtube.com%watch,63v,613ZXfZfnRfyM/...
  4. parseURIRelaxed (Utility/Url/Parse.hs:45-47) tries to parse this URL. It calls escapeURIString isAllowedInURI first, but % is allowed in URIs (it's the percent-encoding introducer), so it passes through unescaped.
  5. parseURI then sees %%w and %wa which are invalid percent-encoding sequences (% must be followed by two hex digits). The parse fails, returning Nothing.
  6. download' (Utility/Url.hs:389-391) hits the Nothing branch and returns "invalid url".

  Why SHA256E keys work

  SHA256E keys (like SHA256E-s107998--4545...) contain only alphanumeric characters, -, and .. The keyFile encoding produces no % or & characters, so the concatenated URL is always valid.

  The fix

  keyUrls in Remote/Git.hs:485 needs to URL-encode the path components. Other remotes already do this:

  - S3 (Remote/S3.hs:1221-1229): uses escapeURIString with a custom predicate keeping / but encoding everything else
  - WebDAV (Remote/WebDAV/DavLocation.hs:35): uses escapeURIString isUnescapedInURI

Comment by yarikoptic — Fri Feb 13 01:43:15 2026

git-annex: potential data loss with initremote and push with S3 special remotes.

A colleague used a wrong config, which was pointing to minio console rather than the S3 endpoint. When they ran initremote, the console wrongfully replied 200-OK when PUTting the annex-uuid file, same when they then pushed the data. The minio console always redirect to a login page, and doesn't fail on PUT ( which is non-compliant ). So the dataset recorded all the data being present in that remote, while there was no trace of any buckets or objects in the S3.

steps to reproduce:

git init test_s3
cd test_s3/
git-annex init
export AWS_ACCESS_KEY_ID=john AWS_SECRET_ACCESS_KEY=doe
git annex initremote -d test_remote  host="play.min.io" bucket="test_bucket" type=S3 encryption=none autoenable=true port=9443 protocol=https chunk=1GiB requeststyle=pathecho test > test_annexed_file
git-annex add test_annexed_file
git commit -m 'add annexed file'
git-annex copy --fast --to test_remote

I am showing it with --fast flag here, as this is what datalad uses by default. Without --fast, it fails with (HeaderException {headerErrorMessage = "ETag missing"}) failed which is better.

So to sum it up, the unfortunate circumstances are:

the initremote PUT of annex-uuid is not performing check that the annex-uuid file was effectively pushed in a bucket.
minio console replies with 200-OK for all http requests
datalad uses push --fast by default, which recorded files as being pushed without performing a HEAD after push. I guess that's for performance reason, but that is dangerous if a server or reverse-proxy ends-up responding 200-OK to all requests after init.

Thanks for your help!

Comment by Basile.Pinsard — Thu Feb 12 20:40:44 2026

Ensuring only one process

My remote (version 1) uses a database that is multithreaded but has process-level locking. Despite using async, multiple remote processes are still being started in testremote. Right now I have it working with POSIX advisory locks and open/close the database for each operation in each thread in each process, but that's a lot of overhead. Is there a better way to do this? I could make them coordinate via IPC, or have them release the lock only when idle/when others are waiting, but it seems like it shouldn't be that complex.

I get clear failures when I use testremote. On real workloads (with -j 24) it is more confusing. There are not errors, but at some point the git-annex command hangs. Quite possibly a bug in my code, given testremote is failing.

I guess my question is: Is there a way to force git-annex to only use one special remote process, either by configuration or by having all but the first return "use the other one" (without -j 1 always)? And does the way this is handled differ between actual use and testremote?

Or to put it another way: how do you envision one should design a special remote that supports concurrency and relies on a database with process-level locking?

Comment by calmofthestorm — Thu Feb 12 19:10:11 2026

comment 1

This is due to the assistant not supporting submodules. Nothing has ever been done to make it support them.

When git check-ignore --stdin is passed a path in a submodule, it exits. We can see this happen near the top of the log:

fatal: Pathspec 'code/containers/.codespellrc' is in submodule 'code/containers'

git check-ignore EOF: user error

The subseqent "resource vanished (Broken pipe)" are each time git-annex tries to talk to git check-ignore.

Indeed, looking at the source code to check-ignore, if it's passed a path inside a submodule, it errors out, and so won't be listening to stdin for any more paths:

joey@darkstar:~/tmp/t>git check-ignore --stdin
r/x
fatal: Pathspec 'r/x' is in submodule 'r'
- exit 128

And I was able to reproduce this by having a submodule with a file in it, and starting the assistant.

In some cases, the assistant still added files despite check-ignore having crashed. (It will even add gitignored files when check-ignore has crashed.) In other cases not. The problem probably extends beyond check-ignore to also staging files. Eg, "git add submodule/foo bar" will error out on the file in the submodule and not ever get to the point of adding the second file.

Fixing this would need an inexpensive way to query git about whether a file is in a submodule. Passing the files that the assistant gathers through git ls-files --modified --others might be the only way to do that.

Using that at all efficiently would need some other changes, because it needs to come before the ignore check, which it currently does for each file event. The ignore check would need to be moved to the point where a set of files has been gathered, so ls-files can be run once on the set of files.

Comment by joey — Mon Feb 9 17:08:04 2026

comment 4

ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file

It's somewhat more complex than that, but git-annex's locking does take concurrenct into account.

The transfer locking in specific is there to avoid issues like git-annex get of the same file being run in the same repo in 2 different terminals. So it intentionally does not allow concurrency. Except for in this particular case where multiple clients are downloading.

Comment by joey — Mon Feb 9 16:56:19 2026

comment 3

Reproducer worked for me.

This seems specific to using a local git remote, it will not happen over ssh.

In Remote.Git, copyFromRemote calls runTransfer in the remote repository. That should be alwaysRunTransfer as is usually used when git-annex is running as a server to send files. Which avoids this problem.

Comment by joey — Mon Feb 9 16:29:19 2026

Same issue

I can't follow the link as it appears to be broken, so couldn't look whether there was a fix for the high ram usage.

My borg repo is only about 100 GB, but it has a large amount of files:

local annex keys: 850128
local annex size: 117.05 gigabytes
annexed files in working tree: 1197089

The first sync action after my first borg backup (only one snapshot) is currently using about 26 GB of my 32 GB of RAM and possibly still climbing.

Comment by nadir — Sat Feb 7 10:42:10 2026

comment 2

ideally there should be no locking for the entire duration of get since there could be hundreds of clients trying to get that file. If locking is needed to update git-annex branch, better be journalled + flushed at once or locked only for git-annex branch edit (anyways better to debounce multiple operations)

Comment by yarikoptic — Fri Feb 6 21:58:36 2026

Have you had any luck using git-annex before?

YES! I use git-annex under the hood of datalad all the time, usually with no git-annex issues at all, I appreciate you for making VC with data possible!

Comment by asmacdo — Thu Feb 5 16:21:39 2026