If we have thousands of syntax nodes on both sides, we can end
up attempting to preallocate a very large hashmap.
In #542, a user hit an issue with two JSON files where the LHS had
33,000 syntax nodes and the RHS had 34,000 nodes, so we'd attempt to
preallocate a hashmap of capacity 1,122,000,000. This required
allocating 70,866,960,400 bytes (roughly 66 GiB).
Impose a sensible limit on the hashmap.
Fixes#542
Show the hunk count and detected language in a dimmed style. This
information is less important than the diff content itself, so this
change makes the important information more prominent.
First part of #544
I've observed PDF files that have sufficiently large headers that they
were detected as text, which wasn't helpful.
Also improve logging to report how many invalid bytes were found.
Previously we didn't check the state of children, which was an
oversight from the original implementation. As a result, we fixed
nested sliders in fewer situations.
Fixes#535
This is important when comparing short string literals. This change
has improved several cases in sample_files/ but I've added a new
example that made the previous unwanted behaviour much more obvious.
This saves us searching the hash map twice. This is a modest
performance improvement: an instruction count reduction of 4% on
slow_before.rs, and 1% reduction on typing_before.ml.
This is a 4% reduction in instructions for typing_before.ml, but a
0.2% increase instructions for slow_before.rs. This seems like a win
overall, and it also keeps the codebase more consistent and simpler.
This was intended to allow usage of .entry_ref(), but it's already a
performance win without using that API! It's around a 9% reduction in
instructions in slow_before.rs, and 2% reduction in typing_before.ml.
If a string is replaced with another, apply subword highlighting
similar to how we handle replaced comments.
Co-authored-by: Wilfred Hughes <me@wilfred.me.uk>
Inside text files, it seems to be better to be conservative and
consider abc123def as one word rather than three.
This is noticeable when looking at changes to the compare.expected
file, which contains hashes. 123c456 and 345c789 don't really have a
`c` in common, so subword highlighting is ugly.
The contiguous penalty was an attempt to fix the slider problem:
// Old
A B
C D
// New
A B
A B
C D
// Unwanted diff
A +B+
+A+ B
C D
However, it doesn't make sense for Dijkstra, which is stateless. The
best route from vertex X is independent of how we got to vertex X.
This worked by dumb luck: in some circumstances we terminate early
rather than fully executing Dijkstra's algorithm. This cost tweak
improved results on a few test files. However, the post-processing
slider logic is a proper, general solution. This was added much later.
There's no reason to keep the contiguous penalty now. It's confusing,
and makes adding new edge costs with consistent 'X costs more than Y'
behaviours more difficult.
Performance is essentially neutral: a small decrease in
typing_before.ml, a small increase in slow_before.rs.
This ensures that choosing a unchanged non-punctuation atom with some
novel atoms is better than choosing punctuation and some changed
comments. This produces better results in general, see
comma_and_comment_after.js for an example.
This will be more noticeable after the next commit, where costs of
novel atoms are in a smaller range of values.
This is harder to reason about, and
2e6666041f did not include a motivating
test case.
Removing contiguous status is a minor perf improvement (2% reduction
in instructions), makes the code simpler, and does not significantly
affect diffing results.
Of the two sample files that have changed, the erlang_before.erl file
has improved and nest_before.rs is neutral.
This is equivalent (increased cost on unchanged nodes vs decreased
cost on changed nodes), but easier to reason about.
Previously we have multiple notions of changed atoms: NovelAtomLHS,
NovelAtomRHS, and ReplacedComment. We want to consider punctuation as
less desirable even when e.g. comments arereplaced.
This makes constructing hunks harder to reason about.
This change doesn't affect output, but helps when debugging, as it
makes multiline atoms much less common.
Fix an incorrect assumption about how LANG_FILE_NAMES works.
We want the alire.toml file to still be treated as TOML (and not Ada),
so adding an entry here is not necessary for Ada support.
Implement support in difftastic for the Ada programming language
using the treesitter grammar provided in 'briot/tree-sitter-ada'.
Language detection depends on the following suffixes:
* adb
* ads
* ada
The presence of the alire TOML file (alire.toml) is also used as
a heuristic.
Line numbers may be less than .max_line(), as .max_line() trims
whitespace. Ensure pad_after() is robust to this, and add a test.
I could only reproduce the crash in inline display mode, but in
principle this could be an issue in all modes.
Fixes#452
This substantially improves performance on text files where there are
few lines in common.
For example, 10,000 line files with no lines in common is more than 10x
faster (8.5 seconds to 0.49 seconds on my machine), and
sample_files/huge_cpp_before.cpp is nearly 2% faster.
Fixes the case mentioned by @quackenbush in #236.
This is inspired by the heuristics discussions at
https://github.com/mitsuhiko/similar/issues/15
Currently it contains a nested string node, even though it's a fixed
set of known types. This was preventing us from applying good syntax
highlighting.
This was particularly noticeable with `string`, which wasn't
previously highlighted as a type.
This allows given nodes (configurable per-language, using tree-sitter's
query syntax) to be re-parsed as other languages. The canonical example
is CSS or JavaScript inside HTML, which normally would be a single token
but now can get the full range of syntax highlighting and tree diffing.
The config sets this up for only two languages: HTML (contains CSS or
JavaScript in <script> or <style> tags; we don't support style="" or
onclick="" etc. at this point), and Makefiles (contains Bash in
$(shell ...) commands). The latter is fairly obscure; the big win is
in the former.
It would be nice to also have this support for PHP; however, the HTML
parser seems to be a bit confused when asked to parse the partial HTML
blocks we get if we just mark the "text" blocks as HTML, so for this
to work well, probably the PHP blocks should be parsed as sub-languages
of HTML instead of vice versa.
Also, as a minor quibble, there should be support for bash in Perl's
backticks (similar to in Makefiles), but the tree-sitter Perl parser
does not support backticks at all (it goes into error recovery).
There may have been languages that I've missed, e.g. some languages
might have nodes that contain e.g. SQL.
Fixes#382. Potentially relevant to #376.
Previously we highlighted changed whitespace, which led to ugly
results if the number of words changed (there was a different number
of whitespace characters so some were highlighted).
Also treat _ and - as word constituents, as it produces nicer results
when people write example CLI invocations in comments.
@QuarticCat observed that popping delimiters is unnecessary, and saw a
speedup in PR #401. This reduces the number of nodes in typical graphs
by ~20%, reducing runtime and memory usage.
This works because there is only one thing we can do at the end of a
list: pop the delimiter. The syntax node on the other side does not
give us more options, we have at most one. Popping all the delimiters
as soon as possible is equivalent, and produces the same graph route.
This change has also slightly changed the output of
samples_files/slow_after.rs, producing a better (more minimal)
diff. This is probably luck, due to the path-dependent nature of the
route solving logic, but it's a positive sign.
A huge thanks to @QuarticCat for their contributions, this is a huge
speedup.
Co-authored-by: QuarticCat <QuarticCat@pm.me>
These can still be queried, but they don't appear in parse tree
results. This makes the corpus tests a little simpler to maintain,
and it's more in line with other markers in the grammar
Tree-sitter does not like parsing empty chars, which is what a null
byte registers as.
This fixes that by forcing the *_NAMERSPACED_NAMES to require
at least one character to be picked up
See
https://github.com/tree-sitter/tree-sitter/issues/98
This commit was squashed
This is the commit message 2:
Don't require at least 1 char for namespaced symbol names.
This allows for symbols like `this-one/` to be parsed without error.
They occur frequently in leiningen templates, which are not valid
clojure files, but are still common enough to account for. Note that
clojure repls will not accpet them as valid symbols. The reader throws
and error
This is the commit message 3
Revert "Don't require at least 1 char for namespaced symbol names."
The leiningnen templates are not valid clojure code, so we won't make
a special exception in the parsing for them, they can just be invalid.
Parsing still works, we just get an error node in the `sym_lit` nodes
when we fail to matach a `sym_name`
Adds runtime conflict resolution for keywords, when one keywords can
match both keyword rules, treesitter will prefer using the
_kwd_qualified rule over _kwd_unqualified
Accounts for things like '(+ - * /)
but creates individual symbol tokens for invalid clojure code like
(/////) ;; 1 token per /
/asdf ;; 2 tokens: / and asdf
/asdf/hjkl ;; 2 token: / and asfd/hjkl
; Correct comment, squash this commit later
This helps skimming the results when multiple files are changed with
multiple hunks. It makes the file changing more prominent than just
going from e.g. 5/5 to 1/10.
Fixes#400
Acked-by: Wilfred Hughes <me@wilfred.me.uk>