Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

programing

Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

itsource 2022. 7. 14. 20:58

Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

나는 어렴풋이 다음과 같은 끈을 가지고 있다.

foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"

콤마로 나누지만 따옴표 안의 쉼표는 무시해야 합니다.이거 어떻게 해?regexp 접근법은 실패할 것 같습니다.견적을 보면 수동으로 스캔하여 다른 모드로 들어갈 수 있을 것 같습니다만, 기존의 라이브러리를 사용하는 것이 좋습니다.(편집: 이미 JDK의 일부이거나 Apache Commons와 같이 일반적으로 사용되는 라이브러리의 일부인 라이브러리를 의미합니다.)

위의 문자열은 다음과 같이 분할해야 합니다.

foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"

주의: 이것은 CSV 파일이 아닙니다.이것은 전체 구조가 더 큰 파일에 포함된 단일 문자열입니다.

시험:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
        String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

출력:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

즉, 콤마 앞에 0 또는 짝수의 따옴표가 있는 경우에만 쉼표로 분할합니다.

또는, 눈에 조금 더 친근하게 보이도록 하겠습니다.

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

        String otherThanQuote = " [^\"] ";
        String quotedString = String.format(" \" %s* \" ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (?:                     "+ //   start non-capturing group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex, -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

첫 번째 예시와 같은 결과가 됩니다.

편집

@Mike가 말한 바와 같이코멘트의 FHay:

Guava의 Splitter를 사용하는 것이 좋습니다.기본값이 san이기 때문입니다(빈 일치가 삭제되는 것에 대해서는 위의 설명 참조).String#split()가 했어:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

일반적으로는 정규 표현을 좋아하지만, 이러한 상태 의존적인 토큰화의 경우 단순한 파서(이 경우 그 단어가 발음하는 것보다 훨씬 단순함)가 특히 유지관리성에 관한 보다 깨끗한 솔루션이라고 생각합니다.

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}
result.add(input.substring(start));

따옴표 안에 쉼표를 보존할 필요가 없는 경우 따옴표 안의 쉼표를 다른 것으로 바꾼 다음 쉼표로 분할하여 이 방법을 단순화할 수 있습니다(시작 인덱스를 처리하지 않고 마지막 문자 특수 대소문자를 사용하지 않음).

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/

https://github.com/pupi1985/JavaCSV-Reloaded (생성된 출력에 Windows 회선 터미네이터를 사용할 수 있는 이전 라이브러리의 버전)\r\nWindows 를 windows windows windows windows windows windows windows windows)

http://opencsv.sourceforge.net/

Java용 CSV API

CSV 파일 읽기(및 쓰기)에 Java 라이브러리를 추천할 수 있습니까?

CSV를 XML 파일로 변환하는 Java lib 또는 앱

Bart의 regex 답변은 추천하지 않겠습니다.이 경우는 파싱 솔루션이 (Fabian의 제안대로) 더 낫다고 생각합니다.regex 솔루션과 자체 구문 분석 구현을 시도해보니 다음과 같습니다.

해석은 백레퍼런스가 있는 regex로 분할하는 것보다 훨씬 빠릅니다.짧은 문자열의 경우 최대 20배, 긴 문자열의 경우 최대 40배 더 빠릅니다.
마지막 쉼표 뒤에 빈 문자열을 찾을 수 없습니다.그것은 당초의 질문이 아니라, 나의 요구였다.

솔루션과 테스트는 다음과 같습니다.

String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;

start = System.nanoTime(); 
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
    switch (c) {
    case ',':
        if (inQuotes) {
            b.append(c);
        } else {
            tokensList.add(b.toString());
            b = new StringBuilder();
        }
        break;
    case '\"':
        inQuotes = !inQuotes;
    default:
        b.append(c);
    break;
    }
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;

System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);

물론 이 스니펫의 추악함이 마음에 들지 않으면 else로 변경할 수 있습니다.세퍼레이터가 있는 스위치 후에는 브레이크가 없는 것에 주의해 주세요.스레드의 안전성은 무관하지만 속도를 높이기 위해 String Builder가 String Buffer로 선택되었습니다.

당신은 정규식이 거의 불가능한 짜증나는 경계 영역에 있다(바트에 의해 지적된 바와 같이, 인용구를 피하는 것은 삶을 힘들게 할 것이다). 그러나 완전한 파서는 과잉 살상처럼 보인다.

조만간 더 복잡해질 것 같으면 파서 라이브러리를 찾아보겠습니다.예를 들면 이거

나는 참을성이 없어서 대답을 기다리지 않기로 했다.참고로 다음과 같은 작업을 수행하는 것은 그다지 어려운 일이 아닌 것 같습니다(내 어플리케이션에서는 인용문의 생략을 걱정할 필요가 없습니다.따옴표는 몇 가지 제약이 있는 형식으로 한정되어 있기 때문입니다).

final static private Pattern splitSearchPattern = Pattern.compile("[\",]"); 
private List<String> splitByCommasNotInQuotes(String s) {
    if (s == null)
        return Collections.emptyList();

    List<String> list = new ArrayList<String>();
    Matcher m = splitSearchPattern.matcher(s);
    int pos = 0;
    boolean quoteMode = false;
    while (m.find())
    {
        String sep = m.group();
        if ("\"".equals(sep))
        {
            quoteMode = !quoteMode;
        }
        else if (!quoteMode && ",".equals(sep))
        {
            int toPos = m.start(); 
            list.add(s.substring(pos, toPos));
            pos = m.end();
        }
    }
    if (pos < s.length())
        list.add(s.substring(pos));
    return list;
}

(리더에 대한 설명: 백슬래시를 찾아 이스케이프된 따옴표 처리까지 확장됩니다.)

다음과 같이 둘러보세요.(?!\"),(?!\")은 일치해야 , 않다".

가장 간단한 방법은 딜리미터, 즉 콤마를 실제로 의도된 것(따옴표로 묶은 데이터)과 일치시키는 복잡한 추가 로직과 대조하는 것이 아니라 잘못된 딜리미터를 제외하는 것입니다.

로 묶인 )2개의 대체됩니다."[^"]*" ★★★★★★★★★★★★★★★★★」".*?"쉼표까지 [^,]+ 셀을 않은 쉼표가 있는 \\G★★★★

Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");

패턴에는 따옴표로 묶인 문자열의 내용 또는 일반 내용 중 하나를 가져올 두 개의 캡처 그룹도 포함됩니다.

Java 9에서는 다음과 같은 어레이를 얻을 수 있습니다.

String[] a = p.matcher(input).results()
    .map(m -> m.group(m.start(1)<0? 2: 1))
    .toArray(String[]::new);

이전 버전의 Java는 다음과 같은 루프가 필요합니다.

for(Matcher m = p.matcher(input); m.find(); ) {
    String token = m.group(m.start(1)<0? 2: 1);
    System.out.println("found: "+token);
}

을 에 List또는 배열이 독자의 소비품으로 남습니다.

Java 8에서는results()Java 9 솔루션처럼 구현합니다.

질문에서와 같이 삽입 문자열이 포함된 혼합 콘텐츠의 경우 다음과 같이 간단히 사용할 수 있습니다.

Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");

그러나 그 후 문자열은 인용된 형태로 유지됩니다.

Lookahead와 다른 미친 정규식을 사용하는 대신 먼저 인용문을 뽑아보세요.즉, 모든 견적 그룹에 대해 해당 그룹을 다음과 같이 바꿉니다.__IDENTIFIER_1또는 다른 표시기를 사용하여 그룹화를 문자열, 문자열 지도에 매핑합니다.

쉼표로 분할한 후 매핑된 모든 ID를 원래 문자열 값으로 바꿉니다.

String.split()을 사용하는 원라이너는 어떻습니까?

String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

정규 표현식에서는 이스케이프 문자를 처리할 수 없습니다.어플리케이션에서는 따옴표와 공간을 이스케이프할 수 있는 기능이 필요했습니다(세퍼레이터는 공백이지만 코드는 동일합니다).

다음은 Fabian Steeg의 솔루션을 기반으로 Kotlin(이 특정 애플리케이션의 언어)에 있는 솔루션입니다.

fun parseString(input: String): List<String> {
    val result = mutableListOf<String>()
    var inQuotes = false
    var inEscape = false
    val current = StringBuilder()
    for (i in input.indices) {
        // If this character is escaped, add it without looking
        if (inEscape) {
            inEscape = false
            current.append(input[i])
            continue
        }
        when (val c = input[i]) {
            '\\' -> inEscape = true // escape the next character, \ isn't added to result
            ',' -> if (inQuotes) {
                current.append(c)
            } else {
                result += current.toString()
                current.clear()
            }
            '"' -> inQuotes = !inQuotes
            else -> current.append(c)
        }
    }
    if (current.isNotEmpty()) {
        result += current.toString()
    }
    return result
}

여기는 정규 표현을 쓰는 곳이 아닌 것 같아요.다른 의견과 달리 파서는 과잉살인은 아니라고 생각합니다.20줄 정도 되고 테스트도 꽤 쉬워요.

저는 이렇게 하고 싶어요.

boolean foundQuote = false;

if(charAtIndex(currentStringIndex) == '"')
{
   foundQuote = true;
}

if(foundQuote == true)
{
   //do nothing
}

else 

{
  string[] split = currentString.split(',');  
}

언급URL : https://stackoverflow.com/questions/1757065/java-splitting-a-comma-separated-string-but-ignoring-commas-in-quotes

저작자표시 (새창열림)

'programing' 카테고리의 다른 글

Vue.js, 컴포넌트에서 SweetAlert2 html 폼 데이터를 입력하는 방법 (0)	2022.07.14
"V8Js::compileString():1812:ReferenceError:document is not defined" Larabel VueJS & V8Js TypeScript (0)	2022.07.14
Java 8 스트림 역순서 (0)	2022.07.14
VueJs Propos에 여러 데이터 유형을 추가하는 방법 (0)	2022.07.14
JavaBean이란 정확히 무엇입니까? (0)	2022.07.14

현재글Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

각종 프로그래밍 정보를 다루는 블로그입니다.

C++, c#, C, JavaScript, Spring, Java, jQuery, vuejs2, vuex, Spring3,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

itsource

Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

편집

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

Java: 쉼표로 구분된 문자열을 분할하지만 따옴표로 둘러싸인 쉼표는 무시합니다.

편집

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바